Optimal reference sequence selection for genome assembly using minimum description length principle

EURASIP Journal on Bioinformatics and Systems Biology

Table 8 The experiment uses the proposed MDL scheme on the same set of reads but different set of reference sequences

S.No.	Ref. Seq. (%)	No. of unaligned reads	Code-length (KB)	Execution time (s)	Length of new Seq.
1	1	696	128.60	0.046	14
2	2	696	128.73	0.031	47
3	5	693	128.575	0.046	113
4	10	684	127.576	0.046	229
5	25	668	126.615	0.093	565
6	50	650	126.615	0.109	650
7	100	3	14.276	0.078	2342
8	150	2	21.164	0.062	2341
9	200	2	27.808	0.124	2341
10	300	2	41.525	0.140	2341

The set of reads contained 3817 reads all of which were derived from ‘Influenza A virus (A Puerto Rico 834 (H1N1)) segment 1, complete sequence’. Out of 3817 reads the method extracted 696 unique reads which were then used in the MDL proposed scheme. All the reference sequences were derived from the same Influenza A (H1N1) virus. Ref. Seq. 1% used in S.No. 1, has a length which is 1% of the actual genome. Similarly Ref. Seq. 25% has a length which is a quarter of the length of the actual genome. All other genomes were derived in a similar way. For, e.g., Ref. Seq. 200% has two H1N1 viruses concatenated together making the length twice that of the original H1N1 sequence. The code-length is calculated using Equation (3). The results show that the MDL proposed scheme chooses the best reference sequence, one which has the smallest code-length as determined by Equation (3). The MDL scheme does not choose smaller reference sequences with more unaligned reads rather than choosing larger reference sequence with smaller unaligned reads. The experiment also proves the correctness of the optimal reference sequence as it chooses Ref. Seq. 7, (shown underlined), since it has the smallest code-length, as the optimal reference sequence. It was Ref. Seq. 7 from which all the reads were derived from. Since the MDL scheme chooses Ref. Seq. 7 as the optimal sequence, the experiment also proves the correctness of the reference sequence chosen.