Optimal reference sequence selection for genome assembly using minimum description length principle

EURASIP Journal on Bioinformatics and Systems Biology

Table 9 The exeriment tests the proposed MDL scheme on a single set of reads yet on a number of reference sequences

S.No.	Ref. Seq. (%)	No. of unaligned reads	Code-length (KB)	Length of new Seq.
1	75	172	25.91	1755
2	85	148	25.10	1989
3	95	123	24.20	2223
4	100	109	23.62	2341
5	105	108	24.22	2458
6	115	107	25.50	2692
7	125	106	26.78	2926

The set of reads, 390 in total, were derived from ‘Influenza A virus (A Puerto Rico 834 (H1N1)) segment 1, complete sequence’ using the ART read simulator for NGS with read length 30, standard deviation 10, and mean fragment length of 100,[79]. Similarly the reference sequences were also derived from the same H1N1 virus. Ref. Seq. 75% used in S.No. 1, has a length which is 75% of the actual genome. Similarly Ref. Seq. 125% has a quarter of the actual genome concatenated with the complete H1N1 genome making the total length 125% of H1N1. All other genomes were derived in a similar way. The code-length is calculated using Equation (3). The results show that the MDL proposed scheme chooses the correct reference sequence, Ref. Seq. 100%, (shown underlined) even when all the contending sequences are closely related to one another in terms of their genome and length.