Skip to main content

Table 9 The exeriment tests the proposed MDL scheme on a single set of reads yet on a number of reference sequences

From: Optimal reference sequence selection for genome assembly using minimum description length principle

S.No.

Ref. Seq. (%)

No. of unaligned reads

Code-length (KB)

Length of new Seq.

1

75

172

25.91

1755

2

85

148

25.10

1989

3

95

123

24.20

2223

4

100

109

23.62

2341

5

105

108

24.22

2458

6

115

107

25.50

2692

7

125

106

26.78

2926

  1. The set of reads, 390 in total, were derived from ‘Influenza A virus (A Puerto Rico 834 (H1N1)) segment 1, complete sequence’ using the ART read simulator for NGS with read length 30, standard deviation 10, and mean fragment length of 100,[79]. Similarly the reference sequences were also derived from the same H1N1 virus. Ref. Seq. 75% used in S.No. 1, has a length which is 75% of the actual genome. Similarly Ref. Seq. 125% has a quarter of the actual genome concatenated with the complete H1N1 genome making the total length 125% of H1N1. All other genomes were derived in a similar way. The code-length is calculated using Equation (3). The results show that the MDL proposed scheme chooses the correct reference sequence, Ref. Seq. 100%, (shown underlined) even when all the contending sequences are closely related to one another in terms of their genome and length.