Optimal reference sequence selection for genome assembly using minimum description length principle

EURASIP Journal on Bioinformatics and Systems Biology

Table 2 Summary of the experiment using three reads {ATAT, GGGG, CCAA} and three reference sequences {1, 2, 3}

			Reads that do not align to the reference sequence	Data given the hypothesis (Bits)			Code-length (Bits)
		Model given by the Data					Code-length
S.No.	Ref. Seq.				*Regret*	Proposed scheme	(Bits)
1	ATAT CGGGG CTATA	1111011110-1-1-1-1	CCAA	12	0	ATATCGGGGCATAT>1111 0 1111 0 -1-1-1-1>CCAA	102
2	ATGGGCCCTTATTGC	000000000000000	ATAT>GGGG>CCAA	42	30	ATGGGCCCTTATTGC> 000000000000000 >ATAT>GGGG >CCAA	138
3	GGGGCCCCGGGG	1111-1-1-1-11111	ATAT>CCAA	27	15	GGGGCCCCGGGG>1111-1-1-1-11111>ATAT>CCAA	105

Regret is defined as $R_{M_{i}, X} = [loss (M_{i}, X) - min_{\hat{M}} loss (\hat{M}, X)]$ . Here the loss function, $loss (M_{i}, X)$ , happens to be code-length of the data $X$ , given the model class M_i. Whereas, “Data given the hypothesis”, is the code-length of the “Reads that do not align to the reference sequence”. The code-length in the last column is measured according to Equation (3). The experiment shows that given the MDL proposed scheme Ref. 1 is the optimal choice for a reference sequence.