Skip to main content

Table 2 Summary of the experiment using three reads {ATAT, GGGG, CCAA} and three reference sequences {1, 2, 3}

From: Optimal reference sequence selection for genome assembly using minimum description length principle

   

Reads that do not align to the reference sequence

Data given the hypothesis (Bits)

  

Code-length (Bits)

  

Model given by the Data

    

Code-length

S.No.

Ref. Seq.

   

Regret

Proposed scheme

(Bits)

1

ATAT CGGGG CTATA

1111011110-1-1-1-1

CCAA

12

0

ATATCGGGGCATAT>1111 0 1111 0 -1-1-1-1>CCAA

102

2

ATGGGCCCTTATTGC

000000000000000

ATAT>GGGG>CCAA

42

30

ATGGGCCCTTATTGC> 000000000000000 >ATAT>GGGG >CCAA

138

3

GGGGCCCCGGGG

1111-1-1-1-11111

ATAT>CCAA

27

15

GGGGCCCCGGGG>1111-1-1-1-11111>ATAT>CCAA

105

  1. Regret is defined as R M i , X = loss ( M i , X ) min M ̂ loss ( M ̂ , X ) . Here the loss function,loss( M i ,X), happens to be code-length of the dataX, given the model class M i . Whereas, “Data given the hypothesis”, is the code-length of the “Reads that do not align to the reference sequence”. The code-length in the last column is measured according to Equation (3). The experiment shows that given the MDL proposed scheme Ref. 1 is the optimal choice for a reference sequence.