Length validation

So far, our application looks like this:

In other words, given a FASTA file with a number of sequences of the same type (mrna/protein), the program takes (by calling blast) a set of reference sequences (the most similar with the current predicted gene). For the moment, from all the information provided by blast, we are interested only in the length of the reference/predicted sequences, in order to start the length validation of the predicted sequence.

As we observed, the length distribution does not fit a bell curve. The actual way to find the majority lengths among the reference lengths  is by a typical hierarchical clusterization. First we assume that each length belongs to a separate cluster. Each step we merge the closest two clusters, until a cluster that contains more than 50% of the reference sequences is obtained.

The result of the clusterization can be observed in the histogram (the most dense cluster is in red). The length of our predicted data is a black vertical line.

Here you are some outputs computed for some predicted protein sequences from the ant Solenopsis invicta [1]:

1) ACCEPTED predicted sequence

2) ACCEPTED predicted sequence
 3) UNACCEPTED predicted sequence 
 4) UNACCEPTED predicted sequence 

You can try the application by yourself by cloning the code from github [2] and meeting the requirements (same Ruby gams and paths to CLASSPATH must be added -- see the README). More histograms can be found here [3].

Next step is to add a confidence percentage for each length validation test.

Any feedback from you is welcome and valuable!

[1] http://www.antgenomes.org/downloads/
[2] https://github.com/monicadragan/gene_prediction
[3] https://github.com/monicadragan/gene_prediction/tree/master/test_dataset/plots


Post a Comment