Back in business with GeneValidator


It's been a while since my last post. But I'm coming back with the same enthusiasm I had in the beginning of GeneValidator. After the end of GSoC I continued the project, at a lower speed though. In the last months GeneValidator have become more robust and efficient in terms of speed and memory management. We also put more accent on testing, by enlarging the size and type of the gene-sets. 

GeneValidator has a huge potential for making the work of biocurators easier and for improving the quality of modern biological data.

We measured the performance of GeneValidator by making two types of comparisons:
  • comparison between the validation results of the corresponding genes from two releases of the same genome project. In the following I present the score difference between two versions of Mus Musculus genome, one from 2003 (32,910 genes) and another from 2013 (51,437 genes) for the genes that differ from one version to the other (here called "curated genes"). From the plot below it can be easily seen that in the most recent release a large number of genes have improved, according to the validation scores.


  • comparison between the validation results for random genes from one weak and one strong database. For instance, a comparison between the validation scores for 10,000 random genes from Swiss-prot (strong database, with manually annotated and reviewed sequences) and Trembl (weak database - genes are automatically annotated and not reviewed) is presented below. As expected, the statistics show better quality genes (overall scores) for the genes in Swiss-prot (in pink on the plot).


Our tests prove that the analysis performed by GeneValidator goes in the right direction. However, deciding whether a gene has prediction errors or is very specific (almost unique) is very challenging. Thus, feedback from biocurators working on recently curated data is mostly welcome!


 



0 comments:

Post a Comment