Tips for a clean GeneValidator installation

The aim of this post is to state the possible challenges that one can face when installing GeneValidator. In the following I will describe the steps I did for setting up GeneValidator on a clean installation of Ubuntu 14.04.

Install GeneValidator with sudo privilege

First thing to do is to install Ruby (>= 1.9.3) and Ruby-dev (>=1.9.1). Note that older versions of ruby and ruby-dev lead to incompatibility issues with some of the ruby gems that GeneValidator uses. 

$ sudo apt-get install ruby1.9.3
$ sudo apt-get install ruby1.9.1-dev


Other packages you need are:
  • Git (to download the source code from GitHub):
$ sudo apt-get install git

  • Make (to execute shell commands files)
$ sudo apt-get install make

  • Rake (Make-like program implemented in Ruby):
$ sudo apt-get install rake


Then, make sure you get the last version of GeneValidator :

$ git clone git://github.com/monicadragan/GeneValidator.git

Build the ruby gem (the command below installs all the dependencies as well):

$ sudo rake

And that's it!

If you use a previous installation of Ruby you might face compatibility problems when GeneValidator tries to install a gem called Nokogiri (used for parsing the input files). Nokogiri requires the following packages which need to be manually installed if missing: libxml2-dev libxslt1-dev

$ sudo aptitude install libxml2-dev libxslt1-dev
$ sudo aptitude build-dep libxml2


Additionally, be careful about the following Nokogiri error message: "libxml2-2.9.0 and higher are currently known to be broken and thus not supported by nokogiri due to compatibility problems and XPath optimization bugs". If you get this message you might need to install libxml2-2.8 from the sources.  

Install GeneValidator in your own directory (no sudo privilege required)

You can easily install GeneValidator locally following the instructions below (inspired from this great post). But first make sure that the following packages are installed on your system: ruby (>=1.9.3), ruby-dev(>=1.9.1), git, rake, make.

In the following commands replace /var/lib/gems/1.9.1 with the path of your own ruby gem installation directory. You can find it by running:

$ gem environment

Create a local directory for gem installation:

$ mkdir ~/.gems

Set up your .gemrc for gem install-time configuration:

$ cat << EOF > ~/.gemrc
gemhome: $HOME/gems
gempath:
- $HOME/gems
- /var/lib/gems/1.9.1
EOF


Set up some environment variables for run-time:

$ cat << EOF >> ~/.bashrc
export GEM_HOME=$HOME/gems
export GEM_PATH=$HOME/gems:/var/lib/gems/1.9.1
export PATH=$PATH:$HOME/gems/bin
EOF


And now (again) the actual installation of GeneValidator:

$ git clone git://github.com/monicadragan/GeneValidator.git
$ rake


And now you are done :).



Back in business with GeneValidator


It's been a while since my last post. But I'm coming back with the same enthusiasm I had in the beginning of GeneValidator. After the end of GSoC I continued the project, at a lower speed though. In the last months GeneValidator have become more robust and efficient in terms of speed and memory management. We also put more accent on testing, by enlarging the size and type of the gene-sets. 

GeneValidator has a huge potential for making the work of biocurators easier and for improving the quality of modern biological data.

We measured the performance of GeneValidator by making two types of comparisons:
  • comparison between the validation results of the corresponding genes from two releases of the same genome project. In the following I present the score difference between two versions of Mus Musculus genome, one from 2003 (32,910 genes) and another from 2013 (51,437 genes) for the genes that differ from one version to the other (here called "curated genes"). From the plot below it can be easily seen that in the most recent release a large number of genes have improved, according to the validation scores.


  • comparison between the validation results for random genes from one weak and one strong database. For instance, a comparison between the validation scores for 10,000 random genes from Swiss-prot (strong database, with manually annotated and reviewed sequences) and Trembl (weak database - genes are automatically annotated and not reviewed) is presented below. As expected, the statistics show better quality genes (overall scores) for the genes in Swiss-prot (in pink on the plot).


Our tests prove that the analysis performed by GeneValidator goes in the right direction. However, deciding whether a gene has prediction errors or is very specific (almost unique) is very challenging. Thus, feedback from biocurators working on recently curated data is mostly welcome!