Tips for a clean GeneValidator installation

The aim of this post is to state the possible challenges that one can face when installing GeneValidator. In the following I will describe the steps I did for setting up GeneValidator on a clean installation of Ubuntu 14.04.

Install GeneValidator with sudo privilege

First thing to do is to install Ruby (>= 1.9.3) and Ruby-dev (>=1.9.1). Note that older versions of ruby and ruby-dev lead to incompatibility issues with some of the ruby gems that GeneValidator uses. 

$ sudo apt-get install ruby1.9.3
$ sudo apt-get install ruby1.9.1-dev


Other packages you need are:
  • Git (to download the source code from GitHub):
$ sudo apt-get install git

  • Make (to execute shell commands files)
$ sudo apt-get install make

  • Rake (Make-like program implemented in Ruby):
$ sudo apt-get install rake


Then, make sure you get the last version of GeneValidator :

$ git clone git://github.com/monicadragan/GeneValidator.git

Build the ruby gem (the command below installs all the dependencies as well):

$ sudo rake

And that's it!

If you use a previous installation of Ruby you might face compatibility problems when GeneValidator tries to install a gem called Nokogiri (used for parsing the input files). Nokogiri requires the following packages which need to be manually installed if missing: libxml2-dev libxslt1-dev

$ sudo aptitude install libxml2-dev libxslt1-dev
$ sudo aptitude build-dep libxml2


Additionally, be careful about the following Nokogiri error message: "libxml2-2.9.0 and higher are currently known to be broken and thus not supported by nokogiri due to compatibility problems and XPath optimization bugs". If you get this message you might need to install libxml2-2.8 from the sources.  

Install GeneValidator in your own directory (no sudo privilege required)

You can easily install GeneValidator locally following the instructions below (inspired from this great post). But first make sure that the following packages are installed on your system: ruby (>=1.9.3), ruby-dev(>=1.9.1), git, rake, make.

In the following commands replace /var/lib/gems/1.9.1 with the path of your own ruby gem installation directory. You can find it by running:

$ gem environment

Create a local directory for gem installation:

$ mkdir ~/.gems

Set up your .gemrc for gem install-time configuration:

$ cat << EOF > ~/.gemrc
gemhome: $HOME/gems
gempath:
- $HOME/gems
- /var/lib/gems/1.9.1
EOF


Set up some environment variables for run-time:

$ cat << EOF >> ~/.bashrc
export GEM_HOME=$HOME/gems
export GEM_PATH=$HOME/gems:/var/lib/gems/1.9.1
export PATH=$PATH:$HOME/gems/bin
EOF


And now (again) the actual installation of GeneValidator:

$ git clone git://github.com/monicadragan/GeneValidator.git
$ rake


And now you are done :).



Back in business with GeneValidator


It's been a while since my last post. But I'm coming back with the same enthusiasm I had in the beginning of GeneValidator. After the end of GSoC I continued the project, at a lower speed though. In the last months GeneValidator have become more robust and efficient in terms of speed and memory management. We also put more accent on testing, by enlarging the size and type of the gene-sets. 

GeneValidator has a huge potential for making the work of biocurators easier and for improving the quality of modern biological data.

We measured the performance of GeneValidator by making two types of comparisons:
  • comparison between the validation results of the corresponding genes from two releases of the same genome project. In the following I present the score difference between two versions of Mus Musculus genome, one from 2003 (32,910 genes) and another from 2013 (51,437 genes) for the genes that differ from one version to the other (here called "curated genes"). From the plot below it can be easily seen that in the most recent release a large number of genes have improved, according to the validation scores.


  • comparison between the validation results for random genes from one weak and one strong database. For instance, a comparison between the validation scores for 10,000 random genes from Swiss-prot (strong database, with manually annotated and reviewed sequences) and Trembl (weak database - genes are automatically annotated and not reviewed) is presented below. As expected, the statistics show better quality genes (overall scores) for the genes in Swiss-prot (in pink on the plot).


Our tests prove that the analysis performed by GeneValidator goes in the right direction. However, deciding whether a gene has prediction errors or is very specific (almost unique) is very challenging. Thus, feedback from biocurators working on recently curated data is mostly welcome!


 



Try our Gene Validation Tool

We are on the second half of the development period and the structure of the code looks like this:

 
If you have doubts about some genes you use in your analysis or you are just curios about the progress of our project, try the steps described below:

* How to use the tool at this point:   

    1. Prerequisite: ncbi blast >=2.2 and R >= 2.14.2

    2. Get the source code
$ git clone git@github.com:monicadragan/GeneValidator.git
    3. Be sudo and build the gem

$ sudo rake
    4. Run GeneValidation
$ genevalidatior [-x xml_file_path]
[-r raw_seq_file_path] fasta_file_path

Example:
$ genevalidator -x data/all_validations_prot/all_validations_prot.xml data/all_validations_prot/all_validations_prot.fasta -r data/all_validations_prot/all_validations_prot.xml.raw_seq


This will provide validation results in yaml and html format, at the same path with your input file. In html format you can visualize the plots and histograms for certain statistics used in the validation steps.

Other things:
 
    5. Run unit tests

$ rake test

     6. Generate documentation

$ rake doc
 
* How to add a new validation (3 steps)
Steps 1 and 2: Each validation is defined by 2 extended classes:
(1)- extend ValidatioReport - stores the output of the validation test and some methods used for data representation (validation, print, color)
(2)- extend ValidationTest ('run' method must be overloaded, run must return ValidationReport class type,  some fields have to be updated: the validation name used in the header, description, plot_files)
Step 3:  Add the validation to the validations list, which is further processed for yaml/html/console visualization

Code:
validations = []
validations.push LengthClusterValidation.new(@type, prediction, hits, plot_path, plots)
 ...
validations.push YourValidationClass(@type, prediction, hits, other_arguments)

# check the class type of the elements in the list
# this will raise an error if YourValidationClass does not extend ValidationTest validations.map do |v|
  raise ValidationClassError unless v.is_a? ValidationTest
end

# run validations
validations.map{|v| v.run}

# check the class type of the validation reports
# this will raise an error if the run method of YourValidationClass does not return ValidationReport
validations.map do |v|
  raise ValidationClassError unless v.validation_report.is_a? ValidationReport
end


Hope this was helpful and we look forward to your feedback!

Thoughts on YAML


One option for our gene validation tool is to store the output results into a YAML file. Why we use YAML and other thoughts about this are exposed below. 

YAML is a data serialization language for representing data structures in a human-readable way. It's "Yet Another Markup Language", along with JSON and XML. Using JSON to serialize class instances means including type information of the instance variables into a JSON object. If the goal is serialization the Ruby object so that to be read back somewhere else, YAML is the choice. Ruby has built-in automagical YAML serialization/deserialization. Any object you create in Ruby can be serialized with pretty much no effort into YAML format.

The basic components of YAML are associative arrays (maps) and lists. Here is an example for the representation of these components in YAML, JSON and XML. The data used in the example is part of a map that links the description of the gene and the validation results.

YAML:

sp|Q197F7|003L_IIV3:
  prediction_len: 159
  length_validation_cluster:
    limits:
    - 156
    - 237
  length_validation_rank:
    percentage: 0.1
    msg: TOO_LONG


In YAML representation, data structure hierarchy is maintained by outline indentation (one or more spaces). Sequence items are denoted by a dash, and key value pairs within a map are separated by a colon. 

JSON:

{"sp|Q197F7|003L_IIV3": {
  "prediction_len": 159,
  "length_validation_cluster": {
    "limits": [
      {"value": "156"},
      {"value": "237"},     
    ]
  }
  "length_validation_rank": { 
    "percentage": 0.1
    "msg": "TOO_LONG"
  }
}}

As you can see, JSON files are also valid YAML files. YAML syntax semms to be more human accessible -  nested delimiters like {}, [], and " marks are replaced with white space indents.

XML:

<gene def="sp|Q197F7|003L_IIV3">
  <prediction_len value=159/>
  <length_validation_cluster>
    <limits value="156"/>
    <limits value="237"/>
    <prediction_len value="159"/>
  </length_validation_cluster>
  <length_validation_rank>
    <percentage value="0.1">
    <msg value="TOO_LONG"> 
  </length_validation_rank>
</gene>


Now it's very easy to add an item (class instance) to the YAML file, using the 'yaml' ruby gem.
         
      raport = ValidationRaport(prediction, hits)
      # load the existing content of the hash from the yaml file       
      hsh = YAML.load_file(file_yaml)
      # add a new key in the hash
      hsh[prediction.definition] = raport
      # update YAML file
      File.open(file_yaml, "w") do |f|
         YAML.dump(hsh, f)
      end


Part of a real output file generated by the gene validation tool looks like this:

sp|P19084|11S3:HELAN: !ruby/object:Output
  prediction_len: 400
  prediction_def: sp|P19084|11S3_HELAN 11S globulin seed storage protein G3 OS=Helianthus
    annuus GN=HAG3 PE=3 SV=1
  nr_hits: 500
  filename: data/uniprot_curated/uniprot_curated.fasta
  idx: 3
  start_idx: 1
  image_histo_len: data/uniprot_curated/uniprot_curated.fasta_3_len_clusters.jpg
  image_plot_merge: data/uniprot_curated/uniprot_curated.fasta_3_match.jpg
  image_histo_merge: data/uniprot_curated/uniprot_curated.fasta_3_match_2d.jpg
  image_orfs: data/uniprot_curated/uniprot_curated.fasta_3_orfs.jpg

  length_validation_cluster: !ruby/object:LengthClusterValidationOutput
    limits:
    - 445
    - 542
    prediction_len: 400
  length_validation_rank: !ruby/object:LengthRankValidationOutput
    percentage: 0.4
    msg: 'YES'
  reading_frame_validation: !ruby/object:BlastRFValidationOutput
    frames_histo:
      0: 541
    msg: 0:541;
  gene_merge_validation: !ruby/object:GeneMergeValidationOutput
    slope: 0.012590073314548955
    threshold_down: 0.4
    threshold_up: 1.2
  duplication: !ruby/object:DuplciationValidationOutput
    pvalue: 1
    threshold: 0.05
  orf: !ruby/object:ValidationReport
    message: ! '-'


This file is obtained by encoding object into YAML format using Ruby.
In order make statistics on the number of positive / false positive / false negative results, we use another YAML file that stores the validation results for a list data labelled by hand:

- PB18752-RA:
     valid_length: "no"
     valid_rf: "yes"
     gene_merge: "no"
     duplication: "no"
     orf: "no"


In the statistical test, the validation output and the reference data are compared based on the gene description (this is the key).

Some aspects one has to care about when handling YAML files:

- boolean variables: note that "yes, no, true, false" are reserved words and may be avoided in the YAML file. If you use the value `yes` in the YAML this will actually be converted to `true` in the resulting object, after deserialization. To fix this, you can treat these reserved words as strings, despite their special meaning.

- indentation should be properly used. Each node must be indented further than its parent and all sibling nodes must use the exact same indentation level.

Just play a little with this YAML file example (spaces are marked with #):

---
(1)-#SI2.2.0_06267:
(2)##valid_length: "no"  # 2 spaces (#)
(3)-#SI2.2.0_13722:
(4)###valid_length: "yes" # 3 spaces (#)


Object deserialization:

require "yaml"
filename_yml_reference = "test_output/test_reference.yaml"
yml_ref = YAML.load_file(filename_yml_reference)
puts yml_ref


Look at the output and see how a single space makes the difference:

{"SI2.2.0_06267"=>nil, "valid_length"=>"no"}
{"SI2.2.0_13722"=>{"valid_length"=>"yes"}}
Moreover, if you use only one space in lines (2) and (4) you get a parse error...

   

Half of GSoC

It's been a while since my last post on the blog: one week holiday, followed by a code refactorization and bug fixing period and finaly, the GSoC mid term evaluations. Meanwhile, the application took shape, became more robust and half of the validations have been implemented. Currently, our application analysis the predicted genes and provides useful information based on the similarities to genes in public databases (e.g length validation, gene merge and sequence duplication checking, reading frame and main ORF validation). An output example is available at [1]. 

The results of the validations can be visualized in console, in yaml or html format. By choosing the html output format, you enable plot image generation. The distribution of the data specific to each kind of distribution can be graphically visualized and possible errors in the genes may be highlighted.

At the moment our deliverable is a ruby gem, decorated with unit and statistical tests and automatically generated documentation. It can be cloned from the 'rubygem' branch of the git repo. I'll come back soon with another post regarding the application code structure and how new validations can be added.

Validations were tested and labelled by hand. What I noticed is that even genes that passed through a first round of curation (e.g those on ncbi databases) are susceptible of not being accurately predicted. Further, we are looking for a list of recently curated genes (unreleased yet) in Hymenopteragenome database [2] to check if our tool makes evidence about some improvements among predictions from two different releases.

What's next? We start a new validation test based on multiple alignment in order highlight the extra regions/gaps in the prediction and check whether the conserved regions appear in the predicted sequence.

If you are a biocurator, you may want to use our tool to save your time and facilitate your work, by keeping track of the genes that are susceptible of having problems so that to be curated first or discarded. If you are a bioinformatics enthusiast, you may want to see if the genes form the databases you use have problems and evaluate how strong your gene analysis is, as long as the data you use is verified and validated.

As half of the GSoC already passed, I just want to say that I enjoy a lot the time spent on this project and the people I met on this occasion. What is cool about GSoC is that you work on the project you are keen on and manage your time as you wish. Also, working remotely involves additional challenges. Regarding Ruby, it's been several weeks since I started programming in this language and what I can say by now is that I got along very well with it. It's an awesome language and very intuitive to use for someone who once got in touch with Python/Haskell. 

I am coming back next with a post on my first experience with YAML data representation.

[1] http://swarm.cs.pub.ro/~mdragan/gsoc2013/one_direction_gene_merge/html
[2] http://hymenopteragenome.org/

First Four Validations: DONE


By now we validate 4 things about the predicted genes: the length (by clusterization and ranking - the rank of the prediction among all the hits), the reading frame, whether there is a duplication in the prediction and whether the prediction is in fact a merge between multiple genes.

We adjusted our previous approach for merge detection in order to have fewer false positives. I'll briefly explain how we do the validation now:
- we plot a 2D graph by using the start/end offsets of the matched regions in the prediction on the two axis
- we draw a line obtained by linear regression 

- a gene merge is present if the slope of the line is between 0.4 and 1.2
- these thresholds were chosen empirically and will be adjusted after analyzing a larger amount of data.


Here it is a simplified drawing explaining this:

 

The output can be provided in html format (if you use the proper command line arguments). Here you can find sume examples, grouped by validation test:
Length validation: http://swarm.cs.pub.ro/~mdragan/gsoc2013/solenopsis_length_test/
Gene merge validation: http://swarm.cs.pub.ro/~mdragan/gsoc2013/one_direction_gene_merge/
Duplication check: http://swarm.cs.pub.ro/~mdragan/gsoc2013/duplications/ 



If I convinced you, try the application yourself:
Update: please clone from the developmenet branch
git clone git@github.com:monicadragan/gene_prediction.git
ruby prediction.rb -t protein --skip_blast --outfmt html gene_merge_validation/duplications/duplications.xml

Problems with predicted genes - gene merges

Apart from deciding whether the length of the prediction is similar to the majority of the lengths of the BLAST hits, there are many other validation tests that have to be applied on the predicted data. This week we have been focusing on reading frame inconsistency, detecting gene merges and duplications.
  • Gene duplications - can occur due to an error in the replication mechanism of the cell, or simply  due to a sequencing error, which we have to report. In these situations BLAST finds more high-scoring segment pairs (hsp) from a single hit. Here's an example for PB16966 gene (of Yannick's knowledge), where a sequence of 6 exons is repeated. BLAST found the following:    <Hit_num>1</Hit_num>
            <Hsp_num>1</Hsp_num>
                      <Hsp_query-from>1715</Hsp_query-from>
                      <Hsp_query-to>3346</Hsp_query-to>
                      <Hsp_hit-from>6</Hsp_hit-from>
                      <Hsp_hit-to>1642</Hsp_hit-to>

            <Hsp_num>2</Hsp_num>
                      <Hsp_query-from>10</Hsp_query-from>
                      <Hsp_query-to>1630</Hsp_query-to>
                      <Hsp_hit-from>1</Hsp_hit-from>
                      <Hsp_hit-to>1642</Hsp_hit-to> 
                                            

    Notice that in such cases hsps from the same hit do overlap and the matched regions in the predictions do not.                                                                                                          
  • Reading frame inconsistency - this case can be retrieved from the BLAST output: for a sequence of nucleotides, we use blastx to find the most similar proteins. The query sequence is translated in six reading frames, and all the resulting six-protein sequences are compared (there's no evidence where the codon starts from, so all the 3 possibilities of starting the gene translation are considered in BLAST; also another 3 possibilities when reading the sequence in reverse). There can be a problem in the predicted sequence when the hits have various reading frames, of the same sign (sequences are read in the same direction). The sequencing errors leading to this type of errors are usually reading frame shifts. For example a nucleotide was accidentally introduced at a certain point of the mRNA extraction, causing a reading frame inconsistency between the left and the right part pf the predicted gene, but still read in the same direction. 
  •  When reading frame is consistent for all the hits, we consider 3 cases:  

    • One direction gene merge (----> ----------->)   - same sign reading frames for all hits
    • Opposite direction gene merge (----><----------)   - two opposite sign reading frames for the hits
    • otherwise we have no merge

    Our approach  for identifying gene merges is to analyze the distribution of the middles of each region in the predicted gene that matches a hit gene from BLAST. You can observe in the plots below that in case of gene merge the data have a multimodal distribution, otherwise the distribution is unimodal. We calculate a score that quantifies whether the assumption of unimodality is more appropriate than the one for multimodality using the approach described in [3] for model based clustering. More plots here [2]. 

    1) Example for merged genes (bimodal distribution)

        2) No merge (unimodal distribution)