[GSoC 2013] Identifying problems with gene predictions: Thoughts on YAML

One option for our gene validation tool is to store the output results into a YAML file. Why we use YAML and other thoughts about this are exposed below.

YAML is a data serialization language for representing data structures in a human-readable way. It's "Yet Another Markup Language", along with JSON and XML. Using JSON to serialize class instances means including type information of the instance variables into a JSON object. If the goal is serialization the Ruby object so that to be read back somewhere else, YAML is the choice. Ruby has built-in automagical YAML serialization/deserialization. Any object you create in Ruby can be serialized with pretty much no effort into YAML format.

The basic components of YAML are associative arrays (maps) and lists. Here is an example for the representation of these components in YAML, JSON and XML. The data used in the example is part of a map that links the description of the gene and the validation results.

YAML:

sp|Q197F7|003L_IIV3: 
  prediction_len: 159
  length_validation_cluster: 
    limits:
    - 156
    - 237
  length_validation_rank: 
    percentage: 0.1
    msg: TOO_LONG

In YAML representation, data structure hierarchy is maintained by outline indentation (one or more spaces). Sequence items are denoted by a dash, and key value pairs within a map are separated by a colon.

JSON:

{"sp|Q197F7|003L_IIV3": {

  "prediction_len": 159,

  "length_validation_cluster": {

    "limits": [

      {"value": "156"},

      {"value": "237"},      

    ]

  }

  "length_validation_rank": {  

    "percentage": 0.1

    "msg": "TOO_LONG"

  }

}}

As you can see, JSON files are also valid YAML files. YAML syntax semms to be more human accessible - nested delimiters like {}, [], and " marks are replaced with white space indents.

XML:


<gene def="sp|Q197F7|003L_IIV3">

  <prediction_len value=159/>

  <length_validation_cluster>

    <limits value="156"/>

    <limits value="237"/>

    <prediction_len value="159"/>

  </length_validation_cluster>

  <length_validation_rank>

    <percentage value="0.1">

    <msg value="TOO_LONG">  

  </length_validation_rank>

</gene>

Now it's very easy to add an item (class instance) to the YAML file, using the 'yaml' ruby gem.

      raport = ValidationRaport(prediction, hits) 

      # load the existing content of the hash from the yaml file        

      hsh = YAML.load_file(file_yaml)

      # add a new key in the hash

      hsh[prediction.definition] = raport

      # update YAML file

      File.open(file_yaml, "w") do |f|

         YAML.dump(hsh, f)

      end

Part of a real output file generated by the gene validation tool looks like this:


sp|P19084|11S3:HELAN: !ruby/object:Output

  prediction_len: 400

  prediction_def: sp|P19084|11S3_HELAN 11S globulin seed storage protein G3 OS=Helianthus

    annuus GN=HAG3 PE=3 SV=1

  nr_hits: 500

  filename: data/uniprot_curated/uniprot_curated.fasta

  idx: 3

  start_idx: 1

  image_histo_len: data/uniprot_curated/uniprot_curated.fasta_3_len_clusters.jpg

  image_plot_merge: data/uniprot_curated/uniprot_curated.fasta_3_match.jpg

  image_histo_merge: data/uniprot_curated/uniprot_curated.fasta_3_match_2d.jpg

  image_orfs: data/uniprot_curated/uniprot_curated.fasta_3_orfs.jpg



  length_validation_cluster: !ruby/object:LengthClusterValidationOutput

    limits:

    - 445

    - 542

    prediction_len: 400

  length_validation_rank: !ruby/object:LengthRankValidationOutput

    percentage: 0.4

    msg: 'YES'

  reading_frame_validation: !ruby/object:BlastRFValidationOutput

    frames_histo:

      0: 541

    msg: 0:541;

  gene_merge_validation: !ruby/object:GeneMergeValidationOutput

    slope: 0.012590073314548955

    threshold_down: 0.4

    threshold_up: 1.2

  duplication: !ruby/object:DuplciationValidationOutput

    pvalue: 1

    threshold: 0.05

  orf: !ruby/object:ValidationReport

    message: ! '-'

This file is obtained by encoding object into YAML format using Ruby.
In order make statistics on the number of positive / false positive / false negative results, we use another YAML file that stores the validation results for a list data labelled by hand:


- PB18752-RA: 
     valid_length: "no"
     valid_rf: "yes"
     gene_merge: "no"
     duplication: "no"
     orf: "no"

In the statistical test, the validation output and the reference data are compared based on the gene description (this is the key).

Some aspects one has to care about when handling YAML files:

- boolean variables: note that "yes, no, true, false" are reserved words and may be avoided in the YAML file. If you use the value `yes` in the YAML this will actually be converted to `true` in the resulting object, after deserialization. To fix this, you can treat these reserved words as strings, despite their special meaning.

- indentation should be properly used. Each node must be indented further than its parent and all sibling nodes must use the exact same indentation level.

Just play a little with this YAML file example (spaces are marked with #):


---
(1)-#SI2.2.0_06267: 
(2)##valid_length: "no"  # 2 spaces (#)
(3)-#SI2.2.0_13722:
(4)###valid_length: "yes" # 3 spaces (#)

Object deserialization:


require "yaml"

filename_yml_reference = "test_output/test_reference.yaml"
yml_ref = YAML.load_file(filename_yml_reference)
puts yml_ref

Look at the output and see how a single space makes the difference:


{"SI2.2.0_06267"=>nil, "valid_length"=>"no"}
{"SI2.2.0_13722"=>{"valid_length"=>"yes"}}

Moreover, if you use only one space in lines (2) and (4) you get a parse error...

[GSoC 2013] Identifying problems with gene predictions

Thoughts on YAML

0 comments:

Post a Comment

Pages

Labels

Blog Archive

Meta

About