db.rb

Path: lib/bio/db.rb
Last Update: Thu Feb 18 18:16:45 +0000 2010

bio/db.rb - common API for database parsers

Copyright:Copyright (C) 2001, 2002, 2005 Toshiaki Katayama <k@bioruby.org>
License:The Ruby License

$Id: db.rb,v 0.38 2007/05/08 17:02:13 nakao Exp $

On-demand parsing and cache

The flatfile parsers (sub classes of the Bio::DB) split the original entry into a Hash and store the hash in the @orig instance variable. To parse in detail is delayed until the method is called which requires a further parsing of a content of the @orig hash. Fully parsed data is cached in the another hash, @data, separately.

Guide lines for the developers to create an new database class

— Bio::DB.new(entry)

The ‘new’ method should accept the entire entry in one String and return the parsed database object.

— Bio::DB#entry_id

Database classes should implement the following methods if appropriate:

  • entry_id
  • definition

Every sub class should define the following constants if appropriate:

  • DELIMITER (RS)
    • entry separator of the flatfile of the database.
    • RS (= record separator) is an alias for the DELIMITER in short.
  • TAGSIZE
    • length of the tag field in the FORTRAN-like format.
        |<- tag       ->||<- data                           ---->|
        ENTRY_ID         A12345
        DEFINITION       Hoge gene of the Pokemonia pikachuae
      

Template of the sub class

  module Bio
  class Hoge < DB

    DELIMITER = RS = "\n//\n"
    TAGSIZE           = 12             # You can omit this line if not needed

    def initialize(entry)
    end

    def entry_id
    end

  end # class Hoge
  end # module Bio

Recommended method names for sub classes

In general, the method name should be in the singular form when returns a Object (including the case when the Object is a String), and should be the plural form when returns same Objects in Array. It depends on the database classes that which form of the method name can be use.

For example, GenBank has several REFERENCE fields in one entry, so define Bio::GenBank#references and this method should return an Array of the Reference objects. On the other hand, MEDLINE has one REFERENCE information per one entry, so define Bio::MEDLINE#reference method and this should return a Reference object.

The method names used in the sub classes should be taken from the following list if appropriate:

— entry_id #=> String

The entry identifier.

— definition #=> String

The description of the entry.

— reference #=> Bio::Reference — references #=> Array of Bio::Reference

The reference field(s) of the entry.

— dblink #=> String — dblinks #=> Array of String

The link(s) to the other database entry.

— naseq #=> Bio::Sequence::NA

The DNA/RNA sequence of the entry.

— nalen #=> Integer

The length of the DNA/RNA sequence of the entry.

— aaseq #=> Bio::Sequence::AA

The amino acid sequence of the entry.

— aalen #=> Integer

The length of the amino acid sequence of the entry.

— seq #=> Bio::Sequence::NA or Bio::Sequence::AA

Returns an appropriate sequence object.

— position #=> String

The position of the sequence in the entry or in the genome (depends on the database).

— locations #=> Bio::Locations

Returns Bio::Locations.new(position).

— division #=> String

The sub division name of the database.

  • Example:
    • EST, VRL etc. for GenBank
    • PATTERN, RULE etc. for PROSITE

— date #=> String

The date of the entry. Should we use Date (by ParseDate) instead of String?

— gene #=> String — genes #=> Array of String

The name(s) of the gene.

— organism #=> String

The name of the organism.

Required files

bio/sequence   bio/reference   bio/feature  

[Validate]