Class Bio::Lasergene
In: lib/bio/db/lasergene.rb
Parent: Object

bio/db/lasergene.rb - Interface for DNAStar Lasergene sequence file format

Author:Trevor Wennblom <trevor@corevx.com>
Copyright:Copyright (c) 2007 Center for Biomedical Research Informatics, University of Minnesota (cbri.umn.edu)
License:The Ruby License

Description

Bio::Lasergene reads DNAStar Lasergene formatted sequence files, or +.seq+ files. It only expects to find one sequence per file.

Usage

  require 'bio'
  filename = 'MyFile.seq'
  lseq = Bio::Lasergene.new( IO.readlines(filename) )
  lseq.entry_id  # => "Contig 1"
  lseq.seq  # => ATGACGTATCCAAAGAGGCGTTACC

Comments

I‘m only aware of the following three kinds of Lasergene file formats. Feel free to send me other examples that may not currently be accounted for.

File format 1:

  ## begin ##
  "Contig 1" (1,934)
    Contig Length:                  934 bases
    Average Length/Sequence:        467 bases
    Total Sequence Length:         1869 bases
    Top Strand:                       2 sequences
    Bottom Strand:                    2 sequences
    Total:                            4 sequences
  ^^
  ATGACGTATCCAAAGAGGCGTTACCGGAGAAGAAGACACCGCCCCCGCAGTCCTCTTGGCCAGATCCTCCGCCGCCGCCCCTGGCTCGTCCACCCCCGCCACAGTTACCGCTGGAGAAGGAAAAATGGCATCTTCAWCACCCGCCTATCCCGCAYCTTCGGAWRTACTATCAAGCGAACCACAGTCAGAACGCCCTCCTGGGCGGTGGACATGATGAGATTCAATATTAATGACTTTCTTCCCCCAGGAGGGGGCTCAAACCCCCGCTCTGTGCCCTTTGAATACTACAGAATAAGAAAGGTTAAGGTTGAATTCTGGCCCTGCTCCCCGATCACCCAGGGTGACAGGGGAATGGGCTCCAGTGCTGWTATTCTAGMTGATRRCTTKGTAACAAAGRCCACAGCCCTCACCTATGACCCCTATGTAAACTTCTCCTCCCGCCATACCATAACCCAGCCCTTCTCCTACCRCTCCCGYTACTTTACCCCCAAACCTGTCCTWGATKCCACTATKGATKACTKCCAACCAAACAACAAAAGAAACCAGCTGTGGSTGAGACTACAWACTGCTGGAAATGTAGACCWCGTAGGCCTSGGCACTGCGTKCGAAAACAGTATATACGACCAGGAATACAATATCCGTGTMACCATGTATGTACAATTCAGAGAATTTAATCTTAAAGACCCCCCRCTTMACCCKTAATGAATAATAAMAACCATTACGAAGTGATAAAAWAGWCTCAGTAATTTATTYCATATGGAAATTCWSGGCATGGGGGGGAAAGGGTGACGAACKKGCCCCCTTCCTCCSTSGMYTKTTCYGTAGCATTCYTCCAMAAYACCWAGGCAGYAMTCCTCCSATCAAGAGcYTSYACAGCTGGGACAGCAGTTGAGGAGGACCATTCAAAGGGGGTCGGATTGCTGGTAATCAGA
  ## end ##

File format 2:

  ## begin ##
  ^^:                                  350,935
  Contig 1 (1,935)
    Contig Length:                  935 bases
    Average Length/Sequence:        580 bases
    Total Sequence Length:         2323 bases
    Top Strand:                       2 sequences
    Bottom Strand:                    2 sequences
    Total:                            4 sequences
  ^^
  ATGTCGGGGAAATGCTTGACCGCGGGCTACTGCTCATCATTGCTTTCTTTGTGGTATATCGTGCCGTTCTGTTTTGCTGTGCTCGTCAACGCCAGCGGCGACAGCAGCTCTCATTTTCAGTCGATTTATAACTTGACGTTATGTGAGCTGAATGGCACGAACTGGCTGGCAGACAACTTTAACTGGGCTGTGGAGACTTTTGTCATCTTCCCCGTGTTGACTCACATTGTTTCCTATGGTGCACTCACTACCAGTCATTTTCTTGACACAGTTGGTCTAGTTACTGTGTCTACCGCCGGGTTTTATCACGGGCGGTACGTCTTGAGTAGCATCTACGCGGTCTGTGCTCTGGCTGCGTTGATTTGCTTCGCCATCAGGTTTGCGAAGAACTGCATGTCCTGGCGCTACTCTTGCACTAGATACACCAACTTCCTCCTGGACACCAAGGGCAGACTCTATCGTTGGCGGTCGCCTGTCATCATAGAGAAAGGGGGTAAGGTTGAGGTCGAAGGTCATCTGATCGATCTCAAAAGAGTTGTGCTTGATGGCTCTGTGGCGACACCTTTAACCAGAGTTTCAGCGGAACAATGGGGTCGTCCCTAGACGACTTTTGCCATGATAGTACAGCCCCACAGAAGGTGCTCTTGGCGTTTTCCATCACCTACACGCCAGTGATGATATATGCCCTAAAGGTAAGCCGCGGCCGACTTTTGGGGCTTCTGCACCTTTTGATTTTTTTGAACTGTGCCTTTACTTTCGGGTACATGACATTCGTGCACTTTCGGAGCACGAACAAGGTCGCGCTCACTATGGGAGCAGTAGTCGCACTCCTTTGGGGGGTGTACTCAGCCATAGAAACCTGGAAATTCATCACCTCCAGATGCCGTTGTGCTTGCTAGGCCGCAAGTACATTCTGGCCCCTGCCCACCACGTTG
  ## end ##

File format 3 (non-standard Lasergene header):

  ## begin ##
  LOCUS       PRU87392               15411 bp    RNA     linear   VRL 17-NOV-2000
  DEFINITION  Porcine reproductive and respiratory syndrome virus strain VR-2332,
              complete genome.
  ACCESSION   U87392 AF030244 U00153
  VERSION     U87392.3  GI:11192298
  [...cut...]
       3'UTR           15261..15411
       polyA_site      15409
  ORIGIN
  ^^
  atgacgtataggtgttggctctatgccttggcatttgtattgtcaggagctgtgaccattggcacagcccaaaacttgctgcacagaaacacccttctgtgatagcctccttcaggggagcttagggtttgtccctagcaccttgcttccggagttgcactgctttacggtctctccacccctttaaccatgtctgggatacttgatcggtgcacgtgtacccccaatgccagggtgtttatggcggagggccaagtctactgcacacgatgcctcagtgcacggtctctccttcccctgaacctccaagtttctgagctcggggtgctaggcctattctacaggcccgaagagccactccggtggacgttgccacgtgcattccccactgttgagtgctcccccgccggggcctgctggctttctgcaatctttccaatcgcacgaatgaccagtggaaacctgaacttccaacaaagaatggtacgggtcgcagctgagctttacagagccggccagctcacccctgcagtcttgaaggctctacaagtttatgaacggggttgccgctggtaccccattgttggacctgtccctggagtggccgttttcgccaattccctacatgtgagtgataaacctttcccgggagcaactcacgtgttgaccaacctgccgctcccgcagagacccaagcctgaagacttttgcccctttgagtgtgctatggctactgtctatgacattggtcatgacgccgtcatgtatgtggccgaaaggaaagtctcctgggcccctcgtggcggggatgaagtgaaatttgaagctgtccccggggagttgaagttgattgcgaaccggctccgcacctccttcccgccccaccacacagtggacatgtctaagttcgccttcacagcccctgggtgtggtgtttctatgcgggtcgaacgccaacacggctgccttcccgctgacactgtccctgaaggcaactgctggtggagcttgtttgacttgcttccactggaagttcagaacaaagaaattcgccatgctaaccaatttggctaccagaccaagcatggtgtctctggcaagtacctacagcggaggctgca[...cut...]
  ## end ##

Methods

entry_id   new   process   seq   standard_comment?  

Constants

DELIMITER_1 = '^\^\^:'
DELIMITER_2 = '^\^\^'

Attributes

average_length  [R]  Average length per sequence
bottom_strand_sequences  [R]  Number of bottom strand sequences
comments  [R]  Entire header before the sequence
contig_length  [R]  Contig length, length of present sequence
name  [R]  Name of sequence
sequence  [R]  Sequence

Bio::Sequence::NA or Bio::Sequence::AA object

top_strand_sequences  [R]  Number of top strand sequences
total_length  [R]  Length of parent sequence
total_sequences  [R]  Number of sequences

Public Class methods

[Source]

     # File lib/bio/db/lasergene.rb, line 124
124:   def initialize(lines)
125:     process(lines)
126:   end

Public Instance methods

Name of sequence

[Source]

     # File lib/bio/db/lasergene.rb, line 147
147:   def entry_id
148:     @name
149:   end

Sequence

Bio::Sequence::NA or Bio::Sequence::AA object

[Source]

     # File lib/bio/db/lasergene.rb, line 141
141:   def seq
142:     @sequence
143:   end

Is the comment header recognized as standard Lasergene format?


Arguments

  • none
Returns:true or false

[Source]

     # File lib/bio/db/lasergene.rb, line 134
134:   def standard_comment?
135:     @standard_comment
136:   end

Protected Instance methods

[Source]

     # File lib/bio/db/lasergene.rb, line 155
155:   def process(lines)
156:     delimiter_1_indices = []
157:     delimiter_2_indices = []
158:     
159:     # If the data from the file is passed as one big String instead of
160:     # broken into an Array, convert lines to an Array
161:     if lines.kind_of? String
162:       lines = lines.tr("\r", '').split("\n")
163:     end
164: 
165:     lines.each_with_index do |line, index|
166:       if line.match DELIMITER_1
167:         delimiter_1_indices << index
168:       elsif line.match DELIMITER_2
169:         delimiter_2_indices << index
170:       end
171:     end
172: 
173:     raise InputError, "More than one delimiter of type '#{DELIMITER_1}'" if delimiter_1_indices.size > 1
174:     raise InputError, "More than one delimiter of type '#{DELIMITER_2}'" if delimiter_2_indices.size > 1
175:     raise InputError, "No comment to data separator of type '#{DELIMITER_2}'" if delimiter_2_indices.size < 1
176: 
177:     if !delimiter_1_indices.empty?
178:       # toss out DELIMETER_1 and anything preceding it
179:       @comments = lines[ (delimiter_1_indices[0] + 1) .. (delimiter_2_indices[0] - 1) ]
180:     else
181:       @comments = lines[ 0 .. (delimiter_2_indices[0] - 1) ]
182:     end
183: 
184:     @standard_comment = false
185:     if @comments[0] =~ %r{(.+)\s+\(\d+,\d+\)} # if we have a standard Lasergene comment
186:       @standard_comment = true
187:       @name = $1
188:       comments.each do |comment|
189:         if comment.match('Contig Length:\s+(\d+)')
190:           @contig_length = $1.to_i
191:         elsif comment.match('Average Length/Sequence:\s+(\d+)')
192:           @average_length = $1.to_i
193:         elsif comment.match('Total Sequence Length:\s+(\d+)')
194:           @total_length = $1.to_i
195:         elsif comment.match('Top Strand:\s+(\d+)')
196:           @top_strand_sequences = $1.to_i
197:         elsif comment.match('Bottom Strand:\s+(\d+)')
198:           @bottom_strand_sequences = $1.to_i
199:         elsif comment.match('Total:\s+(\d+)')
200:           @total_sequences = $1.to_i
201:         end
202:       end
203:     end
204: 
205:     @comments = @comments.join('')
206:     @sequence = Bio::Sequence.auto( lines[ (delimiter_2_indices[0] + 1) .. -1 ].join('') )
207:   end

[Validate]