Class Bio::Fastq
In: lib/bio/db/fastq.rb
Parent: Object

Bio::Fastq is a parser for FASTQ format.

Methods

Classes and Modules

Class Bio::Fastq::Error
Class Bio::Fastq::FormatData

Constants

FormatNames = { "fastq-sanger" => FormatData::FASTQ_SANGER, "fastq-solexa" => FormatData::FASTQ_SOLEXA, "fastq-illumina" => FormatData::FASTQ_ILLUMINA   Available format names.
Formats = { :fastq_sanger => FormatData::FASTQ_SANGER, :fastq_solexa => FormatData::FASTQ_SOLEXA, :fastq_illumina => FormatData::FASTQ_ILLUMINA   Available format name symbols.
DefaultFormatName = 'fastq-sanger'.freeze   Default format name
FLATFILE_SPLITTER = Bio::FlatFile::Splitter::LineOriented   Splitter for Bio::FlatFile

Attributes

definition  [R]  definition; ID line (begins with @)
entry_overrun  [R]  entry_overrun
header  [R]  misc lines before the entry (String or nil)
quality_string  [R]  quality as a string
sequence_string  [R]  raw sequence data as a String object

Public Class methods

Creates a new Fastq object from formatted text string.

The format of quality scores should be specified later by using format= method.


Arguments:

  • str: Formatted string (String)

[Source]

     # File lib/bio/db/fastq.rb, line 383
383:   def initialize(str = nil)
384:     return unless str
385:     sc = StringScanner.new(str)
386:     while !sc.eos? and line = sc.scan(/.*(?:\n|\r|\r\n)?/)
387:       unless add_header_line(line) then
388:         sc.unscan
389:         break
390:       end
391:     end
392:     while !sc.eos? and line = sc.scan(/.*(?:\n|\r|\r\n)?/)
393:       unless add_line(line) then
394:         sc.unscan
395:         break
396:       end
397:     end
398:     @entry_overrun = sc.rest
399:   end

Public Instance methods

Adds a header line if the header data is not yet given and the given line is suitable for header. Returns self if adding header line is succeeded. Otherwise, returns false (the line is not added).

[Source]

     # File lib/bio/db/fastq.rb, line 324
324:   def add_header_line(line)
325:     @header ||= ""
326:     if line[0,1] == "@" then
327:       false
328:     else
329:       @header.concat line
330:       self
331:     end
332:   end

Adds a line to the entry if the given line is regarded as a part of the current entry.

[Source]

     # File lib/bio/db/fastq.rb, line 339
339:   def add_line(line)
340:     line = line.chomp
341:     if !defined? @definition then
342:       if line[0, 1] == "@" then
343:         @definition = line[1..-1]
344:       else
345:         @definition = line
346:         @parse_errors ||= []
347:         @parse_errors.push Error::No_atmark.new
348:       end
349:       return self
350:     end
351:     if defined? @definition2 then
352:       @quality_string ||= ''
353:       if line[0, 1] == "@" and
354:           @quality_string.size >= @sequence_string.size then
355:         return false
356:       else
357:         @quality_string.concat line
358:         return self
359:       end
360:     else
361:       @sequence_string ||= ''
362:       if line[0, 1] == '+' then
363:         @definition2 = line[1..-1]
364:       else
365:         @sequence_string.concat line
366:       end
367:       return self
368:     end
369:     raise "Bug: should not reach here!"
370:   end

Identifier of the entry. Normally, the first word of the ID line.

[Source]

     # File lib/bio/db/fastq.rb, line 432
432:   def entry_id
433:     unless defined? @entry_id then
434:       eid = @definition.strip.split(/\s+/)[0] || @definition
435:       @entry_id = eid
436:     end
437:     @entry_id
438:   end

Estimated probability of error for each base.


Returns:(Array containing Float) error probability values

[Source]

     # File lib/bio/db/fastq.rb, line 515
515:   def error_probabilities
516:     unless defined? @error_probabilities then
517:       self.format ||= self.class::DefaultFormatName
518:       a = @format.q2p(self.quality_scores)
519:       @error_probabilities = a
520:     end
521:     @error_probabilities
522:   end

Format name. One of "fastq-sanger", "fastq-solexa", "fastq-illumina", or nil (when not specified).


Returns:(String or nil) format name

[Source]

     # File lib/bio/db/fastq.rb, line 483
483:   def format
484:     @format ? @format.name : nil
485:   end

Specify the format. If the format is not found, raises RuntimeError.

Available formats are:

  "fastq-sanger" or :fastq_sanger
  "fastq-solexa" or :fastq_solexa
  "fastq-illumina" or :fastq_illumina

Arguments:

Returns:(String) format name

[Source]

     # File lib/bio/db/fastq.rb, line 462
462:   def format=(name)
463:     if name then
464:       f = FormatNames[name] || Formats[name]
465:       if f then
466:         reset_state
467:         @format = f.instance
468:         self.format
469:       else
470:         raise "unknown format"
471:       end
472:     else
473:       reset_state
474:       nil
475:     end
476:   end

length of naseq

[Source]

     # File lib/bio/db/fastq.rb, line 419
419:   def nalen
420:     naseq.length
421:   end

returns Bio::Sequence::NA

[Source]

     # File lib/bio/db/fastq.rb, line 411
411:   def naseq
412:     unless defined? @naseq then
413:       @naseq = Bio::Sequence::NA.new(@sequence_string)
414:     end
415:     @naseq
416:   end
qualities()

Alias for quality_scores

The meaning of the quality scores. It may be one of :phred, :solexa, or nil.

[Source]

     # File lib/bio/db/fastq.rb, line 490
490:   def quality_score_type
491:     self.format ||= self.class::DefaultFormatName
492:     @format.quality_score_type
493:   end

Quality score for each base. For "fastq-sanger" or "fastq-illumina", it is PHRED score. For "fastq-solexa", it is Solexa score.


Returns:(Array containing Integer) quality score values

[Source]

     # File lib/bio/db/fastq.rb, line 501
501:   def quality_scores
502:     unless defined? @quality_scores then
503:       self.format ||= self.class::DefaultFormatName
504:       s = @format.str2scores(@quality_string)
505:       @quality_scores = s
506:     end
507:     @quality_scores
508:   end

returns Bio::Sequence::Generic

[Source]

     # File lib/bio/db/fastq.rb, line 424
424:   def seq
425:     unless defined? @seq then
426:       @seq = Bio::Sequence::Generic.new(@sequence_string)
427:     end
428:     @seq
429:   end

Returns sequence as a Bio::Sequence object.

Note: If you modify the returned Bio::Sequence object, the sequence or definition in this Fastq object might also be changed (but not always be changed) because of efficiency.

[Source]

     # File lib/bio/db/fastq.rb, line 639
639:   def to_biosequence
640:     Bio::Sequence.adapter(self, Bio::Sequence::Adapter::Fastq)
641:   end

Format validation.

If an array is given as the argument, when errors are found, error objects are pushed to the array. Currently, following errors may be added to the array. (All errors are under the Bio::Fastq namespace, for example, Bio::Fastq::Error::Diff_ids).

Error::Diff_ids — the identifier in the two lines are different Error::Long_qual — length of quality is longer than the sequence Error::Short_qual — length of quality is shorter than the sequence Error::No_qual — no quality characters found Error::No_seq — no sequence found Error::Qual_char — invalid character in the quality Error::Seq_char — invalid character in the sequence Error::Qual_range — quality score value out of range Error::No_ids — sequence identifier not found Error::No_atmark — the first identifier does not begin with "@" Error::Skipped_unformatted_lines — the parser skipped unformatted lines that could not be recognized as FASTQ format


Arguments:

  • (optional) errors: (Array or nil) an array for pushing error messages. The array should be empty.
Returns:true:no error, false: containing error.

[Source]

     # File lib/bio/db/fastq.rb, line 548
548:   def validate_format(errors = nil)
549:     err = []
550: 
551:     # if header exists, the format might be broken.
552:     if defined? @header and @header and !@header.strip.empty? then
553:       err.push Error::Skipped_unformatted_lines.new
554:     end
555: 
556:     # if parse errors exist, adding them
557:     if defined? @parse_errors and @parse_errors then
558:       err.concat @parse_errors
559:     end
560: 
561:     # check if identifier exists, and identifier matches
562:     if !defined?(@definition) or !@definition then
563:       err.push Error::No_ids.new
564:     elsif defined?(@definition2) and
565:         !@definition2.to_s.empty? and
566:         @definition != @definition2 then
567:       err.push Error::Diff_ids.new
568:     end
569: 
570:     # check if sequence exists
571:     has_seq  = true
572:     if !defined?(@sequence_string) or !@sequence_string then
573:       err.push Error::No_seq.new
574:       has_seq = false
575:     end
576: 
577:     # check if quality exists
578:     has_qual = true
579:     if !defined?(@quality_string) or !@quality_string then
580:       err.push Error::No_qual.new
581:       has_qual = false
582:     end
583: 
584:     # sequence and quality length check
585:     if has_seq and has_qual then
586:       slen = @sequence_string.length
587:       qlen = @quality_string.length
588:       if slen > qlen then
589:         err.push Error::Short_qual.new
590:       elsif qlen > slen then
591:         err.push Error::Long_qual.new
592:       end
593:     end
594: 
595:     # sequence character check
596:     if has_seq then
597:       sc = StringScanner.new(@sequence_string)
598:       while sc.scan_until(/[ \x00-\x1f\x7f-\xff]/n)
599:         err.push Error::Seq_char.new(sc.pos - sc.matched_size)
600:       end
601:     end
602: 
603:     # sequence character check
604:     if has_qual then
605:       fmt = if defined?(@format) and @format then
606:               @format.name
607:             else
608:               nil
609:             end
610:       re = case fmt
611:            when 'fastq-sanger'
612:              /[^\x21-\x7e]/n
613:            when 'fastq-solexa'
614:              /[^\x3b-\x7e]/n
615:            when 'fastq-illumina'
616:              /[^\x40-\x7e]/n
617:            else
618:              /[ \x00-\x1f\x7f-\xff]/n
619:            end
620:       sc = StringScanner.new(@quality_string)
621:       while sc.scan_until(re)
622:         err.push Error::Qual_char.new(sc.pos - sc.matched_size)
623:       end
624:     end
625: 
626:     # if "errors" is given, set errors
627:     errors.concat err if errors
628:     # returns true if no error; otherwise, returns false
629:     err.empty? ? true : false
630:   end

[Validate]