Class | Bio::Fastq |
In: |
lib/bio/db/fastq.rb
|
Parent: | Object |
Bio::Fastq is a parser for FASTQ format.
FormatNames | = | { "fastq-sanger" => FormatData::FASTQ_SANGER, "fastq-solexa" => FormatData::FASTQ_SOLEXA, "fastq-illumina" => FormatData::FASTQ_ILLUMINA | Available format names. | |
Formats | = | { :fastq_sanger => FormatData::FASTQ_SANGER, :fastq_solexa => FormatData::FASTQ_SOLEXA, :fastq_illumina => FormatData::FASTQ_ILLUMINA | Available format name symbols. | |
DefaultFormatName | = | 'fastq-sanger'.freeze | Default format name | |
FLATFILE_SPLITTER | = | Bio::FlatFile::Splitter::LineOriented | Splitter for Bio::FlatFile |
Creates a new Fastq object from formatted text string.
The format of quality scores should be specified later by using format= method.
Arguments:
# File lib/bio/db/fastq.rb, line 383 383: def initialize(str = nil) 384: return unless str 385: sc = StringScanner.new(str) 386: while !sc.eos? and line = sc.scan(/.*(?:\n|\r|\r\n)?/) 387: unless add_header_line(line) then 388: sc.unscan 389: break 390: end 391: end 392: while !sc.eos? and line = sc.scan(/.*(?:\n|\r|\r\n)?/) 393: unless add_line(line) then 394: sc.unscan 395: break 396: end 397: end 398: @entry_overrun = sc.rest 399: end
Adds a header line if the header data is not yet given and the given line is suitable for header. Returns self if adding header line is succeeded. Otherwise, returns false (the line is not added).
# File lib/bio/db/fastq.rb, line 324 324: def add_header_line(line) 325: @header ||= "" 326: if line[0,1] == "@" then 327: false 328: else 329: @header.concat line 330: self 331: end 332: end
Adds a line to the entry if the given line is regarded as a part of the current entry.
# File lib/bio/db/fastq.rb, line 339 339: def add_line(line) 340: line = line.chomp 341: if !defined? @definition then 342: if line[0, 1] == "@" then 343: @definition = line[1..-1] 344: else 345: @definition = line 346: @parse_errors ||= [] 347: @parse_errors.push Error::No_atmark.new 348: end 349: return self 350: end 351: if defined? @definition2 then 352: @quality_string ||= '' 353: if line[0, 1] == "@" and 354: @quality_string.size >= @sequence_string.size then 355: return false 356: else 357: @quality_string.concat line 358: return self 359: end 360: else 361: @sequence_string ||= '' 362: if line[0, 1] == '+' then 363: @definition2 = line[1..-1] 364: else 365: @sequence_string.concat line 366: end 367: return self 368: end 369: raise "Bug: should not reach here!" 370: end
Identifier of the entry. Normally, the first word of the ID line.
# File lib/bio/db/fastq.rb, line 432 432: def entry_id 433: unless defined? @entry_id then 434: eid = @definition.strip.split(/\s+/)[0] || @definition 435: @entry_id = eid 436: end 437: @entry_id 438: end
Estimated probability of error for each base.
Returns: | (Array containing Float) error probability values |
# File lib/bio/db/fastq.rb, line 515 515: def error_probabilities 516: unless defined? @error_probabilities then 517: self.format ||= self.class::DefaultFormatName 518: a = @format.q2p(self.quality_scores) 519: @error_probabilities = a 520: end 521: @error_probabilities 522: end
Specify the format. If the format is not found, raises RuntimeError.
Available formats are:
"fastq-sanger" or :fastq_sanger "fastq-solexa" or :fastq_solexa "fastq-illumina" or :fastq_illumina
Arguments:
Returns: | (String) format name |
# File lib/bio/db/fastq.rb, line 462 462: def format=(name) 463: if name then 464: f = FormatNames[name] || Formats[name] 465: if f then 466: reset_state 467: @format = f.instance 468: self.format 469: else 470: raise "unknown format" 471: end 472: else 473: reset_state 474: nil 475: end 476: end
returns Bio::Sequence::NA
# File lib/bio/db/fastq.rb, line 411 411: def naseq 412: unless defined? @naseq then 413: @naseq = Bio::Sequence::NA.new(@sequence_string) 414: end 415: @naseq 416: end
The meaning of the quality scores. It may be one of :phred, :solexa, or nil.
# File lib/bio/db/fastq.rb, line 490 490: def quality_score_type 491: self.format ||= self.class::DefaultFormatName 492: @format.quality_score_type 493: end
Quality score for each base. For "fastq-sanger" or "fastq-illumina", it is PHRED score. For "fastq-solexa", it is Solexa score.
Returns: | (Array containing Integer) quality score values |
# File lib/bio/db/fastq.rb, line 501 501: def quality_scores 502: unless defined? @quality_scores then 503: self.format ||= self.class::DefaultFormatName 504: s = @format.str2scores(@quality_string) 505: @quality_scores = s 506: end 507: @quality_scores 508: end
returns Bio::Sequence::Generic
# File lib/bio/db/fastq.rb, line 424 424: def seq 425: unless defined? @seq then 426: @seq = Bio::Sequence::Generic.new(@sequence_string) 427: end 428: @seq 429: end
Returns sequence as a Bio::Sequence object.
Note: If you modify the returned Bio::Sequence object, the sequence or definition in this Fastq object might also be changed (but not always be changed) because of efficiency.
# File lib/bio/db/fastq.rb, line 639 639: def to_biosequence 640: Bio::Sequence.adapter(self, Bio::Sequence::Adapter::Fastq) 641: end
Format validation.
If an array is given as the argument, when errors are found, error objects are pushed to the array. Currently, following errors may be added to the array. (All errors are under the Bio::Fastq namespace, for example, Bio::Fastq::Error::Diff_ids).
Error::Diff_ids — the identifier in the two lines are different Error::Long_qual — length of quality is longer than the sequence Error::Short_qual — length of quality is shorter than the sequence Error::No_qual — no quality characters found Error::No_seq — no sequence found Error::Qual_char — invalid character in the quality Error::Seq_char — invalid character in the sequence Error::Qual_range — quality score value out of range Error::No_ids — sequence identifier not found Error::No_atmark — the first identifier does not begin with "@" Error::Skipped_unformatted_lines — the parser skipped unformatted lines that could not be recognized as FASTQ format
Arguments:
Returns: | true:no error, false: containing error. |
# File lib/bio/db/fastq.rb, line 548 548: def validate_format(errors = nil) 549: err = [] 550: 551: # if header exists, the format might be broken. 552: if defined? @header and @header and !@header.strip.empty? then 553: err.push Error::Skipped_unformatted_lines.new 554: end 555: 556: # if parse errors exist, adding them 557: if defined? @parse_errors and @parse_errors then 558: err.concat @parse_errors 559: end 560: 561: # check if identifier exists, and identifier matches 562: if !defined?(@definition) or !@definition then 563: err.push Error::No_ids.new 564: elsif defined?(@definition2) and 565: !@definition2.to_s.empty? and 566: @definition != @definition2 then 567: err.push Error::Diff_ids.new 568: end 569: 570: # check if sequence exists 571: has_seq = true 572: if !defined?(@sequence_string) or !@sequence_string then 573: err.push Error::No_seq.new 574: has_seq = false 575: end 576: 577: # check if quality exists 578: has_qual = true 579: if !defined?(@quality_string) or !@quality_string then 580: err.push Error::No_qual.new 581: has_qual = false 582: end 583: 584: # sequence and quality length check 585: if has_seq and has_qual then 586: slen = @sequence_string.length 587: qlen = @quality_string.length 588: if slen > qlen then 589: err.push Error::Short_qual.new 590: elsif qlen > slen then 591: err.push Error::Long_qual.new 592: end 593: end 594: 595: # sequence character check 596: if has_seq then 597: sc = StringScanner.new(@sequence_string) 598: while sc.scan_until(/[ \x00-\x1f\x7f-\xff]/n) 599: err.push Error::Seq_char.new(sc.pos - sc.matched_size) 600: end 601: end 602: 603: # sequence character check 604: if has_qual then 605: fmt = if defined?(@format) and @format then 606: @format.name 607: else 608: nil 609: end 610: re = case fmt 611: when 'fastq-sanger' 612: /[^\x21-\x7e]/n 613: when 'fastq-solexa' 614: /[^\x3b-\x7e]/n 615: when 'fastq-illumina' 616: /[^\x40-\x7e]/n 617: else 618: /[ \x00-\x1f\x7f-\xff]/n 619: end 620: sc = StringScanner.new(@quality_string) 621: while sc.scan_until(re) 622: err.push Error::Qual_char.new(sc.pos - sc.matched_size) 623: end 624: end 625: 626: # if "errors" is given, set errors 627: errors.concat err if errors 628: # returns true if no error; otherwise, returns false 629: err.empty? ? true : false 630: end