Class | Bio::ContingencyTable |
In: |
lib/bio/util/contingency_table.rb
|
Parent: | Object |
bio/util/contingency_table.rb - Statistical contingency table analysis for aligned sequences
Author: | Trevor Wennblom <trevor@corevx.com> |
Copyright: | Copyright (c) 2005-2007 Midwinter Laboratories, LLC (midwinterlabs.com) |
License: | The Ruby License |
The Bio::ContingencyTable class provides basic statistical contingency table analysis for two positions within aligned sequences.
When ContingencyTable is instantiated the set of characters in the aligned sequences may be passed to it as an array. This is important since it uses these characters to create the table‘s rows and columns. If this array is not passed it will use it‘s default of an amino acid and nucleotide alphabet in lowercase along with the clustal spacer ’-’.
To get data from the table the most used functions will be chi_square and contingency_coefficient:
ctable = Bio::ContingencyTable.new() ctable['a']['t'] += 1 # .. put more values into the table puts ctable.chi_square puts ctable.contingency_coefficient # between 0.0 and 1.0
The contingency_coefficient represents the degree of correlation of change between two sequence positions in a multiple-sequence alignment. 0.0 indicates no correlation, 1.0 is the maximum correlation.
What follows is an example of ContingencyTable in typical usage analyzing results from a clustal alignment.
require 'bio' seqs = {} max_length = 0 Bio::ClustalW::Report.new( IO.read('sample.aln') ).to_a.each do |entry| data = entry.data.strip seqs[entry.definition] = data.downcase max_length = data.size if max_length == 0 raise "Aligned sequences must be the same length!" unless data.size == max_length end VERBOSE = true puts "i\tj\tchi_square\tcontingency_coefficient" if VERBOSE correlations = {} 0.upto(max_length - 1) do |i| (i+1).upto(max_length - 1) do |j| ctable = Bio::ContingencyTable.new() seqs.each_value { |seq| ctable.table[ seq[i].chr ][ seq[j].chr ] += 1 } chi_square = ctable.chi_square contingency_coefficient = ctable.contingency_coefficient puts [(i+1), (j+1), chi_square, contingency_coefficient].join("\t") if VERBOSE correlations["#{i+1},#{j+1}"] = contingency_coefficient correlations["#{j+1},#{i+1}"] = contingency_coefficient # Both ways are accurate end end require 'yaml' File.new('results.yml', 'a+') { |f| f.puts correlations.to_yaml }
ContingencyTable returns the statistical significance of change between two positions in an alignment. If you would like to see how every possible combination of positions in your alignment compares to one another you must set this up yourself. Hopefully the provided examples will help you get started without too much trouble.
def lite_example(sequences, max_length, characters) %w{i j chi_square contingency_coefficient}.each { |x| print x.ljust(12) } puts 0.upto(max_length - 1) do |i| (i+1).upto(max_length - 1) do |j| ctable = Bio::ContingencyTable.new( characters ) sequences.each do |seq| i_char = seq[i].chr j_char = seq[j].chr ctable.table[i_char][j_char] += 1 end chi_square = ctable.chi_square contingency_coefficient = ctable.contingency_coefficient [(i+1), (j+1), chi_square, contingency_coefficient].each { |x| print x.to_s.ljust(12) } puts end end end allowed_letters = Array.new allowed_letters = 'abcdefghijk'.split('') seqs = Array.new seqs << 'abcde' seqs << 'abcde' seqs << 'aacje' seqs << 'aacae' length_of_every_sequence = seqs[0].size # 5 letters long lite_example(seqs, length_of_every_sequence, allowed_letters)
Producing the following results:
i j chi_square contingency_coefficient 1 2 0.0 0.0 1 3 0.0 0.0 1 4 0.0 0.0 1 5 0.0 0.0 2 3 0.0 0.0 2 4 4.0 0.707106781186548 2 5 0.0 0.0 3 4 0.0 0.0 3 5 0.0 0.0 4 5 0.0 0.0
The position i=2 and j=4 has a high contingency coefficient indicating that the changes at these positions are related. Note that i and j are arbitrary, this could be represented as i=4 and j=2 since they both refer to position two and position four in the alignment. Here are some more examples:
seqs = Array.new seqs << 'abcde' seqs << 'abcde' seqs << 'aacje' seqs << 'aacae' seqs << 'akcfe' seqs << 'akcfe' length_of_every_sequence = seqs[0].size # 5 letters long lite_example(seqs, length_of_every_sequence, allowed_letters)
Results:
i j chi_square contingency_coefficient 1 2 0.0 0.0 1 3 0.0 0.0 1 4 0.0 0.0 1 5 0.0 0.0 2 3 0.0 0.0 2 4 12.0 0.816496580927726 2 5 0.0 0.0 3 4 0.0 0.0 3 5 0.0 0.0 4 5 0.0 0.0
Here we can see that the strength of the correlation of change has increased when more data is added with correlated changes at the same positions.
seqs = Array.new seqs << 'abcde' seqs << 'abcde' seqs << 'kacje' # changed first letter seqs << 'aacae' seqs << 'akcfa' # changed last letter seqs << 'akcfe' length_of_every_sequence = seqs[0].size # 5 letters long lite_example(seqs, length_of_every_sequence, allowed_letters)
Results:
i j chi_square contingency_coefficient 1 2 2.4 0.534522483824849 1 3 0.0 0.0 1 4 6.0 0.707106781186548 1 5 0.24 0.196116135138184 2 3 0.0 0.0 2 4 12.0 0.816496580927726 2 5 2.4 0.534522483824849 3 4 0.0 0.0 3 5 0.0 0.0 4 5 2.4 0.534522483824849
With random changes it becomes more difficult to identify correlated changes, yet positions two and four still have the highest correlation as indicated by the contingency coefficient. The best way to improve the accuracy of your results, as is often the case with statistics, is to increase the sample size.
ContingencyTable is slow. It involves many calculations for even a seemingly small five-string data set. Even worse, it‘s very dependent on matrix traversal, and this is done with two dimensional hashes which dashes any hope of decent speed.
Finally, half of the matrix is redundant and positions could be summed with their companion position to reduce calculations. For example the positions (5,2) and (2,5) could both have their values added together and just stored in (2,5) while (5,2) could be an illegal position. Also, positions (1,1), (2,2), (3,3), etc. will never be used.
The purpose of this package is flexibility and education. The code is short and to the point in aims of achieving that purpose. If the BioRuby project moves towards C extensions in the future a professional caliber version will likely be created.
characters | [R] | |
table | [RW] |
Since we‘re making this math-notation friendly here is the layout of
@table:
|
Create a ContingencyTable that has characters_in_sequence.size rows and characters_in_sequence.size columns for each row
Arguments
Returns: | ContingencyTable object to be filled with values and calculated upon |
# File lib/bio/util/contingency_table.rb, line 256 256: def initialize(characters_in_sequences = nil) 257: @characters = ( characters_in_sequences or %w{a c d e f g h i k l m n p q r s t v w y - x u} ) 258: tmp = Hash[*@characters.collect { |v| [v, 0] }.flatten] 259: @table = Hash[*@characters.collect { |v| [v, tmp.dup] }.flatten] 260: end
Report the chi square of the entire table
Arguments
Returns: | Float chi square value |
# File lib/bio/util/contingency_table.rb, line 332 332: def chi_square 333: total = 0 334: c = @characters 335: max = c.size - 1 336: @characters.each do |i| # Loop through every row in the ContingencyTable 337: @characters.each do |j| # Loop through every column in the ContingencyTable 338: total += chi_square_element(i, j) 339: end 340: end 341: total 342: end
Report the chi-square relation of two elements in the table
Arguments
Returns: | Float chi-square of an intersection |
# File lib/bio/util/contingency_table.rb, line 351 351: def chi_square_element(i, j) 352: eij = expected(i, j) 353: return 0 if eij == 0 354: ( @table[i][j] - eij )**2 / eij 355: end
Report the sum of all values in a given column
Arguments
Returns: | Integer sum of column |
# File lib/bio/util/contingency_table.rb, line 280 280: def column_sum(j) 281: total = 0 282: @table.each { |row_key, column| total += column[j] } 283: total 284: end
Report the sum of all values in all columns.
Arguments
Returns: | Integer sum of all columns |
# File lib/bio/util/contingency_table.rb, line 294 294: def column_sum_all 295: total = 0 296: @characters.each { |j| total += column_sum(j) } 297: total 298: end
Report the contingency coefficient of the table
Arguments
Returns: | Float contingency_coefficient of the table |
# File lib/bio/util/contingency_table.rb, line 363 363: def contingency_coefficient 364: c_s = chi_square 365: Math.sqrt(c_s / (table_sum_all + c_s) ) 366: end
Report the sum of all values in a given row
Arguments
Returns: | Integer sum of row |
# File lib/bio/util/contingency_table.rb, line 268 268: def row_sum(i) 269: total = 0 270: @table[i].each { |k, v| total += v } 271: total 272: end
Report the sum of all values in all rows.
Arguments
Returns: | Integer sum of all rows |
# File lib/bio/util/contingency_table.rb, line 308 308: def row_sum_all 309: total = 0 310: @characters.each { |i| total += row_sum(i) } 311: total 312: end