Class | Bio::KEGG::Taxonomy |
In: |
lib/bio/db/kegg/taxonomy.rb
|
Parent: | Object |
Parse the KEGG ‘taxonomy’ file which describes taxonomic classification of organisms.
The KEGG ‘taxonomy’ file is available at
leaves | [R] | |
path | [R] | |
root | [RW] | |
tree | [R] |
# File lib/bio/db/kegg/taxonomy.rb, line 26 26: def initialize(filename, orgs = []) 27: # Stores the taxonomic tree as a linked list (implemented in Hash), so 28: # every node need to have unique name (key) to work correctly 29: @tree = Hash.new 30: 31: # Also stores the taxonomic tree as a list of arrays (full path) 32: @path = Array.new 33: 34: # Also stores all leaf nodes (organism codes) of every intermediate nodes 35: @leaves = Hash.new 36: 37: # tentative name for the root node (use accessor to change) 38: @root = 'Genes' 39: 40: hier = Array.new 41: level = 0 42: label = nil 43: 44: File.open(filename).each do |line| 45: next if line.strip.empty? 46: 47: # line for taxonomic hierarchy (indent according to the number of # marks) 48: if line[/^#/] 49: level = line[/^#+/].length 50: label = line[/[A-z].*/] 51: hier[level] = sanitize(label) 52: 53: # line for organims name (unify different strains of a species) 54: else 55: tax, org, name, desc = line.chomp.split("\t") 56: if orgs.nil? or orgs.empty? or orgs.include?(org) 57: species, strain, = name.split('_') 58: # (0) Grouping of the strains of the same species. 59: # If the name of species is the same as the previous line, 60: # add the species to the same species group. 61: # ex. Gamma/enterobacteria has a large number of organisms, 62: # so sub grouping of strains is needed for E.coli strains etc. 63: # 64: # However, if the species name is already used, need to avoid 65: # collision of species name as the current implementation stores 66: # the tree as a Hash, which may cause the infinite loop. 67: # 68: # (1) If species name == the intermediate node of other lineage 69: # Add '_sp' to the species name to avoid the conflict (1-1), and if 70: # 'species_sp' is already taken, use 'species_strain' instead (1-2). 71: # ex. Bacteria/Proteobacteria/Beta/T.denitrificans/tbd 72: # Bacteria/Proteobacteria/Epsilon/T.denitrificans_ATCC33889/tdn 73: # -> Bacteria/Proteobacteria/Beta/T.denitrificans/tbd 74: # Bacteria/Proteobacteria/Epsilon/T.denitrificans_sp/tdn 75: # 76: # (2) If species name == the intermediate node of the same lineage 77: # Add '_sp' to the species name to avoid the conflict. 78: # ex. Bacteria/Cyanobacgteria/Cyanobacteria_CYA/cya 79: # Bacteria/Cyanobacgteria/Cyanobacteria_CYB/cya 80: # Bacteria/Proteobacteria/Magnetococcus/Magnetococcus_MC1/mgm 81: # -> Bacteria/Cyanobacgteria/Cyanobacteria_sp/cya 82: # Bacteria/Cyanobacgteria/Cyanobacteria_sp/cya 83: # Bacteria/Proteobacteria/Magnetococcus/Magnetococcus_sp/mgm 84: sp_group = "#{species}_sp" 85: if @tree[species] 86: if hier[level+1] == species 87: # case (0) 88: else 89: # case (1-1) 90: species = sp_group 91: # case (1-2) 92: if @tree[sp_group] and hier[level+1] != species 93: species = name 94: end 95: end 96: else 97: if hier[level] == species 98: # case (2) 99: species = sp_group 100: end 101: end 102: # 'hier' is an array of the taxonomic tree + species and strain name. 103: # ex. [nil, Eukaryotes, Fungi, Ascomycetes, Saccharomycetes] + 104: # [S_cerevisiae, sce] 105: hier[level+1] = species # sanitize(species) 106: hier[level+2] = org 107: ary = hier[1, level+2] 108: warn ary.inspect if $DEBUG 109: add_to_tree(ary) 110: add_to_leaves(ary) 111: add_to_path(ary) 112: end 113: end 114: end 115: return tree 116: end
Add a new path [node, subnode, subsubnode, …, leaf] under the root node and stores leaf nodes to the every intermediate nodes as an Array.
# File lib/bio/db/kegg/taxonomy.rb, line 140 140: def add_to_leaves(ary) 141: leaf = ary.last 142: ary.each do |node| 143: @leaves[node] ||= Array.new 144: @leaves[node] << leaf 145: end 146: end
Add a new path [node, subnode, subsubnode, …, leaf] under the root node and every intermediate nodes stores their child nodes as a Hash.
# File lib/bio/db/kegg/taxonomy.rb, line 129 129: def add_to_tree(ary) 130: parent = @root 131: ary.each do |node| 132: @tree[parent] ||= Hash.new 133: @tree[parent][node] = nil 134: parent = node 135: end 136: end
Compaction of intermediate nodes of the resulted taxonomic tree.
- If child node has only one child node (grandchild), make the child of grandchild as a grandchild. ex. Plants / Monocotyledons / grass family / osa --> Plants / Monocotyledons / osa
# File lib/bio/db/kegg/taxonomy.rb, line 161 161: def compact(node = root) 162: # if the node has children 163: if subnodes = @tree[node] 164: # obtain grandchildren for each child 165: subnodes.keys.each do |subnode| 166: if subsubnodes = @tree[subnode] 167: # if the number of grandchild node is 1 168: if subsubnodes.keys.size == 1 169: # obtain the name of the grandchild node 170: subsubnode = subsubnodes.keys.first 171: # obtain the child of the grandchlid node 172: if subsubsubnodes = @tree[subsubnode] 173: # make the child of grandchild node as a chlid of child node 174: @tree[subnode] = subsubsubnodes 175: # delete grandchild node 176: @tree[subnode].delete(subsubnode) 177: warn "--- compact: #{subsubnode} is replaced by #{subsubsubnodes}" if $DEBUG 178: # retry until new grandchild also needed to be compacted. 179: retry 180: end 181: end 182: end 183: # repeat recurseively 184: compact(subnode) 185: end 186: end 187: end
Traverse the taxonomic tree by the depth first search method under the given (root or intermediate) node.
# File lib/bio/db/kegg/taxonomy.rb, line 224 224: def dfs(parent, &block) 225: if children = @tree[parent] 226: yield parent, children 227: children.keys.each do |child| 228: dfs(child, &block) 229: end 230: end 231: end
Similar to the dfs method but also passes the current level of the nest to the iterator.
# File lib/bio/db/kegg/taxonomy.rb, line 235 235: def dfs_with_level(parent, &block) 236: @level ||= 0 237: if children = @tree[parent] 238: yield parent, children, @level 239: @level += 1 240: children.keys.each do |child| 241: dfs_with_level(child, &block) 242: end 243: @level -= 1 244: end 245: end
Reduction of the leaf node of the resulted taxonomic tree.
- If the parent node have only one leaf node, replace parent node with the leaf node. ex. Plants / Monocotyledons / osa --> Plants / osa
# File lib/bio/db/kegg/taxonomy.rb, line 196 196: def reduce(node = root) 197: # if the node has children 198: if subnodes = @tree[node] 199: # obtain grandchildren for each child 200: subnodes.keys.each do |subnode| 201: if subsubnodes = @tree[subnode] 202: # if the number of grandchild node is 1 203: if subsubnodes.keys.size == 1 204: # obtain the name of the grandchild node 205: subsubnode = subsubnodes.keys.first 206: # if the grandchild node is a leaf node 207: unless @tree[subsubnode] 208: # make the grandchild node as a child node 209: @tree[node].update(subsubnodes) 210: # delete child node 211: @tree[node].delete(subnode) 212: warn "--- reduce: #{subnode} is replaced by #{subsubnode}" if $DEBUG 213: end 214: end 215: end 216: # repeat recursively 217: reduce(subnode) 218: end 219: end 220: end