Merge lp:~jon-hill/supertree-toolkit/sub_in_subfile into lp:supertree-toolkit

Proposed by Jon Hill
Status: Merged
Merged at revision: 281
Proposed branch: lp:~jon-hill/supertree-toolkit/sub_in_subfile
Merge into: lp:supertree-toolkit
Diff against target: 13342 lines (+11326/-781)
44 files modified
debian/control (+1/-1)
debian/rules (+1/-0)
notes.txt (+38/-0)
stk/bzr_version.py (+5/-5)
stk/p4/NexusToken.py (+1/-0)
stk/p4/NexusToken2.py (+1/-1)
stk/p4/Tree.py (+1/-9)
stk/p4/Tree_muck.py (+4/-2)
stk/scripts/check_nomenclature.py (+0/-224)
stk/scripts/check_nomenclature.py.moved (+224/-0)
stk/scripts/create_colours_itol.py (+2/-11)
stk/scripts/create_taxonomy.py (+4/-100)
stk/scripts/fill_in_with_taxonomy.py (+711/-174)
stk/scripts/plot_character_taxa_matrix.py (+83/-1)
stk/scripts/plot_tree_taxa_matrix.py (+56/-0)
stk/scripts/remove_poorly_constrained_taxa.py (+43/-20)
stk/scripts/tree_from_taxonomy.py (+142/-0)
stk/stk (+787/-34)
stk/stk_exceptions.py (+8/-0)
stk/supertree_toolkit.py (+849/-47)
stk/test/_substitute_taxa.py (+19/-1)
stk/test/_supertree_toolkit.py (+138/-15)
stk/test/_trees.py (+13/-1)
stk/test/data/input/auto_sub.phyml (+97/-0)
stk/test/data/input/check_data_ind.phyml (+141/-0)
stk/test/data/input/check_taxonomy.phyml (+67/-0)
stk/test/data/input/check_taxonomy_fixes.phyml (+378/-0)
stk/test/data/input/create_taxonomy.csv (+6/-6)
stk/test/data/input/create_taxonomy.phyml (+67/-0)
stk/test/data/input/equivalents.csv (+5/-0)
stk/test/data/input/mrca.tre (+1/-0)
stk/test/data/input/old_stk_test_data_ind.phyml (+1324/-0)
stk/test/data/input/old_stk_test_data_tax_overlap.phyml (+627/-0)
stk/test/data/input/old_stk_test_nonmonophyl_removed.phyml (+1324/-0)
stk/test/data/input/old_stk_test_species_level.phyml (+1324/-0)
stk/test/data/input/old_stk_test_taxonomy.csv (+334/-0)
stk/test/data/input/old_stk_test_taxonomy_check_subs.dat (+26/-0)
stk/test/data/input/old_stk_test_taxonomy_checked.phyml (+1324/-0)
stk/test/data/input/old_stk_test_taxonomy_checker.csv (+336/-0)
stk/test/data/output/one_click_subs_output.phyml (+97/-0)
stk/test/util.py (+7/-0)
stk_gui/gui/gui.glade (+670/-124)
stk_gui/plugins/phyml/name_author.py (+4/-1)
stk_gui/stk_gui/interface.py (+36/-4)
To merge this branch: bzr merge lp:~jon-hill/supertree-toolkit/sub_in_subfile
Reviewer Review Type Date Requested Status
Jon Hill Approve
Review via email: mp+314598@code.launchpad.net

Description of the change

Adding taxonomic awareness and fixing a lot of bugs

To post a comment you must log in.
322. By Jon Hill

removing file that shouldn't be there

323. By Jon Hill

removing file that shouldn't be there

Revision history for this message
Jon Hill (jon-hill) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'debian/control'
2--- debian/control 2016-12-14 16:22:12 +0000
3+++ debian/control 2017-01-12 09:27:31 +0000
4@@ -9,7 +9,7 @@
5
6 Package: supertree-toolkit
7 Architecture: all
8-Depends: python-tk, python-dxdiff, python-pygraphviz, python-lxml-dbg, python-lxml, python-gtk2, python-numpy, python-matplotlib, python-lxml, libxml2-utils, python, python-gtksourceview2, python-glade2, python-networkx
9+Depends: python-tk, python-simplejson, python-dxdiff, python-pygraphviz, python-lxml-dbg, python-lxml, python-gtk2, python-numpy, python-matplotlib, python-lxml, libxml2-utils, python, python-gtksourceview2, python-glade2, python-networkx, python-argcomplete
10 Recommends: python-psyco
11 Suggests:
12 Conflicts:
13
14=== modified file 'debian/rules'
15--- debian/rules 2013-10-14 12:58:59 +0000
16+++ debian/rules 2017-01-12 09:27:31 +0000
17@@ -6,5 +6,6 @@
18
19 override_dh_auto_install:
20 python setup.py install --root=debian/supertree-toolkit --install-layout=deb --install-scripts=/usr/bin
21+ argcomplete.autocomplete(parser)
22
23 override_dh_auto_build:
24
25=== added file 'notes.txt'
26--- notes.txt 1970-01-01 00:00:00 +0000
27+++ notes.txt 2017-01-12 09:27:31 +0000
28@@ -0,0 +1,38 @@
29+Ideas:
30+
31+Collect data, remove paraphyletic
32+
33+Take taxonomy (from dbs), phyml, users knowledge (encoded as subs file) and information on synonyms (from dbs)
34+to create a master subs file that takes the dat to species level
35+
36+User needs to be able to edit taxonomy - CSV file
37+
38+User needs to choose database source - preferred source.
39+
40+
41+Taxonomic name checker:
42+
43+ - use database to get synonyms and possible mispellings
44+ - Gui is a 2 column table with green, yellow, red. User filles in red (or removes it), green is fine. Yellow - drop down list with alternatives.
45+ - Use this to generate a two column CSV file
46+ - On CLI, generate a three column CSV. Original name, new name (or blank for unknown) and a list of possibles. Warn user they *must* fill in the second column or remove the row or the taxa will be deleted.
47+
48+For colloqual names, user adds to column 1 of taxonomy csv and then adds the latin name in the approriate column of the database. The subs can then generate the species list.
49+
50+Use these two csv files to generate a subs file, including replacing higher taxa and genera to create a "to species" substtution (can also output this file for later)
51+
52+Generating data to any taxonomic level can happen later - need to check each species is accounted for in the taxonomy, with correct levels - may need another parse of the taxonomy csv
53+
54+
55+Add data -> paraphyletic taxa -> taxonomy checker -> sub synonyms -> taxonomy generator -> create species level dataset
56+
57+New functions:
58+ - taxonomic name checker (this might take a while when online for large dataset) - note that this should be a one for one substitution - seperate function so we can check this?
59+ - Pull in taxonomy generator
60+ - Add csv file to schema
61+ - amaend manual with workflow
62+ - warning on multiple subs in data in manual
63+ - generate species level subsfile from taxonomy
64+ - generate specified taxonomic level data
65+
66+
67
68=== modified file 'stk/bzr_version.py'
69--- stk/bzr_version.py 2017-01-11 17:42:56 +0000
70+++ stk/bzr_version.py 2017-01-12 09:27:31 +0000
71@@ -4,12 +4,12 @@
72 So don't edit it. :)
73 """
74
75-version_info = {'branch_nick': u'supertree-toolkit',
76- 'build_date': '2017-01-11 17:42:27 +0000',
77+version_info = {'branch_nick': u'sub_in_subfile',
78+ 'build_date': '2017-01-11 17:48:33 +0000',
79 'clean': None,
80- 'date': '2017-01-11 17:39:43 +0000',
81- 'revision_id': 'jon.hill@imperial.ac.uk-20170111173943-88so1icr33su3afo',
82- 'revno': '279'}
83+ 'date': '2017-01-11 17:48:18 +0000',
84+ 'revision_id': 'jon.hill@imperial.ac.uk-20170111174818-9q8a9octvnawruuw',
85+ 'revno': '317'}
86
87 revisions = {}
88
89
90=== modified file 'stk/p4/NexusToken.py'
91--- stk/p4/NexusToken.py 2012-01-11 08:57:43 +0000
92+++ stk/p4/NexusToken.py 2017-01-12 09:27:31 +0000
93@@ -44,6 +44,7 @@
94 gm = ["safeNextTok(), called from %s" % caller]
95 else:
96 gm = ["safeNextTok()"]
97+ print flob
98 gm.append("Premature Death.")
99 gm.append("Ran out of understandable things to read in nexus file.")
100 raise Glitch, gm
101
102=== modified file 'stk/p4/NexusToken2.py'
103--- stk/p4/NexusToken2.py 2012-01-11 08:57:43 +0000
104+++ stk/p4/NexusToken2.py 2017-01-12 09:27:31 +0000
105@@ -88,7 +88,7 @@
106 else:
107 gm = ["safeNextTok()"]
108 gm.append("Premature Death.")
109- gm.append("Ran out of understandable things to read in nexus file.")
110+ gm.append("Ran out of understandable things to read in nexus file." + str(flob))
111 raise Glitch, gm
112 else:
113 return t
114
115=== modified file 'stk/p4/Tree.py'
116--- stk/p4/Tree.py 2013-08-25 09:24:34 +0000
117+++ stk/p4/Tree.py 2017-01-12 09:27:31 +0000
118@@ -996,17 +996,9 @@
119 if not item.name:
120 if item == self.root:
121 if var.fixRootedTrees:
122- if self.name:
123- print "Tree.initFinish() tree '%s'" % self.name
124- else:
125- print 'Tree.initFinish()'
126- print "Fixing tree to work with SuperTree scores"
127+ #print "Fixing tree to work with SuperTree scores"
128 self.removeRoot()
129 elif var.warnAboutTerminalRootWithNoName:
130- if self.name:
131- print "Tree.initFinish() tree '%s'" % self.name
132- else:
133- print 'Tree.initFinish()'
134 print ' Non-fatal warning: the root is terminal, but has no name.'
135 print ' This may be what you wanted. Or not?'
136 print ' (To get rid of this warning, turn off var.warnAboutTerminalRootWithNoName)'
137
138=== modified file 'stk/p4/Tree_muck.py'
139--- stk/p4/Tree_muck.py 2015-02-19 14:47:06 +0000
140+++ stk/p4/Tree_muck.py 2017-01-12 09:27:31 +0000
141@@ -769,6 +769,7 @@
142 else:
143 gm.append("The 2 specified nodes should have a parent-child relationship")
144 raise Glitch, gm
145+
146 if var.usePfAndNumpy:
147 self.deleteCStuff()
148
149@@ -1629,7 +1630,7 @@
150
151
152
153-def addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None):
154+def addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None, ignoreRootAssert=False):
155 """Add a subtree to a tree.
156
157 The nodes from theSubTree are added to self.nodes, and theSubTree
158@@ -1666,7 +1667,8 @@
159
160 assert selfNode in self.nodes
161 assert selfNode.parent
162- assert theSubTree.root.leftChild and not theSubTree.root.leftChild.sibling # its a root on a stick
163+ if not ignoreRootAssert:
164+ assert theSubTree.root.leftChild and not theSubTree.root.leftChild.sibling # its a root on a stick
165 if not subTreeTaxNames:
166 subTreeTaxNames = [n.name for n in theSubTree.iterLeavesNoRoot()]
167
168
169=== removed file 'stk/scripts/check_nomenclature.py'
170--- stk/scripts/check_nomenclature.py 2016-07-14 10:12:17 +0000
171+++ stk/scripts/check_nomenclature.py 1970-01-01 00:00:00 +0000
172@@ -1,224 +0,0 @@
173-#!/usr/bin/env python
174-#
175-# Derived from the Supertree Toolkit. Software for managing and manipulating sources
176-# trees ready for supretree construction.
177-# Copyright (C) 2015, Jon Hill, Katie Davis
178-#
179-# This program is free software: you can redistribute it and/or modify
180-# it under the terms of the GNU General Public License as published by
181-# the Free Software Foundation, either version 3 of the License, or
182-# (at your option) any later version.
183-#
184-# This program is distributed in the hope that it will be useful,
185-# but WITHOUT ANY WARRANTY; without even the implied warranty of
186-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
187-# GNU General Public License for more details.
188-#
189-# You should have received a copy of the GNU General Public License
190-# along with this program. If not, see <http://www.gnu.org/licenses/>.
191-#
192-# Jon Hill. jon.hill@york.ac.uk.
193-#
194-#
195-# This is an enitrely self-contained script that does not require the STK to be installed.
196-
197-import urllib2
198-from urllib import quote_plus
199-import simplejson as json
200-import argparse
201-import os
202-import sys
203-import csv
204-
205-def main():
206-
207- # do stuff
208- parser = argparse.ArgumentParser(
209- prog="Check nomenclature",
210- description="Check nomenclature from a tree file or list against valid names derived from EOL",
211- )
212- parser.add_argument(
213- '-v',
214- '--verbose',
215- action='store_true',
216- help="Verbose output: mainly progress reports.",
217- default=False
218- )
219- parser.add_argument(
220- '--existing',
221- help="An existing output file to update further, e.g. with a new set of taxa. Supply the file name."
222- )
223- parser.add_argument(
224- 'input_file',
225- metavar='input_file',
226- nargs=1,
227- help="Your input taxa list"
228- )
229- parser.add_argument(
230- 'output_file',
231- metavar='output_file',
232- nargs=1,
233- help="The output file. A CSV-based output, listing name checked, valid name, synonyms and status (red, amber, yellow, green)."
234- )
235-
236- args = parser.parse_args()
237- verbose = args.verbose
238- input_file = args.input_file[0]
239- output_file = args.output_file[0]
240- existing_data = args.existing
241-
242- if (not existing_data == None):
243- exiting_data = load_equivalents(existing_data)
244- else:
245- existing_data = None
246-
247- with open(input_file,'r') as f:
248- lines = f.read().splitlines()
249- equivs = taxonomic_checker_list(lines, existing_data, verbose=verbose)
250-
251-
252- f = open(output_file,"w")
253- for taxon in sorted(equivs.keys()):
254- f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
255- f.close()
256-
257- return
258-
259-
260-def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
261- """ For each name in the database generate a database of the original name,
262- possible synonyms and if the taxon is not know, signal that. We do this by
263- using the EoL API to grab synonyms of each taxon. """
264-
265-
266- if existing_data == None:
267- equivalents = {}
268- else:
269- equivalents = existing_data
270-
271- # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
272- # if not, is there another API function to do this?
273- # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
274- # you get the original ('correct') name - shorten this to two words and you're done.
275- for t in name_list:
276- # make sure t has no spaces.
277- t = t.replace(" ","_")
278- if t in equivalents:
279- continue
280- taxon = t.replace("_"," ")
281- if (verbose):
282- print "Looking up ", taxon
283- # get the data from EOL on taxon
284- taxonq = quote_plus(taxon)
285- URL = "http://eol.org/api/search/1.0.json?q="+taxonq
286- req = urllib2.Request(URL)
287- opener = urllib2.build_opener()
288- f = opener.open(req)
289- data = json.load(f)
290- # check if there's some data
291- if len(data['results']) == 0:
292- equivalents[t] = [[t],'red']
293- continue
294- amber = False
295- if len(data['results']) > 1:
296- # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
297- # for automatic processing we'll just take the first one though
298- # colour is amber in this case
299- amber = True
300- ID = str(data['results'][0]['id']) # take first hit
301- URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=2&videos=0&sounds=0&maps=0&text=2&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"
302- req = urllib2.Request(URL)
303- opener = urllib2.build_opener()
304-
305- try:
306- f = opener.open(req)
307- except urllib2.HTTPError:
308- equivalents[t] = [[t],'red']
309- continue
310- data = json.load(f)
311- if len(data['scientificName']) == 0:
312- # not found a scientific name, so set as red
313- equivalents[t] = [[t],'red']
314- continue
315- correct_name = data['scientificName'].encode("ascii","ignore")
316- # we only want the first two bits of the name, not the original author and year if any
317- temp_name = correct_name.split(' ')
318- if (len(temp_name) > 2):
319- correct_name = ' '.join(temp_name[0:2])
320- correct_name = correct_name.replace(' ','_')
321- print correct_name, t
322-
323- # build up the output dictionary - original name is key, synonyms/missing is value
324- if (correct_name == t or correct_name == taxon):
325- # if the original matches the 'correct', then it's green
326- equivalents[t] = [[t], 'green']
327- else:
328- # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the
329- # 'correct' taxon at the top
330- eol_synonyms = data['synonyms']
331- synonyms = []
332- for s in eol_synonyms:
333- ts = s['synonym'].encode("ascii","ignore")
334- temp_syn = ts.split(' ')
335- if (len(temp_syn) > 2):
336- temp_syn = ' '.join(temp_syn[0:2])
337- ts = temp_syn
338- if (s['relationship'] == "synonym"):
339- ts = ts.replace(" ","_")
340- synonyms.append(ts)
341- synonyms = _uniquify(synonyms)
342- # we need to put the correct name at the top of the list now
343- if (correct_name in synonyms):
344- synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
345- elif len(synonyms) == 0:
346- synonyms.append(correct_name)
347- else:
348- synonyms.insert(0,correct_name)
349-
350- if (amber):
351- equivalents[t] = [synonyms,'amber']
352- else:
353- equivalents[t] = [synonyms,'yellow']
354- # if our search was empty, then it's red - see above
355-
356- # up to the calling funciton to do something sensible with this
357- # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
358- # Amber means we found synonyms and multilpe hits. User def needs to sort these!
359-
360- return equivalents
361-
362-def load_equivalents(equiv_csv):
363- """Load equivalents data from a csv and convert to a equivalents Dict.
364- Structure is key, with a list that is array of synonyms, followed by status ('green',
365- 'yellow', 'amber', or 'red').
366-
367- """
368-
369- import csv
370-
371- equivalents = {}
372-
373- with open(equiv_csv, 'rU') as csvfile:
374- equiv_reader = csv.reader(csvfile, delimiter=',')
375- equiv_reader.next() # skip header
376- for row in equiv_reader:
377- i = 1
378- equivalents[row[0]] = [row[1].split(';'),row[2]]
379-
380- return equivalents
381-
382-def _uniquify(l):
383- """
384- Make a list, l, contain only unique data
385- """
386- keys = {}
387- for e in l:
388- keys[e] = 1
389-
390- return keys.keys()
391-
392-if __name__ == "__main__":
393- main()
394-
395-
396-
397
398=== added file 'stk/scripts/check_nomenclature.py.moved'
399--- stk/scripts/check_nomenclature.py.moved 1970-01-01 00:00:00 +0000
400+++ stk/scripts/check_nomenclature.py.moved 2017-01-12 09:27:31 +0000
401@@ -0,0 +1,224 @@
402+#!/usr/bin/env python
403+#
404+# Derived from the Supertree Toolkit. Software for managing and manipulating sources
405+# trees ready for supretree construction.
406+# Copyright (C) 2015, Jon Hill, Katie Davis
407+#
408+# This program is free software: you can redistribute it and/or modify
409+# it under the terms of the GNU General Public License as published by
410+# the Free Software Foundation, either version 3 of the License, or
411+# (at your option) any later version.
412+#
413+# This program is distributed in the hope that it will be useful,
414+# but WITHOUT ANY WARRANTY; without even the implied warranty of
415+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
416+# GNU General Public License for more details.
417+#
418+# You should have received a copy of the GNU General Public License
419+# along with this program. If not, see <http://www.gnu.org/licenses/>.
420+#
421+# Jon Hill. jon.hill@york.ac.uk.
422+#
423+#
424+# This is an enitrely self-contained script that does not require the STK to be installed.
425+
426+import urllib2
427+from urllib import quote_plus
428+import simplejson as json
429+import argparse
430+import os
431+import sys
432+import csv
433+
434+def main():
435+
436+ # do stuff
437+ parser = argparse.ArgumentParser(
438+ prog="Check nomenclature",
439+ description="Check nomenclature from a tree file or list against valid names derived from EOL",
440+ )
441+ parser.add_argument(
442+ '-v',
443+ '--verbose',
444+ action='store_true',
445+ help="Verbose output: mainly progress reports.",
446+ default=False
447+ )
448+ parser.add_argument(
449+ '--existing',
450+ help="An existing output file to update further, e.g. with a new set of taxa. Supply the file name."
451+ )
452+ parser.add_argument(
453+ 'input_file',
454+ metavar='input_file',
455+ nargs=1,
456+ help="Your input taxa list"
457+ )
458+ parser.add_argument(
459+ 'output_file',
460+ metavar='output_file',
461+ nargs=1,
462+ help="The output file. A CSV-based output, listing name checked, valid name, synonyms and status (red, amber, yellow, green)."
463+ )
464+
465+ args = parser.parse_args()
466+ verbose = args.verbose
467+ input_file = args.input_file[0]
468+ output_file = args.output_file[0]
469+ existing_data = args.existing
470+
471+ if (not existing_data == None):
472+ exiting_data = load_equivalents(existing_data)
473+ else:
474+ existing_data = None
475+
476+ with open(input_file,'r') as f:
477+ lines = f.read().splitlines()
478+ equivs = taxonomic_checker_list(lines, existing_data, verbose=verbose)
479+
480+
481+ f = open(output_file,"w")
482+ for taxon in sorted(equivs.keys()):
483+ f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
484+ f.close()
485+
486+ return
487+
488+
489+def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
490+ """ For each name in the database generate a database of the original name,
491+ possible synonyms and if the taxon is not know, signal that. We do this by
492+ using the EoL API to grab synonyms of each taxon. """
493+
494+
495+ if existing_data == None:
496+ equivalents = {}
497+ else:
498+ equivalents = existing_data
499+
500+ # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
501+ # if not, is there another API function to do this?
502+ # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
503+ # you get the original ('correct') name - shorten this to two words and you're done.
504+ for t in name_list:
505+ # make sure t has no spaces.
506+ t = t.replace(" ","_")
507+ if t in equivalents:
508+ continue
509+ taxon = t.replace("_"," ")
510+ if (verbose):
511+ print "Looking up ", taxon
512+ # get the data from EOL on taxon
513+ taxonq = quote_plus(taxon)
514+ URL = "http://eol.org/api/search/1.0.json?q="+taxonq
515+ req = urllib2.Request(URL)
516+ opener = urllib2.build_opener()
517+ f = opener.open(req)
518+ data = json.load(f)
519+ # check if there's some data
520+ if len(data['results']) == 0:
521+ equivalents[t] = [[t],'red']
522+ continue
523+ amber = False
524+ if len(data['results']) > 1:
525+ # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
526+ # for automatic processing we'll just take the first one though
527+ # colour is amber in this case
528+ amber = True
529+ ID = str(data['results'][0]['id']) # take first hit
530+ URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=2&videos=0&sounds=0&maps=0&text=2&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"
531+ req = urllib2.Request(URL)
532+ opener = urllib2.build_opener()
533+
534+ try:
535+ f = opener.open(req)
536+ except urllib2.HTTPError:
537+ equivalents[t] = [[t],'red']
538+ continue
539+ data = json.load(f)
540+ if len(data['scientificName']) == 0:
541+ # not found a scientific name, so set as red
542+ equivalents[t] = [[t],'red']
543+ continue
544+ correct_name = data['scientificName'].encode("ascii","ignore")
545+ # we only want the first two bits of the name, not the original author and year if any
546+ temp_name = correct_name.split(' ')
547+ if (len(temp_name) > 2):
548+ correct_name = ' '.join(temp_name[0:2])
549+ correct_name = correct_name.replace(' ','_')
550+ print correct_name, t
551+
552+ # build up the output dictionary - original name is key, synonyms/missing is value
553+ if (correct_name == t or correct_name == taxon):
554+ # if the original matches the 'correct', then it's green
555+ equivalents[t] = [[t], 'green']
556+ else:
557+ # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the
558+ # 'correct' taxon at the top
559+ eol_synonyms = data['synonyms']
560+ synonyms = []
561+ for s in eol_synonyms:
562+ ts = s['synonym'].encode("ascii","ignore")
563+ temp_syn = ts.split(' ')
564+ if (len(temp_syn) > 2):
565+ temp_syn = ' '.join(temp_syn[0:2])
566+ ts = temp_syn
567+ if (s['relationship'] == "synonym"):
568+ ts = ts.replace(" ","_")
569+ synonyms.append(ts)
570+ synonyms = _uniquify(synonyms)
571+ # we need to put the correct name at the top of the list now
572+ if (correct_name in synonyms):
573+ synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
574+ elif len(synonyms) == 0:
575+ synonyms.append(correct_name)
576+ else:
577+ synonyms.insert(0,correct_name)
578+
579+ if (amber):
580+ equivalents[t] = [synonyms,'amber']
581+ else:
582+ equivalents[t] = [synonyms,'yellow']
583+ # if our search was empty, then it's red - see above
584+
585+ # up to the calling funciton to do something sensible with this
586+ # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
587+ # Amber means we found synonyms and multilpe hits. User def needs to sort these!
588+
589+ return equivalents
590+
591+def load_equivalents(equiv_csv):
592+ """Load equivalents data from a csv and convert to a equivalents Dict.
593+ Structure is key, with a list that is array of synonyms, followed by status ('green',
594+ 'yellow', 'amber', or 'red').
595+
596+ """
597+
598+ import csv
599+
600+ equivalents = {}
601+
602+ with open(equiv_csv, 'rU') as csvfile:
603+ equiv_reader = csv.reader(csvfile, delimiter=',')
604+ equiv_reader.next() # skip header
605+ for row in equiv_reader:
606+ i = 1
607+ equivalents[row[0]] = [row[1].split(';'),row[2]]
608+
609+ return equivalents
610+
611+def _uniquify(l):
612+ """
613+ Make a list, l, contain only unique data
614+ """
615+ keys = {}
616+ for e in l:
617+ keys[e] = 1
618+
619+ return keys.keys()
620+
621+if __name__ == "__main__":
622+ main()
623+
624+
625+
626
627=== modified file 'stk/scripts/create_colours_itol.py'
628--- stk/scripts/create_colours_itol.py 2014-12-09 10:58:48 +0000
629+++ stk/scripts/create_colours_itol.py 2017-01-12 09:27:31 +0000
630@@ -88,17 +88,8 @@
631 saturation=0.25
632 value=0.8
633
634- index = 3 # family
635- if (level == "Superfamily"):
636- index = 4
637- elif (level == "Infraorder"):
638- index = 5
639- elif (level == "Suborder"):
640- index = 6
641- elif (level == "Order"):
642- index = 7
643- elif (level == "Genus"):
644- index = 2
645+ index = stk.taxonomy_levels.index(level.lower())+1
646+ print index
647
648 if (tree):
649 tree_data = stk.import_tree(input_file)
650
651=== modified file 'stk/scripts/create_taxonomy.py'
652--- stk/scripts/create_taxonomy.py 2014-03-13 18:45:05 +0000
653+++ stk/scripts/create_taxonomy.py 2017-01-12 09:27:31 +0000
654@@ -16,6 +16,8 @@
655 import supertree_toolkit as stk
656 import csv
657
658+taxonomy_levels = stk.taxonomy_levels
659+
660 def main():
661
662 # do stuff
663@@ -66,13 +68,6 @@
664 f.close()
665
666 taxonomy = {}
667- # What we get from EOL
668- current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
669- # And the extra ones from ITIS
670- extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
671- # all of them in order
672- taxonomy_levels = ['species','genus','family','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
673-
674
675 for taxon in taxa:
676 taxon = taxon.replace("_"," ")
677@@ -180,99 +175,8 @@
678 continue
679
680
681- # Now create the CSV output
682- with open(output_file, 'w') as f:
683- writer = csv.writer(f)
684- writer.writerow(taxonomy_levels)
685- for t in taxonomy:
686- species = t
687- try:
688- genus = taxonomy[t]['genus']
689- except KeyError:
690- genus = "-"
691- try:
692- family = taxonomy[t]['family']
693- except KeyError:
694- family = "-"
695- try:
696- superfamily = taxonomy[t]['superfamily']
697- except KeyError:
698- superfamily = "-"
699- try:
700- infraorder = taxonomy[t]['infraorder']
701- except KeyError:
702- infraorder = "-"
703- try:
704- suborder = taxonomy[t]['suborder']
705- except KeyError:
706- suborder = "-"
707- try:
708- order = taxonomy[t]['order']
709- except KeyError:
710- order = "-"
711- try:
712- superorder = taxonomy[t]['superorder']
713- except KeyError:
714- superorder = "-"
715- try:
716- subclass = taxonomy[t]['subclass']
717- except KeyError:
718- subclass = "-"
719- try:
720- tclass = taxonomy[t]['class']
721- except KeyError:
722- tclass = "-"
723- try:
724- subphylum = taxonomy[t]['subphylum']
725- except KeyError:
726- subphylum = "-"
727- try:
728- phylum = taxonomy[t]['phylum']
729- except KeyError:
730- phylum = "-"
731- try:
732- superphylum = taxonomy[t]['superphylum']
733- except KeyError:
734- superphylum = "-"
735- try:
736- infrakingdom = taxonomy[t]['infrakingdom']
737- except:
738- infrakingdom = "-"
739- try:
740- subkingdom = taxonomy[t]['subkingdom']
741- except:
742- subkingdom = "-"
743- try:
744- kingdom = taxonomy[t]['kingdom']
745- except KeyError:
746- kingdom = "-"
747- try:
748- provider = taxonomy[t]['provider']
749- except KeyError:
750- provider = "-"
751-
752-
753- this_classification = [
754- species.encode('utf-8'),
755- genus.encode('utf-8'),
756- family.encode('utf-8'),
757- superfamily.encode('utf-8'),
758- infraorder.encode('utf-8'),
759- suborder.encode('utf-8'),
760- order.encode('utf-8'),
761- superorder.encode('utf-8'),
762- subclass.encode('utf-8'),
763- tclass.encode('utf-8'),
764- subphylum.encode('utf-8'),
765- phylum.encode('utf-8'),
766- superphylum.encode('utf-8'),
767- infrakingdom.encode('utf-8'),
768- subkingdom.encode('utf-8'),
769- kingdom.encode('utf-8'),
770- provider.encode('utf-8')]
771- writer.writerow(this_classification)
772-
773-
774+ stk.save_taxonomy(taxonomy, output_file)
775+
776 def _uniquify(l):
777 """
778 Make a list, l, contain only unique data
779
780=== modified file 'stk/scripts/fill_in_with_taxonomy.py'
781--- stk/scripts/fill_in_with_taxonomy.py 2016-12-14 16:22:12 +0000
782+++ stk/scripts/fill_in_with_taxonomy.py 2017-01-12 09:27:31 +0000
783@@ -23,21 +23,90 @@
784 from urllib import quote_plus
785 import simplejson as json
786 import argparse
787+import copy
788 import os
789 import sys
790 stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir )
791 sys.path.insert(0, stk_path)
792 import supertree_toolkit as stk
793 import csv
794-
795-# What we get from EOL
796-current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
797-# And the extra ones from ITIS
798-extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
799-# all of them in order
800-taxonomy_levels = ['species','genus','subfamily','family','tribe','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
801-
802-def get_tree_taxa_taxonomy(taxon,wsdlObjectWoRMS):
803+from ete2 import Tree
804+import tempfile
805+import re
806+
807+taxonomy_levels = stk.taxonomy_levels
808+#tlevels = ['species','genus','family','superfamily','suborder','order','class','phylum','kingdom']
809+tlevels = ['species','genus', 'subfamily', 'family','infraorder','order','class','phylum','kingdom']
810+
811+def get_tree_taxa_taxonomy_eol(taxon):
812+
813+ taxonq = quote_plus(taxon)
814+ URL = "http://eol.org/api/search/1.0.json?q="+taxonq
815+ req = urllib2.Request(URL)
816+ opener = urllib2.build_opener()
817+ f = opener.open(req)
818+ data = json.load(f)
819+
820+ if data['results'] == []:
821+ return {}
822+ ID = str(data['results'][0]['id']) # take first hit
823+ # Now look for taxonomies
824+ URL = "http://eol.org/api/pages/1.0/"+ID+".json"
825+ req = urllib2.Request(URL)
826+ opener = urllib2.build_opener()
827+ f = opener.open(req)
828+ data = json.load(f)
829+ if len(data['taxonConcepts']) == 0:
830+ return {}
831+ TID = str(data['taxonConcepts'][0]['identifier']) # take first hit
832+ currentdb = str(data['taxonConcepts'][0]['nameAccordingTo'])
833+ # loop through and get preferred one if specified
834+ # now get taxonomy
835+ for db in data['taxonConcepts']:
836+ currentdb = db['nameAccordingTo'].lower()
837+ TID = str(db['identifier'])
838+ break
839+ URL="http://eol.org/api/hierarchy_entries/1.0/"+TID+".json"
840+ req = urllib2.Request(URL)
841+ opener = urllib2.build_opener()
842+ f = opener.open(req)
843+ data = json.load(f)
844+ tax_array = {}
845+ tax_array['provider'] = currentdb
846+ for a in data['ancestors']:
847+ try:
848+ if a.has_key('taxonRank') :
849+ temp_level = a['taxonRank'].encode("ascii","ignore")
850+ if (temp_level in taxonomy_levels):
851+ # note the dump into ASCII
852+ temp_name = a['scientificName'].encode("ascii","ignore")
853+ temp_name = temp_name.split(" ")
854+ if (temp_level == 'species'):
855+ tax_array[temp_level] = "_".join(temp_name[0:2])
856+
857+ else:
858+ tax_array[temp_level] = temp_name[0]
859+ except KeyError as e:
860+ logging.exception("Key not found: taxonRank")
861+ continue
862+ try:
863+ # add taxonomy in to the taxonomy!
864+ # some issues here, so let's make sure it's OK
865+ temp_name = taxon.split(" ")
866+ if data.has_key('taxonRank') :
867+ if not data['taxonRank'].lower() == 'species':
868+ tax_array[data['taxonRank'].lower()] = temp_name[0]
869+ else:
870+ tax_array[data['taxonRank'].lower()] = ' '.join(temp_name[0:2])
871+ except KeyError as e:
872+ return tax_array
873+
874+ return tax_array
875+
876+def get_tree_taxa_taxonomy_worms(taxon):
877+
878+ from SOAPpy import WSDL
879+ wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
880
881 taxon_data = wsdlObjectWoRMS.getAphiaRecords(taxon.replace('_',' '))
882 if taxon_data == None:
883@@ -51,6 +120,8 @@
884 classification = wsdlObjectWoRMS.getAphiaClassificationByID(taxon_id)
885 # construct array
886 tax_array = {}
887+ if (classification == ""):
888+ return {}
889 # classification is a nested dictionary, so we need to iterate down it
890 current_child = classification.child
891 while True:
892@@ -60,27 +131,252 @@
893 break
894 return tax_array
895
896-
897-
898-def get_taxonomy_worms(taxonomy, start_otu):
899+def get_tree_taxa_taxonomy_itis(taxon):
900+
901+ URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(taxon.replace('_',' ').strip())
902+ req = urllib2.Request(URL)
903+ opener = urllib2.build_opener()
904+ f = opener.open(req)
905+ string = unicode(f.read(),"ISO-8859-1")
906+ this_item = json.loads(string)
907+ if this_item['scientificNames'] == [None]: # not found
908+ return {}
909+ tsn = this_item['scientificNames'][0]['tsn'] # there might be records that aren't valid - they point to the valid one though
910+ # so call another function to get any valid names
911+ URL="http://www.itis.gov/ITISWebService/jsonservice/getAcceptedNamesFromTSN?tsn="+tsn
912+ req = urllib2.Request(URL)
913+ opener = urllib2.build_opener()
914+ f = opener.open(req)
915+ string = unicode(f.read(),"ISO-8859-1")
916+ this_item = json.loads(string)
917+ if not this_item['acceptedNames'] == [None]:
918+ tsn = this_item['acceptedNames'][0]['acceptedTsn']
919+
920+ URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+str(tsn)
921+ req = urllib2.Request(URL)
922+ opener = urllib2.build_opener()
923+ f = opener.open(req)
924+ string = unicode(f.read(),"ISO-8859-1")
925+ data = json.loads(string)
926+ # construct array
927+ this_taxonomy = {}
928+ for level in data['hierarchyList']:
929+ if level['rankName'].lower() in taxonomy_levels:
930+ # note the dump into ASCII
931+ this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
932+
933+ return this_taxonomy
934+
935+
936+
937+def get_taxonomy_eol(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
938+
939+ # this is the recursive function
940+ def get_children(taxonomy, ID, aphiaIDsDone):
941+
942+ # get data
943+ URL="http://eol.org/api/hierarchy_entries/1.0/"+str(ID)+".json?common_names=false&synonyms=false&cache_ttl="
944+ req = urllib2.Request(URL)
945+ opener = urllib2.build_opener()
946+ f = opener.open(req)
947+ string = unicode(f.read(),"ISO-8859-1")
948+ this_item = json.loads(string)
949+ if this_item == None:
950+ return taxonomy
951+ if this_item['taxonRank'].lower().strip() == 'species':
952+ # add data to taxonomy dictionary
953+ taxon = this_item['scientificName'].split()[0:2] # just the first two words
954+ taxon = " ".join(taxon[0:2])
955+ # NOTE following line means existing items are *not* updated
956+ if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
957+ this_taxonomy = {}
958+ for level in this_item['ancestors']:
959+ if level['taxonRank'].lower() in taxonomy_levels:
960+ # note the dump into ASCII
961+ this_taxonomy[level['taxonRank'].lower().encode("ascii","ignore")] = level['scientificName'].encode("ascii","ignore")
962+ # add species:
963+ this_taxonomy['species'] = taxon.replace(" ","_")
964+ if verbose:
965+ print "\tAdding "+taxon
966+ taxonomy[taxon] = this_taxonomy
967+ if not tmpfile == None:
968+ stk.save_taxonomy(taxonomy,tmpfile)
969+ return taxonomy
970+ else:
971+ return taxonomy
972+ all_children = []
973+ for level in this_item['children']:
974+ if not level == None:
975+ all_children.append(level['taxonID'])
976+
977+ if (len(all_children) == 0):
978+ return taxonomy
979+
980+ for child in all_children:
981+ if child in aphiaIDsDone: # we get stuck sometime
982+ continue
983+ aphiaIDsDone.append(child)
984+ taxonomy = get_children(taxonomy, child, aphiaIDsDone)
985+ return taxonomy
986+
987+
988+ # main bit of the get_taxonomy_eol function
989+ taxonq = quote_plus(start_otu)
990+ URL = "http://eol.org/api/search/1.0.json?q="+taxonq
991+ req = urllib2.Request(URL)
992+ opener = urllib2.build_opener()
993+ f = opener.open(req)
994+ data = json.load(f)
995+ start_id = str(data['results'][0]['id']) # this is the page ID. We get the species ID next
996+ URL = "http://eol.org/api/pages/1.0/"+start_id+".json"
997+ req = urllib2.Request(URL)
998+ opener = urllib2.build_opener()
999+ f = opener.open(req)
1000+ data = json.load(f)
1001+ if len(data['taxonConcepts']) == 0:
1002+ print "Error finding you start taxa. Spelling?"
1003+ return None
1004+ start_id = data['taxonConcepts'][0]['identifier']
1005+ start_taxonomy_level = data['taxonConcepts'][0]['taxonRank'].lower()
1006+
1007+ aphiaIDsDone = []
1008+ if not skip:
1009+ taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
1010+
1011+ return taxonomy, start_taxonomy_level
1012+
1013+
1014+
1015+def get_taxonomy_itis(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
1016+ import simplejson as json
1017+
1018+ # this is the recursive function
1019+ def get_children(taxonomy, ID, aphiaIDsDone):
1020+
1021+ # get data
1022+ URL="http://www.itis.gov/ITISWebService/jsonservice/getFullRecordFromTSN?tsn="+ID
1023+ req = urllib2.Request(URL)
1024+ opener = urllib2.build_opener()
1025+ f = opener.open(req)
1026+ string = unicode(f.read(),"ISO-8859-1")
1027+ this_item = json.loads(string)
1028+ if this_item == None:
1029+ return taxonomy
1030+ if not this_item['usage']['taxonUsageRating'].lower() == 'valid':
1031+ print "rejecting " , this_item['scientificName']['combinedName']
1032+ return taxonomy
1033+ if this_item['taxRank']['rankName'].lower().strip() == 'species':
1034+ # add data to taxonomy dictionary
1035+ taxon = this_item['scientificName']['combinedName']
1036+ # NOTE following line means existing items are *not* updated
1037+ if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
1038+ # get the taxonomy of this species
1039+ tsn = this_item["scientificName"]["tsn"]
1040+ URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+tsn
1041+ req = urllib2.Request(URL)
1042+ opener = urllib2.build_opener()
1043+ f = opener.open(req)
1044+ string = unicode(f.read(),"ISO-8859-1")
1045+ data = json.loads(string)
1046+ this_taxonomy = {}
1047+ for level in data['hierarchyList']:
1048+ if level['rankName'].lower() in taxonomy_levels:
1049+ # note the dump into ASCII
1050+ this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
1051+ if verbose:
1052+ print "\tAdding "+taxon
1053+ taxonomy[taxon] = this_taxonomy
1054+ if not tmpfile == None:
1055+ stk.save_taxonomy(taxonomy,tmpfile)
1056+ return taxonomy
1057+ else:
1058+ return taxonomy
1059+
1060+ all_children = []
1061+ URL="http://www.itis.gov/ITISWebService/jsonservice/getHierarchyDownFromTSN?tsn="+ID
1062+ req = urllib2.Request(URL)
1063+ opener = urllib2.build_opener()
1064+ f = opener.open(req)
1065+ string = unicode(f.read(),"ISO-8859-1")
1066+ this_item = json.loads(string)
1067+ if this_item == None:
1068+ return taxonomy
1069+
1070+ for level in this_item['hierarchyList']:
1071+ if not level == None:
1072+ all_children.append(level['tsn'])
1073+
1074+ if (len(all_children) == 0):
1075+ return taxonomy
1076+
1077+ for child in all_children:
1078+ if child in aphiaIDsDone: # we get stuck sometime
1079+ continue
1080+ aphiaIDsDone.append(child)
1081+ taxonomy = get_children(taxonomy, child, aphiaIDsDone)
1082+
1083+ return taxonomy
1084+
1085+
1086+ # main bit of the get_taxonomy_worms function
1087+ URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(start_otu.strip())
1088+ req = urllib2.Request(URL)
1089+ opener = urllib2.build_opener()
1090+ f = opener.open(req)
1091+ string = unicode(f.read(),"ISO-8859-1")
1092+ this_item = json.loads(string)
1093+ start_id = this_item['scientificNames'][0]['tsn'] # there might be records that aren't valid - they point to the valid one though
1094+ # call it again via the ID this time to make sure we've got the right one.
1095+ # so call another function to get any valid names
1096+ URL="http://www.itis.gov/ITISWebService/jsonservice/getAcceptedNamesFromTSN?tsn="+start_id
1097+ req = urllib2.Request(URL)
1098+ opener = urllib2.build_opener()
1099+ f = opener.open(req)
1100+ string = unicode(f.read(),"ISO-8859-1")
1101+ this_item = json.loads(string)
1102+ if not this_item['acceptedNames'] == [None]:
1103+ start_id = this_item['acceptedNames'][0]['acceptedTsn']
1104+
1105+ URL="http://www.itis.gov/ITISWebService/jsonservice/getFullRecordFromTSN?tsn="+start_id
1106+ req = urllib2.Request(URL)
1107+ opener = urllib2.build_opener()
1108+ f = opener.open(req)
1109+ string = unicode(f.read(),"ISO-8859-1")
1110+ this_item = json.loads(string)
1111+ start_taxonomy_level = this_item['taxRank']['rankName'].lower()
1112+
1113+ aphiaIDsDone = []
1114+ if not skip:
1115+ taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
1116+
1117+ return taxonomy, start_taxonomy_level
1118+
1119+
1120+
1121+
1122+def get_taxonomy_worms(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
1123 """ Gets and processes a taxon from the queue to get its taxonomy."""
1124 from SOAPpy import WSDL
1125
1126 wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
1127
1128 # this is the recursive function
1129- def get_children(taxonomy, ID):
1130+ def get_children(taxonomy, ID, aphiaIDsDone):
1131
1132 # get data
1133 this_item = wsdlObjectWoRMS.getAphiaRecordByID(ID)
1134 if this_item == None:
1135 return taxonomy
1136+ if not this_item['status'].lower() == 'accepted':
1137+ print "rejecting " , this_item.valid_name
1138+ return taxonomy
1139 if this_item['rank'].lower() == 'species':
1140 # add data to taxonomy dictionary
1141- # get the taxonomy of this species
1142- classification = wsdlObjectWoRMS.getAphiaClassificationByID(ID)
1143- taxon = this_item.scientificname
1144+ taxon = this_item.valid_name
1145+ # NOTE following line means existing items are *not* updated
1146 if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
1147+ # get the taxonomy of this species
1148+ classification = wsdlObjectWoRMS.getAphiaClassificationByID(ID)
1149 # construct array
1150 tax_array = {}
1151 # classification is a nested dictionary, so we need to iterate down it
1152@@ -92,16 +388,36 @@
1153 current_child = current_child.child
1154 if current_child == '': # empty one is a string for some reason
1155 break
1156- taxonomy[this_item.scientificname] = tax_array
1157+ if verbose:
1158+ print "\tAdding "+this_item.scientificname
1159+ taxonomy[this_item.valid_name] = tax_array
1160+ if not tmpfile == None:
1161+ stk.save_taxonomy(taxonomy,tmpfile)
1162 return taxonomy
1163 else:
1164 return taxonomy
1165
1166- children = wsdlObjectWoRMS.getAphiaChildrenByID(ID, 1, False)
1167-
1168- for child in children:
1169- taxonomy = get_children(taxonomy, child['valid_AphiaID'])
1170-
1171+ all_children = []
1172+ start = 1
1173+ while True:
1174+ children = wsdlObjectWoRMS.getAphiaChildrenByID(ID, start, False)
1175+ if (children is None or children == None):
1176+ break
1177+ if (len(children) < 50):
1178+ all_children.extend(children)
1179+ break
1180+ all_children.extend(children)
1181+ start += 50
1182+
1183+ if (len(all_children) == 0):
1184+ return taxonomy
1185+
1186+ for child in all_children:
1187+ if child['valid_AphiaID'] in aphiaIDsDone: # we get stuck sometime
1188+ continue
1189+ aphiaIDsDone.append(child['valid_AphiaID'])
1190+ taxonomy = get_children(taxonomy, child['valid_AphiaID'], aphiaIDsDone)
1191+
1192 return taxonomy
1193
1194
1195@@ -111,12 +427,17 @@
1196 start_id = start_taxa[0]['valid_AphiaID'] # there might be records that aren't valid - they point to the valid one though
1197 # call it again via the ID this time to make sure we've got the right one.
1198 start_taxa = wsdlObjectWoRMS.getAphiaRecordByID(start_id)
1199- start_taxonomy_level = start_taxa['rank'].lower()
1200- except HTTPError:
1201- print "Error"
1202+ if start_taxa == None:
1203+ start_taxonomy_level = 'infraorder'
1204+ else:
1205+ start_taxonomy_level = start_taxa['rank'].lower()
1206+ except urllib2.HTTPError:
1207+ print "Error finding start_otu taxonomic level. Do you have an internet connection?"
1208 sys.exit(-1)
1209
1210- taxonomy = get_children(taxonomy,start_id)
1211+ aphiaIDsDone = []
1212+ if not skip:
1213+ taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
1214
1215 return taxonomy, start_taxonomy_level
1216
1217@@ -136,9 +457,16 @@
1218 default=False
1219 )
1220 parser.add_argument(
1221+ '-s',
1222+ '--skip',
1223+ action='store_true',
1224+ help="Skip online checking, just use taxonomy files",
1225+ default=False
1226+ )
1227+ parser.add_argument(
1228 '--pref_db',
1229 help="Taxonomy database to use. Default is Species 2000/ITIS",
1230- choices=['itis', 'worms', 'ncbi'],
1231+ choices=['itis', 'worms', 'ncbi', 'eol'],
1232 default = 'worms'
1233 )
1234 parser.add_argument(
1235@@ -178,58 +506,250 @@
1236 top_level = args.top_level[0]
1237 save_taxonomy_file = args.save_taxonomy
1238 tree_taxonomy = args.tree_taxonomy
1239+ taxonomy = args.taxonomy_from_file
1240 pref_db = args.pref_db
1241+ skip = args.skip
1242 if (save_taxonomy_file == None):
1243 save_taxonomy = False
1244 else:
1245 save_taxonomy = True
1246+ load_tree_taxonomy = False
1247+ if (not tree_taxonomy == None):
1248+ tree_taxonomy_file = tree_taxonomy
1249+ load_tree_taxonomy = True
1250+ if skip:
1251+ if taxonomy == None:
1252+ print "Error: If you're skipping checking online, then you need to supply taxonomy files"
1253+ return
1254
1255 # grab taxa in tree
1256 tree = stk.import_tree(input_file)
1257 taxa_list = stk._getTaxaFromNewick(tree)
1258-
1259- taxonomy = {}
1260-
1261- # we're going to add the taxa in the tree to the taxonomy, to stop them
1262+
1263+ if verbose:
1264+ print "Taxa count for input tree: ", len(taxa_list)
1265+
1266+ # load in any taxonomy files - we still call the APIs as a) they may have updated data and
1267+ # b) the user may have missed some first time round (i.e. expanded the tree and not redone
1268+ # the taxonomy
1269+ if (taxonomy == None):
1270+ taxonomy = {}
1271+ else:
1272+ taxonomy = stk.load_taxonomy(taxonomy)
1273+ tree_taxonomy = {}
1274+ # this might also have tree_taxonomy in too - let's check this
1275+ for t in taxa_list:
1276+ if t in taxonomy:
1277+ tree_taxonomy[t] = taxonomy[t]
1278+ elif t.replace("_"," ") in taxonomy:
1279+ tree_taxonomy[t] = taxonomy[t.replace("_"," ")]
1280+
1281+ if (load_tree_taxonomy): # overwrite the good work above...
1282+ tree_taxonomy = stk.load_taxonomy(tree_taxonomy_file)
1283+ if (tree_taxonomy == None):
1284+ tree_taxonomy = {}
1285+
1286+ # we're going to add the taxa in the tree to the main WORMS taxonomy, to stop them
1287 # being fetched in first place. We delete them later
1288+ # If you've loaded a taxonomy created by this script, this overwrites the tree taxa in the main taxonomy dict
1289+ # Don't worry, we put them back in before saving again!
1290 for taxon in taxa_list:
1291 taxon = taxon.replace('_',' ')
1292- taxonomy[taxon] = []
1293-
1294+ taxonomy[taxon] = {}
1295
1296 if (pref_db == 'itis'):
1297 # get taxonomy info from itis
1298- print "Sorry, ITIS is not implemented yet"
1299- pass
1300+ if (verbose):
1301+ print "Getting data from ITIS"
1302+ if (verbose):
1303+ print "Dealing with taxa in tree"
1304+ for t in taxa_list:
1305+ if verbose:
1306+ print "\t"+t
1307+ if not(t in tree_taxonomy or t.replace("_"," ") in tree_taxonomy):
1308+ # we don't have data - NOTE we assume things are *not* updated here if we do
1309+ tree_taxonomy[t] = get_tree_taxa_taxonomy_itis(t)
1310+
1311+ if save_taxonomy:
1312+ if (verbose):
1313+ print "Saving tree taxonomy"
1314+ # note -temporary save as we overwrite this file later.
1315+ stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
1316+
1317+ # get taxonomy from worms
1318+ if verbose:
1319+ print "Now dealing with all other taxa - this might take a while..."
1320+ # create a temp file so we can checkpoint and continue
1321+ tmpf, tmpfile = tempfile.mkstemp()
1322+
1323+ if os.path.isfile('.fit_lock'):
1324+ f = open('.fit_lock','r')
1325+ tf = f.read()
1326+ f.close()
1327+ if os.path.isfile(tf.strip()):
1328+ taxonomy = stk.load_taxonomy(tf.strip())
1329+ os.remove('.fit_lock')
1330+
1331+ # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
1332+ # where we left off.
1333+ with open(".fit_lock", 'w') as f:
1334+ f.write(tmpfile)
1335+ # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
1336+ taxonomy, start_level = get_taxonomy_itis(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
1337+
1338+ # clean up
1339+ os.close(tmpf)
1340+ os.remove('.fit_lock')
1341+ try:
1342+ os.remove('tmpfile')
1343+ except OSError:
1344+ pass
1345 elif (pref_db == 'worms'):
1346+ if (verbose):
1347+ print "Getting data from WoRMS"
1348 # get tree taxonomy from worms
1349- if (tree_taxonomy == None):
1350- tree_taxonomy = {}
1351- for t in taxa_list:
1352- from SOAPpy import WSDL
1353- wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
1354- tree_taxonomy[t] = get_tree_taxa_taxonomy(t,wsdlObjectWoRMS)
1355- else:
1356- tree_taxonomy = stk.load_taxonomy(tree_taxonomy)
1357+ if (verbose):
1358+ print "Dealing with taxa in tree"
1359+
1360+ for t in taxa_list:
1361+ if verbose:
1362+ print "\t"+t
1363+ if not(t in tree_taxonomy or t.replace("_"," ") in tree_taxonomy):
1364+ # we don't have data - NOTE we assume things are *not* updated here if we do
1365+ tree_taxonomy[t] = get_tree_taxa_taxonomy_worms(t)
1366+
1367+ if save_taxonomy:
1368+ if (verbose):
1369+ print "Saving tree taxonomy"
1370+ # note -temporary save as we overwrite this file later.
1371+ stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
1372+
1373 # get taxonomy from worms
1374- taxonomy, start_level = get_taxonomy_worms(taxonomy,top_level)
1375+ if verbose:
1376+ print "Now dealing with all other taxa - this might take a while..."
1377+ # create a temp file so we can checkpoint and continue
1378+ tmpf, tmpfile = tempfile.mkstemp()
1379+
1380+ if os.path.isfile('.fit_lock'):
1381+ f = open('.fit_lock','r')
1382+ tf = f.read()
1383+ f.close()
1384+ if os.path.isfile(tf.strip()):
1385+ taxonomy = stk.load_taxonomy(tf.strip())
1386+ os.remove('.fit_lock')
1387+
1388+ # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
1389+ # where we left off.
1390+ with open(".fit_lock", 'w') as f:
1391+ f.write(tmpfile)
1392+ # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
1393+ taxonomy, start_level = get_taxonomy_worms(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
1394+
1395+ # clean up
1396+ os.close(tmpf)
1397+ os.remove('.fit_lock')
1398+ try:
1399+ os.remove('tmpfile')
1400+ except OSError:
1401+ pass
1402
1403 elif (pref_db == 'ncbi'):
1404 # get taxonomy from ncbi
1405 print "Sorry, NCBI is not implemented yet"
1406 pass
1407+ elif (pref_db == 'eol'):
1408+ if (verbose):
1409+ print "Getting data from EOL"
1410+ # get tree taxonomy from worms
1411+ if (verbose):
1412+ print "Dealing with taxa in tree"
1413+ for t in taxa_list:
1414+ if verbose:
1415+ print "\t"+t
1416+ try:
1417+ tree_taxonomy[t]
1418+ pass # we have data - NOTE we assume things are *not* updated here...
1419+ except KeyError:
1420+ try:
1421+ tree_taxonomy[t.replace('_',' ')]
1422+ except KeyError:
1423+ tree_taxonomy[t] = get_tree_taxa_taxonomy_eol(t)
1424+
1425+ if save_taxonomy:
1426+ if (verbose):
1427+ print "Saving tree taxonomy"
1428+ # note -temporary save as we overwrite this file later.
1429+ stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
1430+
1431+ # get taxonomy from worms
1432+ if verbose:
1433+ print "Now dealing with all other taxa - this might take a while..."
1434+ # create a temp file so we can checkpoint and continue
1435+ tmpf, tmpfile = tempfile.mkstemp()
1436+
1437+ if os.path.isfile('.fit_lock'):
1438+ f = open('.fit_lock','r')
1439+ tf = f.read()
1440+ f.close()
1441+ if os.path.isfile(tf.strip()):
1442+ taxonomy = stk.load_taxonomy(tf.strip())
1443+ os.remove('.fit_lock')
1444+
1445+ # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
1446+ # where we left off.
1447+ with open(".fit_lock", 'w') as f:
1448+ f.write(tmpfile)
1449+ # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
1450+ taxonomy, start_level = get_taxonomy_eol(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
1451+
1452+ # clean up
1453+ os.close(tmpf)
1454+ os.remove('.fit_lock')
1455+ try:
1456+ os.remove('tmpfile')
1457+ except OSError:
1458+ pass
1459 else:
1460- print "ERROR: Didn't understand you database choice"
1461+ print "ERROR: Didn't understand your database choice"
1462 sys.exit(-1)
1463
1464 # clean up taxonomy, deleting the ones already in the tree
1465 for taxon in taxa_list:
1466- taxon = taxon.replace('_',' ')
1467- del taxonomy[taxon]
1468+ taxon = taxon.replace('_',' ')
1469+ try:
1470+ del taxonomy[taxon]
1471+ except KeyError:
1472+ pass # if it's not there, so we care?
1473+
1474+ # We now have 2 taxonomies:
1475+ # - for taxa in the tree
1476+ # - for all other taxa in the clade of interest
1477+
1478+ if save_taxonomy:
1479+ tot_taxonomy = taxonomy.copy()
1480+ tot_taxonomy.update(tree_taxonomy)
1481+ stk.save_taxonomy(tot_taxonomy,save_taxonomy_file)
1482+
1483+
1484+ orig_taxa_list = taxa_list
1485+
1486+ remove_higher_level = [] # for storing the higher level taxa in the original tree that need deleting
1487+ generic = []
1488+ # find all the generic and build an internal subs file
1489+ for t in taxa_list:
1490+ t = t.replace(" ","_")
1491+ if t.find("_") == -1:
1492+ # no underscore, so just generic
1493+ generic.append(t)
1494
1495 # step up the taxonomy levels from genus, adding taxa to the correct node
1496 # as a polytomy
1497- for level in taxonomy_levels[1::]: # skip species....
1498+ start_level = start_level.encode('utf-8').strip()
1499+ if verbose:
1500+ print "I think your start OTU is at: ", start_level
1501+ for level in tlevels[1::]: # skip species....
1502+ if verbose:
1503+ print "Dealing with ",level
1504 new_taxa = []
1505 for t in taxonomy:
1506 # skip odd ones that should be in there
1507@@ -239,135 +759,61 @@
1508 except KeyError:
1509 continue # don't have this info
1510 new_taxa = _uniquify(new_taxa)
1511+
1512 for nt in new_taxa:
1513- taxa_to_add = []
1514+ taxa_to_add = {}
1515 taxa_in_clade = []
1516 for t in taxonomy:
1517 if start_level in taxonomy[t] and taxonomy[t][start_level] == top_level:
1518 try:
1519- if taxonomy[t][level] == nt:
1520- taxa_to_add.append(t.replace(' ','_'))
1521+ if taxonomy[t][level] == nt and not t in taxa_list:
1522+ taxa_to_add[t] = taxonomy[t]
1523 except KeyError:
1524 continue
1525+
1526 # add to tree
1527 for t in taxa_list:
1528 if level in tree_taxonomy[t] and tree_taxonomy[t][level] == nt:
1529 taxa_in_clade.append(t)
1530- if len(taxa_in_clade) > 0:
1531- tree = add_taxa(tree, taxa_to_add, taxa_in_clade)
1532- for t in taxa_to_add: # clean up taxonomy
1533- del taxonomy[t.replace('_',' ')]
1534-
1535-
1536+ if t in generic:
1537+ # we are appending taxa to this higher taxon, so we need to remove it
1538+ remove_higher_level.append(t)
1539+
1540+
1541+ if len(taxa_in_clade) > 0 and len(taxa_to_add) > 0:
1542+ tree = add_taxa(tree, taxa_to_add, taxa_in_clade,level)
1543+ try:
1544+ taxa_list = stk._getTaxaFromNewick(tree)
1545+ except stk.TreeParseError as e:
1546+ print taxa_to_add, taxa_in_clade, level, tree
1547+ print e.msg
1548+ return
1549+
1550+ for t in taxa_to_add:
1551+ tree_taxonomy[t.replace(' ','_')] = taxa_to_add[t]
1552+ try:
1553+ del taxonomy[t.replace('_',' ')]
1554+ except KeyError:
1555+ # It might have _ or it might not...
1556+ del taxonomy[t]
1557+
1558+
1559+ # remove singelton nodes
1560+ tree = stk._collapse_nodes(tree)
1561+ tree = stk._collapse_nodes(tree)
1562+ tree = stk._collapse_nodes(tree)
1563+
1564+ tree = stk._sub_taxa_in_tree(tree, remove_higher_level)
1565 trees = {}
1566 trees['tree_1'] = tree
1567 output = stk._amalgamate_trees(trees,format='nexus')
1568 f = open(output_file, "w")
1569 f.write(output)
1570 f.close()
1571-
1572- if not save_taxonomy_file == None:
1573- with open(save_taxonomy_file, 'w') as f:
1574- writer = csv.writer(f)
1575- headers = []
1576- headers.append("OTU")
1577- headers.extend(taxonomy_levels)
1578- headers.append("Data source")
1579- writer.writerow(headers)
1580- for t in taxonomy:
1581- otu = t
1582- try:
1583- species = taxonomy[t]['species']
1584- except KeyError:
1585- species = "-"
1586- try:
1587- genus = taxonomy[t]['genus']
1588- except KeyError:
1589- genus = "-"
1590- try:
1591- family = taxonomy[t]['family']
1592- except KeyError:
1593- family = "-"
1594- try:
1595- superfamily = taxonomy[t]['superfamily']
1596- except KeyError:
1597- superfamily = "-"
1598- try:
1599- infraorder = taxonomy[t]['infraorder']
1600- except KeyError:
1601- infraorder = "-"
1602- try:
1603- suborder = taxonomy[t]['suborder']
1604- except KeyError:
1605- suborder = "-"
1606- try:
1607- order = taxonomy[t]['order']
1608- except KeyError:
1609- order = "-"
1610- try:
1611- superorder = taxonomy[t]['superorder']
1612- except KeyError:
1613- superorder = "-"
1614- try:
1615- subclass = taxonomy[t]['subclass']
1616- except KeyError:
1617- subclass = "-"
1618- try:
1619- tclass = taxonomy[t]['class']
1620- except KeyError:
1621- tclass = "-"
1622- try:
1623- subphylum = taxonomy[t]['subphylum']
1624- except KeyError:
1625- subphylum = "-"
1626- try:
1627- phylum = taxonomy[t]['phylum']
1628- except KeyError:
1629- phylum = "-"
1630- try:
1631- superphylum = taxonomy[t]['superphylum']
1632- except KeyError:
1633- superphylum = "-"
1634- try:
1635- infrakingdom = taxonomy[t]['infrakingdom']
1636- except:
1637- infrakingdom = "-"
1638- try:
1639- subkingdom = taxonomy[t]['subkingdom']
1640- except:
1641- subkingdom = "-"
1642- try:
1643- kingdom = taxonomy[t]['kingdom']
1644- except KeyError:
1645- kingdom = "-"
1646- try:
1647- provider = taxonomy[t]['provider']
1648- except KeyError:
1649- provider = "-"
1650-
1651- if (isinstance(species, list)):
1652- species = " ".join(species)
1653- this_classification = [
1654- otu.encode('utf-8'),
1655- species.encode('utf-8'),
1656- genus.encode('utf-8'),
1657- family.encode('utf-8'),
1658- superfamily.encode('utf-8'),
1659- infraorder.encode('utf-8'),
1660- suborder.encode('utf-8'),
1661- order.encode('utf-8'),
1662- superorder.encode('utf-8'),
1663- subclass.encode('utf-8'),
1664- tclass.encode('utf-8'),
1665- subphylum.encode('utf-8'),
1666- phylum.encode('utf-8'),
1667- superphylum.encode('utf-8'),
1668- infrakingdom.encode('utf-8'),
1669- subkingdom.encode('utf-8'),
1670- kingdom.encode('utf-8'),
1671- provider.encode('utf-8')]
1672- writer.writerow(this_classification)
1673-
1674+ taxa_list = stk._getTaxaFromNewick(tree)
1675+
1676+ print "Final taxa count:", len(taxa_list)
1677+
1678
1679 def _uniquify(l):
1680 """
1681@@ -379,28 +825,119 @@
1682
1683 return keys.keys()
1684
1685-def add_taxa(tree, new_taxa, taxa_in_clade):
1686+def add_taxa(tree, new_taxa, taxa_in_clade, level):
1687
1688 # create new tree of the new taxa
1689- #tree_string = "(" + ",".join(new_taxa) + ");"
1690- #additionalTaxa = stk._parse_tree(tree_string)
1691+ additionalTaxa = tree_from_taxonomy(level,new_taxa)
1692
1693 # find mrca parent
1694 treeobj = stk._parse_tree(tree)
1695 mrca = stk.get_mrca(tree,taxa_in_clade)
1696- mrca_parent = treeobj.node(mrca).parent
1697-
1698- # insert a node into the tree between the MRCA and it's parent (p4.addNodeBetweenNodes)
1699- newNode = treeobj.addNodeBetweenNodes(mrca, mrca_parent)
1700-
1701- # add the new tree at the new node using p4.addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None)
1702- #treeobj.addSubTree(newNode, additionalTaxa)
1703- for t in new_taxa:
1704- treeobj.addSibLeaf(newNode,t)
1705-
1706- # return new tree
1707+ if (mrca == 0):
1708+ # we need to make a new tree! The additional taxa are being placed at the root of the tree
1709+ t = Tree()
1710+ A = t.add_child()
1711+ B = t.add_child()
1712+ t1 = Tree(additionalTaxa)
1713+ t2 = Tree(tree)
1714+ A.add_child(t1)
1715+ B.add_child(t2)
1716+ return t.write(format=9)
1717+ else:
1718+ mrca = treeobj.nodes[mrca]
1719+ additionalTaxa = stk._parse_tree(additionalTaxa)
1720+
1721+ if len(taxa_in_clade) == 1:
1722+ taxon = treeobj.node(taxa_in_clade[0])
1723+ mrca = treeobj.addNodeBetweenNodes(taxon,mrca)
1724+
1725+
1726+ # insert a node into the tree between the MRCA and it's parent (p4.addNodeBetweenNodes)
1727+ # newNode = treeobj.addNodeBetweenNodes(mrca, mrca_parent)
1728+
1729+ # add the new tree at the new node using p4.addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None)
1730+ treeobj.addSubTree(mrca, additionalTaxa, ignoreRootAssert=True)
1731+
1732 return treeobj.writeNewick(fName=None,toString=True).strip()
1733
1734+
1735+
1736+def tree_from_taxonomy(top_level, tree_taxonomy):
1737+
1738+ start_level = taxonomy_levels.index(top_level)
1739+ new_taxa = tree_taxonomy.keys()
1740+
1741+ tl_types = []
1742+ for tt in tree_taxonomy:
1743+ tl_types.append(tree_taxonomy[tt][top_level])
1744+
1745+ tl_types = _uniquify(tl_types)
1746+ levels_to_worry_about = tlevels[0:tlevels.index(top_level)+1]
1747+
1748+ t = Tree()
1749+ nodes = {}
1750+ nodes[top_level] = []
1751+ for tl in tl_types:
1752+ n = t.add_child(name=tl)
1753+ nodes[top_level].append({tl:n})
1754+
1755+ for l in levels_to_worry_about[-2::-1]:
1756+ names = []
1757+ nodes[l] = []
1758+ ci = levels_to_worry_about.index(l)
1759+ for tt in tree_taxonomy:
1760+ try:
1761+ names.append(tree_taxonomy[tt][l])
1762+ except KeyError:
1763+ pass
1764+ names = _uniquify(names)
1765+ for n in names:
1766+ # find my parent
1767+ parent = None
1768+ for tt in tree_taxonomy:
1769+ try:
1770+ if tree_taxonomy[tt][l] == n:
1771+ try:
1772+ parent = tree_taxonomy[tt][levels_to_worry_about[ci+1]]
1773+ level = ci+1
1774+ except KeyError:
1775+ try:
1776+ parent = tree_taxonomy[tt][levels_to_worry_about[ci+2]]
1777+ level = ci+2
1778+ except KeyError:
1779+ try:
1780+ parent = tree_taxonomy[tt][levels_to_worry_about[ci+3]]
1781+ level = ci+3
1782+ except KeyError:
1783+ print "ERROR: tried to find some taxonomic info for "+tt+" from tree_taxonomy file/downloaded data and I went two levels up, but failed find any. Looked at:\n"
1784+ print "\t"+levels_to_worry_about[ci+1]
1785+ print "\t"+levels_to_worry_about[ci+2]
1786+ print "\t"+levels_to_worry_about[ci+3]
1787+ print "This is the taxonomy info I have for "+tt
1788+ print tree_taxonomy[tt]
1789+ sys.exit(1)
1790+
1791+ k = []
1792+ for nd in nodes[levels_to_worry_about[level]]:
1793+ k.extend(nd.keys())
1794+ i = 0
1795+ for kk in k:
1796+ if kk == parent:
1797+ break
1798+ i += 1
1799+ parent_id = i
1800+ break
1801+ except KeyError:
1802+ pass # no data at this level for this beastie
1803+ # find out where to attach it
1804+ node_id = nodes[levels_to_worry_about[level]][parent_id][parent]
1805+ nd = node_id.add_child(name=n.replace(" ","_"))
1806+ nodes[l].append({n:nd})
1807+
1808+ tree = t.write(format=9)
1809+
1810+ return tree
1811+
1812 if __name__ == "__main__":
1813 main()
1814
1815
1816=== modified file 'stk/scripts/plot_character_taxa_matrix.py'
1817--- stk/scripts/plot_character_taxa_matrix.py 2014-12-10 08:55:43 +0000
1818+++ stk/scripts/plot_character_taxa_matrix.py 2017-01-12 09:27:31 +0000
1819@@ -42,6 +42,18 @@
1820 default=False
1821 )
1822 parser.add_argument(
1823+ '-t',
1824+ '--taxonomy',
1825+ help="Use taxonomy to sort the taxa on the axis. Supply a STK taxonomy file",
1826+ )
1827+ parser.add_argument(
1828+ '--level',
1829+ choices=['family','superfamily','infraorder','suborder','order'],
1830+ default='family',
1831+ help="""What level to group the taxonomy at. Default is family.
1832+ Note data for a particular levelmay be missing in taxonomy."""
1833+ )
1834+ parser.add_argument(
1835 'input_file',
1836 metavar='input_file',
1837 nargs=1,
1838@@ -59,14 +71,58 @@
1839 verbose = args.verbose
1840 input_file = args.input_file[0]
1841 output_file = args.output_file[0]
1842+ taxonomy = args.taxonomy
1843+ level = args.level
1844
1845 XML = stk.load_phyml(input_file)
1846+ if not taxonomy == None:
1847+ taxonomy = stk.load_taxonomy(taxonomy)
1848+
1849 all_taxa = stk.get_all_taxa(XML)
1850 all_chars_d = stk.get_all_characters(XML)
1851 all_chars = []
1852 for c in all_chars_d:
1853 all_chars.extend(all_chars_d[c])
1854
1855+ if not taxonomy == None:
1856+ tax_data = {}
1857+ new_all_taxa = []
1858+ for t in all_taxa:
1859+ taxon = t.replace("_"," ")
1860+ try:
1861+ if taxonomy[taxon][level] == "":
1862+ # skip this
1863+ continue
1864+ tax_data[t] = taxonomy[taxon][level]
1865+ except KeyError:
1866+ print "Couldn't find "+t+" in taxonomy. Adding as null data"
1867+ tax_data[t] = 'zzzzz' # it's at the end...
1868+
1869+ from sets import Set
1870+ unique = set(tax_data.values())
1871+ unique = list(unique)
1872+ unique.sort()
1873+ print "Groups are:"
1874+ print unique
1875+ counts = []
1876+ for u in unique:
1877+ count = 0
1878+ for t in tax_data:
1879+ if tax_data[t] == u:
1880+ count += 1
1881+ new_all_taxa.append(t)
1882+ counts.append(count)
1883+
1884+ all_taxa = new_all_taxa
1885+ # cumulate counts
1886+ count_cumulate = []
1887+ count_cumulate.append(counts[0])
1888+ for c in counts[1::]:
1889+ count_cumulate.append(c+count_cumulate[-1])
1890+
1891+ print count_cumulate
1892+
1893+
1894 taxa_character_matrix = {}
1895 for t in all_taxa:
1896 taxa_character_matrix[t] = []
1897@@ -77,7 +133,8 @@
1898 taxa = stk.get_taxa_from_tree(XML,t, sort=True)
1899 for taxon in taxa:
1900 taxon = taxon.replace(" ","_")
1901- taxa_character_matrix[taxon].extend(chars)
1902+ if taxon in all_taxa:
1903+ taxa_character_matrix[taxon].extend(chars)
1904
1905 for t in taxa_character_matrix:
1906 array = taxa_character_matrix[t]
1907@@ -92,6 +149,31 @@
1908 x.append(i)
1909 y.append(j)
1910
1911+
1912+ i = 0
1913+ for j in all_chars:
1914+ # do a substitution of character names to tidy things up
1915+ if j.lower().startswith('mitochondrial carrier; adenine nucleotide translocator'):
1916+ j = "ANT"
1917+ if j.lower().startswith('mitochondrially encoded 12s'):
1918+ j = '12S'
1919+ if j.lower().startswith('complete mitochondrial genome'):
1920+ j = 'Mitogenome'
1921+ if j.lower().startswith('mtdna'):
1922+ j = "mtDNA restriction sites"
1923+ if j.lower().startswith('h3 histone'):
1924+ j = 'H3'
1925+ if j.lower().startswith('mitochondrially encoded cytochrome'):
1926+ j = 'COI'
1927+ if j.lower().startswith('rna, 28s'):
1928+ j = '28S'
1929+ if j.lower().startswith('rna, 18s'):
1930+ j = '18S'
1931+ if j.lower().startswith('mitochondrially encoded 16s'):
1932+ j = '16S'
1933+ all_chars[i] = j
1934+ i += 1
1935+
1936 fig=figure(figsize=(22,17),dpi=90)
1937 fig.subplots_adjust(left=0.3)
1938 ax = fig.add_subplot(1,1,1)
1939
1940=== modified file 'stk/scripts/plot_tree_taxa_matrix.py'
1941--- stk/scripts/plot_tree_taxa_matrix.py 2014-12-10 08:55:43 +0000
1942+++ stk/scripts/plot_tree_taxa_matrix.py 2017-01-12 09:27:31 +0000
1943@@ -43,6 +43,18 @@
1944 default=False
1945 )
1946 parser.add_argument(
1947+ '-t',
1948+ '--taxonomy',
1949+ help="Use taxonomy to sort the taxa on the axis. Supply a STK taxonomy file",
1950+ )
1951+ parser.add_argument(
1952+ '--level',
1953+ choices=['family','superfamily','infraorder','suborder','order'],
1954+ default='family',
1955+ help="""What level to group the taxonomy at. Default is family.
1956+ Note data for a particular levelmay be missing in taxonomy."""
1957+ )
1958+ parser.add_argument(
1959 'input_file',
1960 metavar='input_file',
1961 nargs=1,
1962@@ -60,13 +72,57 @@
1963 verbose = args.verbose
1964 input_file = args.input_file[0]
1965 output_file = args.output_file[0]
1966+ taxonomy = args.taxonomy
1967+ level = args.level
1968
1969 XML = stk.load_phyml(input_file)
1970+ if not taxonomy == None:
1971+ taxonomy = stk.load_taxonomy(taxonomy)
1972+
1973 all_taxa = stk.get_all_taxa(XML)
1974
1975 taxa_tree_matrix = {}
1976 for t in all_taxa:
1977 taxa_tree_matrix[t] = []
1978+
1979+ if not taxonomy == None:
1980+ tax_data = {}
1981+ new_all_taxa = []
1982+ for t in all_taxa:
1983+ taxon = t.replace("_"," ")
1984+ try:
1985+ if taxonomy[taxon][level] == "":
1986+ # skip this
1987+ continue
1988+ tax_data[t] = taxonomy[taxon][level]
1989+ except KeyError:
1990+ print "Couldn't find "+t+" in taxonomy. Adding as null data"
1991+ tax_data[t] = 'zzzzz' # it's at the end...
1992+
1993+ from sets import Set
1994+ unique = set(tax_data.values())
1995+ unique = list(unique)
1996+ unique.sort()
1997+ print "Groups are:"
1998+ print unique
1999+ counts = []
2000+ for u in unique:
2001+ count = 0
2002+ for t in tax_data:
2003+ if tax_data[t] == u:
2004+ count += 1
2005+ new_all_taxa.append(t)
2006+ counts.append(count)
2007+
2008+ all_taxa = new_all_taxa
2009+ # cumulate counts
2010+ count_cumulate = []
2011+ count_cumulate.append(counts[0])
2012+ for c in counts[1::]:
2013+ count_cumulate.append(c+count_cumulate[-1])
2014+
2015+ print count_cumulate
2016+
2017
2018 trees = stk.obtain_trees(XML)
2019 i = 0
2020
2021=== modified file 'stk/scripts/remove_poorly_constrained_taxa.py'
2022--- stk/scripts/remove_poorly_constrained_taxa.py 2014-04-18 11:57:14 +0000
2023+++ stk/scripts/remove_poorly_constrained_taxa.py 2017-01-12 09:27:31 +0000
2024@@ -12,8 +12,8 @@
2025
2026 # do stuff
2027 parser = argparse.ArgumentParser(
2028- prog="convert tree from specific to generic",
2029- description="""Converts a tree at specific level to generic level""",
2030+ prog="remove poorly contrained taxa",
2031+ description="""Remove taxa that appea in one source tree only.""",
2032 )
2033 parser.add_argument(
2034 '-v',
2035@@ -34,6 +34,13 @@
2036 " to removal those in polytomies *and* only in one other tree."
2037 )
2038 parser.add_argument(
2039+ '--tree_only',
2040+ default=False,
2041+ action='store_true',
2042+ help="Restrict removal of taxa that only occur in one source tree. Default"+
2043+ " to removal those in polytomies *and* only in one other tree."
2044+ )
2045+ parser.add_argument(
2046 'input_phyml',
2047 metavar='input_phyml',
2048 nargs=1,
2049@@ -43,13 +50,13 @@
2050 'input_tree',
2051 metavar='input_tree',
2052 nargs=1,
2053- help="Your tree"
2054+ help="Your tree - can be NULL or None"
2055 )
2056 parser.add_argument(
2057 'output_tree',
2058 metavar='output_tree',
2059 nargs=1,
2060- help="Your output tree"
2061+ help="Your output tree or phyml - if input_tree is none, this is the Phyml"
2062 )
2063
2064
2065@@ -62,14 +69,20 @@
2066 dl = True
2067 poly_only = args.poly_only
2068 input_tree = args.input_tree[0]
2069- output_tree = args.output_tree[0]
2070+ if input_tree == 'NULL' or input_tree == 'None':
2071+ input_tree = None
2072+ output_file = args.output_tree[0]
2073 input_phyml = args.input_phyml[0]
2074
2075 XML = stk.load_phyml(input_phyml)
2076 # load tree
2077- supertree = stk.import_tree(input_tree)
2078+ if (not input_tree == None):
2079+ supertree = stk.import_tree(input_tree)
2080+ taxa = stk._getTaxaFromNewick(supertree)
2081+ else:
2082+ supertree = None
2083+ taxa = stk.get_all_taxa(XML)
2084 # grab taxa
2085- taxa = stk._getTaxaFromNewick(supertree)
2086 delete_list = []
2087
2088 # loop over taxa in supertree and get some stats
2089@@ -115,19 +128,29 @@
2090
2091 print "Taxa: "+str(len(taxa))
2092 print "Deleting: "+str(len(delete_list))
2093- # done, so delete the problem taxa from the supertree
2094- for t in delete_list:
2095- # remove taxa from supertree
2096- supertree = stk._sub_taxa_in_tree(supertree,t)
2097-
2098- # save supertree
2099- tree = {}
2100- tree['Tree_1'] = supertree
2101- output = stk._amalgamate_trees(tree,format='nexus')
2102- # write file
2103- f = open(output_tree,"w")
2104- f.write(output)
2105- f.close()
2106+
2107+ if not supertree == None:
2108+ # done, so delete the problem taxa from the supertree
2109+ for t in delete_list:
2110+ # remove taxa from supertree
2111+ supertree = stk._sub_taxa_in_tree(supertree,t)
2112+
2113+ # save supertree
2114+ tree = {}
2115+ tree['Tree_1'] = supertree
2116+ output = stk._amalgamate_trees(tree,format='nexus')
2117+ # write file
2118+ f = open(output_file,"w")
2119+ f.write(output)
2120+ f.close()
2121+ else:
2122+ new_phyml = stk.substitute_taxa(XML,delete_list)
2123+ # write file
2124+ f = open(output_file,"w")
2125+ f.write(new_phyml)
2126+ f.close()
2127+
2128+
2129
2130 if (dl):
2131 # write file
2132
2133=== added file 'stk/scripts/tree_from_taxonomy.py'
2134--- stk/scripts/tree_from_taxonomy.py 1970-01-01 00:00:00 +0000
2135+++ stk/scripts/tree_from_taxonomy.py 2017-01-12 09:27:31 +0000
2136@@ -0,0 +1,142 @@
2137+# trees ready for supretree construction.
2138+# Copyright (C) 2015, Jon Hill, Katie Davis
2139+#
2140+# This program is free software: you can redistribute it and/or modify
2141+# it under the terms of the GNU General Public License as published by
2142+# the Free Software Foundation, either version 3 of the License, or
2143+# (at your option) any later version.
2144+#
2145+# This program is distributed in the hope that it will be useful,
2146+# but WITHOUT ANY WARRANTY; without even the implied warranty of
2147+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
2148+# GNU General Public License for more details.
2149+#
2150+# You should have received a copy of the GNU General Public License
2151+# along with this program. If not, see <http://www.gnu.org/licenses/>.
2152+#
2153+# Jon Hill. jon.hill@york.ac.uk
2154+
2155+import argparse
2156+import copy
2157+import os
2158+import sys
2159+stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir )
2160+sys.path.insert(0, stk_path)
2161+import supertree_toolkit as stk
2162+import csv
2163+from ete2 import Tree
2164+
2165+taxonomy_levels = ['species','subgenus','genus','subfamily','family','superfamily','subsection','section','infraorder','suborder','order','superorder','subclass','class','superclass','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
2166+tlevels = ['species','genus','family','order','class','phylum','kingdom']
2167+
2168+
2169+def main():
2170+
2171+ # do stuff
2172+ parser = argparse.ArgumentParser(
2173+ prog="create a tree from a taxonomy file",
2174+ description="Create a taxonomic tree",
2175+ )
2176+ parser.add_argument(
2177+ '-v',
2178+ '--verbose',
2179+ action='store_true',
2180+ help="Verbose output: mainly progress reports.",
2181+ default=False
2182+ )
2183+ parser.add_argument(
2184+ 'top_level',
2185+ nargs=1,
2186+ help="The top level group to start with, e.g. family"
2187+ )
2188+ parser.add_argument(
2189+ 'input_file',
2190+ metavar='input_file',
2191+ nargs=1,
2192+ help="Your taxonomy file"
2193+ )
2194+ parser.add_argument(
2195+ 'output_file',
2196+ metavar='output_file',
2197+ nargs=1,
2198+ help="Your new tree file"
2199+ )
2200+
2201+ args = parser.parse_args()
2202+ verbose = args.verbose
2203+ input_file = args.input_file[0]
2204+ output_file = args.output_file[0]
2205+ top_level = args.top_level[0]
2206+
2207+ start_level = taxonomy_levels.index(top_level)
2208+ tree_taxonomy = stk.load_taxonomy(input_file)
2209+ new_taxa = tree_taxonomy.keys()
2210+
2211+ tl_types = []
2212+ for tt in tree_taxonomy:
2213+ tl_types.append(tree_taxonomy[tt][top_level])
2214+
2215+ tl_types = _uniquify(tl_types)
2216+ levels_to_worry_about = tlevels[0:tlevels.index(top_level)+1]
2217+
2218+ #print levels_to_worry_about[-2::-1]
2219+
2220+ t = Tree()
2221+ nodes = {}
2222+ nodes[top_level] = []
2223+ for tl in tl_types:
2224+ n = t.add_child(name=tl)
2225+ nodes[top_level].append({tl:n})
2226+
2227+ for l in levels_to_worry_about[-2::-1]:
2228+ #print t
2229+ names = []
2230+ nodes[l] = []
2231+ ci = levels_to_worry_about.index(l)
2232+ for tt in tree_taxonomy:
2233+ names.append(tree_taxonomy[tt][l])
2234+ names = _uniquify(names)
2235+ for n in names:
2236+ #print n
2237+ # find my parent
2238+ parent = None
2239+ for tt in tree_taxonomy:
2240+ if tree_taxonomy[tt][l] == n:
2241+ parent = tree_taxonomy[tt][levels_to_worry_about[ci+1]]
2242+ k = []
2243+ for nd in nodes[levels_to_worry_about[ci+1]]:
2244+ k.extend(nd.keys())
2245+ i = 0
2246+ for kk in k:
2247+ print kk
2248+ if kk == parent:
2249+ break
2250+ i += 1
2251+ parent_id = i
2252+ break
2253+ # find out where to attach it
2254+ node_id = nodes[levels_to_worry_about[ci+1]][parent_id][parent]
2255+ nd = node_id.add_child(name=n.replace(" ","_"))
2256+ nodes[l].append({n:nd})
2257+
2258+ tree = t.write(format=9)
2259+ tree = stk._collapse_nodes(tree)
2260+ tree = stk._collapse_nodes(tree)
2261+ print tree
2262+
2263+
2264+def _uniquify(l):
2265+ """
2266+ Make a list, l, contain only unique data
2267+ """
2268+ keys = {}
2269+ for e in l:
2270+ keys[e] = 1
2271+
2272+ return keys.keys()
2273+
2274+if __name__ == "__main__":
2275+ main()
2276+
2277+
2278+
2279
2280=== modified file 'stk/stk'
2281--- stk/stk 2014-12-09 10:58:48 +0000
2282+++ stk/stk 2017-01-12 09:27:31 +0000
2283@@ -23,6 +23,7 @@
2284 import sys
2285 import argparse
2286 import traceback
2287+import time
2288 try:
2289 __file__
2290 except NameError:
2291@@ -41,6 +42,10 @@
2292 import string
2293 import stk.p4 as p4
2294 import lxml
2295+import csv
2296+import tempfile
2297+from subprocess import check_call, CalledProcessError, call
2298+
2299 import stk.bzr_version as bzr_version
2300 d = bzr_version.version_info
2301 build = d.get('revno','<unknown revno>')
2302@@ -366,7 +371,7 @@
2303
2304 # Clean data
2305 parser_cm = subparsers.add_parser('clean_data',
2306- help='Remove errant taxa, uninformative trees and empty sources.'
2307+ help='Renames all sources and trees sensibly. Removes errant taxa, uninformative trees and empty sources.'
2308 )
2309 parser_cm.add_argument('input',
2310 help='The input phyml file')
2311@@ -488,7 +493,81 @@
2312 parser_cm.add_argument('subs',
2313 help='The subs file')
2314 parser_cm.set_defaults(func=check_subs)
2315-
2316+
2317+ # taxonomic name checker
2318+ parser_cm = subparsers.add_parser('check_otus',
2319+ help='Check your OTUs against EoL.'
2320+ )
2321+ parser_cm.add_argument('input',
2322+ help='The input Phyml. Also accepts tree files or a simple list')
2323+ parser_cm.add_argument('output',
2324+ help='The output CSV file. Taxon, synonyms, status')
2325+ parser_cm.add_argument('--overwrite',
2326+ action='store_true',
2327+ default=False,
2328+ help="Overwrite the existing file without asking for confirmation")
2329+ parser_cm.set_defaults(func=check_otus)
2330+
2331+ # create taxonomy csv file
2332+ parser_cm = subparsers.add_parser('create_taxonomy',
2333+ help='Create a taxonomy file in CSV for you to then augment.'
2334+ )
2335+ parser_cm.add_argument('input',
2336+ help='The input Phyml. Also accepts tree files or a simple list')
2337+ parser_cm.add_argument('output',
2338+ help='The output CSV file. Name, followed by classification and source')
2339+ parser_cm.add_argument('--overwrite',
2340+ action='store_true',
2341+ default=False,
2342+ help="Overwrite the existing file without asking for confirmation")
2343+ parser_cm.add_argument('--taxonomy',
2344+ help="Give a starting taxonomy file, e.g. one you ran earlier",)
2345+ parser_cm.set_defaults(func=create_taxonomy)
2346+
2347+
2348+ # do the subs in a one go using taxonomy
2349+ parser_cm = subparsers.add_parser('auto_subs',
2350+ help='Using a taxonomy, generate a species level version of your data in one go.'
2351+ )
2352+ parser_cm.add_argument('input',
2353+ help='The input Phyml')
2354+ parser_cm.add_argument('taxonomy',
2355+ help='Your taxonomy file',
2356+ )
2357+ parser_cm.add_argument('output',
2358+ help='The output phyml')
2359+ parser_cm.add_argument('--overwrite',
2360+ action='store_true',
2361+ default=False,
2362+ help="Overwrite the existing file without asking for confirmation")
2363+ #parser_cm.add_argument('--level',
2364+ # choices=supertree_toolkit.taxonomy_levels,
2365+ # help="Taxonomic level to output at",)
2366+ parser_cm.set_defaults(func=auto_subs)
2367+
2368+
2369+ # attempt to process the data into a matrix all automatically
2370+ parser_cm = subparsers.add_parser('process',
2371+ help='Generate a species-level matrix, and do all the checks and processing automatically. Note this creates a taxonomy and does all the processing, but will not be perfect (as taxonomies are not perfect)'
2372+ )
2373+ parser_cm.add_argument('input',
2374+ help='The input Phyml')
2375+ parser_cm.add_argument('output',
2376+ help='The output matrix')
2377+ parser_cm.add_argument('--taxonomy_file',
2378+ help='Existing taxonomy file to prevent redownloading data. Any taxa not in the file will be checked online, so partial complete file are OK.')
2379+ parser_cm.add_argument('--equivalents_file',
2380+ help='Existing equivalents file from a taxonomic name check. Any taxa not in the file will be checked online, so partially complete files are OK.')
2381+ parser_cm.add_argument('--overwrite',
2382+ action='store_true',
2383+ default=False,
2384+ help="Overwrite the existing file without asking for confirmation")
2385+ parser_cm.add_argument('--no_store',
2386+ action="store_true",
2387+ default=False,
2388+ help="Do not store intermediate files -- not recommended")
2389+ parser_cm.set_defaults(func=process)
2390+
2391
2392 # before we let argparse work its magic, check for --version
2393 if "--version" in sys.argv:
2394@@ -602,7 +681,7 @@
2395 # check if output files are there
2396 if (output_file and os.path.exists(output_file) and not overwrite):
2397 print "Output file exists. Either remove the file or use the --overwrite flag."
2398- print "Do you wish to continue? [Y/n]"
2399+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2400 while True:
2401 k=inkey()
2402 if k.lower() == 'n':
2403@@ -612,7 +691,7 @@
2404 break
2405 if (not newphyml == None and os.path.exists(newphyml) and not overwrite):
2406 print "Output Phyml file exists. Either remove the file or use the --overwrite flag."
2407- print "Do you wish to continue? [Y/n]"
2408+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2409 while True:
2410 k=inkey()
2411 if k.lower() == 'n':
2412@@ -624,9 +703,9 @@
2413 XML = supertree_toolkit.load_phyml(input_file)
2414 try:
2415 if (newphyml == None):
2416- data_independence = supertree_toolkit.data_independence(XML,ignoreWarnings=ignoreWarnings)
2417+ data_independence, subsets = supertree_toolkit.data_independence(XML,ignoreWarnings=ignoreWarnings)
2418 else:
2419- data_independence, new_phyml = supertree_toolkit.data_independence(XML,make_new_xml=True,ignoreWarnings=ignoreWarnings)
2420+ data_independence, subsets, new_phyml = supertree_toolkit.data_independence(XML,make_new_xml=True,ignoreWarnings=ignoreWarnings)
2421 except NotUniqueError as detail:
2422 msg = "***Error: Failed to check independence.\n"+detail.msg
2423 print msg
2424@@ -644,7 +723,7 @@
2425 print msg
2426 return
2427 except:
2428- msg = "***Error: failed to check independence due to unknown error."
2429+ msg = "***Error: failed to check independence due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2430 print msg
2431 traceback.print_exc()
2432 return
2433@@ -653,16 +732,14 @@
2434 data_ind = ""
2435 #column headers
2436 data_ind = "Source trees that are subsets of others\n"
2437- data_ind = data_ind + "Flagged tree, is a subset of:\n"
2438- for name in data_independence:
2439- if ( data_independence[name][1] == supertree_toolkit.SUBSET ):
2440- data_ind += name + "," + data_independence[name][0] + "\n"
2441+ data_ind = data_ind + "Flagged tree(s), is/are subset(s) of:\n"
2442+ for names in subsets:
2443+ data_ind += names[1:] + "," + names[0] + "\n"
2444
2445 data_ind += "\n\nSource trees that are identical to others\n"
2446- data_ind = data_ind + "Flagged tree, is identical to:\n"
2447- for name in data_independence:
2448- if ( data_independence[name][1] == supertree_toolkit.IDENTICAL ):
2449- data_ind += name + "," + data_independence[name][0] + "\n"
2450+ data_ind = data_ind + "Flagged tree(s), is/are identical to:\n"
2451+ for names in data_independence:
2452+ data_ind += names[1:] + "," + names[0] + "\n"
2453
2454
2455 if (output_file == False or
2456@@ -762,7 +839,7 @@
2457 # Does the output file already exist?
2458 if (os.path.exists(output_file) and not overwrite):
2459 print "Output file exists. Either remove the file or use the --overwrite flag."
2460- print "Do you wish to continue? [Y/n]"
2461+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2462 while True:
2463 k=inkey()
2464 if k.lower() == 'n':
2465@@ -771,6 +848,7 @@
2466 if k.lower() == 'y':
2467 break
2468 try:
2469+
2470 XML = supertree_toolkit.load_phyml(input_file)
2471 input_is_xml = True
2472 except:
2473@@ -896,7 +974,7 @@
2474 # Does the output file already exist?
2475 if (os.path.exists(output_file) and not overwrite):
2476 print "Output file exists. Either remove the file or use the --overwrite flag."
2477- print "Do you wish to continue? [Y/n]"
2478+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2479 while True:
2480 k=inkey()
2481 if k.lower() == 'n':
2482@@ -942,7 +1020,7 @@
2483 print msg
2484 return
2485 except:
2486- msg = "***Error: Failed sbstituting taxa due to unknown error.\n"
2487+ msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2488 print msg
2489 traceback.print_exc()
2490 return
2491@@ -983,7 +1061,7 @@
2492
2493 if (os.path.exists(output_file) and not overwrite):
2494 print "Output file exists. Either remove the file or use the --overwrite flag."
2495- print "Do you wish to continue? [Y/n]"
2496+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2497 while True:
2498 k=inkey()
2499 if k.lower() == 'n':
2500@@ -1013,7 +1091,7 @@
2501 print msg
2502 return
2503 except:
2504- msg = "***Error: Failed sbstituting taxa due to unknown error.\n"
2505+ msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2506 print msg
2507 traceback.print_exc()
2508 return
2509@@ -1060,7 +1138,7 @@
2510 print msg
2511 return
2512 except:
2513- msg = "***Error: Failed to export data due to unknown error.\n"
2514+ msg = "***Error: Failed to export data due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2515 print msg
2516 traceback.print_exc()
2517 return
2518@@ -1115,7 +1193,7 @@
2519 print msg
2520 return
2521 except:
2522- msg = "***Error: Failed to check overlap due to unknown error.\n"
2523+ msg = "***Error: Failed to check overlap due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2524 print msg
2525 traceback.print_exc()
2526 return
2527@@ -1161,7 +1239,7 @@
2528 # check if output files are there
2529 if (output_file and os.path.exists(output_file) and not overwrite):
2530 print "Output file exists. Either remove the file or use the --overwrite flag."
2531- print "Do you wish to continue? [Y/n]"
2532+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2533 while True:
2534 k=inkey()
2535 if k.lower() == 'n':
2536@@ -1191,7 +1269,7 @@
2537 print msg
2538 return
2539 except:
2540- msg = "***Error: Failed to export trees due to unknown error.\n"
2541+ msg = "***Error: Failed to export trees due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2542 print msg
2543 traceback.print_exc()
2544 return
2545@@ -1220,7 +1298,7 @@
2546 # check if output files are there
2547 if (output_file and os.path.exists(output_file) and not overwrite):
2548 print "Output file exists. Either remove the file or use the --overwrite flag."
2549- print "Do you wish to continue? [Y/n]"
2550+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2551 while True:
2552 k=inkey()
2553 if k.lower() == 'n':
2554@@ -1309,7 +1387,7 @@
2555 print msg
2556 return
2557 except:
2558- msg = "***Error: Failed to permute trees due to unknown error.\n"
2559+ msg = "***Error: Failed to permute trees due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2560 print msg
2561 traceback.print_exc()
2562 return
2563@@ -1347,7 +1425,7 @@
2564 # check if output files are there
2565 if (os.path.exists(output_file) and not overwrite):
2566 print "Output file exists. Either remove the file or use the --overwrite flag."
2567- print "Do you wish to continue? [Y/n]"
2568+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2569 while True:
2570 k=inkey()
2571 if k.lower() == 'n':
2572@@ -1376,7 +1454,7 @@
2573 print msg
2574 return
2575 except:
2576- msg = "***Error: Failed to clean data due to unknown error.\n"
2577+ msg = "***Error: Failed to clean data due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2578 print msg
2579 traceback.print_exc()
2580 return
2581@@ -1404,7 +1482,7 @@
2582 # check if output files are there
2583 if (os.path.exists(output_file) and not overwrite):
2584 print "Output file exists. Either remove the file or use the --overwrite flag."
2585- print "Do you wish to continue? [Y/n]"
2586+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2587 while True:
2588 k=inkey()
2589 if k.lower() == 'n':
2590@@ -1433,7 +1511,7 @@
2591 print msg
2592 return
2593 except:
2594- msg = "***Error: Failed to replace genera due to unknown error.\n"
2595+ msg = "***Error: Failed to replace genera due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2596 print msg
2597 traceback.print_exc()
2598 return
2599@@ -1488,7 +1566,7 @@
2600 new_trees = {}
2601 i = 1
2602 for t in trees:
2603- new_trees['tree_'+str(i)] = t
2604+ new_trees['tree_'+str(i)] = supertree_toolkit._collapse_nodes(t)
2605 i += 1
2606 output = supertree_toolkit._amalgamate_trees(new_trees,format=output_format)
2607 except TreeParseError as detail:
2608@@ -1503,7 +1581,7 @@
2609 # check if output files are there
2610 if (os.path.exists(output_file) and not overwrite):
2611 print "Output file exists. Either remove the file or use the --overwrite flag."
2612- print "Do you wish to continue? [Y/n]"
2613+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2614 while True:
2615 k=inkey()
2616 if k.lower() == 'n':
2617@@ -1540,7 +1618,7 @@
2618 # check if output files are there
2619 if (os.path.exists(output_file) and not overwrite):
2620 print "Output file exists. Either remove the file or use the --overwrite flag."
2621- print "Do you wish to continue? [Y/n]"
2622+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2623 while True:
2624 k=inkey()
2625 if k.lower() == 'n':
2626@@ -1589,7 +1667,7 @@
2627 print msg
2628 return
2629 except:
2630- msg = "***Error: Failed to create subset due to unknown error.\n"
2631+ msg = "***Error: Failed to create subset due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2632 print msg
2633 traceback.print_exc()
2634 return
2635@@ -1637,6 +1715,681 @@
2636 print "**************************************************************\n"
2637
2638
2639+def check_otus(args):
2640+ """check out the OTUs in the Phyml - are they considered valid?"""
2641+
2642+ verbose = args.verbose
2643+ input_file = args.input
2644+ output_file = args.output
2645+
2646+ print input_file
2647+ if (input_file.endswith(".phyml")):
2648+ XML = supertree_toolkit.load_phyml(input_file)
2649+ try:
2650+ equivs = supertree_toolkit.taxonomic_checker(XML, verbose=verbose)
2651+ except NotUniqueError as detail:
2652+ msg = "***Error: Failed to check OTUs.\n"+detail.msg
2653+ print msg
2654+ return
2655+ except InvalidSTKData as detail:
2656+ msg = "***Error: Failed to check OTUs.\n"+detail.msg
2657+ print msg
2658+ return
2659+ except UninformativeTreeError as detail:
2660+ msg = "***Error: Failed to check OTUs.\n"+detail.msg
2661+ print msg
2662+ return
2663+ except TreeParseError as detail:
2664+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2665+ print msg
2666+ return
2667+ except:
2668+ # what about no internet conenction? What error do that throw?
2669+ msg = "***Error: failed to create OTUs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2670+ print msg
2671+ traceback.print_exc()
2672+ return
2673+ elif (input_file.endswith(".txt") or input_file.endswith('.dat')):
2674+ # read file - assume one taxa per line
2675+ with open(input_file,'r') as f:
2676+ lines = f.read().splitlines()
2677+ equivs = supertree_toolkit.taxonomic_checker_list(lines, verbose=verbose)
2678+ else:
2679+ # assume a tree!
2680+ equivs = supertree_toolkit.taxonomic_checker_tree(input_file, verbose=verbose)
2681+
2682+
2683+
2684+ f = open(output_file,"w")
2685+ for taxon in sorted(equivs.keys()):
2686+ f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
2687+ f.close()
2688+
2689+
2690+
2691+def create_taxonomy(args):
2692+ """create a taxonomic heirachy for each OTU in the Phyml"""
2693+
2694+ verbose = args.verbose
2695+ input_file = args.input
2696+ output_file = args.output
2697+ existing_taxonomy = args.taxonomy
2698+ ignoreWarnings = args.ignoreWarnings
2699+
2700+ XML = supertree_toolkit.load_phyml(input_file)
2701+ if (not existing_taxonomy == None):
2702+ existing_taxonomy = supertree_toolkit.load_taxonomy(existing_taxonomy) # load it in and create the dictionary
2703+ pass
2704+
2705+ try:
2706+ taxonomy = supertree_toolkit.create_taxonomy(XML,existing_taxonomy=existing_taxonomy,verbose=verbose,ignoreWarnings=ignoreWarnings)
2707+ except NotUniqueError as detail:
2708+ msg = "***Error: Failed to create taxonomy.\n"+detail.msg
2709+ print msg
2710+ return
2711+ except InvalidSTKData as detail:
2712+ msg = "***Error: Failed to create taxonomy.\n"+detail.msg
2713+ print msg
2714+ return
2715+ except UninformativeTreeError as detail:
2716+ msg = "***Error: Failed to create taxonomy.\n"+detail.msg
2717+ print msg
2718+ return
2719+ except TreeParseError as detail:
2720+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2721+ print msg
2722+ return
2723+ except:
2724+ # what about no internet conenction? What error do that throw?
2725+ msg = "***Error: failed to create taxonomy due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2726+ print msg
2727+ traceback.print_exc()
2728+ return
2729+
2730+ # Now create the CSV output
2731+ with open(output_file, 'w') as f:
2732+ writer = csv.writer(f)
2733+ headers = []
2734+ headers.append("OTU")
2735+ headers.extend(supertree_toolkit.taxonomy_levels)
2736+ headers.append("Data source")
2737+ writer.writerow(headers)
2738+ for t in taxonomy:
2739+ otu = t
2740+ try:
2741+ species = taxonomy[t]['species']
2742+ except KeyError:
2743+ species = "-"
2744+ try:
2745+ genus = taxonomy[t]['genus']
2746+ except KeyError:
2747+ genus = "-"
2748+ try:
2749+ family = taxonomy[t]['family']
2750+ except KeyError:
2751+ family = "-"
2752+ try:
2753+ superfamily = taxonomy[t]['superfamily']
2754+ except KeyError:
2755+ superfamily = "-"
2756+ try:
2757+ infraorder = taxonomy[t]['infraorder']
2758+ except KeyError:
2759+ infraorder = "-"
2760+ try:
2761+ suborder = taxonomy[t]['suborder']
2762+ except KeyError:
2763+ suborder = "-"
2764+ try:
2765+ order = taxonomy[t]['order']
2766+ except KeyError:
2767+ order = "-"
2768+ try:
2769+ superorder = taxonomy[t]['superorder']
2770+ except KeyError:
2771+ superorder = "-"
2772+ try:
2773+ subclass = taxonomy[t]['subclass']
2774+ except KeyError:
2775+ subclass = "-"
2776+ try:
2777+ tclass = taxonomy[t]['class']
2778+ except KeyError:
2779+ tclass = "-"
2780+ try:
2781+ subphylum = taxonomy[t]['subphylum']
2782+ except KeyError:
2783+ subphylum = "-"
2784+ try:
2785+ phylum = taxonomy[t]['phylum']
2786+ except KeyError:
2787+ phylum = "-"
2788+ try:
2789+ superphylum = taxonomy[t]['superphylum']
2790+ except KeyError:
2791+ superphylum = "-"
2792+ try:
2793+ infrakingdom = taxonomy[t]['infrakingdom']
2794+ except:
2795+ infrakingdom = "-"
2796+ try:
2797+ subkingdom = taxonomy[t]['subkingdom']
2798+ except:
2799+ subkingdom = "-"
2800+ try:
2801+ kingdom = taxonomy[t]['kingdom']
2802+ except KeyError:
2803+ kingdom = "-"
2804+ try:
2805+ provider = taxonomy[t]['provider']
2806+ except KeyError:
2807+ provider = "-"
2808+
2809+ if (isinstance(species, list)):
2810+ species = " ".join(species)
2811+ this_classification = [
2812+ otu.encode('utf-8'),
2813+ species.encode('utf-8'),
2814+ genus.encode('utf-8'),
2815+ family.encode('utf-8'),
2816+ superfamily.encode('utf-8'),
2817+ infraorder.encode('utf-8'),
2818+ suborder.encode('utf-8'),
2819+ order.encode('utf-8'),
2820+ superorder.encode('utf-8'),
2821+ subclass.encode('utf-8'),
2822+ tclass.encode('utf-8'),
2823+ subphylum.encode('utf-8'),
2824+ phylum.encode('utf-8'),
2825+ superphylum.encode('utf-8'),
2826+ infrakingdom.encode('utf-8'),
2827+ subkingdom.encode('utf-8'),
2828+ kingdom.encode('utf-8'),
2829+ provider.encode('utf-8')]
2830+ writer.writerow(this_classification)
2831+
2832+def auto_subs(args):
2833+ """Get all OTUs to the same taxonomic level"""
2834+
2835+
2836+ verbose = args.verbose
2837+ input_file = args.input
2838+ output = args.output
2839+ taxonomy = args.taxonomy
2840+ ignoreWarnings = args.ignoreWarnings
2841+
2842+ if (os.path.exists(output) and not overwrite):
2843+ print "Output Phyml file exists. Either remove the file or use the --overwrite flag."
2844+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2845+ while True:
2846+ k=inkey()
2847+ if k.lower() == 'n':
2848+ print "Exiting..."
2849+ sys.exit(0)
2850+ if k.lower() == 'y':
2851+ break
2852+
2853+ XML = supertree_toolkit.load_phyml(input_file)
2854+ taxonomy = supertree_toolkit.load_taxonomy(taxonomy) # load it in and create the dictionary
2855+
2856+ try:
2857+ newXML = supertree_toolkit.generate_species_level_data(XML,taxonomy,verbose=verbose,ignoreWarnings=ignoreWarnings)
2858+ except NotUniqueError as detail:
2859+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2860+ print msg
2861+ return
2862+ except InvalidSTKData as detail:
2863+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2864+ print msg
2865+ return
2866+ except UninformativeTreeError as detail:
2867+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2868+ print msg
2869+ return
2870+ except TreeParseError as detail:
2871+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2872+ print msg
2873+ return
2874+ except NoneCompleteTaxonomy as detail:
2875+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2876+ print msg
2877+ return
2878+ except:
2879+ # what about no internet conenction? What error do that throw?
2880+ msg = "***Error: failed to carry out auto subs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2881+ print msg
2882+ traceback.print_exc()
2883+ return
2884+
2885+ f = open(output,"w")
2886+ f.write(newXML)
2887+ f.close()
2888+
2889+def process(args):
2890+
2891+ verbose = args.verbose
2892+ input_file = args.input
2893+ output = args.output
2894+ no_store = args.no_store
2895+ ignoreWarnings = args.ignoreWarnings
2896+ taxonomy_file = args.taxonomy_file
2897+ equivalents_file = args.equivalents_file
2898+ overwrite = args.overwrite
2899+
2900+ if (os.path.exists(output) and not overwrite):
2901+ print "Output matrix file exists. Either remove the file or use the --overwrite flag."
2902+ print "Do you wish to continue and overwrite the file anyway? [Y/n]"
2903+ while True:
2904+ k=inkey()
2905+ if k.lower() == 'n':
2906+ print "Exiting..."
2907+ sys.exit(0)
2908+ if k.lower() == 'y':
2909+ break
2910+
2911+ filename = os.path.basename(input_file)
2912+ dirname = os.path.dirname(input_file)
2913+
2914+ if verbose:
2915+ print "Loading and checking your data"
2916+ # 0) load and check data
2917+ try:
2918+ phyml = supertree_toolkit.load_phyml(input_file)
2919+ project_name = supertree_toolkit.get_project_name(phyml)
2920+ supertree_toolkit._check_data(phyml)
2921+ except NotUniqueError as detail:
2922+ msg = "***Error: Failed to load data.\n"+detail.msg
2923+ print msg
2924+ return
2925+ except InvalidSTKData as detail:
2926+ msg = "***Error: Failed to load data.\n"+detail.msg
2927+ print msg
2928+ return
2929+ except UninformativeTreeError as detail:
2930+ msg = "***Error: Failed to load data.\n"+detail.msg
2931+ print msg
2932+ return
2933+ except TreeParseError as detail:
2934+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2935+ print msg
2936+ return
2937+ except:
2938+ msg = "***Error: Failed to load input due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2939+ print msg
2940+ traceback.print_exc()
2941+ return
2942+
2943+ if verbose:
2944+ print "Checking taxa againt online databases"
2945+ # 1) taxonomy checker with autoreplace
2946+ # Load existing data if any:
2947+ if (not equivalents_file == None):
2948+ equivalents = supertree_toolkit.load_equivalents(equivalents_file)
2949+ else:
2950+ equivalents = None
2951+ equivalents = supertree_toolkit.taxonomic_checker(phyml,existing_data=equivalents,verbose=verbose)
2952+ # save the equivalents for later (as CSV and as sub file)
2953+ data_string_csv = _equivalents_to_csv(equivalents)
2954+ data_string_subs = _equivalents_to_subs(equivalents)
2955+ f = open(os.path.join(dirname,project_name+"_taxonomy_checker.csv"), "w")
2956+ f.write(data_string_csv)
2957+ f.close()
2958+ f = open(os.path.join(dirname,project_name+"_taxonomy_check_subs.dat"), "w")
2959+ f.write(data_string_subs)
2960+ f.close()
2961+
2962+ # now do the replacements - we use the subs file :)
2963+ if verbose:
2964+ print "Swapping in the corrected taxa names"
2965+ try:
2966+ old_taxa, new_taxa = supertree_toolkit.parse_subs_file(os.path.join(dirname,project_name+"_taxonomy_check_subs.dat"))
2967+ except UnableToParseSubsFile as e:
2968+ print e.msg
2969+ sys.exit(-1)
2970+ try:
2971+ phyml = supertree_toolkit.substitute_taxa(phyml,old_taxa,new_taxa,only_existing=False,verbose=verbose)
2972+ except NotUniqueError as detail:
2973+ msg = "***Error: Failed to substituting taxa.\n"+detail.msg
2974+ print msg
2975+ return
2976+ except InvalidSTKData as detail:
2977+ msg = "***Error: Failed substituting taxa.\n"+detail.msg
2978+ print msg
2979+ return
2980+ except UninformativeTreeError as detail:
2981+ msg = "***Error: Failed to substituting taxa.\n"+detail.msg
2982+ print msg
2983+ return
2984+ except TreeParseError as detail:
2985+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2986+ print msg
2987+ return
2988+ except:
2989+ msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2990+ print msg
2991+ traceback.print_exc()
2992+ return
2993+ # save phyml as intermediate step
2994+ f = open(os.path.join(dirname,project_name+"_taxonomy_checked.phyml"), "w")
2995+ f.write(phyml)
2996+ f.close()
2997+
2998+
2999+ if verbose:
3000+ print "Creating taxonomic information"
3001+ # 2) create taxonomy
3002+ if (not taxonomy_file == None):
3003+ taxonomy = supertree_toolkit.load_taxonomy(taxonomy_file)
3004+ else:
3005+ taxonomy = None
3006+ taxonomy = supertree_toolkit.create_taxonomy(phyml,existing_taxonomy=taxonomy,verbose=verbose)
3007+ # save the taxonomy for later
3008+ # Now create the CSV output - seperate out into function in STK (used several times)
3009+ with open(os.path.join(dirname,project_name+"_taxonomy.csv"), 'w') as f:
3010+ writer = csv.writer(f)
3011+ headers = []
3012+ headers.append("OTU")
3013+ headers.extend(supertree_toolkit.taxonomy_levels)
3014+ headers.append("Data source")
3015+ writer.writerow(headers)
3016+ for t in taxonomy:
3017+ otu = t
3018+ try:
3019+ species = taxonomy[t]['species']
3020+ except KeyError:
3021+ species = "-"
3022+ try:
3023+ subgenus = taxonomy[t]['subgenus']
3024+ except KeyError:
3025+ subgenus = "-"
3026+ try:
3027+ genus = taxonomy[t]['genus']
3028+ except KeyError:
3029+ genus = "-"
3030+ try:
3031+ subfamily = taxonomy[t]['subfamily']
3032+ except KeyError:
3033+ subfamily = "-"
3034+ try:
3035+ family = taxonomy[t]['family']
3036+ except KeyError:
3037+ family = "-"
3038+ try:
3039+ superfamily = taxonomy[t]['superfamily']
3040+ except KeyError:
3041+ superfamily = "-"
3042+ try:
3043+ subsection = taxonomy[t]['subsection']
3044+ except KeyError:
3045+ subsection = "-"
3046+ try:
3047+ section = taxonomy[t]['section']
3048+ except KeyError:
3049+ section = "-"
3050+ try:
3051+ infraorder = taxonomy[t]['infraorder']
3052+ except KeyError:
3053+ infraorder = "-"
3054+ try:
3055+ suborder = taxonomy[t]['suborder']
3056+ except KeyError:
3057+ suborder = "-"
3058+ try:
3059+ order = taxonomy[t]['order']
3060+ except KeyError:
3061+ order = "-"
3062+ try:
3063+ superorder = taxonomy[t]['superorder']
3064+ except KeyError:
3065+ superorder = "-"
3066+ try:
3067+ subclass = taxonomy[t]['subclass']
3068+ except KeyError:
3069+ subclass = "-"
3070+ try:
3071+ tclass = taxonomy[t]['class']
3072+ except KeyError:
3073+ tclass = "-"
3074+ try:
3075+ superclass = taxonomy[t]['superclass']
3076+ except KeyError:
3077+ superclass = "-"
3078+ try:
3079+ subphylum = taxonomy[t]['subphylum']
3080+ except KeyError:
3081+ subphylum = "-"
3082+ try:
3083+ phylum = taxonomy[t]['phylum']
3084+ except KeyError:
3085+ phylum = "-"
3086+ try:
3087+ superphylum = taxonomy[t]['superphylum']
3088+ except KeyError:
3089+ superphylum = "-"
3090+ try:
3091+ infrakingdom = taxonomy[t]['infrakingdom']
3092+ except:
3093+ infrakingdom = "-"
3094+ try:
3095+ subkingdom = taxonomy[t]['subkingdom']
3096+ except:
3097+ subkingdom = "-"
3098+ try:
3099+ kingdom = taxonomy[t]['kingdom']
3100+ except KeyError:
3101+ kingdom = "-"
3102+ try:
3103+ provider = taxonomy[t]['provider']
3104+ except KeyError:
3105+ provider = "-"
3106+ this_classification = [
3107+ otu.encode('utf-8'),
3108+ species.encode('utf-8'),
3109+ subgenus.encode('utf-8'),
3110+ genus.encode('utf-8'),
3111+ subfamily.encode('utf-8'),
3112+ family.encode('utf-8'),
3113+ superfamily.encode('utf-8'),
3114+ subsection.encode('utf-8'),
3115+ section.encode('utf-8'),
3116+ infraorder.encode('utf-8'),
3117+ suborder.encode('utf-8'),
3118+ order.encode('utf-8'),
3119+ superorder.encode('utf-8'),
3120+ subclass.encode('utf-8'),
3121+ tclass.encode('utf-8'),
3122+ superclass.encode('utf-8'),
3123+ subphylum.encode('utf-8'),
3124+ phylum.encode('utf-8'),
3125+ superphylum.encode('utf-8'),
3126+ infrakingdom.encode('utf-8'),
3127+ subkingdom.encode('utf-8'),
3128+ kingdom.encode('utf-8'),
3129+ provider.encode('utf-8')]
3130+ writer.writerow(this_classification)
3131+
3132+ # 3) create species level dataset
3133+ if verbose:
3134+ print "Converting data to species level"
3135+ try:
3136+ phyml = supertree_toolkit.generate_species_level_data(phyml,taxonomy,verbose=verbose)
3137+ except NotUniqueError as detail:
3138+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3139+ print msg
3140+ return
3141+ except InvalidSTKData as detail:
3142+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3143+ print msg
3144+ return
3145+ except UninformativeTreeError as detail:
3146+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3147+ print msg
3148+ return
3149+ except TreeParseError as detail:
3150+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
3151+ print msg
3152+ return
3153+ except NoneCompleteTaxonomy as detail:
3154+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3155+ print msg
3156+ return
3157+ except:
3158+ # what about no internet conenction? What error do that throw?
3159+ msg = "***Error: failed to carry out auto subs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
3160+ print msg
3161+ traceback.print_exc()
3162+ return
3163+ # save the phyml as intermediate step
3164+ f = open(os.path.join(dirname,project_name+"_species_level.phyml"), "w")
3165+ f.write(phyml)
3166+ f.close()
3167+
3168+ # 4) Remove non-monophyletic taxa (requires TNT to be installed)
3169+ if verbose:
3170+ print "Removing non-monophyletic taxa via mini-supertree method"
3171+ tree_list = supertree_toolkit._find_trees_for_permuting(phyml)
3172+ try:
3173+ for t in tree_list:
3174+ # permute
3175+ output_string = supertree_toolkit.permute_tree(tree_list[t],matrix='hennig',treefile=None,verbose=verbose)
3176+ #save
3177+ if (not output_string == ""):
3178+ file_name = os.path.basename(filename)
3179+ dirname = os.path.dirname(filename)
3180+ new_output = os.path.join(dirname,t,t+"_matrix.tnt")
3181+ try:
3182+ os.makedirs(os.path.join(dirname,t))
3183+ except OSError:
3184+ if not os.path.isdir(os.path.join(dirname,t)):
3185+ raise
3186+ f = open(new_output,'w',0)
3187+ f.write(output_string)
3188+ f.close
3189+ time.sleep(1)
3190+
3191+ # now create the tnt command to deal with this
3192+ # create a tmp file for the output tree
3193+ temp_file_handle, temp_file = tempfile.mkstemp(suffix=".tnt")
3194+ tnt_command = "tnt mxram 512,run "+new_output+",echo= ,timeout 00:10:00,rseed0,rseed*,hold 1000,xmult= level 0,taxname=,nelsen *,tsave *"+temp_file+",save /,quit"
3195+ #tnt_command = "tnt run "+new_output+",ienum,taxname=,nelsen*,tsave *"+temp_file+",save /,quit"
3196+ # run tnt, grab the output and store back in the data
3197+ #try:
3198+ call(tnt_command, shell=True)
3199+ #except CalledProcessError as e:
3200+ # msg = "***Error: Failed to run TNT. Is it installed correctl?.\n"+e.msg
3201+ # print msg
3202+ # return
3203+ #ret = os.system(tnt_command)
3204+ #if (not ret == 0):
3205+ # print "error running tnt"
3206+ # return
3207+
3208+ new_tree = supertree_toolkit.import_tree(temp_file)
3209+ phyml = supertree_toolkit._swap_tree_in_XML(phyml,new_tree,t)
3210+
3211+ except TreeParseError as e:
3212+ msg = "***Error permuting trees.\n"+e.msg
3213+ print msg
3214+ return
3215+
3216+ #4.5) remove MRP_Outgroups
3217+ phyml = supertree_toolkit.substitute_taxa(phyml,'MRP_Outgroup')
3218+ phyml = supertree_toolkit.substitute_taxa(phyml,'MRPOutgroup')
3219+ phyml = supertree_toolkit.substitute_taxa(phyml,'MRP_outgroup')
3220+ phyml = supertree_toolkit.substitute_taxa(phyml,'MRPoutgroup')
3221+ phyml = supertree_toolkit.substitute_taxa(phyml,'MRPOUTGROUP')
3222+
3223+ # save intermediate phyml
3224+ f = open(os.path.join(dirname,project_name+"_nonmonophyl_removed.phyml"), "w")
3225+ f.write(phyml)
3226+ f.close()
3227+
3228+
3229+ # 5) Remove common names
3230+ # no function to do this yet...
3231+
3232+ # 6) Data independance
3233+ if verbose:
3234+ print "Checking data independence"
3235+ data_ind,subsets,phyml = supertree_toolkit.data_independence(phyml,make_new_xml=True)
3236+ # save phyml
3237+ f = open(os.path.join(dirname,project_name+"_data_ind.phyml"), "w")
3238+ f.write(phyml)
3239+ f.close()
3240+
3241+ # 7) Data overlap
3242+ if verbose:
3243+ print "Checking data overlap"
3244+ sufficient_overlap, key_list = supertree_toolkit.data_overlap(phyml,verbose=verbose)
3245+ # process the key_list to remove the unconnected trees
3246+ if not sufficient_overlap:
3247+ # we don't, have enough, then remove all but the largest group.
3248+ # the key contains a list, with the largest group first (thanks networkX!)
3249+ # we can therefore just remove trees from everything but the first in the list
3250+ delete_me = []
3251+ for t in key_list[1::]: # skip 0
3252+ delete_me.extend(t)
3253+ for tree in delete_me:
3254+ phyml = supertree_toolkit._swap_tree_in_XML(phyml, None, tree, delete=True) # delete the tree and clean the data as we go
3255+ # save phyml
3256+ f = open(os.path.join(dirname,project_name+"_data_tax_overlap.phyml"), "w")
3257+ f.write(phyml)
3258+ f.close()
3259+
3260+
3261+ # 8) Create matrix
3262+ if verbose:
3263+ print "Creating matrix"
3264+ try:
3265+ matrix = supertree_toolkit.create_matrix(phyml)
3266+ except NotUniqueError as detail:
3267+ msg = "***Error: Failed to create matrix.\n"+detail.msg
3268+ print msg
3269+ return
3270+ except InvalidSTKData as detail:
3271+ msg = "***Error: Failed to create matrix.\n"+detail.msg
3272+ print msg
3273+ return
3274+ except UninformativeTreeError as detail:
3275+ msg = "***Error: Failed to create matrix.\n"+detail.msg
3276+ print msg
3277+ return
3278+ except TreeParseError as detail:
3279+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
3280+ print msg
3281+ return
3282+ except:
3283+ msg = "***Error: Failed to create matrix due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
3284+ print msg
3285+ traceback.print_exc()
3286+ return
3287+
3288+ f = open(output, "w")
3289+ f.write(matrix)
3290+ f.close()
3291+
3292+ return
3293+
3294+
3295+def _equivalents_to_csv(equivalents):
3296+
3297+ output_string = 'Taxa,Equivalents,Status\n'
3298+
3299+ for taxon in sorted(equivalents):
3300+ output_string += taxon + "," + ';'.join(equivalents[taxon][0]) + "," + equivalents[taxon][1] + "\n"
3301+
3302+ return output_string
3303+
3304+
3305+def _equivalents_to_subs(equivalents):
3306+ """Only corrects the yellow ones. Red and green are left alone"""
3307+
3308+ output_string = ""
3309+ for taxon in sorted(equivalents):
3310+ if (equivalents[taxon][1] == 'yellow'):
3311+ # the first name is always the correct one
3312+ output_string += taxon + " = "+equivalents[taxon][0][0]+"\n"
3313+ return output_string
3314
3315 if __name__ == "__main__":
3316 main()
3317
3318=== modified file 'stk/stk_exceptions.py'
3319--- stk/stk_exceptions.py 2013-10-22 08:26:54 +0000
3320+++ stk/stk_exceptions.py 2017-01-12 09:27:31 +0000
3321@@ -134,4 +134,12 @@
3322 def __init__(self, msg):
3323 self.msg = msg
3324
3325+class NoneCompleteTaxonomy(Error):
3326+ """Exception raised when a taxonomy is not complete for these data
3327+ Attributes:
3328+ msg -- explaination of error
3329+ """
3330+
3331+ def __init__(self, msg):
3332+ self.msg = msg
3333
3334
3335=== modified file 'stk/supertree_toolkit.py'
3336--- stk/supertree_toolkit.py 2017-01-11 15:16:21 +0000
3337+++ stk/supertree_toolkit.py 2017-01-12 09:27:31 +0000
3338@@ -44,15 +44,49 @@
3339 import unicodedata
3340 from stk_internals import *
3341 from copy import deepcopy
3342+import Queue
3343+import threading
3344+import urllib2
3345+from urllib import quote_plus
3346+import simplejson as json
3347+import time
3348 import types
3349
3350 #plt.ion()
3351
3352+sys.setrecursionlimit(50000)
3353 # GLOBAL VARIABLES
3354 IDENTICAL = 0
3355 SUBSET = 1
3356 PLATFORM = sys.platform
3357-taxonomy_levels = ['species','genus','family','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
3358+#Logging
3359+import logging
3360+logging.basicConfig(filename='supertreetoolkit.log', level=logging.DEBUG, format='%(asctime)s %(levelname)s:%(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')
3361+
3362+# taxonomy levels
3363+# What we get from EOL
3364+current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
3365+# And the extra ones from ITIS
3366+extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
3367+# all of them in order
3368+taxonomy_levels = ['species','subgenus','genus','tribe','subfamily','family','superfamily','subsection','section','parvorder','infraorder','suborder','order','superorder','subclass','class','superclass','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
3369+
3370+SPECIES = taxonomy_levels[0]
3371+GENUS = taxonomy_levels[1]
3372+FAMILY = taxonomy_levels[2]
3373+SUPERFAMILY = taxonomy_levels[3]
3374+INFRAORDER = taxonomy_levels[4]
3375+SUBORDER = taxonomy_levels[5]
3376+ORDER = taxonomy_levels[6]
3377+SUPERORDER = taxonomy_levels[7]
3378+SUBCLASS = taxonomy_levels[8]
3379+CLASS = taxonomy_levels[9]
3380+SUBPHYLUM = taxonomy_levels[10]
3381+PHYLUM = taxonomy_levels[11]
3382+SUPERPHYLUM = taxonomy_levels[12]
3383+INFRAKINGDOM = taxonomy_levels[13]
3384+SUBKINGDOM = taxonomy_levels[14]
3385+KINGDOM = taxonomy_levels[15]
3386
3387 # supertree_toolkit is the backend for the STK. Loaded by both the GUI and
3388 # CLI, this contains all the functions to actually *do* something
3389@@ -60,6 +94,17 @@
3390 # All functions take XML and a list of other arguments, process the data and return
3391 # it back to the user interface handler to save it somewhere
3392
3393+
3394+def get_project_name(XML):
3395+ """
3396+ Get the name of the dataset currently being worked on
3397+ """
3398+
3399+ xml_root = _parse_xml(XML)
3400+
3401+ return xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
3402+
3403+
3404 def create_name(authors, year, append=''):
3405 """
3406 Construct a sensible from a list of authors and a year for a
3407@@ -161,6 +206,22 @@
3408
3409 return names
3410
3411+def get_all_tree_names(XML):
3412+ """ From a full XML-PHYML string, extract all tree names.
3413+ """
3414+
3415+ xml_root = _parse_xml(XML)
3416+ find = etree.XPath("//source")
3417+ sources = find(xml_root)
3418+ names = []
3419+ for s in sources:
3420+ for st in s.xpath("source_tree"):
3421+ if 'name' in st.attrib and not st.attrib['name'] == "":
3422+ names.append(st.attrib['name'])
3423+
3424+ return names
3425+
3426+
3427 def set_unique_names(XML):
3428 """ Ensures all sources have unique names.
3429 """
3430@@ -249,9 +310,17 @@
3431 if (ele.tag == "source"):
3432 sources.append(ele)
3433
3434+ if overwrite:
3435+ # remove all the names first
3436+ for s in sources:
3437+ for st in s.xpath("source_tree"):
3438+ if 'name' in st.attrib:
3439+ del st.attrib['name']
3440+
3441+
3442 for s in sources:
3443 for st in s.xpath("source_tree"):
3444- if overwrite or not 'name' in st.attrib:
3445+ if not'name' in st.attrib:
3446 tree_name = create_tree_name(XML,st)
3447 st.attrib['name'] = tree_name
3448
3449@@ -339,7 +408,7 @@
3450 taxa = etree.SubElement(s_tree,"taxa_data")
3451 taxa.tail="\n "
3452 # Note: we do not add all elements as otherwise they get set to some option
3453- # rather than remaining blank (and hence blue int he interface)
3454+ # rather than remaining blank (and hence blue in the interface)
3455
3456 # append our new source to the main tree
3457 # if sources has no valid source, overwrite,
3458@@ -877,7 +946,7 @@
3459 # Need to add checks on the file. Problems include:
3460 # TNT: outputs Phyllip format or something - basically a Newick
3461 # string without commas, so add 'em back in
3462- m = re.search(r'proc-;', content)
3463+ m = re.search(r'proc.;', content)
3464 if (m != None):
3465 # TNT output tree
3466 # Done on a Mac? Replace ^M with a newline
3467@@ -1402,6 +1471,36 @@
3468
3469 return _amalgamate_trees(trees,format,anonymous)
3470
3471+def get_taxa_from_tree_for_taxonomy(tree, pretty=False, ignoreErrors=False):
3472+ """Returns a list of all taxa available for the tree passed as argument.
3473+ :param tree: string with the data for the tree in Newick format.
3474+ :type tree: string
3475+ :param pretty: defines if '_' in taxa names should be replaced with spaces.
3476+ :type pretty: boolean
3477+ :param ignoreErrors: should execution continue on error?
3478+ :type ignoreErrors: boolean
3479+ :returns: list of strings with the taxa names, sorted alphabetically
3480+ :rtype: list
3481+ """
3482+ taxa_list = []
3483+
3484+ try:
3485+ taxa_list.extend(_getTaxaFromNewick(tree))
3486+ except TreeParseError as detail:
3487+ if (ignoreErrors):
3488+ logging.warning(detail.msg)
3489+ pass
3490+ else:
3491+ raise TreeParseError( detail.msg )
3492+
3493+ # now uniquify the list of taxa
3494+ taxa_list = _uniquify(taxa_list)
3495+ taxa_list.sort()
3496+
3497+ if (pretty):
3498+ taxa_list = [x.replace('_', ' ') for x in taxa_list]
3499+
3500+ return taxa_list
3501
3502 def get_all_taxa(XML, pretty=False, ignoreErrors=False):
3503 """ Produce a taxa list by scanning all trees within
3504@@ -1422,21 +1521,17 @@
3505 taxa_list.extend(_getTaxaFromNewick(t))
3506 except TreeParseError as detail:
3507 if (ignoreErrors):
3508+ logging.warning(detail.msg)
3509 pass
3510 else:
3511 raise TreeParseError( detail.msg )
3512
3513-
3514-
3515 # now uniquify the list of taxa
3516 taxa_list = _uniquify(taxa_list)
3517 taxa_list.sort()
3518
3519- if (pretty):
3520- unpretty_tl = taxa_list
3521- taxa_list = []
3522- for t in unpretty_tl:
3523- taxa_list.append(t.replace('_',' '))
3524+ if (pretty): #Remove underscores from names
3525+ taxa_list = [x.replace('_', ' ') for x in taxa_list]
3526
3527 return taxa_list
3528
3529@@ -1508,7 +1603,7 @@
3530 return outgroups
3531
3532
3533-def create_matrix(XML,format="hennig",quote=False,taxonomy=None,outgroups=False,ignoreWarnings=False):
3534+def create_matrix(XML,format="hennig",quote=False,taxonomy=None,outgroups=False,ignoreWarnings=False, verbose=False):
3535 """ From all trees in the XML, create a matrix
3536 """
3537
3538@@ -1553,7 +1648,7 @@
3539 taxa.sort()
3540 taxa.insert(0,"MRP_Outgroup")
3541
3542- return _create_matrix(trees, taxa, format=format, quote=quote, weights=weights)
3543+ return _create_matrix(trees, taxa, format=format, quote=quote, weights=weights,verbose=verbose)
3544
3545
3546 def create_matrix_from_trees(trees,format="hennig"):
3547@@ -1925,7 +2020,7 @@
3548 _check_data(XML)
3549
3550 xml_root = _parse_xml(XML)
3551- proj_name = xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
3552+ proj_name = get_project_name(XML)
3553
3554 output_string = "======================\n"
3555 output_string += " Data summary of: " + proj_name + "\n"
3556@@ -1989,6 +2084,188 @@
3557
3558 return output_string
3559
3560+def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
3561+ """ For each name in the database generate a database of the original name,
3562+ possible synonyms and if the taxon is not know, signal that. We do this by
3563+ using the EoL API to grab synonyms of each taxon. """
3564+
3565+ import urllib2
3566+ from urllib import quote_plus
3567+ import simplejson as json
3568+
3569+ if existing_data == None:
3570+ equivalents = {}
3571+ else:
3572+ equivalents = existing_data
3573+
3574+ # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
3575+ # if not, is there another API function to do this?
3576+ # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
3577+ # you get the original ('correct') name - shorten this to two words and you're done.
3578+ for t in name_list:
3579+ if t in equivalents:
3580+ continue
3581+ taxon = t.replace("_"," ")
3582+ if (verbose):
3583+ print "Looking up ", taxon
3584+ # get the data from EOL on taxon
3585+ taxonq = quote_plus(taxon)
3586+ URL = "http://eol.org/api/search/1.0.json?q="+taxonq
3587+ req = urllib2.Request(URL)
3588+ opener = urllib2.build_opener()
3589+ f = opener.open(req)
3590+ data = json.load(f)
3591+ # check if there's some data
3592+ if len(data['results']) == 0:
3593+ equivalents[t] = [[t],'red']
3594+ continue
3595+ amber = False
3596+ if len(data['results']) > 1:
3597+ # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
3598+ # for automatic processing we'll just take the first one though
3599+ # colour is amber in this case
3600+ amber = True
3601+ ID = str(data['results'][0]['id']) # take first hit
3602+ URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=0&videos=0&sounds=0&maps=0&text=0&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"
3603+ req = urllib2.Request(URL)
3604+ opener = urllib2.build_opener()
3605+
3606+ try:
3607+ f = opener.open(req)
3608+ except urllib2.HTTPError:
3609+ equivalents[t] = [[t],'red']
3610+ continue
3611+ data = json.load(f)
3612+ if len(data['scientificName']) == 0:
3613+ # not found a scientific name, so set as red
3614+ equivalents[t] = [[t],'red']
3615+ continue
3616+ correct_name = data['scientificName'].encode("ascii","ignore")
3617+ # we only want the first two bits of the name, not the original author and year if any
3618+ temp_name = correct_name.split(' ')
3619+ if (len(temp_name) > 2):
3620+ correct_name = ' '.join(temp_name[0:2])
3621+ correct_name = correct_name.replace(' ','_')
3622+
3623+ # build up the output dictionary - original name is key, synonyms/missing is value
3624+ if (correct_name == t):
3625+ # if the original matches the 'correct', then it's green
3626+ equivalents[t] = [[t], 'green']
3627+ else:
3628+ # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the
3629+ # 'correct' taxon at the top
3630+ eol_synonyms = data['synonyms']
3631+ synonyms = []
3632+ for s in eol_synonyms:
3633+ ts = s['synonym'].encode("ascii","ignore")
3634+ temp_syn = ts.split(' ')
3635+ if (len(temp_syn) > 2):
3636+ temp_syn = ' '.join(temp_syn[0:2])
3637+ ts = temp_syn
3638+ if (s['relationship'] == "synonym"):
3639+ ts = ts.replace(" ","_")
3640+ synonyms.append(ts)
3641+ synonyms = _uniquify(synonyms)
3642+ # we need to put the correct name at the top of the list now
3643+ if (correct_name in synonyms):
3644+ synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
3645+ elif len(synonyms) == 0:
3646+ synonyms.append(correct_name)
3647+ else:
3648+ synonyms.insert(0,correct_name)
3649+
3650+ if (amber):
3651+ equivalents[t] = [synonyms,'amber']
3652+ else:
3653+ equivalents[t] = [synonyms,'yellow']
3654+ # if our search was empty, then it's red - see above
3655+
3656+ # up to the calling funciton to do something sensible with this
3657+ # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
3658+ # Amber means we found synonyms and multilpe hits. User def needs to sort these!
3659+
3660+ return equivalents
3661+
3662+def taxonomic_checker_tree(tree_file,existing_data=None,verbose=False):
3663+ """ For each name in the database generate a database of the original name,
3664+ possible synonyms and if the taxon is not know, signal that. We do this by
3665+ using the EoL API to grab synonyms of each taxon. """
3666+
3667+ tree = import_tree(tree_file)
3668+ p4tree = _parse_tree(tree)
3669+ taxa = p4tree.getAllLeafNames(p4tree.root)
3670+ if existing_data == None:
3671+ equivalents = {}
3672+ else:
3673+ equivalents = existing_data
3674+
3675+ equivalents = taxonomic_checker_list(taxa,existing_data,verbose)
3676+ return equivalents
3677+
3678+def taxonomic_checker(XML,existing_data=None,verbose=False):
3679+ """ For each name in the database generate a database of the original name,
3680+ possible synonyms and if the taxon is not know, signal that. We do this by
3681+ using the EoL API to grab synonyms of each taxon. """
3682+
3683+ # grab all taxa
3684+ taxa = get_all_taxa(XML)
3685+
3686+ if existing_data == None:
3687+ equivalents = {}
3688+ else:
3689+ equivalents = existing_data
3690+
3691+ equivalents = taxonomic_checker_list(taxa,existing_data,verbose)
3692+ return equivalents
3693+
3694+
3695+def load_equivalents(equiv_csv):
3696+ """Load equivalents data from a csv and convert to a equivalents Dict.
3697+ Structure is key, with a list that is array of synonyms, followed by status ('green',
3698+ 'yellow' or 'red').
3699+
3700+ """
3701+
3702+ import csv
3703+
3704+ equivalents = {}
3705+
3706+ with open(equiv_csv, 'rU') as csvfile:
3707+ equiv_reader = csv.reader(csvfile, delimiter=',')
3708+ equiv_reader.next() # skip header
3709+ for row in equiv_reader:
3710+ i = 1
3711+ equivalents[row[0]] = [row[1].split(';'),row[2]]
3712+
3713+ return equivalents
3714+
3715+def save_taxonomy(taxonomy, output_file):
3716+
3717+ import csv
3718+
3719+ with open(output_file, 'w') as f:
3720+ writer = csv.writer(f)
3721+ row = ['OTU']
3722+ row.extend(taxonomy_levels)
3723+ row.append('Provider')
3724+ writer.writerow(row)
3725+ for t in taxonomy:
3726+ species = t
3727+ row = []
3728+ row.append(t.encode('utf-8'))
3729+ for l in taxonomy_levels:
3730+ try:
3731+ g = taxonomy[t][l]
3732+ except KeyError:
3733+ g = '-'
3734+ row.append(g.encode('utf-8'))
3735+ try:
3736+ provider = taxonomy[t]['provider']
3737+ except KeyError:
3738+ provider = "-"
3739+ row.append(provider)
3740+
3741+ writer.writerow(row)
3742
3743
3744 def load_taxonomy(taxonomy_csv):
3745@@ -2000,20 +2277,443 @@
3746
3747 with open(taxonomy_csv, 'rU') as csvfile:
3748 tax_reader = csv.reader(csvfile, delimiter=',')
3749- tax_reader.next()
3750- for row in tax_reader:
3751- current_taxonomy = {}
3752- i = 1
3753- for t in taxonomy_levels:
3754- if not row[i] == '-':
3755- current_taxonomy[t] = row[i]
3756- i = i+ 1
3757-
3758- current_taxonomy['provider'] = row[17] # data source
3759- taxonomy[row[0]] = current_taxonomy
3760-
3761- return taxonomy
3762-
3763+ try:
3764+ j = 0
3765+ for row in tax_reader:
3766+ if j == 0:
3767+ tax_levels = row[1:-1]
3768+ j += 1
3769+ continue
3770+ i = 1
3771+ current_taxonomy = {}
3772+ for t in tax_levels:
3773+ if not row[i] == '-':
3774+ current_taxonomy[t] = row[i]
3775+ i = i+ 1
3776+ current_taxonomy['provider'] = row[-1] # data source
3777+ taxonomy[row[0].replace(" ","_")] = current_taxonomy
3778+ j += 1
3779+ except:
3780+ pass
3781+
3782+ return taxonomy
3783+
3784+
3785+class TaxonomyFetcher(threading.Thread):
3786+ """ Class to provide the taxonomy fetching functionality as a threaded function to be used individually or working with a pool.
3787+ """
3788+
3789+ def __init__(self, taxonomy, lock, queue, id=0, pref_db=None, verbose=False, ignoreWarnings=False):
3790+ """ Constructor for the threaded model.
3791+ :param taxonomy: previous taxonomy available (if available) or an empty dictionary to store the results .
3792+ :type taxonomy: dictionary
3793+ :param lock: lock to keep the taxonomy threadsafe.
3794+ :type lock: Lock
3795+ :param queue: queue where the taxa are kept to be processed.
3796+ :type queue: Queue of strings
3797+ :param id: id for the thread to use if messages need to be printed.
3798+ :type id: int
3799+ :param pref_db: Gives priority to database. Seems it is unused.
3800+ :type pref_db: string
3801+ :param verbose: Show verbose messages during execution, will also define level of logging. True will set logging level to INFO.
3802+ :type verbose: boolean
3803+ :param ignoreWarnings: Ignore warnings and errors during execution? Errors will be logged with ERROR level on the logging output.
3804+ :type ignoreWarnings: boolean
3805+ """
3806+
3807+ threading.Thread.__init__(self)
3808+ self.taxonomy = taxonomy
3809+ self.lock = lock
3810+ self.queue = queue
3811+ self.id = id
3812+ self.verbose = verbose
3813+ self.pref_db = pref_db
3814+ self.ignoreWarnings = ignoreWarnings
3815+
3816+ def run(self):
3817+ """ Gets and processes a taxon from the queue to get its taxonomy."""
3818+ while True :
3819+ if self.verbose :
3820+ logging.getLogger().setLevel(logging.INFO)
3821+ #get taxon from queue
3822+ taxon = self.queue.get()
3823+
3824+ logging.debug("Starting {} with thread #{} remaining ~{}".format(taxon,str(self.id),str(self.queue.qsize())))
3825+
3826+ #Lock access to the taxonomy
3827+ self.lock.acquire()
3828+ if not taxon in self.taxonomy: # is a new taxon, not previously in the taxonomy
3829+ #Release access to the taxonomy
3830+ self.lock.release()
3831+ if (self.verbose):
3832+ print "Looking up ", taxon
3833+ logging.info("Loolking up taxon: {}".format(str(taxon)))
3834+ try:
3835+ # get the data from EOL on taxon
3836+ taxonq = quote_plus(taxon)
3837+ URL = "http://eol.org/api/search/1.0.json?q="+taxonq
3838+ req = urllib2.Request(URL)
3839+ opener = urllib2.build_opener()
3840+ f = opener.open(req)
3841+ data = json.load(f)
3842+ # check if there's some data
3843+ if len(data['results']) == 0:
3844+ # try PBDB as it might be a fossil
3845+ URL = "http://paleobiodb.org/data1.1/taxa/single.json?name="+taxonq+"&show=phylo&vocab=pbdb"
3846+ req = urllib2.Request(URL)
3847+ opener = urllib2.build_opener()
3848+ f = opener.open(req)
3849+ datapbdb = json.load(f)
3850+ if (len(datapbdb['records']) == 0):
3851+ # no idea!
3852+ with self.lock:
3853+ self.taxonomy[taxon] = {}
3854+ self.queue.task_done()
3855+ continue
3856+ # otherwise, let's fill in info here - only if extinct!
3857+ if datapbdb['records'][0]['is_extant'] == 0:
3858+ this_taxonomy = {}
3859+ this_taxonomy['provider'] = 'PBDB'
3860+ for level in taxonomy_levels:
3861+ try:
3862+ if datapbdb.has_key('records'):
3863+ pbdb_lev = datapbdb['records'][0][level]
3864+ temp_lev = pbdb_lev.split(" ")
3865+ # they might have the author on the end, so strip it off
3866+ if (level == 'species'):
3867+ this_taxonomy[level] = ' '.join(temp_lev[0:2])
3868+ else:
3869+ this_taxonomy[level] = temp_lev[0]
3870+ except KeyError as e:
3871+ logging.exception("Key not found records")
3872+ continue
3873+ # add the taxon at right level too
3874+ try:
3875+ if datapbdb.has_key('records'):
3876+ current_level = datapbdb['records'][0]['rank']
3877+ this_taxonomy[current_level] = datapbdb['records'][0]['taxon_name']
3878+ except KeyError as e:
3879+ self.queue.task_done()
3880+ logging.exception("Key not found records")
3881+ continue
3882+ with self.lock:
3883+ self.taxonomy[taxon] = this_taxonomy
3884+ self.queue.task_done()
3885+ continue
3886+ else:
3887+ # extant, but not in EoL - leave the user to sort this one out
3888+ with self.lock:
3889+ self.taxonomy[taxon] = {}
3890+ self.queue.task_done()
3891+ continue
3892+
3893+
3894+ ID = str(data['results'][0]['id']) # take first hit
3895+ # Now look for taxonomies
3896+ URL = "http://eol.org/api/pages/1.0/"+ID+".json"
3897+ req = urllib2.Request(URL)
3898+ opener = urllib2.build_opener()
3899+ f = opener.open(req)
3900+ data = json.load(f)
3901+ if len(data['taxonConcepts']) == 0:
3902+ with self.lock:
3903+ self.taxonomy[taxon] = {}
3904+ self.queue.task_done()
3905+ continue
3906+ TID = str(data['taxonConcepts'][0]['identifier']) # take first hit
3907+ currentdb = str(data['taxonConcepts'][0]['nameAccordingTo'])
3908+ # loop through and get preferred one if specified
3909+ # now get taxonomy
3910+ if (not self.pref_db is None):
3911+ for db in data['taxonConcepts']:
3912+ currentdb = db['nameAccordingTo'].lower()
3913+ if (self.pref_db.lower() in currentdb):
3914+ TID = str(db['identifier'])
3915+ break
3916+ URL="http://eol.org/api/hierarchy_entries/1.0/"+TID+".json"
3917+ req = urllib2.Request(URL)
3918+ opener = urllib2.build_opener()
3919+ f = opener.open(req)
3920+ data = json.load(f)
3921+ this_taxonomy = {}
3922+ this_taxonomy['provider'] = currentdb
3923+ for a in data['ancestors']:
3924+ try:
3925+ if a.has_key('taxonRank') :
3926+ temp_level = a['taxonRank'].encode("ascii","ignore")
3927+ if (temp_level in taxonomy_levels):
3928+ # note the dump into ASCII
3929+ temp_name = a['scientificName'].encode("ascii","ignore")
3930+ temp_name = temp_name.split(" ")
3931+ if (temp_level == 'species'):
3932+ this_taxonomy[temp_level] = temp_name[0:2]
3933+
3934+ else:
3935+ this_taxonomy[temp_level] = temp_name[0]
3936+ except KeyError as e:
3937+ logging.exception("Key not found: taxonRank")
3938+ continue
3939+ try:
3940+ # add taxonomy in to the taxonomy!
3941+ # some issues here, so let's make sure it's OK
3942+ temp_name = taxon.split(" ")
3943+ if data.has_key('taxonRank') :
3944+ if not data['taxonRank'].lower() == 'species':
3945+ this_taxonomy[data['taxonRank'].lower()] = temp_name[0]
3946+ else:
3947+ this_taxonomy[data['taxonRank'].lower()] = ' '.join(temp_name[0:2])
3948+ except KeyError as e:
3949+ self.queue.task_done()
3950+ logging.exception("Key not found: taxonRank")
3951+ continue
3952+ with self.lock:
3953+ #Send result to dictionary
3954+ self.taxonomy[taxon] = this_taxonomy
3955+ except urllib2.HTTPError:
3956+ print("Network error when processing {} ".format(taxon,))
3957+ logging.info("Network error when processing {} ".format(taxon,))
3958+ self.queue.task_done()
3959+ continue
3960+ except urllib2.URLError:
3961+ print("Network error when processing {} ".format(taxon,))
3962+ logging.info("Network error when processing {} ".format(taxon,))
3963+ self.queue.task_done()
3964+ continue
3965+ else :
3966+ #Nothing to do release the lock on taxonomy
3967+ self.lock.release()
3968+ #Mark task as done
3969+ self.queue.task_done()
3970+
3971+def create_taxonomy_from_taxa(taxa, taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False, threadNumber=5):
3972+ """Uses the taxa provided to generate a taxonomy for all the taxon available.
3973+ :param taxa: list of the taxa.
3974+ :type taxa : list
3975+ :param taxonomy: previous taxonomy available (if available) or an empty
3976+ dictionary to store the results. If None will be init to an empty dictionary
3977+ :type taxonomy: dictionary
3978+ :param pref_db: Gives priority to database. Seems it is unused.
3979+ :type pref_db: string
3980+ :param verbose: Show verbose messages during execution, will also define
3981+ level of logging. True will set logging level to INFO.
3982+ :type verbose: boolean
3983+ :param ignoreWarnings: Ignore warnings and errors during execution? Errors
3984+ will be logged with ERROR level on the logging output.
3985+ :type ignoreWarnings: boolean
3986+ :param threadNumber: Maximum number of threads to use for taxonomy processing.
3987+ :type threadNumber: int
3988+ :returns: dictionary with resulting taxonomy for each taxon (keys)
3989+ :rtype: dictionary
3990+ """
3991+ if verbose :
3992+ logging.getLogger().setLevel(logging.INFO)
3993+ if taxonomy is None:
3994+ taxonomy = {}
3995+
3996+ lock = threading.Lock()
3997+ queue = Queue.Queue()
3998+
3999+ #Starting a few threads as daemons checking the queue
4000+ for i in range(threadNumber) :
4001+ t = TaxonomyFetcher(taxonomy, lock, queue, i, pref_db, verbose, ignoreWarnings)
4002+ t.setDaemon(True)
4003+ t.start()
4004+
4005+ #Popoluate the queue with the taxa.
4006+ for taxon in taxa :
4007+ queue.put(taxon)
4008+
4009+ #Wait till everyone finishes
4010+ queue.join()
4011+ logging.getLogger().setLevel(logging.WARNING)
4012+
4013+def create_taxonomy_from_tree(tree, existing_taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False):
4014+ """ Generates the taxonomy from a tree. Uses a similar method to the XML version but works directly on a string with the tree.
4015+ :param tree: list of the taxa.
4016+ :type tree : list
4017+ :param existing_taxonomy: list of the taxa.
4018+ :type existing_taxonomy: list
4019+ :param pref_db: Gives priority to database. Seems it is unused.
4020+ :type pref_db: string
4021+ :param verbose: Flag for verbosity.
4022+ :type verbose: boolean
4023+ :param ignoreWarnings: Flag for exception processing.
4024+ :type ignoreWarnings: boolean
4025+ :returns: the modified taxonomy
4026+ :rtype: dictionary
4027+ """
4028+ starttime = time.time()
4029+
4030+ if(existing_taxonomy is None) :
4031+ taxonomy = {}
4032+ else :
4033+ taxonomy = existing_taxonomy
4034+
4035+ taxa = get_taxa_from_tree_for_taxonomy(tree, pretty=True)
4036+
4037+ create_taxonomy_from_taxa(taxa, taxonomy)
4038+
4039+ taxonomy = create_extended_taxonomy(taxonomy, starttime, verbose, ignoreWarnings)
4040+
4041+ return taxonomy
4042+
4043+def create_taxonomy(XML, existing_taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False):
4044+ """Generates a taxonomy of the data from EoL data. This is stored as a
4045+ dictionary of taxonomy for each taxon in the dataset. Missing data are
4046+ encoded as '' (blank string). It's up to the calling function to store this
4047+ data to file or display it."""
4048+
4049+ starttime = time.time()
4050+
4051+ if not ignoreWarnings:
4052+ _check_data(XML)
4053+
4054+ if (existing_taxonomy is None):
4055+ taxonomy = {}
4056+ else:
4057+ taxonomy = existing_taxonomy
4058+ taxa = get_all_taxa(XML, pretty=True)
4059+ create_taxonomy_from_taxa(taxa, taxonomy)
4060+ #taxonomy = create_extended_taxonomy(taxonomy, starttime, verbose, ignoreWarnings)
4061+ return taxonomy
4062+
4063+def create_extended_taxonomy(taxonomy, starttime, verbose=False, ignoreWarnings=False):
4064+ """Bring extra taxonomy terms from other databases, shared method for completing the taxonomy
4065+ both for trees comming from XML or directly from trees.
4066+ :param taxonomy: Dictionary with the relationship for taxa and taxonomy terms.
4067+ :type taxonomy: dictionary
4068+ :param starttime: time to keep track of processing time.
4069+ :type starttime: long
4070+ :param verbose: Flag for verbosity.
4071+ :type verbose: boolean
4072+ :param ignoreWarnings: Flag for exception processing.
4073+ :type ignoreWarnings: boolean
4074+ :returns: the modified taxonomy
4075+ :rtype: dictionary
4076+ """
4077+
4078+ if (verbose):
4079+ logging.info('Done basic taxonomy, getting more info from ITIS')
4080+ print("Time elapsed {}".format(str(time.time() - starttime)))
4081+ print "Done basic taxonomy, getting more info from ITIS"
4082+ # fill in the rest of the taxonomy
4083+ # get all genera
4084+ genera = []
4085+ for t in taxonomy:
4086+ if t in taxonomy:
4087+ if GENUS in taxonomy[t]:
4088+ genera.append(taxonomy[t][GENUS])
4089+ genera = _uniquify(genera)
4090+ # We then use ITIS to fill in missing info based on the genera only - that saves us a species level search
4091+ # and we can fill in most of the EoL missing data
4092+ for g in genera:
4093+ if (verbose):
4094+ print "Looking up ", g
4095+ logging.info("Looking up {}".format(str(g)))
4096+ try:
4097+ URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(g.strip())
4098+ except:
4099+ continue
4100+ req = urllib2.Request(URL)
4101+ opener = urllib2.build_opener()
4102+ try:
4103+ f = opener.open(req)
4104+ except urllib2.HTTPError:
4105+ continue
4106+ string = unicode(f.read(),"ISO-8859-1")
4107+ data = json.loads(string)
4108+ if data['scientificNames'][0] == None:
4109+ continue
4110+ tsn = data["scientificNames"][0]["tsn"]
4111+ URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+str(tsn)
4112+ req = urllib2.Request(URL)
4113+ opener = urllib2.build_opener()
4114+ f = opener.open(req)
4115+ try:
4116+ string = unicode(f.read(),"ISO-8859-1")
4117+ except:
4118+ continue
4119+ data = json.loads(string)
4120+ this_taxonomy = {}
4121+ for level in data['hierarchyList']:
4122+ if not level['rankName'].lower() in current_taxonomy_levels:
4123+ # note the dump into ASCII
4124+ if level['rankName'].lower() == 'species':
4125+ this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = ' '.join.level['taxonName'][0:2].encode("ascii","ignore")
4126+ else:
4127+ this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
4128+
4129+ for t in taxonomy:
4130+ if t in taxonomy:
4131+ if GENUS in taxonomy[t]:
4132+ if taxonomy[t][GENUS] == g:
4133+ taxonomy[t].update(this_taxonomy)
4134+
4135+ return taxonomy
4136+
4137+def generate_species_level_data(XML, taxonomy, ignoreWarnings=False, verbose=False):
4138+ """ Based on a taxonomy data set, amend the data to be at species level as
4139+ far as possible. This function creates an internal 'subs file' and calls
4140+ the standard substitution functions. The internal subs are generated by
4141+ looping over the taxa and if not at species-level, working out which level
4142+ they are at and then adding species already in the dataset to replace it
4143+ via a polytomy. This has to be done in one step to avoid adding spurious
4144+ structure to the phylogenies """
4145+
4146+ if not ignoreWarnings:
4147+ _check_data(XML)
4148+
4149+ # if taxonomic checker not done, warn
4150+ if (not taxonomy):
4151+ raise NoneCompleteTaxonomy("Taxonomy is empty. Create a taxonomy first. You'll probably need to hand edit the file to complete")
4152+ return
4153+
4154+ # if missing data in taxonomy, warn
4155+ taxa = get_all_taxa(XML)
4156+ keys = taxonomy.keys()
4157+ if (not ignoreWarnings):
4158+ for t in taxa:
4159+ t = t.replace("_"," ")
4160+ if not t in keys:
4161+ # This idea here is that the caller will catch this, then re-run with ignoreWarnings set to True
4162+ raise NoneCompleteTaxonomy("Taxonomy is not complete. I will soldier on anyway, but this might not work as intended")
4163+
4164+ # get all taxa - see above!
4165+ # for each taxa, if not at species level
4166+ new_taxa = []
4167+ old_taxa = []
4168+ for t in taxa:
4169+ subs = []
4170+ t = t.replace("_"," ")
4171+ if (not SPECIES in taxonomy[t]): # the current taxon is not a species, but higher level taxon
4172+ # work out which level - should we encode this in the data to start with?
4173+ for tl in taxonomy_levels:
4174+ try:
4175+ tax_data = taxonomy[t][tl]
4176+ except KeyError:
4177+ continue
4178+ if (t == taxonomy[t][tl]):
4179+ current_level = tl
4180+ # find all species in the taxonomy that match this level
4181+ for taxon in taxa:
4182+ taxon = taxon.replace("_"," ")
4183+ if (SPECIES in taxonomy[taxon]):
4184+ try:
4185+ if taxonomy[taxon][current_level] == t: # our current taxon
4186+ subs.append(taxon.replace(" ","_"))
4187+ except KeyError:
4188+ continue
4189+
4190+ # create the sub
4191+ if len(subs) > 0:
4192+ old_taxa.append(t.replace(" ","_"))
4193+ new_taxa.append(','.join(subs))
4194+
4195+ # call the sub
4196+ new_XML = substitute_taxa(XML, old_taxa, new_taxa, verbose=verbose)
4197+ new_XML = clean_data(new_XML)
4198+
4199+ return new_XML
4200
4201 def data_overlap(XML, overlap_amount=2, filename=None, detailed=False, show=False, verbose=False, ignoreWarnings=False):
4202 """ Calculate the amount of taxonomic overlap between source trees.
4203@@ -2024,7 +2724,7 @@
4204 If filename is None, no graphic is generated. Otherwise a simple
4205 graphic is generated showing the number of cluster. If detailed is set to
4206 true, a graphic is generated showing *all* trees. For data containing >200
4207- source tres this could be very big and take along time. More likely, you'll run
4208+ source trees this could be very big and take along time. More likely, you'll run
4209 out of memory.
4210 """
4211 import matplotlib
4212@@ -2103,6 +2803,7 @@
4213 sufficient_overlap = True
4214
4215 # The above list actually contains which components are seperate from each other
4216+ key_list = connected_components
4217
4218 if (not filename == None or show):
4219 if (verbose):
4220@@ -2266,7 +2967,9 @@
4221 prev_char = None
4222 prev_taxa = None
4223 prev_name = None
4224- non_ind = {}
4225+ subsets = []
4226+ identical = []
4227+ is_identical = False
4228 for data in data_ind:
4229 name = data[0]
4230 char = data[1]
4231@@ -2275,22 +2978,71 @@
4232 # when sorted, the longer list comes first
4233 if set(taxa).issubset(set(prev_taxa)):
4234 if (taxa == prev_taxa):
4235- non_ind[name] = [prev_name,IDENTICAL]
4236+ if (is_identical):
4237+ identical[-1].append(name)
4238+ else:
4239+ identical.append([name,prev_name])
4240+ is_identical = True
4241+
4242 else:
4243- non_ind[name] = [prev_name,SUBSET]
4244+ subsets.append([prev_name, name])
4245+ prev_name = name
4246+ is_identical = False
4247+ else:
4248+ prev_name = name
4249+ is_identical = False
4250+ else:
4251+ prev_name = name
4252+ is_identical = False
4253+
4254 prev_char = char
4255 prev_taxa = taxa
4256- prev_name = name
4257-
4258+
4259 if (make_new_xml):
4260 new_xml = XML
4261- for name in non_ind:
4262- if (non_ind[name][1] == SUBSET):
4263- new_xml = _swap_tree_in_XML(new_xml,None,name)
4264+ # deal with subsets
4265+ for s in subsets:
4266+ new_xml = _swap_tree_in_XML(new_xml,None,s[1])
4267 new_xml = clean_data(new_xml)
4268- return non_ind, new_xml
4269+ # deal with identical - weight them, if there's 3, weights are 0.3, i.e.
4270+ # weights are 1/no of identical trees
4271+ for i in identical:
4272+ weight = 1.0 / float(len(i))
4273+ new_xml = add_weights(new_xml, i, weight)
4274+
4275+ return identical, subsets, new_xml
4276 else:
4277- return non_ind
4278+ return identical, subsets
4279+
4280+
4281+def add_weights(XML, names, weight):
4282+ """ Add weights for tree, supply array of names and a weight, they get set
4283+ Returns a new XML
4284+ """
4285+
4286+ xml_root = _parse_xml(XML)
4287+ # By getting source, we can then loop over each source_tree
4288+ find = etree.XPath("//source_tree")
4289+ sources = find(xml_root)
4290+ for s in sources:
4291+ s_name = s.attrib['name']
4292+ for n in names:
4293+ if s_name == n:
4294+ if s.xpath("tree/weight/real_value") == []:
4295+ # add weights
4296+ weights_element = etree.Element("weight")
4297+ weights_element.tail="\n"
4298+ real_value = etree.SubElement(weights_element,'real_value')
4299+ real_value.attrib['rank'] = '0'
4300+ real_value.tail = '\n'
4301+ real_value.text = str(weight)
4302+ t = s.xpath("tree")[0]
4303+ t.append(weights_element)
4304+ else:
4305+ s.xpath("tree/weight/real_value")[0].text = str(weight)
4306+
4307+ return etree.tostring(xml_root,pretty_print=True)
4308+
4309
4310 def add_historical_event(XML, event_description):
4311 """
4312@@ -2380,8 +3132,15 @@
4313 # check trees are informative
4314 XML = _check_informative_trees(XML,delete=True)
4315
4316+
4317 # check sources
4318 XML = _check_sources(XML,delete=True)
4319+ XML = all_sourcenames(XML)
4320+
4321+ # fix tree names
4322+ XML = set_unique_names(XML)
4323+ XML = set_all_tree_names(XML,overwrite=True)
4324+
4325
4326 # unpermutable trees
4327 permutable_trees = _find_trees_for_permuting(XML)
4328@@ -2659,7 +3418,7 @@
4329 s.getparent().remove(s)
4330
4331 # edit name (append _subset)
4332- proj_name = xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
4333+ proj_name = get_project_name(XML)
4334 proj_name += "_subset"
4335 xml_root.xpath('/phylo_storage/project_name/string_value')[0].text = proj_name
4336
4337@@ -2928,6 +3687,37 @@
4338
4339 return mrca
4340
4341+
4342+def tree_from_taxonomy(taxonomy, end_level, end_rank):
4343+ """Create a tree from a taxonomy data structure.
4344+ This is not the most efficient way, but works OK
4345+ """
4346+
4347+ # Grab data only for the end_level classification
4348+ required_taxonomy = {}
4349+ for t in taxonomy:
4350+ if (end_level in t):
4351+ required_taxonomy[t] = taxonomy[t]
4352+
4353+ rank_index = taxonomy_levels.index(end_rank)
4354+
4355+ # create basic string
4356+
4357+ # get unique otus
4358+
4359+ # sort by the subfamily
4360+
4361+ # for each genus create a newick string
4362+
4363+ # if it's the same grouping as previous, add as sister clade (i.e. ,)
4364+ # else, prepend a (, append a ) and add new clade (ie. ,)
4365+
4366+
4367+ # return tree
4368+
4369+
4370+
4371+
4372 ################ PRIVATE FUNCTIONS ########################
4373
4374 def _uniquify(l):
4375@@ -2975,13 +3765,25 @@
4376 "The source names in the dataset are not unique. Please run the auto-name function on these data. Name: "+name+"\n"
4377 last_name = name
4378
4379+ # do same for tree names:
4380+ names = get_all_tree_names(XML)
4381+ names.sort()
4382+ last_name = "" # This will actually throw an non-unique error if a name is empty
4383+ # not great, but still an error!
4384+ for name in names:
4385+ if name == last_name:
4386+ # if non-unique throw exception
4387+ message = message + \
4388+ "The tree names in the dataset are not unique. Please run the auto-name function on these data with replace or edit by hand. Name: "+name+"\n"
4389+ last_name = name
4390+
4391 if (not message == ""):
4392 raise NotUniqueError(message)
4393
4394 return
4395
4396
4397-def _assemble_tree_matrix(tree_string):
4398+def _assemble_tree_matrix(tree_string, verbose=False):
4399 """ Assembles the MRP matrix for an individual tree
4400
4401 returns: matrix (2D numpy array: taxa on i, nodes on j)
4402@@ -3009,7 +3811,7 @@
4403 for i in range(0,len(names)):
4404 adjmat.append([1])
4405 adjmat = numpy.array(adjmat)
4406-
4407+ if verbose:
4408 print "Warning: Found uninformative tree in data. Including it in the matrix anyway"
4409
4410 return adjmat, names
4411@@ -3020,7 +3822,7 @@
4412
4413 If the new_taxa array is missing, simply delete the old_taxa
4414 """
4415-
4416+
4417 tree = _correctly_quote_taxa(tree)
4418 # are the input values lists or simple strings?
4419 if (isinstance(old_taxa,str)):
4420@@ -3564,7 +4366,7 @@
4421
4422 return permute_trees
4423
4424-def _create_matrix(trees, taxa, format="hennig", quote=False, weights=None):
4425+def _create_matrix(trees, taxa, format="hennig", quote=False, weights=None, verbose=False):
4426 """
4427 Does the hard work on creating a matrix
4428 """
4429@@ -3585,7 +4387,7 @@
4430 if (not weights == None):
4431 weight = weights[key]
4432 names.append(key)
4433- submatrix, tree_taxa = _assemble_tree_matrix(trees[key])
4434+ submatrix, tree_taxa = _assemble_tree_matrix(trees[key], verbose=verbose)
4435 nChars = len(submatrix[0,:])
4436 # loop over characters in the submatrix
4437 for i in range(1,nChars):
4438@@ -3637,7 +4439,7 @@
4439 matrix_string += string + "\n"
4440 i += 1
4441
4442- matrix_string += "\t;\n"
4443+ matrix_string += "\n"
4444 if (not weights == None):
4445 # get unique weights
4446 unique_weights = _uniquify(weights)
4447@@ -3652,7 +4454,7 @@
4448 matrix_string += " " + str(i)
4449 i += 1
4450 matrix_string += ";\n"
4451- matrix_string += "procedure /;"
4452+ matrix_string += "proc /;"
4453 elif (format == 'nexus'):
4454 matrix_string = "#nexus\n\nbegin data;\n"
4455 matrix_string += "\tdimensions ntax = "+str(len(taxa)) +" nchar = "+str(last_char)+";\n"
4456
4457=== modified file 'stk/test/_substitute_taxa.py'
4458--- stk/test/_substitute_taxa.py 2016-07-14 10:12:17 +0000
4459+++ stk/test/_substitute_taxa.py 2017-01-12 09:27:31 +0000
4460@@ -10,6 +10,7 @@
4461 from stk.supertree_toolkit import check_subs, _tree_contains, _correctly_quote_taxa, _remove_single_poly_taxa
4462 from stk.supertree_toolkit import _swap_tree_in_XML, substitute_taxa, get_all_taxa, _parse_tree, _delete_taxon
4463 from stk.supertree_toolkit import _collapse_nodes, import_tree, subs_from_csv, _getTaxaFromNewick, obtain_trees
4464+from stk.supertree_toolkit import generate_species_level_data
4465 from lxml import etree
4466 from util import *
4467 from stk.stk_exceptions import *
4468@@ -776,7 +777,24 @@
4469 new_tree = _sub_taxa_in_tree(tree2,"Thereuopodina",sub_in,skip_existing=True);
4470 self.assert_(answer2, new_tree)
4471
4472-
4473+
4474+ def test_auto_subs_taxonomy(self):
4475+ """test the automatic subs function with a simple test"""
4476+ XML = etree.tostring(etree.parse('data/input/auto_sub.phyml',parser),pretty_print=True)
4477+ taxonomy = {'Ardea goliath': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Ecdysozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Ardea', 'order': 'Pelecaniformes', 'species': 'Ardea goliath'},
4478+ 'Pelecaniformes': {'kingdom': 'Animalia', 'phylum': 'Chordata', 'order': 'Pelecaniformes', 'class': 'Aves', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013'}, 'Gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes'},
4479+ 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'infraphylum': 'Gnathostomata', 'superclass': 'Tetrapoda', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'order': 'Procellariiformes', 'species': 'Thalassarche melanophris'},
4480+ 'Platalea leucorodia': {'kingdom': 'Animalia', 'subfamily': 'Plataleinae', 'family': 'Threskiornithidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'infraphylum': 'Gnathostomata', 'superclass': 'Tetrapoda', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Platalea', 'order': 'Pelecaniformes', 'species': 'Platalea leucorodia'},
4481+ 'Gallus lafayetii': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus lafayetii'},
4482+ 'Ardea humbloti': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Ecdysozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Ardea', 'order': 'Pelecaniformes', 'species': 'Ardea humbloti'},
4483+ 'Gallus varius': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus varius'}}
4484+ XML = generate_species_level_data(XML, taxonomy)
4485+ expected_XML = etree.tostring(etree.parse('data/output/one_click_subs_output.phyml',parser),pretty_print=True)
4486+ trees = obtain_trees(XML)
4487+ expected_trees = obtain_trees(expected_XML)
4488+ for t in trees:
4489+ self.assert_(_trees_equal(trees[t], expected_trees[t]))
4490+
4491 def test_parrot_edge_case(self):
4492 """Random edge case where the tree dissappeared..."""
4493 trees = ["(((((((Agapornis_lilianae, Agapornis_nigrigenis), Agapornis_personata, Agapornis_fischeri), Agapornis_roseicollis), (Agapornis_pullaria, Agapornis_taranta)), Agapornis_cana), Loriculus_galgulus), Geopsittacus_occidentalis);"]
4494
4495=== modified file 'stk/test/_supertree_toolkit.py'
4496--- stk/test/_supertree_toolkit.py 2015-03-26 09:58:58 +0000
4497+++ stk/test/_supertree_toolkit.py 2017-01-12 09:27:31 +0000
4498@@ -7,12 +7,13 @@
4499 import os
4500 stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir, os.pardir )
4501 sys.path.insert(0, stk_path)
4502-from stk.supertree_toolkit import _check_uniqueness, _check_taxa, _check_data, get_all_characters, data_independence
4503+from stk.supertree_toolkit import _check_uniqueness, _check_taxa, _check_data, get_all_characters, data_independence, add_weights
4504 from stk.supertree_toolkit import get_fossil_taxa, get_publication_years, data_summary, get_character_numbers, get_analyses_used
4505 from stk.supertree_toolkit import data_overlap, read_matrix, subs_file_from_str, clean_data, obtain_trees, get_all_source_names
4506 from stk.supertree_toolkit import add_historical_event, _sort_data, _parse_xml, _check_sources, _swap_tree_in_XML, replace_genera
4507 from stk.supertree_toolkit import get_all_taxa, _get_all_siblings, _parse_tree, get_characters_used, _trees_equal, get_weights
4508-from stk.supertree_toolkit import get_outgroup, set_all_tree_names, create_tree_name, load_taxonomy
4509+from stk.supertree_toolkit import get_outgroup, set_all_tree_names, create_tree_name, taxonomic_checker, load_taxonomy, load_equivalents
4510+from stk.supertree_toolkit import create_taxonomy, create_taxonomy_from_tree, get_all_tree_names
4511 from lxml import etree
4512 from util import *
4513 from stk.stk_exceptions import *
4514@@ -268,19 +269,52 @@
4515
4516 def test_data_independence(self):
4517 XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
4518- expected_dict = {'Hill_2011_2': ['Hill_2011_1', 1], 'Hill_Davis_2011_1': ['Hill_Davis_2011_2', 0]}
4519- non_ind = data_independence(XML)
4520- self.assertDictEqual(expected_dict, non_ind)
4521+ expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
4522+ non_ind,subsets = data_independence(XML)
4523+ expected_subsets = [['Hill_2011_1', 'Hill_2011_2']]
4524+ self.assertListEqual(expected_subsets, subsets)
4525+ self.assertListEqual(expected_idents, non_ind)
4526
4527- def test_data_independence(self):
4528+ def test_data_independence_2(self):
4529 XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
4530- expected_dict = {'Hill_2011_2': ['Hill_2011_1', 1], 'Hill_Davis_2011_1': ['Hill_Davis_2011_2', 0]}
4531- non_ind, new_xml = data_independence(XML,make_new_xml=True)
4532- self.assertDictEqual(expected_dict, non_ind)
4533+ expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
4534+ expected_subsets = [['Hill_2011_1', 'Hill_2011_2']]
4535+ non_ind, subset, new_xml = data_independence(XML,make_new_xml=True)
4536+ self.assertListEqual(expected_idents, non_ind)
4537+ self.assertListEqual(expected_subsets, subset)
4538 # check the second tree has not been removed
4539 self.assertRegexpMatches(new_xml,re.escape('((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,H:1.00000)0.00000:0.00000)0.00000:0.00000;'))
4540 # check that the first tree is removed
4541 self.assertNotRegexpMatches(new_xml,re.escape('((A:1.00000,B:1.00000)0.00000:0.00000,(F:1.00000,E:1.00000)0.00000:0.00000)0.00000:0.00000;'))
4542+
4543+ def test_add_weights(self):
4544+ """Add weights to a bunch of trees"""
4545+ XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
4546+ # see above
4547+ expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
4548+ # so the first should end up with a weight of 0.33333 and the second with 0.5
4549+ for ei in expected_idents:
4550+ weight = 1.0/float(len(ei))
4551+ XML = add_weights(XML, ei, weight)
4552+
4553+ expected_weights = [str(1.0/3.0), str(1.0/3.0), str(1.0/3.0), str(0.5), str(0.5)]
4554+ weights_in_xml = []
4555+ # now check weights have been added to the correct part of the tree
4556+ xml_root = _parse_xml(XML)
4557+ i = 0
4558+ for ei in expected_idents:
4559+ for tree in ei:
4560+ find = etree.XPath("//source_tree")
4561+ trees = find(xml_root)
4562+ for t in trees:
4563+ if t.attrib['name'] == tree:
4564+ # check len(trees) == 0
4565+ weights_in_xml.append(t.xpath("tree/weight/real_value")[0].text)
4566+
4567+ self.assertListEqual(expected_weights,weights_in_xml)
4568+
4569+
4570+
4571
4572 def test_overlap(self):
4573 XML = etree.tostring(etree.parse('data/input/check_overlap_ok.phyml',parser),pretty_print=True)
4574@@ -438,7 +472,7 @@
4575 XML = clean_data(XML)
4576 trees = obtain_trees(XML)
4577 self.assert_(len(trees) == 2)
4578- expected_trees = {'Hill_2011_4': '(A,B,(C,D,E));', 'Hill_2011_2': '(A, B, C, (D, E, F));'}
4579+ expected_trees = {'Hill_2011_2': '(A,B,(C,D,E));', 'Hill_2011_1': '(A, B, C, (D, E, F));'}
4580 for t in trees:
4581 self.assert_(_trees_equal(trees[t],expected_trees[t]))
4582
4583@@ -558,18 +592,78 @@
4584 self.assert_(c in expected_characters)
4585 self.assert_(len(characters) == len(expected_characters))
4586
4587+ def test_create_taxonomy(self):
4588+ XML = etree.tostring(etree.parse('data/input/create_taxonomy.phyml',parser),pretty_print=True)
4589+ # Tested on 11/01/17 and EOL have changed the output
4590+ # old_expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'superclass': 'Sarcopterygii', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Egretta tricolor', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Jeletzkytes criptonodosus': {'superfamily': 'Scaphitoidea', 'family': 'Scaphitidae', 'subkingdom': 'Metazoa', 'subclass': 'Ammonoidea', 'species': 'Jeletzkytes criptonodosus', 'phylum': 'Mollusca', 'suborder': 'Ancyloceratina', 'provider': 'Paleobiology Database', 'genus': 'Jeletzkytes', 'class': 'Cephalopoda'}}
4591+ expected = {'Jeletzkytes criptonodosus': {'superfamily': 'Scaphitoidea', 'family': 'Scaphitidae', 'subkingdom': 'Metazoa', 'subclass': 'Ammonoidea', 'species': 'Jeletzkytes criptonodosus', 'phylum': 'Mollusca', 'suborder': 'Ancyloceratina', 'provider': 'Paleobiology Database', 'genus': 'Jeletzkytes', 'class': 'Cephalopoda'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'infraspecies': 'Egretta', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': ['Egretta', 'tricolor'], 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Archaeopteryx lithographica': {'genus': 'Archaeopteryx', 'provider': 'Paleobiology Database'}}
4592+ if (internet_on()):
4593+ taxonomy = create_taxonomy(XML)
4594+ self.maxDiff = None
4595+ self.assertDictEqual(taxonomy, expected)
4596+ else:
4597+ print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
4598+ return
4599+
4600+ def test_create_taxonomy_from_tree(self):
4601+ """Tests if taxonomy from tree works. Uses same data for normal XML test but goes directly for the tree instead of parsing the XML """
4602+ # Tested on 11/01/17 and this no longer worked, but is correct! EOL returned something different.
4603+ #old_expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'superclass': 'Sarcopterygii', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Egretta tricolor', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}}
4604+ expected = {'Archaeopteryx lithographica': {'genus': 'Archaeopteryx', 'provider': 'Paleobiology Database'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'infraspecies': 'Egretta', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': ['Egretta', 'tricolor'], 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}}
4605+ tree = "(Archaeopteryx_lithographica, (Gallus_gallus, (Thalassarche_melanophris, Egretta_tricolor)));"
4606+ if (internet_on()):
4607+ taxonomy = create_taxonomy_from_tree(tree)
4608+ self.maxDiff = None
4609+ self.assertDictEqual(taxonomy, expected)
4610+ else:
4611+ print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the create_taxonomy function"
4612+ return
4613+
4614+ def test_taxonomy_checker(self):
4615+ expected = {'Thalassarche_melanophrys': [['Thalassarche_melanophris', 'Thalassarche_melanophrys', 'Diomedea_melanophris', 'Thalassarche_[melanophrys', 'Diomedea_melanophrys'], 'amber'], 'Egretta_tricolor': [['Egretta_tricolor'], 'green'], 'Gallus_gallus': [['Gallus_gallus'], 'green']}
4616+ XML = etree.tostring(etree.parse('data/input/check_taxonomy.phyml',parser),pretty_print=True)
4617+ if (internet_on()):
4618+ equivs = taxonomic_checker(XML)
4619+ self.maxDiff = None
4620+ self.assertDictEqual(equivs, expected)
4621+ else:
4622+ print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
4623+ return
4624+
4625+ def test_taxonomy_checker2(self):
4626+ XML = etree.tostring(etree.parse('data/input/check_taxonomy_fixes.phyml',parser),pretty_print=True)
4627+ if (internet_on()):
4628+ # This test is a bit dodgy as it depends on EOL's server speed. Run it a few times before deciding it's broken.
4629+ equivs = taxonomic_checker(XML,verbose=False)
4630+ self.maxDiff = None
4631+ self.assert_(equivs['Agathamera_crassa'][0][0] == 'Agathemera_crassa')
4632+ self.assert_(equivs['Celatoblatta_brunni'][0][0] == 'Maoriblatta_brunni')
4633+ self.assert_(equivs['Blatta_lateralis'][1] == 'amber')
4634+ else:
4635+ print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
4636+ return
4637+
4638+
4639 def test_load_taxonomy(self):
4640 csv_file = "data/input/create_taxonomy.csv"
4641- expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'},
4642- 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'subclass': 'Neoloricata', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'suborder': 'Ischnochitonina', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Egretta', 'order': 'Pelecaniformes', 'species': 'Egretta tricolor'},
4643- 'Gallus gallus': {'kingdom': 'Animalia', 'infrakingdom': 'Protostomia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus gallus'},
4644- 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'order': 'Procellariiformes', 'species': 'Thalassarche melanophris'},
4645- 'Jeletzkytes criptonodosus': {'kingdom': 'Metazoa', 'family': 'Scaphitidae', 'order': 'Ammonoidea', 'phylum': 'Mollusca', 'provider': 'PBDB', 'species': 'Jeletzkytes criptonodosus', 'class': 'Cephalopoda'}}
4646+ expected = {'Jeletzkytes_criptonodosus': {'kingdom': 'Metazoa', 'subclass': 'Cephalopoda', 'species': 'Jeletzkytes criptonodosus', 'suborder': 'Ammonoidea', 'provider': 'PBDB', 'subfamily': 'Scaphitidae', 'class': 'Mollusca'}, 'Archaeopteryx_lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Egretta_tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'subkingdom': 'Bilateria', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'subclass': 'Neoloricata', 'species': 'Egretta tricolor', 'phylum': 'Chordata', 'suborder': 'Ischnochitonina', 'superphylum': 'Lophozoa', 'infrakingdom': 'Protostomia', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus_gallus': {'kingdom': 'Animalia', 'superorder': 'Galliformes', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'class': 'Aves'}, 'Thalassarche_melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'species': 'Thalassarche melanophris', 'order': 'Procellariiformes', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'class': 'Aves'}}
4647 taxonomy = load_taxonomy(csv_file)
4648 self.maxDiff = None
4649
4650 self.assertDictEqual(taxonomy, expected)
4651
4652+
4653+ def test_load_equivalents(self):
4654+ csv_file = "data/input/equivalents.csv"
4655+ expected = {'Turnix_sylvatica': [['Turnix_sylvaticus','Tetrao_sylvaticus','Tetrao_sylvatica','Turnix_sylvatica'],'yellow'],
4656+ 'Xiphorhynchus_pardalotus':[['Xiphorhynchus_pardalotus'],'green'],
4657+ 'Phaenicophaeus_curvirostris':[['Zanclostomus_curvirostris','Rhamphococcyx_curvirostris','Phaenicophaeus_curvirostris','Rhamphococcyx_curvirostr'],'yellow'],
4658+ 'Megalapteryx_benhami':[['Megalapteryx_benhami'],'red']
4659+ }
4660+ equivalents = load_equivalents(csv_file)
4661+ self.assertDictEqual(equivalents, expected)
4662+
4663+
4664 def test_name_tree(self):
4665 XML = etree.tostring(etree.parse('data/input/single_source_no_names.phyml',parser),pretty_print=True)
4666 xml_root = _parse_xml(XML)
4667@@ -583,6 +677,35 @@
4668 XML = etree.tostring(etree.parse('data/input/single_source.phyml',parser),pretty_print=True)
4669 self.assert_(isEqualXML(new_xml,XML))
4670
4671+ def test_all_rename_tree(self):
4672+ XML = etree.tostring(etree.parse('data/input/single_source_same_tree_name.phyml',parser),pretty_print=True)
4673+ new_xml = set_all_tree_names(XML,overwrite=True)
4674+ XML = etree.tostring(etree.parse('data/output/single_source_same_tree_name.phyml',parser),pretty_print=True)
4675+ self.assert_(isEqualXML(new_xml,XML))
4676+
4677+ def test_get_all_tree_names(self):
4678+ XML = etree.tostring(etree.parse('data/input/single_source_same_tree_name.phyml',parser),pretty_print=True)
4679+ names = get_all_tree_names(XML)
4680+ self.assertListEqual(names,['Hill_2011_2','Hill_2011_2'])
4681+
4682+
4683+def internet_on(host="8.8.8.8", port=443, timeout=5):
4684+ import socket
4685+
4686+ """
4687+ Host: 8.8.8.8 (google-public-dns-a.google.com)
4688+ OpenPort: 53/tcp
4689+ Service: domain (DNS/TCP)
4690+ """
4691+ try:
4692+ socket.setdefaulttimeout(timeout)
4693+ socket.socket(socket.AF_INET, socket.SOCK_STREAM).connect((host, port))
4694+ return True
4695+ except Exception as ex:
4696+ print ex.message
4697+ return False
4698+
4699+
4700
4701 if __name__ == '__main__':
4702 unittest.main()
4703
4704=== modified file 'stk/test/_trees.py'
4705--- stk/test/_trees.py 2015-03-26 09:58:58 +0000
4706+++ stk/test/_trees.py 2017-01-12 09:27:31 +0000
4707@@ -5,7 +5,7 @@
4708 sys.path.insert(0,"../../")
4709 from stk.supertree_toolkit import import_tree, obtain_trees, get_all_taxa, _assemble_tree_matrix, create_matrix, _delete_taxon, _sub_taxon,_tree_contains
4710 from stk.supertree_toolkit import _swap_tree_in_XML, substitute_taxa, get_taxa_from_tree, get_characters_from_tree, amalgamate_trees, _uniquify
4711-from stk.supertree_toolkit import import_trees, import_tree, _trees_equal, _find_trees_for_permuting, permute_tree, get_all_source_names, _getTaxaFromNewick
4712+from stk.supertree_toolkit import import_trees, import_tree, _trees_equal, _find_trees_for_permuting, permute_tree, get_all_source_names, _getTaxaFromNewick, _parse_tree
4713 from stk.supertree_toolkit import get_mrca
4714 import os
4715 from lxml import etree
4716@@ -215,6 +215,18 @@
4717 mrca = get_mrca(tree,["A","I", "L"])
4718 self.assert_(mrca == 8)
4719
4720+ def test_get_mrca(self):
4721+ tree = "(B,(C,(D,(E,((A,F),((I,(G,H)),(J,(K,L))))))));"
4722+ mrca = get_mrca(tree,["A","F"])
4723+ print mrca
4724+ #self.assert_(mrca == 8)
4725+ to = _parse_tree('(X,Y,Z,(Q,W));')
4726+ treeobj = _parse_tree(tree)
4727+ newnode = treeobj.addNodeBetweenNodes(10,9)
4728+ treeobj.addSubTree(newnode, to, ignoreRootAssert=True)
4729+ treeobj.draw()
4730+
4731+
4732 def test_get_all_trees(self):
4733 XML = etree.tostring(etree.parse(single_source_input,parser),pretty_print=True)
4734 tree = obtain_trees(XML)
4735
4736=== added file 'stk/test/data/input/auto_sub.phyml'
4737--- stk/test/data/input/auto_sub.phyml 1970-01-01 00:00:00 +0000
4738+++ stk/test/data/input/auto_sub.phyml 2017-01-12 09:27:31 +0000
4739@@ -0,0 +1,97 @@
4740+<?xml version='1.0' encoding='utf-8'?>
4741+<phylo_storage>
4742+ <project_name>
4743+ <string_value lines="1">Test</string_value>
4744+ </project_name>
4745+ <sources>
4746+ <source name="Hill_2011">
4747+ <bibliographic_information>
4748+ <article>
4749+ <authors>
4750+ <author>
4751+ <surname>
4752+ <string_value lines="1">Hill</string_value>
4753+ </surname>
4754+ <other_names>
4755+ <string_value lines="1">Jon</string_value>
4756+ </other_names>
4757+ </author>
4758+ </authors>
4759+ <title>
4760+ <string_value lines="1">A great paper</string_value>
4761+ </title>
4762+ <year>
4763+ <integer_value rank="0">2011</integer_value>
4764+ </year>
4765+ <journal>
4766+ <string_value lines="1">Nature</string_value>
4767+ </journal>
4768+ <pages>
4769+ <string_value lines="1">1-12</string_value>
4770+ </pages>
4771+ </article>
4772+ </bibliographic_information>
4773+ <source_tree name="Hill_2011_1">
4774+ <tree>
4775+ <tree_string>
4776+ <string_value lines="1">(Thalassarche_melanophris, Pelecaniformes, (Gallus, Gallus_varius));</string_value>
4777+ </tree_string>
4778+ <figure_legend>
4779+ <string_value lines="1">NA</string_value>
4780+ </figure_legend>
4781+ <figure_number>
4782+ <string_value lines="1">1</string_value>
4783+ </figure_number>
4784+ <page_number>
4785+ <string_value lines="1">1</string_value>
4786+ </page_number>
4787+ <tree_inference>
4788+ <optimality_criterion name="Maximum Parsimony"/>
4789+ </tree_inference>
4790+ <topology>
4791+ <outgroup>
4792+ <string_value lines="1">A</string_value>
4793+ </outgroup>
4794+ </topology>
4795+ </tree>
4796+ <taxa_data>
4797+ <all_extant/>
4798+ </taxa_data>
4799+ <character_data>
4800+ <character type="molecular" name="12S"/>
4801+ </character_data>
4802+ </source_tree>
4803+ <source_tree name="Hill_2011_2">
4804+ <tree>
4805+ <tree_string>
4806+ <string_value lines="1">(Gallus_lafayetii, (Platalea_leucorodia, (Ardea_humbloti, Ardea_goliath)));</string_value>
4807+ </tree_string>
4808+ <figure_legend>
4809+ <string_value lines="1">NA</string_value>
4810+ </figure_legend>
4811+ <figure_number>
4812+ <string_value lines="1">1</string_value>
4813+ </figure_number>
4814+ <page_number>
4815+ <string_value lines="1">1</string_value>
4816+ </page_number>
4817+ <tree_inference>
4818+ <optimality_criterion name="Maximum Parsimony"/>
4819+ </tree_inference>
4820+ <topology>
4821+ <outgroup>
4822+ <string_value lines="1">A</string_value>
4823+ </outgroup>
4824+ </topology>
4825+ </tree>
4826+ <taxa_data>
4827+ <all_extant/>
4828+ </taxa_data>
4829+ <character_data>
4830+ <character type="molecular" name="12S"/>
4831+ </character_data>
4832+ </source_tree>
4833+ </source>
4834+ </sources>
4835+ <history/>
4836+</phylo_storage>
4837
4838=== modified file 'stk/test/data/input/check_data_ind.phyml'
4839--- stk/test/data/input/check_data_ind.phyml 2014-10-09 09:33:21 +0000
4840+++ stk/test/data/input/check_data_ind.phyml 2017-01-12 09:27:31 +0000
4841@@ -249,6 +249,147 @@
4842 <character type="molecular" name="12S"/>
4843 </character_data>
4844 </source_tree>
4845+ <source_tree name="Hill_Davis_2011_3">
4846+ <tree>
4847+ <tree_string>
4848+ <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,H:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
4849+ </tree_string>
4850+ <figure_legend>
4851+ <string_value lines="1">NA</string_value>
4852+ </figure_legend>
4853+ <figure_number>
4854+ <string_value lines="1">0</string_value>
4855+ </figure_number>
4856+ <page_number>
4857+ <string_value lines="1">0</string_value>
4858+ </page_number>
4859+ <tree_inference>
4860+ <optimality_criterion name="Maximum Parsimony"/>
4861+ </tree_inference>
4862+ <topology>
4863+ <outgroup>
4864+ <string_value lines="1">A</string_value>
4865+ </outgroup>
4866+ </topology>
4867+ </tree>
4868+ <taxa_data>
4869+ <mixed_fossil_and_extant>
4870+ <taxon name="A">
4871+ <fossil/>
4872+ </taxon>
4873+ <taxon name="B">
4874+ <fossil/>
4875+ </taxon>
4876+ </mixed_fossil_and_extant>
4877+ </taxa_data>
4878+ <character_data>
4879+ <character type="molecular" name="12S"/>
4880+ </character_data>
4881+ </source_tree>
4882+ </source>
4883+ <source name="Hill_Davis_2013">
4884+ <bibliographic_information>
4885+ <article>
4886+ <authors>
4887+ <author>
4888+ <surname>
4889+ <string_value lines="1">Hill</string_value>
4890+ </surname>
4891+ <other_names>
4892+ <string_value lines="1">Jon</string_value>
4893+ </other_names>
4894+ </author>
4895+ <author>
4896+ <surname>
4897+ <string_value lines="1">Davis</string_value>
4898+ </surname>
4899+ <other_names>
4900+ <string_value lines="1">Katie</string_value>
4901+ </other_names>
4902+ </author>
4903+ </authors>
4904+ <title>
4905+ <string_value lines="1">Another superb paper</string_value>
4906+ </title>
4907+ <year>
4908+ <integer_value rank="0">2013</integer_value>
4909+ </year>
4910+ </article>
4911+ </bibliographic_information>
4912+ <source_tree name="Hill_Davis_2013_1">
4913+ <tree>
4914+ <tree_string>
4915+ <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,Z:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
4916+ </tree_string>
4917+ <figure_legend>
4918+ <string_value lines="1">NA</string_value>
4919+ </figure_legend>
4920+ <figure_number>
4921+ <string_value lines="1">0</string_value>
4922+ </figure_number>
4923+ <page_number>
4924+ <string_value lines="1">0</string_value>
4925+ </page_number>
4926+ <tree_inference>
4927+ <optimality_criterion name="Maximum Parsimony"/>
4928+ </tree_inference>
4929+ <topology>
4930+ <outgroup>
4931+ <string_value lines="1">A</string_value>
4932+ </outgroup>
4933+ </topology>
4934+ </tree>
4935+ <taxa_data>
4936+ <mixed_fossil_and_extant>
4937+ <taxon name="A">
4938+ <fossil/>
4939+ </taxon>
4940+ <taxon name="B">
4941+ <fossil/>
4942+ </taxon>
4943+ </mixed_fossil_and_extant>
4944+ </taxa_data>
4945+ <character_data>
4946+ <character type="molecular" name="12S"/>
4947+ </character_data>
4948+ </source_tree>
4949+ <source_tree name="Hill_Davis_2013_2">
4950+ <tree>
4951+ <tree_string>
4952+ <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,Z:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
4953+ </tree_string>
4954+ <figure_legend>
4955+ <string_value lines="1">NA</string_value>
4956+ </figure_legend>
4957+ <figure_number>
4958+ <string_value lines="1">0</string_value>
4959+ </figure_number>
4960+ <page_number>
4961+ <string_value lines="1">0</string_value>
4962+ </page_number>
4963+ <tree_inference>
4964+ <optimality_criterion name="Maximum Parsimony"/>
4965+ </tree_inference>
4966+ <topology>
4967+ <outgroup>
4968+ <string_value lines="1">A</string_value>
4969+ </outgroup>
4970+ </topology>
4971+ </tree>
4972+ <taxa_data>
4973+ <mixed_fossil_and_extant>
4974+ <taxon name="A">
4975+ <fossil/>
4976+ </taxon>
4977+ <taxon name="B">
4978+ <fossil/>
4979+ </taxon>
4980+ </mixed_fossil_and_extant>
4981+ </taxa_data>
4982+ <character_data>
4983+ <character type="molecular" name="12S"/>
4984+ </character_data>
4985+ </source_tree>
4986 </source>
4987 </sources>
4988 <history/>
4989
4990=== added file 'stk/test/data/input/check_taxonomy.phyml'
4991--- stk/test/data/input/check_taxonomy.phyml 1970-01-01 00:00:00 +0000
4992+++ stk/test/data/input/check_taxonomy.phyml 2017-01-12 09:27:31 +0000
4993@@ -0,0 +1,67 @@
4994+<?xml version='1.0' encoding='utf-8'?>
4995+<phylo_storage>
4996+ <project_name>
4997+ <string_value lines="1">Test</string_value>
4998+ </project_name>
4999+ <sources>
5000+ <source name="Hill_2011">
The diff has been truncated for viewing.

Subscribers

People subscribed via source and target branches