Merge lp:~jon-hill/supertree-toolkit/sub_in_subfile into lp:supertree-toolkit

Proposed by Jon Hill on 2017-01-12
Status: Merged
Merged at revision: 281
Proposed branch: lp:~jon-hill/supertree-toolkit/sub_in_subfile
Merge into: lp:supertree-toolkit
Diff against target: 13342 lines (+11326/-781)
44 files modified
debian/control (+1/-1)
debian/rules (+1/-0)
notes.txt (+38/-0)
stk/bzr_version.py (+5/-5)
stk/p4/NexusToken.py (+1/-0)
stk/p4/NexusToken2.py (+1/-1)
stk/p4/Tree.py (+1/-9)
stk/p4/Tree_muck.py (+4/-2)
stk/scripts/check_nomenclature.py (+0/-224)
stk/scripts/check_nomenclature.py.moved (+224/-0)
stk/scripts/create_colours_itol.py (+2/-11)
stk/scripts/create_taxonomy.py (+4/-100)
stk/scripts/fill_in_with_taxonomy.py (+711/-174)
stk/scripts/plot_character_taxa_matrix.py (+83/-1)
stk/scripts/plot_tree_taxa_matrix.py (+56/-0)
stk/scripts/remove_poorly_constrained_taxa.py (+43/-20)
stk/scripts/tree_from_taxonomy.py (+142/-0)
stk/stk (+787/-34)
stk/stk_exceptions.py (+8/-0)
stk/supertree_toolkit.py (+849/-47)
stk/test/_substitute_taxa.py (+19/-1)
stk/test/_supertree_toolkit.py (+138/-15)
stk/test/_trees.py (+13/-1)
stk/test/data/input/auto_sub.phyml (+97/-0)
stk/test/data/input/check_data_ind.phyml (+141/-0)
stk/test/data/input/check_taxonomy.phyml (+67/-0)
stk/test/data/input/check_taxonomy_fixes.phyml (+378/-0)
stk/test/data/input/create_taxonomy.csv (+6/-6)
stk/test/data/input/create_taxonomy.phyml (+67/-0)
stk/test/data/input/equivalents.csv (+5/-0)
stk/test/data/input/mrca.tre (+1/-0)
stk/test/data/input/old_stk_test_data_ind.phyml (+1324/-0)
stk/test/data/input/old_stk_test_data_tax_overlap.phyml (+627/-0)
stk/test/data/input/old_stk_test_nonmonophyl_removed.phyml (+1324/-0)
stk/test/data/input/old_stk_test_species_level.phyml (+1324/-0)
stk/test/data/input/old_stk_test_taxonomy.csv (+334/-0)
stk/test/data/input/old_stk_test_taxonomy_check_subs.dat (+26/-0)
stk/test/data/input/old_stk_test_taxonomy_checked.phyml (+1324/-0)
stk/test/data/input/old_stk_test_taxonomy_checker.csv (+336/-0)
stk/test/data/output/one_click_subs_output.phyml (+97/-0)
stk/test/util.py (+7/-0)
stk_gui/gui/gui.glade (+670/-124)
stk_gui/plugins/phyml/name_author.py (+4/-1)
stk_gui/stk_gui/interface.py (+36/-4)
To merge this branch: bzr merge lp:~jon-hill/supertree-toolkit/sub_in_subfile
Reviewer Review Type Date Requested Status
Jon Hill Approve on 2017-01-12
Review via email: mp+314598@code.launchpad.net

Description of the change

Adding taxonomic awareness and fixing a lot of bugs

To post a comment you must log in.
322. By Jon Hill on 2017-01-12

removing file that shouldn't be there

323. By Jon Hill on 2017-01-12

removing file that shouldn't be there

Jon Hill (jon-hill) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'debian/control'
2--- debian/control 2016-12-14 16:22:12 +0000
3+++ debian/control 2017-01-12 09:27:31 +0000
4@@ -9,7 +9,7 @@
5
6 Package: supertree-toolkit
7 Architecture: all
8-Depends: python-tk, python-dxdiff, python-pygraphviz, python-lxml-dbg, python-lxml, python-gtk2, python-numpy, python-matplotlib, python-lxml, libxml2-utils, python, python-gtksourceview2, python-glade2, python-networkx
9+Depends: python-tk, python-simplejson, python-dxdiff, python-pygraphviz, python-lxml-dbg, python-lxml, python-gtk2, python-numpy, python-matplotlib, python-lxml, libxml2-utils, python, python-gtksourceview2, python-glade2, python-networkx, python-argcomplete
10 Recommends: python-psyco
11 Suggests:
12 Conflicts:
13
14=== modified file 'debian/rules'
15--- debian/rules 2013-10-14 12:58:59 +0000
16+++ debian/rules 2017-01-12 09:27:31 +0000
17@@ -6,5 +6,6 @@
18
19 override_dh_auto_install:
20 python setup.py install --root=debian/supertree-toolkit --install-layout=deb --install-scripts=/usr/bin
21+ argcomplete.autocomplete(parser)
22
23 override_dh_auto_build:
24
25=== added file 'notes.txt'
26--- notes.txt 1970-01-01 00:00:00 +0000
27+++ notes.txt 2017-01-12 09:27:31 +0000
28@@ -0,0 +1,38 @@
29+Ideas:
30+
31+Collect data, remove paraphyletic
32+
33+Take taxonomy (from dbs), phyml, users knowledge (encoded as subs file) and information on synonyms (from dbs)
34+to create a master subs file that takes the dat to species level
35+
36+User needs to be able to edit taxonomy - CSV file
37+
38+User needs to choose database source - preferred source.
39+
40+
41+Taxonomic name checker:
42+
43+ - use database to get synonyms and possible mispellings
44+ - Gui is a 2 column table with green, yellow, red. User filles in red (or removes it), green is fine. Yellow - drop down list with alternatives.
45+ - Use this to generate a two column CSV file
46+ - On CLI, generate a three column CSV. Original name, new name (or blank for unknown) and a list of possibles. Warn user they *must* fill in the second column or remove the row or the taxa will be deleted.
47+
48+For colloqual names, user adds to column 1 of taxonomy csv and then adds the latin name in the approriate column of the database. The subs can then generate the species list.
49+
50+Use these two csv files to generate a subs file, including replacing higher taxa and genera to create a "to species" substtution (can also output this file for later)
51+
52+Generating data to any taxonomic level can happen later - need to check each species is accounted for in the taxonomy, with correct levels - may need another parse of the taxonomy csv
53+
54+
55+Add data -> paraphyletic taxa -> taxonomy checker -> sub synonyms -> taxonomy generator -> create species level dataset
56+
57+New functions:
58+ - taxonomic name checker (this might take a while when online for large dataset) - note that this should be a one for one substitution - seperate function so we can check this?
59+ - Pull in taxonomy generator
60+ - Add csv file to schema
61+ - amaend manual with workflow
62+ - warning on multiple subs in data in manual
63+ - generate species level subsfile from taxonomy
64+ - generate specified taxonomic level data
65+
66+
67
68=== modified file 'stk/bzr_version.py'
69--- stk/bzr_version.py 2017-01-11 17:42:56 +0000
70+++ stk/bzr_version.py 2017-01-12 09:27:31 +0000
71@@ -4,12 +4,12 @@
72 So don't edit it. :)
73 """
74
75-version_info = {'branch_nick': u'supertree-toolkit',
76- 'build_date': '2017-01-11 17:42:27 +0000',
77+version_info = {'branch_nick': u'sub_in_subfile',
78+ 'build_date': '2017-01-11 17:48:33 +0000',
79 'clean': None,
80- 'date': '2017-01-11 17:39:43 +0000',
81- 'revision_id': 'jon.hill@imperial.ac.uk-20170111173943-88so1icr33su3afo',
82- 'revno': '279'}
83+ 'date': '2017-01-11 17:48:18 +0000',
84+ 'revision_id': 'jon.hill@imperial.ac.uk-20170111174818-9q8a9octvnawruuw',
85+ 'revno': '317'}
86
87 revisions = {}
88
89
90=== modified file 'stk/p4/NexusToken.py'
91--- stk/p4/NexusToken.py 2012-01-11 08:57:43 +0000
92+++ stk/p4/NexusToken.py 2017-01-12 09:27:31 +0000
93@@ -44,6 +44,7 @@
94 gm = ["safeNextTok(), called from %s" % caller]
95 else:
96 gm = ["safeNextTok()"]
97+ print flob
98 gm.append("Premature Death.")
99 gm.append("Ran out of understandable things to read in nexus file.")
100 raise Glitch, gm
101
102=== modified file 'stk/p4/NexusToken2.py'
103--- stk/p4/NexusToken2.py 2012-01-11 08:57:43 +0000
104+++ stk/p4/NexusToken2.py 2017-01-12 09:27:31 +0000
105@@ -88,7 +88,7 @@
106 else:
107 gm = ["safeNextTok()"]
108 gm.append("Premature Death.")
109- gm.append("Ran out of understandable things to read in nexus file.")
110+ gm.append("Ran out of understandable things to read in nexus file." + str(flob))
111 raise Glitch, gm
112 else:
113 return t
114
115=== modified file 'stk/p4/Tree.py'
116--- stk/p4/Tree.py 2013-08-25 09:24:34 +0000
117+++ stk/p4/Tree.py 2017-01-12 09:27:31 +0000
118@@ -996,17 +996,9 @@
119 if not item.name:
120 if item == self.root:
121 if var.fixRootedTrees:
122- if self.name:
123- print "Tree.initFinish() tree '%s'" % self.name
124- else:
125- print 'Tree.initFinish()'
126- print "Fixing tree to work with SuperTree scores"
127+ #print "Fixing tree to work with SuperTree scores"
128 self.removeRoot()
129 elif var.warnAboutTerminalRootWithNoName:
130- if self.name:
131- print "Tree.initFinish() tree '%s'" % self.name
132- else:
133- print 'Tree.initFinish()'
134 print ' Non-fatal warning: the root is terminal, but has no name.'
135 print ' This may be what you wanted. Or not?'
136 print ' (To get rid of this warning, turn off var.warnAboutTerminalRootWithNoName)'
137
138=== modified file 'stk/p4/Tree_muck.py'
139--- stk/p4/Tree_muck.py 2015-02-19 14:47:06 +0000
140+++ stk/p4/Tree_muck.py 2017-01-12 09:27:31 +0000
141@@ -769,6 +769,7 @@
142 else:
143 gm.append("The 2 specified nodes should have a parent-child relationship")
144 raise Glitch, gm
145+
146 if var.usePfAndNumpy:
147 self.deleteCStuff()
148
149@@ -1629,7 +1630,7 @@
150
151
152
153-def addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None):
154+def addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None, ignoreRootAssert=False):
155 """Add a subtree to a tree.
156
157 The nodes from theSubTree are added to self.nodes, and theSubTree
158@@ -1666,7 +1667,8 @@
159
160 assert selfNode in self.nodes
161 assert selfNode.parent
162- assert theSubTree.root.leftChild and not theSubTree.root.leftChild.sibling # its a root on a stick
163+ if not ignoreRootAssert:
164+ assert theSubTree.root.leftChild and not theSubTree.root.leftChild.sibling # its a root on a stick
165 if not subTreeTaxNames:
166 subTreeTaxNames = [n.name for n in theSubTree.iterLeavesNoRoot()]
167
168
169=== removed file 'stk/scripts/check_nomenclature.py'
170--- stk/scripts/check_nomenclature.py 2016-07-14 10:12:17 +0000
171+++ stk/scripts/check_nomenclature.py 1970-01-01 00:00:00 +0000
172@@ -1,224 +0,0 @@
173-#!/usr/bin/env python
174-#
175-# Derived from the Supertree Toolkit. Software for managing and manipulating sources
176-# trees ready for supretree construction.
177-# Copyright (C) 2015, Jon Hill, Katie Davis
178-#
179-# This program is free software: you can redistribute it and/or modify
180-# it under the terms of the GNU General Public License as published by
181-# the Free Software Foundation, either version 3 of the License, or
182-# (at your option) any later version.
183-#
184-# This program is distributed in the hope that it will be useful,
185-# but WITHOUT ANY WARRANTY; without even the implied warranty of
186-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
187-# GNU General Public License for more details.
188-#
189-# You should have received a copy of the GNU General Public License
190-# along with this program. If not, see <http://www.gnu.org/licenses/>.
191-#
192-# Jon Hill. jon.hill@york.ac.uk.
193-#
194-#
195-# This is an enitrely self-contained script that does not require the STK to be installed.
196-
197-import urllib2
198-from urllib import quote_plus
199-import simplejson as json
200-import argparse
201-import os
202-import sys
203-import csv
204-
205-def main():
206-
207- # do stuff
208- parser = argparse.ArgumentParser(
209- prog="Check nomenclature",
210- description="Check nomenclature from a tree file or list against valid names derived from EOL",
211- )
212- parser.add_argument(
213- '-v',
214- '--verbose',
215- action='store_true',
216- help="Verbose output: mainly progress reports.",
217- default=False
218- )
219- parser.add_argument(
220- '--existing',
221- help="An existing output file to update further, e.g. with a new set of taxa. Supply the file name."
222- )
223- parser.add_argument(
224- 'input_file',
225- metavar='input_file',
226- nargs=1,
227- help="Your input taxa list"
228- )
229- parser.add_argument(
230- 'output_file',
231- metavar='output_file',
232- nargs=1,
233- help="The output file. A CSV-based output, listing name checked, valid name, synonyms and status (red, amber, yellow, green)."
234- )
235-
236- args = parser.parse_args()
237- verbose = args.verbose
238- input_file = args.input_file[0]
239- output_file = args.output_file[0]
240- existing_data = args.existing
241-
242- if (not existing_data == None):
243- exiting_data = load_equivalents(existing_data)
244- else:
245- existing_data = None
246-
247- with open(input_file,'r') as f:
248- lines = f.read().splitlines()
249- equivs = taxonomic_checker_list(lines, existing_data, verbose=verbose)
250-
251-
252- f = open(output_file,"w")
253- for taxon in sorted(equivs.keys()):
254- f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
255- f.close()
256-
257- return
258-
259-
260-def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
261- """ For each name in the database generate a database of the original name,
262- possible synonyms and if the taxon is not know, signal that. We do this by
263- using the EoL API to grab synonyms of each taxon. """
264-
265-
266- if existing_data == None:
267- equivalents = {}
268- else:
269- equivalents = existing_data
270-
271- # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
272- # if not, is there another API function to do this?
273- # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
274- # you get the original ('correct') name - shorten this to two words and you're done.
275- for t in name_list:
276- # make sure t has no spaces.
277- t = t.replace(" ","_")
278- if t in equivalents:
279- continue
280- taxon = t.replace("_"," ")
281- if (verbose):
282- print "Looking up ", taxon
283- # get the data from EOL on taxon
284- taxonq = quote_plus(taxon)
285- URL = "http://eol.org/api/search/1.0.json?q="+taxonq
286- req = urllib2.Request(URL)
287- opener = urllib2.build_opener()
288- f = opener.open(req)
289- data = json.load(f)
290- # check if there's some data
291- if len(data['results']) == 0:
292- equivalents[t] = [[t],'red']
293- continue
294- amber = False
295- if len(data['results']) > 1:
296- # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
297- # for automatic processing we'll just take the first one though
298- # colour is amber in this case
299- amber = True
300- ID = str(data['results'][0]['id']) # take first hit
301- URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=2&videos=0&sounds=0&maps=0&text=2&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"
302- req = urllib2.Request(URL)
303- opener = urllib2.build_opener()
304-
305- try:
306- f = opener.open(req)
307- except urllib2.HTTPError:
308- equivalents[t] = [[t],'red']
309- continue
310- data = json.load(f)
311- if len(data['scientificName']) == 0:
312- # not found a scientific name, so set as red
313- equivalents[t] = [[t],'red']
314- continue
315- correct_name = data['scientificName'].encode("ascii","ignore")
316- # we only want the first two bits of the name, not the original author and year if any
317- temp_name = correct_name.split(' ')
318- if (len(temp_name) > 2):
319- correct_name = ' '.join(temp_name[0:2])
320- correct_name = correct_name.replace(' ','_')
321- print correct_name, t
322-
323- # build up the output dictionary - original name is key, synonyms/missing is value
324- if (correct_name == t or correct_name == taxon):
325- # if the original matches the 'correct', then it's green
326- equivalents[t] = [[t], 'green']
327- else:
328- # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the
329- # 'correct' taxon at the top
330- eol_synonyms = data['synonyms']
331- synonyms = []
332- for s in eol_synonyms:
333- ts = s['synonym'].encode("ascii","ignore")
334- temp_syn = ts.split(' ')
335- if (len(temp_syn) > 2):
336- temp_syn = ' '.join(temp_syn[0:2])
337- ts = temp_syn
338- if (s['relationship'] == "synonym"):
339- ts = ts.replace(" ","_")
340- synonyms.append(ts)
341- synonyms = _uniquify(synonyms)
342- # we need to put the correct name at the top of the list now
343- if (correct_name in synonyms):
344- synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
345- elif len(synonyms) == 0:
346- synonyms.append(correct_name)
347- else:
348- synonyms.insert(0,correct_name)
349-
350- if (amber):
351- equivalents[t] = [synonyms,'amber']
352- else:
353- equivalents[t] = [synonyms,'yellow']
354- # if our search was empty, then it's red - see above
355-
356- # up to the calling funciton to do something sensible with this
357- # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
358- # Amber means we found synonyms and multilpe hits. User def needs to sort these!
359-
360- return equivalents
361-
362-def load_equivalents(equiv_csv):
363- """Load equivalents data from a csv and convert to a equivalents Dict.
364- Structure is key, with a list that is array of synonyms, followed by status ('green',
365- 'yellow', 'amber', or 'red').
366-
367- """
368-
369- import csv
370-
371- equivalents = {}
372-
373- with open(equiv_csv, 'rU') as csvfile:
374- equiv_reader = csv.reader(csvfile, delimiter=',')
375- equiv_reader.next() # skip header
376- for row in equiv_reader:
377- i = 1
378- equivalents[row[0]] = [row[1].split(';'),row[2]]
379-
380- return equivalents
381-
382-def _uniquify(l):
383- """
384- Make a list, l, contain only unique data
385- """
386- keys = {}
387- for e in l:
388- keys[e] = 1
389-
390- return keys.keys()
391-
392-if __name__ == "__main__":
393- main()
394-
395-
396-
397
398=== added file 'stk/scripts/check_nomenclature.py.moved'
399--- stk/scripts/check_nomenclature.py.moved 1970-01-01 00:00:00 +0000
400+++ stk/scripts/check_nomenclature.py.moved 2017-01-12 09:27:31 +0000
401@@ -0,0 +1,224 @@
402+#!/usr/bin/env python
403+#
404+# Derived from the Supertree Toolkit. Software for managing and manipulating sources
405+# trees ready for supretree construction.
406+# Copyright (C) 2015, Jon Hill, Katie Davis
407+#
408+# This program is free software: you can redistribute it and/or modify
409+# it under the terms of the GNU General Public License as published by
410+# the Free Software Foundation, either version 3 of the License, or
411+# (at your option) any later version.
412+#
413+# This program is distributed in the hope that it will be useful,
414+# but WITHOUT ANY WARRANTY; without even the implied warranty of
415+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
416+# GNU General Public License for more details.
417+#
418+# You should have received a copy of the GNU General Public License
419+# along with this program. If not, see <http://www.gnu.org/licenses/>.
420+#
421+# Jon Hill. jon.hill@york.ac.uk.
422+#
423+#
424+# This is an enitrely self-contained script that does not require the STK to be installed.
425+
426+import urllib2
427+from urllib import quote_plus
428+import simplejson as json
429+import argparse
430+import os
431+import sys
432+import csv
433+
434+def main():
435+
436+ # do stuff
437+ parser = argparse.ArgumentParser(
438+ prog="Check nomenclature",
439+ description="Check nomenclature from a tree file or list against valid names derived from EOL",
440+ )
441+ parser.add_argument(
442+ '-v',
443+ '--verbose',
444+ action='store_true',
445+ help="Verbose output: mainly progress reports.",
446+ default=False
447+ )
448+ parser.add_argument(
449+ '--existing',
450+ help="An existing output file to update further, e.g. with a new set of taxa. Supply the file name."
451+ )
452+ parser.add_argument(
453+ 'input_file',
454+ metavar='input_file',
455+ nargs=1,
456+ help="Your input taxa list"
457+ )
458+ parser.add_argument(
459+ 'output_file',
460+ metavar='output_file',
461+ nargs=1,
462+ help="The output file. A CSV-based output, listing name checked, valid name, synonyms and status (red, amber, yellow, green)."
463+ )
464+
465+ args = parser.parse_args()
466+ verbose = args.verbose
467+ input_file = args.input_file[0]
468+ output_file = args.output_file[0]
469+ existing_data = args.existing
470+
471+ if (not existing_data == None):
472+ exiting_data = load_equivalents(existing_data)
473+ else:
474+ existing_data = None
475+
476+ with open(input_file,'r') as f:
477+ lines = f.read().splitlines()
478+ equivs = taxonomic_checker_list(lines, existing_data, verbose=verbose)
479+
480+
481+ f = open(output_file,"w")
482+ for taxon in sorted(equivs.keys()):
483+ f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
484+ f.close()
485+
486+ return
487+
488+
489+def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
490+ """ For each name in the database generate a database of the original name,
491+ possible synonyms and if the taxon is not know, signal that. We do this by
492+ using the EoL API to grab synonyms of each taxon. """
493+
494+
495+ if existing_data == None:
496+ equivalents = {}
497+ else:
498+ equivalents = existing_data
499+
500+ # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
501+ # if not, is there another API function to do this?
502+ # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
503+ # you get the original ('correct') name - shorten this to two words and you're done.
504+ for t in name_list:
505+ # make sure t has no spaces.
506+ t = t.replace(" ","_")
507+ if t in equivalents:
508+ continue
509+ taxon = t.replace("_"," ")
510+ if (verbose):
511+ print "Looking up ", taxon
512+ # get the data from EOL on taxon
513+ taxonq = quote_plus(taxon)
514+ URL = "http://eol.org/api/search/1.0.json?q="+taxonq
515+ req = urllib2.Request(URL)
516+ opener = urllib2.build_opener()
517+ f = opener.open(req)
518+ data = json.load(f)
519+ # check if there's some data
520+ if len(data['results']) == 0:
521+ equivalents[t] = [[t],'red']
522+ continue
523+ amber = False
524+ if len(data['results']) > 1:
525+ # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
526+ # for automatic processing we'll just take the first one though
527+ # colour is amber in this case
528+ amber = True
529+ ID = str(data['results'][0]['id']) # take first hit
530+ URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=2&videos=0&sounds=0&maps=0&text=2&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"
531+ req = urllib2.Request(URL)
532+ opener = urllib2.build_opener()
533+
534+ try:
535+ f = opener.open(req)
536+ except urllib2.HTTPError:
537+ equivalents[t] = [[t],'red']
538+ continue
539+ data = json.load(f)
540+ if len(data['scientificName']) == 0:
541+ # not found a scientific name, so set as red
542+ equivalents[t] = [[t],'red']
543+ continue
544+ correct_name = data['scientificName'].encode("ascii","ignore")
545+ # we only want the first two bits of the name, not the original author and year if any
546+ temp_name = correct_name.split(' ')
547+ if (len(temp_name) > 2):
548+ correct_name = ' '.join(temp_name[0:2])
549+ correct_name = correct_name.replace(' ','_')
550+ print correct_name, t
551+
552+ # build up the output dictionary - original name is key, synonyms/missing is value
553+ if (correct_name == t or correct_name == taxon):
554+ # if the original matches the 'correct', then it's green
555+ equivalents[t] = [[t], 'green']
556+ else:
557+ # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the
558+ # 'correct' taxon at the top
559+ eol_synonyms = data['synonyms']
560+ synonyms = []
561+ for s in eol_synonyms:
562+ ts = s['synonym'].encode("ascii","ignore")
563+ temp_syn = ts.split(' ')
564+ if (len(temp_syn) > 2):
565+ temp_syn = ' '.join(temp_syn[0:2])
566+ ts = temp_syn
567+ if (s['relationship'] == "synonym"):
568+ ts = ts.replace(" ","_")
569+ synonyms.append(ts)
570+ synonyms = _uniquify(synonyms)
571+ # we need to put the correct name at the top of the list now
572+ if (correct_name in synonyms):
573+ synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
574+ elif len(synonyms) == 0:
575+ synonyms.append(correct_name)
576+ else:
577+ synonyms.insert(0,correct_name)
578+
579+ if (amber):
580+ equivalents[t] = [synonyms,'amber']
581+ else:
582+ equivalents[t] = [synonyms,'yellow']
583+ # if our search was empty, then it's red - see above
584+
585+ # up to the calling funciton to do something sensible with this
586+ # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
587+ # Amber means we found synonyms and multilpe hits. User def needs to sort these!
588+
589+ return equivalents
590+
591+def load_equivalents(equiv_csv):
592+ """Load equivalents data from a csv and convert to a equivalents Dict.
593+ Structure is key, with a list that is array of synonyms, followed by status ('green',
594+ 'yellow', 'amber', or 'red').
595+
596+ """
597+
598+ import csv
599+
600+ equivalents = {}
601+
602+ with open(equiv_csv, 'rU') as csvfile:
603+ equiv_reader = csv.reader(csvfile, delimiter=',')
604+ equiv_reader.next() # skip header
605+ for row in equiv_reader:
606+ i = 1
607+ equivalents[row[0]] = [row[1].split(';'),row[2]]
608+
609+ return equivalents
610+
611+def _uniquify(l):
612+ """
613+ Make a list, l, contain only unique data
614+ """
615+ keys = {}
616+ for e in l:
617+ keys[e] = 1
618+
619+ return keys.keys()
620+
621+if __name__ == "__main__":
622+ main()
623+
624+
625+
626
627=== modified file 'stk/scripts/create_colours_itol.py'
628--- stk/scripts/create_colours_itol.py 2014-12-09 10:58:48 +0000
629+++ stk/scripts/create_colours_itol.py 2017-01-12 09:27:31 +0000
630@@ -88,17 +88,8 @@
631 saturation=0.25
632 value=0.8
633
634- index = 3 # family
635- if (level == "Superfamily"):
636- index = 4
637- elif (level == "Infraorder"):
638- index = 5
639- elif (level == "Suborder"):
640- index = 6
641- elif (level == "Order"):
642- index = 7
643- elif (level == "Genus"):
644- index = 2
645+ index = stk.taxonomy_levels.index(level.lower())+1
646+ print index
647
648 if (tree):
649 tree_data = stk.import_tree(input_file)
650
651=== modified file 'stk/scripts/create_taxonomy.py'
652--- stk/scripts/create_taxonomy.py 2014-03-13 18:45:05 +0000
653+++ stk/scripts/create_taxonomy.py 2017-01-12 09:27:31 +0000
654@@ -16,6 +16,8 @@
655 import supertree_toolkit as stk
656 import csv
657
658+taxonomy_levels = stk.taxonomy_levels
659+
660 def main():
661
662 # do stuff
663@@ -66,13 +68,6 @@
664 f.close()
665
666 taxonomy = {}
667- # What we get from EOL
668- current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
669- # And the extra ones from ITIS
670- extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
671- # all of them in order
672- taxonomy_levels = ['species','genus','family','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
673-
674
675 for taxon in taxa:
676 taxon = taxon.replace("_"," ")
677@@ -180,99 +175,8 @@
678 continue
679
680
681- # Now create the CSV output
682- with open(output_file, 'w') as f:
683- writer = csv.writer(f)
684- writer.writerow(taxonomy_levels)
685- for t in taxonomy:
686- species = t
687- try:
688- genus = taxonomy[t]['genus']
689- except KeyError:
690- genus = "-"
691- try:
692- family = taxonomy[t]['family']
693- except KeyError:
694- family = "-"
695- try:
696- superfamily = taxonomy[t]['superfamily']
697- except KeyError:
698- superfamily = "-"
699- try:
700- infraorder = taxonomy[t]['infraorder']
701- except KeyError:
702- infraorder = "-"
703- try:
704- suborder = taxonomy[t]['suborder']
705- except KeyError:
706- suborder = "-"
707- try:
708- order = taxonomy[t]['order']
709- except KeyError:
710- order = "-"
711- try:
712- superorder = taxonomy[t]['superorder']
713- except KeyError:
714- superorder = "-"
715- try:
716- subclass = taxonomy[t]['subclass']
717- except KeyError:
718- subclass = "-"
719- try:
720- tclass = taxonomy[t]['class']
721- except KeyError:
722- tclass = "-"
723- try:
724- subphylum = taxonomy[t]['subphylum']
725- except KeyError:
726- subphylum = "-"
727- try:
728- phylum = taxonomy[t]['phylum']
729- except KeyError:
730- phylum = "-"
731- try:
732- superphylum = taxonomy[t]['superphylum']
733- except KeyError:
734- superphylum = "-"
735- try:
736- infrakingdom = taxonomy[t]['infrakingdom']
737- except:
738- infrakingdom = "-"
739- try:
740- subkingdom = taxonomy[t]['subkingdom']
741- except:
742- subkingdom = "-"
743- try:
744- kingdom = taxonomy[t]['kingdom']
745- except KeyError:
746- kingdom = "-"
747- try:
748- provider = taxonomy[t]['provider']
749- except KeyError:
750- provider = "-"
751-
752-
753- this_classification = [
754- species.encode('utf-8'),
755- genus.encode('utf-8'),
756- family.encode('utf-8'),
757- superfamily.encode('utf-8'),
758- infraorder.encode('utf-8'),
759- suborder.encode('utf-8'),
760- order.encode('utf-8'),
761- superorder.encode('utf-8'),
762- subclass.encode('utf-8'),
763- tclass.encode('utf-8'),
764- subphylum.encode('utf-8'),
765- phylum.encode('utf-8'),
766- superphylum.encode('utf-8'),
767- infrakingdom.encode('utf-8'),
768- subkingdom.encode('utf-8'),
769- kingdom.encode('utf-8'),
770- provider.encode('utf-8')]
771- writer.writerow(this_classification)
772-
773-
774+ stk.save_taxonomy(taxonomy, output_file)
775+
776 def _uniquify(l):
777 """
778 Make a list, l, contain only unique data
779
780=== modified file 'stk/scripts/fill_in_with_taxonomy.py'
781--- stk/scripts/fill_in_with_taxonomy.py 2016-12-14 16:22:12 +0000
782+++ stk/scripts/fill_in_with_taxonomy.py 2017-01-12 09:27:31 +0000
783@@ -23,21 +23,90 @@
784 from urllib import quote_plus
785 import simplejson as json
786 import argparse
787+import copy
788 import os
789 import sys
790 stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir )
791 sys.path.insert(0, stk_path)
792 import supertree_toolkit as stk
793 import csv
794-
795-# What we get from EOL
796-current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
797-# And the extra ones from ITIS
798-extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
799-# all of them in order
800-taxonomy_levels = ['species','genus','subfamily','family','tribe','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
801-
802-def get_tree_taxa_taxonomy(taxon,wsdlObjectWoRMS):
803+from ete2 import Tree
804+import tempfile
805+import re
806+
807+taxonomy_levels = stk.taxonomy_levels
808+#tlevels = ['species','genus','family','superfamily','suborder','order','class','phylum','kingdom']
809+tlevels = ['species','genus', 'subfamily', 'family','infraorder','order','class','phylum','kingdom']
810+
811+def get_tree_taxa_taxonomy_eol(taxon):
812+
813+ taxonq = quote_plus(taxon)
814+ URL = "http://eol.org/api/search/1.0.json?q="+taxonq
815+ req = urllib2.Request(URL)
816+ opener = urllib2.build_opener()
817+ f = opener.open(req)
818+ data = json.load(f)
819+
820+ if data['results'] == []:
821+ return {}
822+ ID = str(data['results'][0]['id']) # take first hit
823+ # Now look for taxonomies
824+ URL = "http://eol.org/api/pages/1.0/"+ID+".json"
825+ req = urllib2.Request(URL)
826+ opener = urllib2.build_opener()
827+ f = opener.open(req)
828+ data = json.load(f)
829+ if len(data['taxonConcepts']) == 0:
830+ return {}
831+ TID = str(data['taxonConcepts'][0]['identifier']) # take first hit
832+ currentdb = str(data['taxonConcepts'][0]['nameAccordingTo'])
833+ # loop through and get preferred one if specified
834+ # now get taxonomy
835+ for db in data['taxonConcepts']:
836+ currentdb = db['nameAccordingTo'].lower()
837+ TID = str(db['identifier'])
838+ break
839+ URL="http://eol.org/api/hierarchy_entries/1.0/"+TID+".json"
840+ req = urllib2.Request(URL)
841+ opener = urllib2.build_opener()
842+ f = opener.open(req)
843+ data = json.load(f)
844+ tax_array = {}
845+ tax_array['provider'] = currentdb
846+ for a in data['ancestors']:
847+ try:
848+ if a.has_key('taxonRank') :
849+ temp_level = a['taxonRank'].encode("ascii","ignore")
850+ if (temp_level in taxonomy_levels):
851+ # note the dump into ASCII
852+ temp_name = a['scientificName'].encode("ascii","ignore")
853+ temp_name = temp_name.split(" ")
854+ if (temp_level == 'species'):
855+ tax_array[temp_level] = "_".join(temp_name[0:2])
856+
857+ else:
858+ tax_array[temp_level] = temp_name[0]
859+ except KeyError as e:
860+ logging.exception("Key not found: taxonRank")
861+ continue
862+ try:
863+ # add taxonomy in to the taxonomy!
864+ # some issues here, so let's make sure it's OK
865+ temp_name = taxon.split(" ")
866+ if data.has_key('taxonRank') :
867+ if not data['taxonRank'].lower() == 'species':
868+ tax_array[data['taxonRank'].lower()] = temp_name[0]
869+ else:
870+ tax_array[data['taxonRank'].lower()] = ' '.join(temp_name[0:2])
871+ except KeyError as e:
872+ return tax_array
873+
874+ return tax_array
875+
876+def get_tree_taxa_taxonomy_worms(taxon):
877+
878+ from SOAPpy import WSDL
879+ wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
880
881 taxon_data = wsdlObjectWoRMS.getAphiaRecords(taxon.replace('_',' '))
882 if taxon_data == None:
883@@ -51,6 +120,8 @@
884 classification = wsdlObjectWoRMS.getAphiaClassificationByID(taxon_id)
885 # construct array
886 tax_array = {}
887+ if (classification == ""):
888+ return {}
889 # classification is a nested dictionary, so we need to iterate down it
890 current_child = classification.child
891 while True:
892@@ -60,27 +131,252 @@
893 break
894 return tax_array
895
896-
897-
898-def get_taxonomy_worms(taxonomy, start_otu):
899+def get_tree_taxa_taxonomy_itis(taxon):
900+
901+ URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(taxon.replace('_',' ').strip())
902+ req = urllib2.Request(URL)
903+ opener = urllib2.build_opener()
904+ f = opener.open(req)
905+ string = unicode(f.read(),"ISO-8859-1")
906+ this_item = json.loads(string)
907+ if this_item['scientificNames'] == [None]: # not found
908+ return {}
909+ tsn = this_item['scientificNames'][0]['tsn'] # there might be records that aren't valid - they point to the valid one though
910+ # so call another function to get any valid names
911+ URL="http://www.itis.gov/ITISWebService/jsonservice/getAcceptedNamesFromTSN?tsn="+tsn
912+ req = urllib2.Request(URL)
913+ opener = urllib2.build_opener()
914+ f = opener.open(req)
915+ string = unicode(f.read(),"ISO-8859-1")
916+ this_item = json.loads(string)
917+ if not this_item['acceptedNames'] == [None]:
918+ tsn = this_item['acceptedNames'][0]['acceptedTsn']
919+
920+ URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+str(tsn)
921+ req = urllib2.Request(URL)
922+ opener = urllib2.build_opener()
923+ f = opener.open(req)
924+ string = unicode(f.read(),"ISO-8859-1")
925+ data = json.loads(string)
926+ # construct array
927+ this_taxonomy = {}
928+ for level in data['hierarchyList']:
929+ if level['rankName'].lower() in taxonomy_levels:
930+ # note the dump into ASCII
931+ this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
932+
933+ return this_taxonomy
934+
935+
936+
937+def get_taxonomy_eol(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
938+
939+ # this is the recursive function
940+ def get_children(taxonomy, ID, aphiaIDsDone):
941+
942+ # get data
943+ URL="http://eol.org/api/hierarchy_entries/1.0/"+str(ID)+".json?common_names=false&synonyms=false&cache_ttl="
944+ req = urllib2.Request(URL)
945+ opener = urllib2.build_opener()
946+ f = opener.open(req)
947+ string = unicode(f.read(),"ISO-8859-1")
948+ this_item = json.loads(string)
949+ if this_item == None:
950+ return taxonomy
951+ if this_item['taxonRank'].lower().strip() == 'species':
952+ # add data to taxonomy dictionary
953+ taxon = this_item['scientificName'].split()[0:2] # just the first two words
954+ taxon = " ".join(taxon[0:2])
955+ # NOTE following line means existing items are *not* updated
956+ if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
957+ this_taxonomy = {}
958+ for level in this_item['ancestors']:
959+ if level['taxonRank'].lower() in taxonomy_levels:
960+ # note the dump into ASCII
961+ this_taxonomy[level['taxonRank'].lower().encode("ascii","ignore")] = level['scientificName'].encode("ascii","ignore")
962+ # add species:
963+ this_taxonomy['species'] = taxon.replace(" ","_")
964+ if verbose:
965+ print "\tAdding "+taxon
966+ taxonomy[taxon] = this_taxonomy
967+ if not tmpfile == None:
968+ stk.save_taxonomy(taxonomy,tmpfile)
969+ return taxonomy
970+ else:
971+ return taxonomy
972+ all_children = []
973+ for level in this_item['children']:
974+ if not level == None:
975+ all_children.append(level['taxonID'])
976+
977+ if (len(all_children) == 0):
978+ return taxonomy
979+
980+ for child in all_children:
981+ if child in aphiaIDsDone: # we get stuck sometime
982+ continue
983+ aphiaIDsDone.append(child)
984+ taxonomy = get_children(taxonomy, child, aphiaIDsDone)
985+ return taxonomy
986+
987+
988+ # main bit of the get_taxonomy_eol function
989+ taxonq = quote_plus(start_otu)
990+ URL = "http://eol.org/api/search/1.0.json?q="+taxonq
991+ req = urllib2.Request(URL)
992+ opener = urllib2.build_opener()
993+ f = opener.open(req)
994+ data = json.load(f)
995+ start_id = str(data['results'][0]['id']) # this is the page ID. We get the species ID next
996+ URL = "http://eol.org/api/pages/1.0/"+start_id+".json"
997+ req = urllib2.Request(URL)
998+ opener = urllib2.build_opener()
999+ f = opener.open(req)
1000+ data = json.load(f)
1001+ if len(data['taxonConcepts']) == 0:
1002+ print "Error finding you start taxa. Spelling?"
1003+ return None
1004+ start_id = data['taxonConcepts'][0]['identifier']
1005+ start_taxonomy_level = data['taxonConcepts'][0]['taxonRank'].lower()
1006+
1007+ aphiaIDsDone = []
1008+ if not skip:
1009+ taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
1010+
1011+ return taxonomy, start_taxonomy_level
1012+
1013+
1014+
1015+def get_taxonomy_itis(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
1016+ import simplejson as json
1017+
1018+ # this is the recursive function
1019+ def get_children(taxonomy, ID, aphiaIDsDone):
1020+
1021+ # get data
1022+ URL="http://www.itis.gov/ITISWebService/jsonservice/getFullRecordFromTSN?tsn="+ID
1023+ req = urllib2.Request(URL)
1024+ opener = urllib2.build_opener()
1025+ f = opener.open(req)
1026+ string = unicode(f.read(),"ISO-8859-1")
1027+ this_item = json.loads(string)
1028+ if this_item == None:
1029+ return taxonomy
1030+ if not this_item['usage']['taxonUsageRating'].lower() == 'valid':
1031+ print "rejecting " , this_item['scientificName']['combinedName']
1032+ return taxonomy
1033+ if this_item['taxRank']['rankName'].lower().strip() == 'species':
1034+ # add data to taxonomy dictionary
1035+ taxon = this_item['scientificName']['combinedName']
1036+ # NOTE following line means existing items are *not* updated
1037+ if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
1038+ # get the taxonomy of this species
1039+ tsn = this_item["scientificName"]["tsn"]
1040+ URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+tsn
1041+ req = urllib2.Request(URL)
1042+ opener = urllib2.build_opener()
1043+ f = opener.open(req)
1044+ string = unicode(f.read(),"ISO-8859-1")
1045+ data = json.loads(string)
1046+ this_taxonomy = {}
1047+ for level in data['hierarchyList']:
1048+ if level['rankName'].lower() in taxonomy_levels:
1049+ # note the dump into ASCII
1050+ this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
1051+ if verbose:
1052+ print "\tAdding "+taxon
1053+ taxonomy[taxon] = this_taxonomy
1054+ if not tmpfile == None:
1055+ stk.save_taxonomy(taxonomy,tmpfile)
1056+ return taxonomy
1057+ else:
1058+ return taxonomy
1059+
1060+ all_children = []
1061+ URL="http://www.itis.gov/ITISWebService/jsonservice/getHierarchyDownFromTSN?tsn="+ID
1062+ req = urllib2.Request(URL)
1063+ opener = urllib2.build_opener()
1064+ f = opener.open(req)
1065+ string = unicode(f.read(),"ISO-8859-1")
1066+ this_item = json.loads(string)
1067+ if this_item == None:
1068+ return taxonomy
1069+
1070+ for level in this_item['hierarchyList']:
1071+ if not level == None:
1072+ all_children.append(level['tsn'])
1073+
1074+ if (len(all_children) == 0):
1075+ return taxonomy
1076+
1077+ for child in all_children:
1078+ if child in aphiaIDsDone: # we get stuck sometime
1079+ continue
1080+ aphiaIDsDone.append(child)
1081+ taxonomy = get_children(taxonomy, child, aphiaIDsDone)
1082+
1083+ return taxonomy
1084+
1085+
1086+ # main bit of the get_taxonomy_worms function
1087+ URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(start_otu.strip())
1088+ req = urllib2.Request(URL)
1089+ opener = urllib2.build_opener()
1090+ f = opener.open(req)
1091+ string = unicode(f.read(),"ISO-8859-1")
1092+ this_item = json.loads(string)
1093+ start_id = this_item['scientificNames'][0]['tsn'] # there might be records that aren't valid - they point to the valid one though
1094+ # call it again via the ID this time to make sure we've got the right one.
1095+ # so call another function to get any valid names
1096+ URL="http://www.itis.gov/ITISWebService/jsonservice/getAcceptedNamesFromTSN?tsn="+start_id
1097+ req = urllib2.Request(URL)
1098+ opener = urllib2.build_opener()
1099+ f = opener.open(req)
1100+ string = unicode(f.read(),"ISO-8859-1")
1101+ this_item = json.loads(string)
1102+ if not this_item['acceptedNames'] == [None]:
1103+ start_id = this_item['acceptedNames'][0]['acceptedTsn']
1104+
1105+ URL="http://www.itis.gov/ITISWebService/jsonservice/getFullRecordFromTSN?tsn="+start_id
1106+ req = urllib2.Request(URL)
1107+ opener = urllib2.build_opener()
1108+ f = opener.open(req)
1109+ string = unicode(f.read(),"ISO-8859-1")
1110+ this_item = json.loads(string)
1111+ start_taxonomy_level = this_item['taxRank']['rankName'].lower()
1112+
1113+ aphiaIDsDone = []
1114+ if not skip:
1115+ taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
1116+
1117+ return taxonomy, start_taxonomy_level
1118+
1119+
1120+
1121+
1122+def get_taxonomy_worms(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
1123 """ Gets and processes a taxon from the queue to get its taxonomy."""
1124 from SOAPpy import WSDL
1125
1126 wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
1127
1128 # this is the recursive function
1129- def get_children(taxonomy, ID):
1130+ def get_children(taxonomy, ID, aphiaIDsDone):
1131
1132 # get data
1133 this_item = wsdlObjectWoRMS.getAphiaRecordByID(ID)
1134 if this_item == None:
1135 return taxonomy
1136+ if not this_item['status'].lower() == 'accepted':
1137+ print "rejecting " , this_item.valid_name
1138+ return taxonomy
1139 if this_item['rank'].lower() == 'species':
1140 # add data to taxonomy dictionary
1141- # get the taxonomy of this species
1142- classification = wsdlObjectWoRMS.getAphiaClassificationByID(ID)
1143- taxon = this_item.scientificname
1144+ taxon = this_item.valid_name
1145+ # NOTE following line means existing items are *not* updated
1146 if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
1147+ # get the taxonomy of this species
1148+ classification = wsdlObjectWoRMS.getAphiaClassificationByID(ID)
1149 # construct array
1150 tax_array = {}
1151 # classification is a nested dictionary, so we need to iterate down it
1152@@ -92,16 +388,36 @@
1153 current_child = current_child.child
1154 if current_child == '': # empty one is a string for some reason
1155 break
1156- taxonomy[this_item.scientificname] = tax_array
1157+ if verbose:
1158+ print "\tAdding "+this_item.scientificname
1159+ taxonomy[this_item.valid_name] = tax_array
1160+ if not tmpfile == None:
1161+ stk.save_taxonomy(taxonomy,tmpfile)
1162 return taxonomy
1163 else:
1164 return taxonomy
1165
1166- children = wsdlObjectWoRMS.getAphiaChildrenByID(ID, 1, False)
1167-
1168- for child in children:
1169- taxonomy = get_children(taxonomy, child['valid_AphiaID'])
1170-
1171+ all_children = []
1172+ start = 1
1173+ while True:
1174+ children = wsdlObjectWoRMS.getAphiaChildrenByID(ID, start, False)
1175+ if (children is None or children == None):
1176+ break
1177+ if (len(children) < 50):
1178+ all_children.extend(children)
1179+ break
1180+ all_children.extend(children)
1181+ start += 50
1182+
1183+ if (len(all_children) == 0):
1184+ return taxonomy
1185+
1186+ for child in all_children:
1187+ if child['valid_AphiaID'] in aphiaIDsDone: # we get stuck sometime
1188+ continue
1189+ aphiaIDsDone.append(child['valid_AphiaID'])
1190+ taxonomy = get_children(taxonomy, child['valid_AphiaID'], aphiaIDsDone)
1191+
1192 return taxonomy
1193
1194
1195@@ -111,12 +427,17 @@
1196 start_id = start_taxa[0]['valid_AphiaID'] # there might be records that aren't valid - they point to the valid one though
1197 # call it again via the ID this time to make sure we've got the right one.
1198 start_taxa = wsdlObjectWoRMS.getAphiaRecordByID(start_id)
1199- start_taxonomy_level = start_taxa['rank'].lower()
1200- except HTTPError:
1201- print "Error"
1202+ if start_taxa == None:
1203+ start_taxonomy_level = 'infraorder'
1204+ else:
1205+ start_taxonomy_level = start_taxa['rank'].lower()
1206+ except urllib2.HTTPError:
1207+ print "Error finding start_otu taxonomic level. Do you have an internet connection?"
1208 sys.exit(-1)
1209
1210- taxonomy = get_children(taxonomy,start_id)
1211+ aphiaIDsDone = []
1212+ if not skip:
1213+ taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
1214
1215 return taxonomy, start_taxonomy_level
1216
1217@@ -136,9 +457,16 @@
1218 default=False
1219 )
1220 parser.add_argument(
1221+ '-s',
1222+ '--skip',
1223+ action='store_true',
1224+ help="Skip online checking, just use taxonomy files",
1225+ default=False
1226+ )
1227+ parser.add_argument(
1228 '--pref_db',
1229 help="Taxonomy database to use. Default is Species 2000/ITIS",
1230- choices=['itis', 'worms', 'ncbi'],
1231+ choices=['itis', 'worms', 'ncbi', 'eol'],
1232 default = 'worms'
1233 )
1234 parser.add_argument(
1235@@ -178,58 +506,250 @@
1236 top_level = args.top_level[0]
1237 save_taxonomy_file = args.save_taxonomy
1238 tree_taxonomy = args.tree_taxonomy
1239+ taxonomy = args.taxonomy_from_file
1240 pref_db = args.pref_db
1241+ skip = args.skip
1242 if (save_taxonomy_file == None):
1243 save_taxonomy = False
1244 else:
1245 save_taxonomy = True
1246+ load_tree_taxonomy = False
1247+ if (not tree_taxonomy == None):
1248+ tree_taxonomy_file = tree_taxonomy
1249+ load_tree_taxonomy = True
1250+ if skip:
1251+ if taxonomy == None:
1252+ print "Error: If you're skipping checking online, then you need to supply taxonomy files"
1253+ return
1254
1255 # grab taxa in tree
1256 tree = stk.import_tree(input_file)
1257 taxa_list = stk._getTaxaFromNewick(tree)
1258-
1259- taxonomy = {}
1260-
1261- # we're going to add the taxa in the tree to the taxonomy, to stop them
1262+
1263+ if verbose:
1264+ print "Taxa count for input tree: ", len(taxa_list)
1265+
1266+ # load in any taxonomy files - we still call the APIs as a) they may have updated data and
1267+ # b) the user may have missed some first time round (i.e. expanded the tree and not redone
1268+ # the taxonomy
1269+ if (taxonomy == None):
1270+ taxonomy = {}
1271+ else:
1272+ taxonomy = stk.load_taxonomy(taxonomy)
1273+ tree_taxonomy = {}
1274+ # this might also have tree_taxonomy in too - let's check this
1275+ for t in taxa_list:
1276+ if t in taxonomy:
1277+ tree_taxonomy[t] = taxonomy[t]
1278+ elif t.replace("_"," ") in taxonomy:
1279+ tree_taxonomy[t] = taxonomy[t.replace("_"," ")]
1280+
1281+ if (load_tree_taxonomy): # overwrite the good work above...
1282+ tree_taxonomy = stk.load_taxonomy(tree_taxonomy_file)
1283+ if (tree_taxonomy == None):
1284+ tree_taxonomy = {}
1285+
1286+ # we're going to add the taxa in the tree to the main WORMS taxonomy, to stop them
1287 # being fetched in first place. We delete them later
1288+ # If you've loaded a taxonomy created by this script, this overwrites the tree taxa in the main taxonomy dict
1289+ # Don't worry, we put them back in before saving again!
1290 for taxon in taxa_list:
1291 taxon = taxon.replace('_',' ')
1292- taxonomy[taxon] = []
1293-
1294+ taxonomy[taxon] = {}
1295
1296 if (pref_db == 'itis'):
1297 # get taxonomy info from itis
1298- print "Sorry, ITIS is not implemented yet"
1299- pass
1300+ if (verbose):
1301+ print "Getting data from ITIS"
1302+ if (verbose):
1303+ print "Dealing with taxa in tree"
1304+ for t in taxa_list:
1305+ if verbose:
1306+ print "\t"+t
1307+ if not(t in tree_taxonomy or t.replace("_"," ") in tree_taxonomy):
1308+ # we don't have data - NOTE we assume things are *not* updated here if we do
1309+ tree_taxonomy[t] = get_tree_taxa_taxonomy_itis(t)
1310+
1311+ if save_taxonomy:
1312+ if (verbose):
1313+ print "Saving tree taxonomy"
1314+ # note -temporary save as we overwrite this file later.
1315+ stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
1316+
1317+ # get taxonomy from worms
1318+ if verbose:
1319+ print "Now dealing with all other taxa - this might take a while..."
1320+ # create a temp file so we can checkpoint and continue
1321+ tmpf, tmpfile = tempfile.mkstemp()
1322+
1323+ if os.path.isfile('.fit_lock'):
1324+ f = open('.fit_lock','r')
1325+ tf = f.read()
1326+ f.close()
1327+ if os.path.isfile(tf.strip()):
1328+ taxonomy = stk.load_taxonomy(tf.strip())
1329+ os.remove('.fit_lock')
1330+
1331+ # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
1332+ # where we left off.
1333+ with open(".fit_lock", 'w') as f:
1334+ f.write(tmpfile)
1335+ # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
1336+ taxonomy, start_level = get_taxonomy_itis(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
1337+
1338+ # clean up
1339+ os.close(tmpf)
1340+ os.remove('.fit_lock')
1341+ try:
1342+ os.remove('tmpfile')
1343+ except OSError:
1344+ pass
1345 elif (pref_db == 'worms'):
1346+ if (verbose):
1347+ print "Getting data from WoRMS"
1348 # get tree taxonomy from worms
1349- if (tree_taxonomy == None):
1350- tree_taxonomy = {}
1351- for t in taxa_list:
1352- from SOAPpy import WSDL
1353- wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
1354- tree_taxonomy[t] = get_tree_taxa_taxonomy(t,wsdlObjectWoRMS)
1355- else:
1356- tree_taxonomy = stk.load_taxonomy(tree_taxonomy)
1357+ if (verbose):
1358+ print "Dealing with taxa in tree"
1359+
1360+ for t in taxa_list:
1361+ if verbose:
1362+ print "\t"+t
1363+ if not(t in tree_taxonomy or t.replace("_"," ") in tree_taxonomy):
1364+ # we don't have data - NOTE we assume things are *not* updated here if we do
1365+ tree_taxonomy[t] = get_tree_taxa_taxonomy_worms(t)
1366+
1367+ if save_taxonomy:
1368+ if (verbose):
1369+ print "Saving tree taxonomy"
1370+ # note -temporary save as we overwrite this file later.
1371+ stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
1372+
1373 # get taxonomy from worms
1374- taxonomy, start_level = get_taxonomy_worms(taxonomy,top_level)
1375+ if verbose:
1376+ print "Now dealing with all other taxa - this might take a while..."
1377+ # create a temp file so we can checkpoint and continue
1378+ tmpf, tmpfile = tempfile.mkstemp()
1379+
1380+ if os.path.isfile('.fit_lock'):
1381+ f = open('.fit_lock','r')
1382+ tf = f.read()
1383+ f.close()
1384+ if os.path.isfile(tf.strip()):
1385+ taxonomy = stk.load_taxonomy(tf.strip())
1386+ os.remove('.fit_lock')
1387+
1388+ # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
1389+ # where we left off.
1390+ with open(".fit_lock", 'w') as f:
1391+ f.write(tmpfile)
1392+ # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
1393+ taxonomy, start_level = get_taxonomy_worms(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
1394+
1395+ # clean up
1396+ os.close(tmpf)
1397+ os.remove('.fit_lock')
1398+ try:
1399+ os.remove('tmpfile')
1400+ except OSError:
1401+ pass
1402
1403 elif (pref_db == 'ncbi'):
1404 # get taxonomy from ncbi
1405 print "Sorry, NCBI is not implemented yet"
1406 pass
1407+ elif (pref_db == 'eol'):
1408+ if (verbose):
1409+ print "Getting data from EOL"
1410+ # get tree taxonomy from worms
1411+ if (verbose):
1412+ print "Dealing with taxa in tree"
1413+ for t in taxa_list:
1414+ if verbose:
1415+ print "\t"+t
1416+ try:
1417+ tree_taxonomy[t]
1418+ pass # we have data - NOTE we assume things are *not* updated here...
1419+ except KeyError:
1420+ try:
1421+ tree_taxonomy[t.replace('_',' ')]
1422+ except KeyError:
1423+ tree_taxonomy[t] = get_tree_taxa_taxonomy_eol(t)
1424+
1425+ if save_taxonomy:
1426+ if (verbose):
1427+ print "Saving tree taxonomy"
1428+ # note -temporary save as we overwrite this file later.
1429+ stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
1430+
1431+ # get taxonomy from worms
1432+ if verbose:
1433+ print "Now dealing with all other taxa - this might take a while..."
1434+ # create a temp file so we can checkpoint and continue
1435+ tmpf, tmpfile = tempfile.mkstemp()
1436+
1437+ if os.path.isfile('.fit_lock'):
1438+ f = open('.fit_lock','r')
1439+ tf = f.read()
1440+ f.close()
1441+ if os.path.isfile(tf.strip()):
1442+ taxonomy = stk.load_taxonomy(tf.strip())
1443+ os.remove('.fit_lock')
1444+
1445+ # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
1446+ # where we left off.
1447+ with open(".fit_lock", 'w') as f:
1448+ f.write(tmpfile)
1449+ # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
1450+ taxonomy, start_level = get_taxonomy_eol(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
1451+
1452+ # clean up
1453+ os.close(tmpf)
1454+ os.remove('.fit_lock')
1455+ try:
1456+ os.remove('tmpfile')
1457+ except OSError:
1458+ pass
1459 else:
1460- print "ERROR: Didn't understand you database choice"
1461+ print "ERROR: Didn't understand your database choice"
1462 sys.exit(-1)
1463
1464 # clean up taxonomy, deleting the ones already in the tree
1465 for taxon in taxa_list:
1466- taxon = taxon.replace('_',' ')
1467- del taxonomy[taxon]
1468+ taxon = taxon.replace('_',' ')
1469+ try:
1470+ del taxonomy[taxon]
1471+ except KeyError:
1472+ pass # if it's not there, so we care?
1473+
1474+ # We now have 2 taxonomies:
1475+ # - for taxa in the tree
1476+ # - for all other taxa in the clade of interest
1477+
1478+ if save_taxonomy:
1479+ tot_taxonomy = taxonomy.copy()
1480+ tot_taxonomy.update(tree_taxonomy)
1481+ stk.save_taxonomy(tot_taxonomy,save_taxonomy_file)
1482+
1483+
1484+ orig_taxa_list = taxa_list
1485+
1486+ remove_higher_level = [] # for storing the higher level taxa in the original tree that need deleting
1487+ generic = []
1488+ # find all the generic and build an internal subs file
1489+ for t in taxa_list:
1490+ t = t.replace(" ","_")
1491+ if t.find("_") == -1:
1492+ # no underscore, so just generic
1493+ generic.append(t)
1494
1495 # step up the taxonomy levels from genus, adding taxa to the correct node
1496 # as a polytomy
1497- for level in taxonomy_levels[1::]: # skip species....
1498+ start_level = start_level.encode('utf-8').strip()
1499+ if verbose:
1500+ print "I think your start OTU is at: ", start_level
1501+ for level in tlevels[1::]: # skip species....
1502+ if verbose:
1503+ print "Dealing with ",level
1504 new_taxa = []
1505 for t in taxonomy:
1506 # skip odd ones that should be in there
1507@@ -239,135 +759,61 @@
1508 except KeyError:
1509 continue # don't have this info
1510 new_taxa = _uniquify(new_taxa)
1511+
1512 for nt in new_taxa:
1513- taxa_to_add = []
1514+ taxa_to_add = {}
1515 taxa_in_clade = []
1516 for t in taxonomy:
1517 if start_level in taxonomy[t] and taxonomy[t][start_level] == top_level:
1518 try:
1519- if taxonomy[t][level] == nt:
1520- taxa_to_add.append(t.replace(' ','_'))
1521+ if taxonomy[t][level] == nt and not t in taxa_list:
1522+ taxa_to_add[t] = taxonomy[t]
1523 except KeyError:
1524 continue
1525+
1526 # add to tree
1527 for t in taxa_list:
1528 if level in tree_taxonomy[t] and tree_taxonomy[t][level] == nt:
1529 taxa_in_clade.append(t)
1530- if len(taxa_in_clade) > 0:
1531- tree = add_taxa(tree, taxa_to_add, taxa_in_clade)
1532- for t in taxa_to_add: # clean up taxonomy
1533- del taxonomy[t.replace('_',' ')]
1534-
1535-
1536+ if t in generic:
1537+ # we are appending taxa to this higher taxon, so we need to remove it
1538+ remove_higher_level.append(t)
1539+
1540+
1541+ if len(taxa_in_clade) > 0 and len(taxa_to_add) > 0:
1542+ tree = add_taxa(tree, taxa_to_add, taxa_in_clade,level)
1543+ try:
1544+ taxa_list = stk._getTaxaFromNewick(tree)
1545+ except stk.TreeParseError as e:
1546+ print taxa_to_add, taxa_in_clade, level, tree
1547+ print e.msg
1548+ return
1549+
1550+ for t in taxa_to_add:
1551+ tree_taxonomy[t.replace(' ','_')] = taxa_to_add[t]
1552+ try:
1553+ del taxonomy[t.replace('_',' ')]
1554+ except KeyError:
1555+ # It might have _ or it might not...
1556+ del taxonomy[t]
1557+
1558+
1559+ # remove singelton nodes
1560+ tree = stk._collapse_nodes(tree)
1561+ tree = stk._collapse_nodes(tree)
1562+ tree = stk._collapse_nodes(tree)
1563+
1564+ tree = stk._sub_taxa_in_tree(tree, remove_higher_level)
1565 trees = {}
1566 trees['tree_1'] = tree
1567 output = stk._amalgamate_trees(trees,format='nexus')
1568 f = open(output_file, "w")
1569 f.write(output)
1570 f.close()
1571-
1572- if not save_taxonomy_file == None:
1573- with open(save_taxonomy_file, 'w') as f:
1574- writer = csv.writer(f)
1575- headers = []
1576- headers.append("OTU")
1577- headers.extend(taxonomy_levels)
1578- headers.append("Data source")
1579- writer.writerow(headers)
1580- for t in taxonomy:
1581- otu = t
1582- try:
1583- species = taxonomy[t]['species']
1584- except KeyError:
1585- species = "-"
1586- try:
1587- genus = taxonomy[t]['genus']
1588- except KeyError:
1589- genus = "-"
1590- try:
1591- family = taxonomy[t]['family']
1592- except KeyError:
1593- family = "-"
1594- try:
1595- superfamily = taxonomy[t]['superfamily']
1596- except KeyError:
1597- superfamily = "-"
1598- try:
1599- infraorder = taxonomy[t]['infraorder']
1600- except KeyError:
1601- infraorder = "-"
1602- try:
1603- suborder = taxonomy[t]['suborder']
1604- except KeyError:
1605- suborder = "-"
1606- try:
1607- order = taxonomy[t]['order']
1608- except KeyError:
1609- order = "-"
1610- try:
1611- superorder = taxonomy[t]['superorder']
1612- except KeyError:
1613- superorder = "-"
1614- try:
1615- subclass = taxonomy[t]['subclass']
1616- except KeyError:
1617- subclass = "-"
1618- try:
1619- tclass = taxonomy[t]['class']
1620- except KeyError:
1621- tclass = "-"
1622- try:
1623- subphylum = taxonomy[t]['subphylum']
1624- except KeyError:
1625- subphylum = "-"
1626- try:
1627- phylum = taxonomy[t]['phylum']
1628- except KeyError:
1629- phylum = "-"
1630- try:
1631- superphylum = taxonomy[t]['superphylum']
1632- except KeyError:
1633- superphylum = "-"
1634- try:
1635- infrakingdom = taxonomy[t]['infrakingdom']
1636- except:
1637- infrakingdom = "-"
1638- try:
1639- subkingdom = taxonomy[t]['subkingdom']
1640- except:
1641- subkingdom = "-"
1642- try:
1643- kingdom = taxonomy[t]['kingdom']
1644- except KeyError:
1645- kingdom = "-"
1646- try:
1647- provider = taxonomy[t]['provider']
1648- except KeyError:
1649- provider = "-"
1650-
1651- if (isinstance(species, list)):
1652- species = " ".join(species)
1653- this_classification = [
1654- otu.encode('utf-8'),
1655- species.encode('utf-8'),
1656- genus.encode('utf-8'),
1657- family.encode('utf-8'),
1658- superfamily.encode('utf-8'),
1659- infraorder.encode('utf-8'),
1660- suborder.encode('utf-8'),
1661- order.encode('utf-8'),
1662- superorder.encode('utf-8'),
1663- subclass.encode('utf-8'),
1664- tclass.encode('utf-8'),
1665- subphylum.encode('utf-8'),
1666- phylum.encode('utf-8'),
1667- superphylum.encode('utf-8'),
1668- infrakingdom.encode('utf-8'),
1669- subkingdom.encode('utf-8'),
1670- kingdom.encode('utf-8'),
1671- provider.encode('utf-8')]
1672- writer.writerow(this_classification)
1673-
1674+ taxa_list = stk._getTaxaFromNewick(tree)
1675+
1676+ print "Final taxa count:", len(taxa_list)
1677+
1678
1679 def _uniquify(l):
1680 """
1681@@ -379,28 +825,119 @@
1682
1683 return keys.keys()
1684
1685-def add_taxa(tree, new_taxa, taxa_in_clade):
1686+def add_taxa(tree, new_taxa, taxa_in_clade, level):
1687
1688 # create new tree of the new taxa
1689- #tree_string = "(" + ",".join(new_taxa) + ");"
1690- #additionalTaxa = stk._parse_tree(tree_string)
1691+ additionalTaxa = tree_from_taxonomy(level,new_taxa)
1692
1693 # find mrca parent
1694 treeobj = stk._parse_tree(tree)
1695 mrca = stk.get_mrca(tree,taxa_in_clade)
1696- mrca_parent = treeobj.node(mrca).parent
1697-
1698- # insert a node into the tree between the MRCA and it's parent (p4.addNodeBetweenNodes)
1699- newNode = treeobj.addNodeBetweenNodes(mrca, mrca_parent)
1700-
1701- # add the new tree at the new node using p4.addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None)
1702- #treeobj.addSubTree(newNode, additionalTaxa)
1703- for t in new_taxa:
1704- treeobj.addSibLeaf(newNode,t)
1705-
1706- # return new tree
1707+ if (mrca == 0):
1708+ # we need to make a new tree! The additional taxa are being placed at the root of the tree
1709+ t = Tree()
1710+ A = t.add_child()
1711+ B = t.add_child()
1712+ t1 = Tree(additionalTaxa)
1713+ t2 = Tree(tree)
1714+ A.add_child(t1)
1715+ B.add_child(t2)
1716+ return t.write(format=9)
1717+ else:
1718+ mrca = treeobj.nodes[mrca]
1719+ additionalTaxa = stk._parse_tree(additionalTaxa)
1720+
1721+ if len(taxa_in_clade) == 1:
1722+ taxon = treeobj.node(taxa_in_clade[0])
1723+ mrca = treeobj.addNodeBetweenNodes(taxon,mrca)
1724+
1725+
1726+ # insert a node into the tree between the MRCA and it's parent (p4.addNodeBetweenNodes)
1727+ # newNode = treeobj.addNodeBetweenNodes(mrca, mrca_parent)
1728+
1729+ # add the new tree at the new node using p4.addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None)
1730+ treeobj.addSubTree(mrca, additionalTaxa, ignoreRootAssert=True)
1731+
1732 return treeobj.writeNewick(fName=None,toString=True).strip()
1733
1734+
1735+
1736+def tree_from_taxonomy(top_level, tree_taxonomy):
1737+
1738+ start_level = taxonomy_levels.index(top_level)
1739+ new_taxa = tree_taxonomy.keys()
1740+
1741+ tl_types = []
1742+ for tt in tree_taxonomy:
1743+ tl_types.append(tree_taxonomy[tt][top_level])
1744+
1745+ tl_types = _uniquify(tl_types)
1746+ levels_to_worry_about = tlevels[0:tlevels.index(top_level)+1]
1747+
1748+ t = Tree()
1749+ nodes = {}
1750+ nodes[top_level] = []
1751+ for tl in tl_types:
1752+ n = t.add_child(name=tl)
1753+ nodes[top_level].append({tl:n})
1754+
1755+ for l in levels_to_worry_about[-2::-1]:
1756+ names = []
1757+ nodes[l] = []
1758+ ci = levels_to_worry_about.index(l)
1759+ for tt in tree_taxonomy:
1760+ try:
1761+ names.append(tree_taxonomy[tt][l])
1762+ except KeyError:
1763+ pass
1764+ names = _uniquify(names)
1765+ for n in names:
1766+ # find my parent
1767+ parent = None
1768+ for tt in tree_taxonomy:
1769+ try:
1770+ if tree_taxonomy[tt][l] == n:
1771+ try:
1772+ parent = tree_taxonomy[tt][levels_to_worry_about[ci+1]]
1773+ level = ci+1
1774+ except KeyError:
1775+ try:
1776+ parent = tree_taxonomy[tt][levels_to_worry_about[ci+2]]
1777+ level = ci+2
1778+ except KeyError:
1779+ try:
1780+ parent = tree_taxonomy[tt][levels_to_worry_about[ci+3]]
1781+ level = ci+3
1782+ except KeyError:
1783+ print "ERROR: tried to find some taxonomic info for "+tt+" from tree_taxonomy file/downloaded data and I went two levels up, but failed find any. Looked at:\n"
1784+ print "\t"+levels_to_worry_about[ci+1]
1785+ print "\t"+levels_to_worry_about[ci+2]
1786+ print "\t"+levels_to_worry_about[ci+3]
1787+ print "This is the taxonomy info I have for "+tt
1788+ print tree_taxonomy[tt]
1789+ sys.exit(1)
1790+
1791+ k = []
1792+ for nd in nodes[levels_to_worry_about[level]]:
1793+ k.extend(nd.keys())
1794+ i = 0
1795+ for kk in k:
1796+ if kk == parent:
1797+ break
1798+ i += 1
1799+ parent_id = i
1800+ break
1801+ except KeyError:
1802+ pass # no data at this level for this beastie
1803+ # find out where to attach it
1804+ node_id = nodes[levels_to_worry_about[level]][parent_id][parent]
1805+ nd = node_id.add_child(name=n.replace(" ","_"))
1806+ nodes[l].append({n:nd})
1807+
1808+ tree = t.write(format=9)
1809+
1810+ return tree
1811+
1812 if __name__ == "__main__":
1813 main()
1814
1815
1816=== modified file 'stk/scripts/plot_character_taxa_matrix.py'
1817--- stk/scripts/plot_character_taxa_matrix.py 2014-12-10 08:55:43 +0000
1818+++ stk/scripts/plot_character_taxa_matrix.py 2017-01-12 09:27:31 +0000
1819@@ -42,6 +42,18 @@
1820 default=False
1821 )
1822 parser.add_argument(
1823+ '-t',
1824+ '--taxonomy',
1825+ help="Use taxonomy to sort the taxa on the axis. Supply a STK taxonomy file",
1826+ )
1827+ parser.add_argument(
1828+ '--level',
1829+ choices=['family','superfamily','infraorder','suborder','order'],
1830+ default='family',
1831+ help="""What level to group the taxonomy at. Default is family.
1832+ Note data for a particular levelmay be missing in taxonomy."""
1833+ )
1834+ parser.add_argument(
1835 'input_file',
1836 metavar='input_file',
1837 nargs=1,
1838@@ -59,14 +71,58 @@
1839 verbose = args.verbose
1840 input_file = args.input_file[0]
1841 output_file = args.output_file[0]
1842+ taxonomy = args.taxonomy
1843+ level = args.level
1844
1845 XML = stk.load_phyml(input_file)
1846+ if not taxonomy == None:
1847+ taxonomy = stk.load_taxonomy(taxonomy)
1848+
1849 all_taxa = stk.get_all_taxa(XML)
1850 all_chars_d = stk.get_all_characters(XML)
1851 all_chars = []
1852 for c in all_chars_d:
1853 all_chars.extend(all_chars_d[c])
1854
1855+ if not taxonomy == None:
1856+ tax_data = {}
1857+ new_all_taxa = []
1858+ for t in all_taxa:
1859+ taxon = t.replace("_"," ")
1860+ try:
1861+ if taxonomy[taxon][level] == "":
1862+ # skip this
1863+ continue
1864+ tax_data[t] = taxonomy[taxon][level]
1865+ except KeyError:
1866+ print "Couldn't find "+t+" in taxonomy. Adding as null data"
1867+ tax_data[t] = 'zzzzz' # it's at the end...
1868+
1869+ from sets import Set
1870+ unique = set(tax_data.values())
1871+ unique = list(unique)
1872+ unique.sort()
1873+ print "Groups are:"
1874+ print unique
1875+ counts = []
1876+ for u in unique:
1877+ count = 0
1878+ for t in tax_data:
1879+ if tax_data[t] == u:
1880+ count += 1
1881+ new_all_taxa.append(t)
1882+ counts.append(count)
1883+
1884+ all_taxa = new_all_taxa
1885+ # cumulate counts
1886+ count_cumulate = []
1887+ count_cumulate.append(counts[0])
1888+ for c in counts[1::]:
1889+ count_cumulate.append(c+count_cumulate[-1])
1890+
1891+ print count_cumulate
1892+
1893+
1894 taxa_character_matrix = {}
1895 for t in all_taxa:
1896 taxa_character_matrix[t] = []
1897@@ -77,7 +133,8 @@
1898 taxa = stk.get_taxa_from_tree(XML,t, sort=True)
1899 for taxon in taxa:
1900 taxon = taxon.replace(" ","_")
1901- taxa_character_matrix[taxon].extend(chars)
1902+ if taxon in all_taxa:
1903+ taxa_character_matrix[taxon].extend(chars)
1904
1905 for t in taxa_character_matrix:
1906 array = taxa_character_matrix[t]
1907@@ -92,6 +149,31 @@
1908 x.append(i)
1909 y.append(j)
1910
1911+
1912+ i = 0
1913+ for j in all_chars:
1914+ # do a substitution of character names to tidy things up
1915+ if j.lower().startswith('mitochondrial carrier; adenine nucleotide translocator'):
1916+ j = "ANT"
1917+ if j.lower().startswith('mitochondrially encoded 12s'):
1918+ j = '12S'
1919+ if j.lower().startswith('complete mitochondrial genome'):
1920+ j = 'Mitogenome'
1921+ if j.lower().startswith('mtdna'):
1922+ j = "mtDNA restriction sites"
1923+ if j.lower().startswith('h3 histone'):
1924+ j = 'H3'
1925+ if j.lower().startswith('mitochondrially encoded cytochrome'):
1926+ j = 'COI'
1927+ if j.lower().startswith('rna, 28s'):
1928+ j = '28S'
1929+ if j.lower().startswith('rna, 18s'):
1930+ j = '18S'
1931+ if j.lower().startswith('mitochondrially encoded 16s'):
1932+ j = '16S'
1933+ all_chars[i] = j
1934+ i += 1
1935+
1936 fig=figure(figsize=(22,17),dpi=90)
1937 fig.subplots_adjust(left=0.3)
1938 ax = fig.add_subplot(1,1,1)
1939
1940=== modified file 'stk/scripts/plot_tree_taxa_matrix.py'
1941--- stk/scripts/plot_tree_taxa_matrix.py 2014-12-10 08:55:43 +0000
1942+++ stk/scripts/plot_tree_taxa_matrix.py 2017-01-12 09:27:31 +0000
1943@@ -43,6 +43,18 @@
1944 default=False
1945 )
1946 parser.add_argument(
1947+ '-t',
1948+ '--taxonomy',
1949+ help="Use taxonomy to sort the taxa on the axis. Supply a STK taxonomy file",
1950+ )
1951+ parser.add_argument(
1952+ '--level',
1953+ choices=['family','superfamily','infraorder','suborder','order'],
1954+ default='family',
1955+ help="""What level to group the taxonomy at. Default is family.
1956+ Note data for a particular levelmay be missing in taxonomy."""
1957+ )
1958+ parser.add_argument(
1959 'input_file',
1960 metavar='input_file',
1961 nargs=1,
1962@@ -60,13 +72,57 @@
1963 verbose = args.verbose
1964 input_file = args.input_file[0]
1965 output_file = args.output_file[0]
1966+ taxonomy = args.taxonomy
1967+ level = args.level
1968
1969 XML = stk.load_phyml(input_file)
1970+ if not taxonomy == None:
1971+ taxonomy = stk.load_taxonomy(taxonomy)
1972+
1973 all_taxa = stk.get_all_taxa(XML)
1974
1975 taxa_tree_matrix = {}
1976 for t in all_taxa:
1977 taxa_tree_matrix[t] = []
1978+
1979+ if not taxonomy == None:
1980+ tax_data = {}
1981+ new_all_taxa = []
1982+ for t in all_taxa:
1983+ taxon = t.replace("_"," ")
1984+ try:
1985+ if taxonomy[taxon][level] == "":
1986+ # skip this
1987+ continue
1988+ tax_data[t] = taxonomy[taxon][level]
1989+ except KeyError:
1990+ print "Couldn't find "+t+" in taxonomy. Adding as null data"
1991+ tax_data[t] = 'zzzzz' # it's at the end...
1992+
1993+ from sets import Set
1994+ unique = set(tax_data.values())
1995+ unique = list(unique)
1996+ unique.sort()
1997+ print "Groups are:"
1998+ print unique
1999+ counts = []
2000+ for u in unique:
2001+ count = 0
2002+ for t in tax_data:
2003+ if tax_data[t] == u:
2004+ count += 1
2005+ new_all_taxa.append(t)
2006+ counts.append(count)
2007+
2008+ all_taxa = new_all_taxa
2009+ # cumulate counts
2010+ count_cumulate = []
2011+ count_cumulate.append(counts[0])
2012+ for c in counts[1::]:
2013+ count_cumulate.append(c+count_cumulate[-1])
2014+
2015+ print count_cumulate
2016+
2017
2018 trees = stk.obtain_trees(XML)
2019 i = 0
2020
2021=== modified file 'stk/scripts/remove_poorly_constrained_taxa.py'
2022--- stk/scripts/remove_poorly_constrained_taxa.py 2014-04-18 11:57:14 +0000
2023+++ stk/scripts/remove_poorly_constrained_taxa.py 2017-01-12 09:27:31 +0000
2024@@ -12,8 +12,8 @@
2025
2026 # do stuff
2027 parser = argparse.ArgumentParser(
2028- prog="convert tree from specific to generic",
2029- description="""Converts a tree at specific level to generic level""",
2030+ prog="remove poorly contrained taxa",
2031+ description="""Remove taxa that appea in one source tree only.""",
2032 )
2033 parser.add_argument(
2034 '-v',
2035@@ -34,6 +34,13 @@
2036 " to removal those in polytomies *and* only in one other tree."
2037 )
2038 parser.add_argument(
2039+ '--tree_only',
2040+ default=False,
2041+ action='store_true',
2042+ help="Restrict removal of taxa that only occur in one source tree. Default"+
2043+ " to removal those in polytomies *and* only in one other tree."
2044+ )
2045+ parser.add_argument(
2046 'input_phyml',
2047 metavar='input_phyml',
2048 nargs=1,
2049@@ -43,13 +50,13 @@
2050 'input_tree',
2051 metavar='input_tree',
2052 nargs=1,
2053- help="Your tree"
2054+ help="Your tree - can be NULL or None"
2055 )
2056 parser.add_argument(
2057 'output_tree',
2058 metavar='output_tree',
2059 nargs=1,
2060- help="Your output tree"
2061+ help="Your output tree or phyml - if input_tree is none, this is the Phyml"
2062 )
2063
2064
2065@@ -62,14 +69,20 @@
2066 dl = True
2067 poly_only = args.poly_only
2068 input_tree = args.input_tree[0]
2069- output_tree = args.output_tree[0]
2070+ if input_tree == 'NULL' or input_tree == 'None':
2071+ input_tree = None
2072+ output_file = args.output_tree[0]
2073 input_phyml = args.input_phyml[0]
2074
2075 XML = stk.load_phyml(input_phyml)
2076 # load tree
2077- supertree = stk.import_tree(input_tree)
2078+ if (not input_tree == None):
2079+ supertree = stk.import_tree(input_tree)
2080+ taxa = stk._getTaxaFromNewick(supertree)
2081+ else:
2082+ supertree = None
2083+ taxa = stk.get_all_taxa(XML)
2084 # grab taxa
2085- taxa = stk._getTaxaFromNewick(supertree)
2086 delete_list = []
2087
2088 # loop over taxa in supertree and get some stats
2089@@ -115,19 +128,29 @@
2090
2091 print "Taxa: "+str(len(taxa))
2092 print "Deleting: "+str(len(delete_list))
2093- # done, so delete the problem taxa from the supertree
2094- for t in delete_list:
2095- # remove taxa from supertree
2096- supertree = stk._sub_taxa_in_tree(supertree,t)
2097-
2098- # save supertree
2099- tree = {}
2100- tree['Tree_1'] = supertree
2101- output = stk._amalgamate_trees(tree,format='nexus')
2102- # write file
2103- f = open(output_tree,"w")
2104- f.write(output)
2105- f.close()
2106+
2107+ if not supertree == None:
2108+ # done, so delete the problem taxa from the supertree
2109+ for t in delete_list:
2110+ # remove taxa from supertree
2111+ supertree = stk._sub_taxa_in_tree(supertree,t)
2112+
2113+ # save supertree
2114+ tree = {}
2115+ tree['Tree_1'] = supertree
2116+ output = stk._amalgamate_trees(tree,format='nexus')
2117+ # write file
2118+ f = open(output_file,"w")
2119+ f.write(output)
2120+ f.close()
2121+ else:
2122+ new_phyml = stk.substitute_taxa(XML,delete_list)
2123+ # write file
2124+ f = open(output_file,"w")
2125+ f.write(new_phyml)
2126+ f.close()
2127+
2128+
2129
2130 if (dl):
2131 # write file
2132
2133=== added file 'stk/scripts/tree_from_taxonomy.py'
2134--- stk/scripts/tree_from_taxonomy.py 1970-01-01 00:00:00 +0000
2135+++ stk/scripts/tree_from_taxonomy.py 2017-01-12 09:27:31 +0000
2136@@ -0,0 +1,142 @@
2137+# trees ready for supretree construction.
2138+# Copyright (C) 2015, Jon Hill, Katie Davis
2139+#
2140+# This program is free software: you can redistribute it and/or modify
2141+# it under the terms of the GNU General Public License as published by
2142+# the Free Software Foundation, either version 3 of the License, or
2143+# (at your option) any later version.
2144+#
2145+# This program is distributed in the hope that it will be useful,
2146+# but WITHOUT ANY WARRANTY; without even the implied warranty of
2147+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
2148+# GNU General Public License for more details.
2149+#
2150+# You should have received a copy of the GNU General Public License
2151+# along with this program. If not, see <http://www.gnu.org/licenses/>.
2152+#
2153+# Jon Hill. jon.hill@york.ac.uk
2154+
2155+import argparse
2156+import copy
2157+import os
2158+import sys
2159+stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir )
2160+sys.path.insert(0, stk_path)
2161+import supertree_toolkit as stk
2162+import csv
2163+from ete2 import Tree
2164+
2165+taxonomy_levels = ['species','subgenus','genus','subfamily','family','superfamily','subsection','section','infraorder','suborder','order','superorder','subclass','class','superclass','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
2166+tlevels = ['species','genus','family','order','class','phylum','kingdom']
2167+
2168+
2169+def main():
2170+
2171+ # do stuff
2172+ parser = argparse.ArgumentParser(
2173+ prog="create a tree from a taxonomy file",
2174+ description="Create a taxonomic tree",
2175+ )
2176+ parser.add_argument(
2177+ '-v',
2178+ '--verbose',
2179+ action='store_true',
2180+ help="Verbose output: mainly progress reports.",
2181+ default=False
2182+ )
2183+ parser.add_argument(
2184+ 'top_level',
2185+ nargs=1,
2186+ help="The top level group to start with, e.g. family"
2187+ )
2188+ parser.add_argument(
2189+ 'input_file',
2190+ metavar='input_file',
2191+ nargs=1,
2192+ help="Your taxonomy file"
2193+ )
2194+ parser.add_argument(
2195+ 'output_file',
2196+ metavar='output_file',
2197+ nargs=1,
2198+ help="Your new tree file"
2199+ )
2200+
2201+ args = parser.parse_args()
2202+ verbose = args.verbose
2203+ input_file = args.input_file[0]
2204+ output_file = args.output_file[0]
2205+ top_level = args.top_level[0]
2206+
2207+ start_level = taxonomy_levels.index(top_level)
2208+ tree_taxonomy = stk.load_taxonomy(input_file)
2209+ new_taxa = tree_taxonomy.keys()
2210+
2211+ tl_types = []
2212+ for tt in tree_taxonomy:
2213+ tl_types.append(tree_taxonomy[tt][top_level])
2214+
2215+ tl_types = _uniquify(tl_types)
2216+ levels_to_worry_about = tlevels[0:tlevels.index(top_level)+1]
2217+
2218+ #print levels_to_worry_about[-2::-1]
2219+
2220+ t = Tree()
2221+ nodes = {}
2222+ nodes[top_level] = []
2223+ for tl in tl_types:
2224+ n = t.add_child(name=tl)
2225+ nodes[top_level].append({tl:n})
2226+
2227+ for l in levels_to_worry_about[-2::-1]:
2228+ #print t
2229+ names = []
2230+ nodes[l] = []
2231+ ci = levels_to_worry_about.index(l)
2232+ for tt in tree_taxonomy:
2233+ names.append(tree_taxonomy[tt][l])
2234+ names = _uniquify(names)
2235+ for n in names:
2236+ #print n
2237+ # find my parent
2238+ parent = None
2239+ for tt in tree_taxonomy:
2240+ if tree_taxonomy[tt][l] == n:
2241+ parent = tree_taxonomy[tt][levels_to_worry_about[ci+1]]
2242+ k = []
2243+ for nd in nodes[levels_to_worry_about[ci+1]]:
2244+ k.extend(nd.keys())
2245+ i = 0
2246+ for kk in k:
2247+ print kk
2248+ if kk == parent:
2249+ break
2250+ i += 1
2251+ parent_id = i
2252+ break
2253+ # find out where to attach it
2254+ node_id = nodes[levels_to_worry_about[ci+1]][parent_id][parent]
2255+ nd = node_id.add_child(name=n.replace(" ","_"))
2256+ nodes[l].append({n:nd})
2257+
2258+ tree = t.write(format=9)
2259+ tree = stk._collapse_nodes(tree)
2260+ tree = stk._collapse_nodes(tree)
2261+ print tree
2262+
2263+
2264+def _uniquify(l):
2265+ """
2266+ Make a list, l, contain only unique data
2267+ """
2268+ keys = {}
2269+ for e in l:
2270+ keys[e] = 1
2271+
2272+ return keys.keys()
2273+
2274+if __name__ == "__main__":
2275+ main()
2276+
2277+
2278+
2279
2280=== modified file 'stk/stk'
2281--- stk/stk 2014-12-09 10:58:48 +0000
2282+++ stk/stk 2017-01-12 09:27:31 +0000
2283@@ -23,6 +23,7 @@
2284 import sys
2285 import argparse
2286 import traceback
2287+import time
2288 try:
2289 __file__
2290 except NameError:
2291@@ -41,6 +42,10 @@
2292 import string
2293 import stk.p4 as p4
2294 import lxml
2295+import csv
2296+import tempfile
2297+from subprocess import check_call, CalledProcessError, call
2298+
2299 import stk.bzr_version as bzr_version
2300 d = bzr_version.version_info
2301 build = d.get('revno','<unknown revno>')
2302@@ -366,7 +371,7 @@
2303
2304 # Clean data
2305 parser_cm = subparsers.add_parser('clean_data',
2306- help='Remove errant taxa, uninformative trees and empty sources.'
2307+ help='Renames all sources and trees sensibly. Removes errant taxa, uninformative trees and empty sources.'
2308 )
2309 parser_cm.add_argument('input',
2310 help='The input phyml file')
2311@@ -488,7 +493,81 @@
2312 parser_cm.add_argument('subs',
2313 help='The subs file')
2314 parser_cm.set_defaults(func=check_subs)
2315-
2316+
2317+ # taxonomic name checker
2318+ parser_cm = subparsers.add_parser('check_otus',
2319+ help='Check your OTUs against EoL.'
2320+ )
2321+ parser_cm.add_argument('input',
2322+ help='The input Phyml. Also accepts tree files or a simple list')
2323+ parser_cm.add_argument('output',
2324+ help='The output CSV file. Taxon, synonyms, status')
2325+ parser_cm.add_argument('--overwrite',
2326+ action='store_true',
2327+ default=False,
2328+ help="Overwrite the existing file without asking for confirmation")
2329+ parser_cm.set_defaults(func=check_otus)
2330+
2331+ # create taxonomy csv file
2332+ parser_cm = subparsers.add_parser('create_taxonomy',
2333+ help='Create a taxonomy file in CSV for you to then augment.'
2334+ )
2335+ parser_cm.add_argument('input',
2336+ help='The input Phyml. Also accepts tree files or a simple list')
2337+ parser_cm.add_argument('output',
2338+ help='The output CSV file. Name, followed by classification and source')
2339+ parser_cm.add_argument('--overwrite',
2340+ action='store_true',
2341+ default=False,
2342+ help="Overwrite the existing file without asking for confirmation")
2343+ parser_cm.add_argument('--taxonomy',
2344+ help="Give a starting taxonomy file, e.g. one you ran earlier",)
2345+ parser_cm.set_defaults(func=create_taxonomy)
2346+
2347+
2348+ # do the subs in a one go using taxonomy
2349+ parser_cm = subparsers.add_parser('auto_subs',
2350+ help='Using a taxonomy, generate a species level version of your data in one go.'
2351+ )
2352+ parser_cm.add_argument('input',
2353+ help='The input Phyml')
2354+ parser_cm.add_argument('taxonomy',
2355+ help='Your taxonomy file',
2356+ )
2357+ parser_cm.add_argument('output',
2358+ help='The output phyml')
2359+ parser_cm.add_argument('--overwrite',
2360+ action='store_true',
2361+ default=False,
2362+ help="Overwrite the existing file without asking for confirmation")
2363+ #parser_cm.add_argument('--level',
2364+ # choices=supertree_toolkit.taxonomy_levels,
2365+ # help="Taxonomic level to output at",)
2366+ parser_cm.set_defaults(func=auto_subs)
2367+
2368+
2369+ # attempt to process the data into a matrix all automatically
2370+ parser_cm = subparsers.add_parser('process',
2371+ help='Generate a species-level matrix, and do all the checks and processing automatically. Note this creates a taxonomy and does all the processing, but will not be perfect (as taxonomies are not perfect)'
2372+ )
2373+ parser_cm.add_argument('input',
2374+ help='The input Phyml')
2375+ parser_cm.add_argument('output',
2376+ help='The output matrix')
2377+ parser_cm.add_argument('--taxonomy_file',
2378+ help='Existing taxonomy file to prevent redownloading data. Any taxa not in the file will be checked online, so partial complete file are OK.')
2379+ parser_cm.add_argument('--equivalents_file',
2380+ help='Existing equivalents file from a taxonomic name check. Any taxa not in the file will be checked online, so partially complete files are OK.')
2381+ parser_cm.add_argument('--overwrite',
2382+ action='store_true',
2383+ default=False,
2384+ help="Overwrite the existing file without asking for confirmation")
2385+ parser_cm.add_argument('--no_store',
2386+ action="store_true",
2387+ default=False,
2388+ help="Do not store intermediate files -- not recommended")
2389+ parser_cm.set_defaults(func=process)
2390+
2391
2392 # before we let argparse work its magic, check for --version
2393 if "--version" in sys.argv:
2394@@ -602,7 +681,7 @@
2395 # check if output files are there
2396 if (output_file and os.path.exists(output_file) and not overwrite):
2397 print "Output file exists. Either remove the file or use the --overwrite flag."
2398- print "Do you wish to continue? [Y/n]"
2399+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2400 while True:
2401 k=inkey()
2402 if k.lower() == 'n':
2403@@ -612,7 +691,7 @@
2404 break
2405 if (not newphyml == None and os.path.exists(newphyml) and not overwrite):
2406 print "Output Phyml file exists. Either remove the file or use the --overwrite flag."
2407- print "Do you wish to continue? [Y/n]"
2408+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2409 while True:
2410 k=inkey()
2411 if k.lower() == 'n':
2412@@ -624,9 +703,9 @@
2413 XML = supertree_toolkit.load_phyml(input_file)
2414 try:
2415 if (newphyml == None):
2416- data_independence = supertree_toolkit.data_independence(XML,ignoreWarnings=ignoreWarnings)
2417+ data_independence, subsets = supertree_toolkit.data_independence(XML,ignoreWarnings=ignoreWarnings)
2418 else:
2419- data_independence, new_phyml = supertree_toolkit.data_independence(XML,make_new_xml=True,ignoreWarnings=ignoreWarnings)
2420+ data_independence, subsets, new_phyml = supertree_toolkit.data_independence(XML,make_new_xml=True,ignoreWarnings=ignoreWarnings)
2421 except NotUniqueError as detail:
2422 msg = "***Error: Failed to check independence.\n"+detail.msg
2423 print msg
2424@@ -644,7 +723,7 @@
2425 print msg
2426 return
2427 except:
2428- msg = "***Error: failed to check independence due to unknown error."
2429+ msg = "***Error: failed to check independence due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2430 print msg
2431 traceback.print_exc()
2432 return
2433@@ -653,16 +732,14 @@
2434 data_ind = ""
2435 #column headers
2436 data_ind = "Source trees that are subsets of others\n"
2437- data_ind = data_ind + "Flagged tree, is a subset of:\n"
2438- for name in data_independence:
2439- if ( data_independence[name][1] == supertree_toolkit.SUBSET ):
2440- data_ind += name + "," + data_independence[name][0] + "\n"
2441+ data_ind = data_ind + "Flagged tree(s), is/are subset(s) of:\n"
2442+ for names in subsets:
2443+ data_ind += names[1:] + "," + names[0] + "\n"
2444
2445 data_ind += "\n\nSource trees that are identical to others\n"
2446- data_ind = data_ind + "Flagged tree, is identical to:\n"
2447- for name in data_independence:
2448- if ( data_independence[name][1] == supertree_toolkit.IDENTICAL ):
2449- data_ind += name + "," + data_independence[name][0] + "\n"
2450+ data_ind = data_ind + "Flagged tree(s), is/are identical to:\n"
2451+ for names in data_independence:
2452+ data_ind += names[1:] + "," + names[0] + "\n"
2453
2454
2455 if (output_file == False or
2456@@ -762,7 +839,7 @@
2457 # Does the output file already exist?
2458 if (os.path.exists(output_file) and not overwrite):
2459 print "Output file exists. Either remove the file or use the --overwrite flag."
2460- print "Do you wish to continue? [Y/n]"
2461+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2462 while True:
2463 k=inkey()
2464 if k.lower() == 'n':
2465@@ -771,6 +848,7 @@
2466 if k.lower() == 'y':
2467 break
2468 try:
2469+
2470 XML = supertree_toolkit.load_phyml(input_file)
2471 input_is_xml = True
2472 except:
2473@@ -896,7 +974,7 @@
2474 # Does the output file already exist?
2475 if (os.path.exists(output_file) and not overwrite):
2476 print "Output file exists. Either remove the file or use the --overwrite flag."
2477- print "Do you wish to continue? [Y/n]"
2478+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2479 while True:
2480 k=inkey()
2481 if k.lower() == 'n':
2482@@ -942,7 +1020,7 @@
2483 print msg
2484 return
2485 except:
2486- msg = "***Error: Failed sbstituting taxa due to unknown error.\n"
2487+ msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2488 print msg
2489 traceback.print_exc()
2490 return
2491@@ -983,7 +1061,7 @@
2492
2493 if (os.path.exists(output_file) and not overwrite):
2494 print "Output file exists. Either remove the file or use the --overwrite flag."
2495- print "Do you wish to continue? [Y/n]"
2496+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2497 while True:
2498 k=inkey()
2499 if k.lower() == 'n':
2500@@ -1013,7 +1091,7 @@
2501 print msg
2502 return
2503 except:
2504- msg = "***Error: Failed sbstituting taxa due to unknown error.\n"
2505+ msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2506 print msg
2507 traceback.print_exc()
2508 return
2509@@ -1060,7 +1138,7 @@
2510 print msg
2511 return
2512 except:
2513- msg = "***Error: Failed to export data due to unknown error.\n"
2514+ msg = "***Error: Failed to export data due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2515 print msg
2516 traceback.print_exc()
2517 return
2518@@ -1115,7 +1193,7 @@
2519 print msg
2520 return
2521 except:
2522- msg = "***Error: Failed to check overlap due to unknown error.\n"
2523+ msg = "***Error: Failed to check overlap due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2524 print msg
2525 traceback.print_exc()
2526 return
2527@@ -1161,7 +1239,7 @@
2528 # check if output files are there
2529 if (output_file and os.path.exists(output_file) and not overwrite):
2530 print "Output file exists. Either remove the file or use the --overwrite flag."
2531- print "Do you wish to continue? [Y/n]"
2532+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2533 while True:
2534 k=inkey()
2535 if k.lower() == 'n':
2536@@ -1191,7 +1269,7 @@
2537 print msg
2538 return
2539 except:
2540- msg = "***Error: Failed to export trees due to unknown error.\n"
2541+ msg = "***Error: Failed to export trees due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2542 print msg
2543 traceback.print_exc()
2544 return
2545@@ -1220,7 +1298,7 @@
2546 # check if output files are there
2547 if (output_file and os.path.exists(output_file) and not overwrite):
2548 print "Output file exists. Either remove the file or use the --overwrite flag."
2549- print "Do you wish to continue? [Y/n]"
2550+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2551 while True:
2552 k=inkey()
2553 if k.lower() == 'n':
2554@@ -1309,7 +1387,7 @@
2555 print msg
2556 return
2557 except:
2558- msg = "***Error: Failed to permute trees due to unknown error.\n"
2559+ msg = "***Error: Failed to permute trees due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2560 print msg
2561 traceback.print_exc()
2562 return
2563@@ -1347,7 +1425,7 @@
2564 # check if output files are there
2565 if (os.path.exists(output_file) and not overwrite):
2566 print "Output file exists. Either remove the file or use the --overwrite flag."
2567- print "Do you wish to continue? [Y/n]"
2568+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2569 while True:
2570 k=inkey()
2571 if k.lower() == 'n':
2572@@ -1376,7 +1454,7 @@
2573 print msg
2574 return
2575 except:
2576- msg = "***Error: Failed to clean data due to unknown error.\n"
2577+ msg = "***Error: Failed to clean data due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2578 print msg
2579 traceback.print_exc()
2580 return
2581@@ -1404,7 +1482,7 @@
2582 # check if output files are there
2583 if (os.path.exists(output_file) and not overwrite):
2584 print "Output file exists. Either remove the file or use the --overwrite flag."
2585- print "Do you wish to continue? [Y/n]"
2586+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2587 while True:
2588 k=inkey()
2589 if k.lower() == 'n':
2590@@ -1433,7 +1511,7 @@
2591 print msg
2592 return
2593 except:
2594- msg = "***Error: Failed to replace genera due to unknown error.\n"
2595+ msg = "***Error: Failed to replace genera due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2596 print msg
2597 traceback.print_exc()
2598 return
2599@@ -1488,7 +1566,7 @@
2600 new_trees = {}
2601 i = 1
2602 for t in trees:
2603- new_trees['tree_'+str(i)] = t
2604+ new_trees['tree_'+str(i)] = supertree_toolkit._collapse_nodes(t)
2605 i += 1
2606 output = supertree_toolkit._amalgamate_trees(new_trees,format=output_format)
2607 except TreeParseError as detail:
2608@@ -1503,7 +1581,7 @@
2609 # check if output files are there
2610 if (os.path.exists(output_file) and not overwrite):
2611 print "Output file exists. Either remove the file or use the --overwrite flag."
2612- print "Do you wish to continue? [Y/n]"
2613+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2614 while True:
2615 k=inkey()
2616 if k.lower() == 'n':
2617@@ -1540,7 +1618,7 @@
2618 # check if output files are there
2619 if (os.path.exists(output_file) and not overwrite):
2620 print "Output file exists. Either remove the file or use the --overwrite flag."
2621- print "Do you wish to continue? [Y/n]"
2622+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2623 while True:
2624 k=inkey()
2625 if k.lower() == 'n':
2626@@ -1589,7 +1667,7 @@
2627 print msg
2628 return
2629 except:
2630- msg = "***Error: Failed to create subset due to unknown error.\n"
2631+ msg = "***Error: Failed to create subset due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2632 print msg
2633 traceback.print_exc()
2634 return
2635@@ -1637,6 +1715,681 @@
2636 print "**************************************************************\n"
2637
2638
2639+def check_otus(args):
2640+ """check out the OTUs in the Phyml - are they considered valid?"""
2641+
2642+ verbose = args.verbose
2643+ input_file = args.input
2644+ output_file = args.output
2645+
2646+ print input_file
2647+ if (input_file.endswith(".phyml")):
2648+ XML = supertree_toolkit.load_phyml(input_file)
2649+ try:
2650+ equivs = supertree_toolkit.taxonomic_checker(XML, verbose=verbose)
2651+ except NotUniqueError as detail:
2652+ msg = "***Error: Failed to check OTUs.\n"+detail.msg
2653+ print msg
2654+ return
2655+ except InvalidSTKData as detail:
2656+ msg = "***Error: Failed to check OTUs.\n"+detail.msg
2657+ print msg
2658+ return
2659+ except UninformativeTreeError as detail:
2660+ msg = "***Error: Failed to check OTUs.\n"+detail.msg
2661+ print msg
2662+ return
2663+ except TreeParseError as detail:
2664+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2665+ print msg
2666+ return
2667+ except:
2668+ # what about no internet conenction? What error do that throw?
2669+ msg = "***Error: failed to create OTUs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2670+ print msg
2671+ traceback.print_exc()
2672+ return
2673+ elif (input_file.endswith(".txt") or input_file.endswith('.dat')):
2674+ # read file - assume one taxa per line
2675+ with open(input_file,'r') as f:
2676+ lines = f.read().splitlines()
2677+ equivs = supertree_toolkit.taxonomic_checker_list(lines, verbose=verbose)
2678+ else:
2679+ # assume a tree!
2680+ equivs = supertree_toolkit.taxonomic_checker_tree(input_file, verbose=verbose)
2681+
2682+
2683+
2684+ f = open(output_file,"w")
2685+ for taxon in sorted(equivs.keys()):
2686+ f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
2687+ f.close()
2688+
2689+
2690+
2691+def create_taxonomy(args):
2692+ """create a taxonomic heirachy for each OTU in the Phyml"""
2693+
2694+ verbose = args.verbose
2695+ input_file = args.input
2696+ output_file = args.output
2697+ existing_taxonomy = args.taxonomy
2698+ ignoreWarnings = args.ignoreWarnings
2699+
2700+ XML = supertree_toolkit.load_phyml(input_file)
2701+ if (not existing_taxonomy == None):
2702+ existing_taxonomy = supertree_toolkit.load_taxonomy(existing_taxonomy) # load it in and create the dictionary
2703+ pass
2704+
2705+ try:
2706+ taxonomy = supertree_toolkit.create_taxonomy(XML,existing_taxonomy=existing_taxonomy,verbose=verbose,ignoreWarnings=ignoreWarnings)
2707+ except NotUniqueError as detail:
2708+ msg = "***Error: Failed to create taxonomy.\n"+detail.msg
2709+ print msg
2710+ return
2711+ except InvalidSTKData as detail:
2712+ msg = "***Error: Failed to create taxonomy.\n"+detail.msg
2713+ print msg
2714+ return
2715+ except UninformativeTreeError as detail:
2716+ msg = "***Error: Failed to create taxonomy.\n"+detail.msg
2717+ print msg
2718+ return
2719+ except TreeParseError as detail:
2720+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2721+ print msg
2722+ return
2723+ except:
2724+ # what about no internet conenction? What error do that throw?
2725+ msg = "***Error: failed to create taxonomy due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2726+ print msg
2727+ traceback.print_exc()
2728+ return
2729+
2730+ # Now create the CSV output
2731+ with open(output_file, 'w') as f:
2732+ writer = csv.writer(f)
2733+ headers = []
2734+ headers.append("OTU")
2735+ headers.extend(supertree_toolkit.taxonomy_levels)
2736+ headers.append("Data source")
2737+ writer.writerow(headers)
2738+ for t in taxonomy:
2739+ otu = t
2740+ try:
2741+ species = taxonomy[t]['species']
2742+ except KeyError:
2743+ species = "-"
2744+ try:
2745+ genus = taxonomy[t]['genus']
2746+ except KeyError:
2747+ genus = "-"
2748+ try:
2749+ family = taxonomy[t]['family']
2750+ except KeyError:
2751+ family = "-"
2752+ try:
2753+ superfamily = taxonomy[t]['superfamily']
2754+ except KeyError:
2755+ superfamily = "-"
2756+ try:
2757+ infraorder = taxonomy[t]['infraorder']
2758+ except KeyError:
2759+ infraorder = "-"
2760+ try:
2761+ suborder = taxonomy[t]['suborder']
2762+ except KeyError:
2763+ suborder = "-"
2764+ try:
2765+ order = taxonomy[t]['order']
2766+ except KeyError:
2767+ order = "-"
2768+ try:
2769+ superorder = taxonomy[t]['superorder']
2770+ except KeyError:
2771+ superorder = "-"
2772+ try:
2773+ subclass = taxonomy[t]['subclass']
2774+ except KeyError:
2775+ subclass = "-"
2776+ try:
2777+ tclass = taxonomy[t]['class']
2778+ except KeyError:
2779+ tclass = "-"
2780+ try:
2781+ subphylum = taxonomy[t]['subphylum']
2782+ except KeyError:
2783+ subphylum = "-"
2784+ try:
2785+ phylum = taxonomy[t]['phylum']
2786+ except KeyError:
2787+ phylum = "-"
2788+ try:
2789+ superphylum = taxonomy[t]['superphylum']
2790+ except KeyError:
2791+ superphylum = "-"
2792+ try:
2793+ infrakingdom = taxonomy[t]['infrakingdom']
2794+ except:
2795+ infrakingdom = "-"
2796+ try:
2797+ subkingdom = taxonomy[t]['subkingdom']
2798+ except:
2799+ subkingdom = "-"
2800+ try:
2801+ kingdom = taxonomy[t]['kingdom']
2802+ except KeyError:
2803+ kingdom = "-"
2804+ try:
2805+ provider = taxonomy[t]['provider']
2806+ except KeyError:
2807+ provider = "-"
2808+
2809+ if (isinstance(species, list)):
2810+ species = " ".join(species)
2811+ this_classification = [
2812+ otu.encode('utf-8'),
2813+ species.encode('utf-8'),
2814+ genus.encode('utf-8'),
2815+ family.encode('utf-8'),
2816+ superfamily.encode('utf-8'),
2817+ infraorder.encode('utf-8'),
2818+ suborder.encode('utf-8'),
2819+ order.encode('utf-8'),
2820+ superorder.encode('utf-8'),
2821+ subclass.encode('utf-8'),
2822+ tclass.encode('utf-8'),
2823+ subphylum.encode('utf-8'),
2824+ phylum.encode('utf-8'),
2825+ superphylum.encode('utf-8'),
2826+ infrakingdom.encode('utf-8'),
2827+ subkingdom.encode('utf-8'),
2828+ kingdom.encode('utf-8'),
2829+ provider.encode('utf-8')]
2830+ writer.writerow(this_classification)
2831+
2832+def auto_subs(args):
2833+ """Get all OTUs to the same taxonomic level"""
2834+
2835+
2836+ verbose = args.verbose
2837+ input_file = args.input
2838+ output = args.output
2839+ taxonomy = args.taxonomy
2840+ ignoreWarnings = args.ignoreWarnings
2841+
2842+ if (os.path.exists(output) and not overwrite):
2843+ print "Output Phyml file exists. Either remove the file or use the --overwrite flag."
2844+ print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2845+ while True:
2846+ k=inkey()
2847+ if k.lower() == 'n':
2848+ print "Exiting..."
2849+ sys.exit(0)
2850+ if k.lower() == 'y':
2851+ break
2852+
2853+ XML = supertree_toolkit.load_phyml(input_file)
2854+ taxonomy = supertree_toolkit.load_taxonomy(taxonomy) # load it in and create the dictionary
2855+
2856+ try:
2857+ newXML = supertree_toolkit.generate_species_level_data(XML,taxonomy,verbose=verbose,ignoreWarnings=ignoreWarnings)
2858+ except NotUniqueError as detail:
2859+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2860+ print msg
2861+ return
2862+ except InvalidSTKData as detail:
2863+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2864+ print msg
2865+ return
2866+ except UninformativeTreeError as detail:
2867+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2868+ print msg
2869+ return
2870+ except TreeParseError as detail:
2871+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2872+ print msg
2873+ return
2874+ except NoneCompleteTaxonomy as detail:
2875+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2876+ print msg
2877+ return
2878+ except:
2879+ # what about no internet conenction? What error do that throw?
2880+ msg = "***Error: failed to carry out auto subs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2881+ print msg
2882+ traceback.print_exc()
2883+ return
2884+
2885+ f = open(output,"w")
2886+ f.write(newXML)
2887+ f.close()
2888+
2889+def process(args):
2890+
2891+ verbose = args.verbose
2892+ input_file = args.input
2893+ output = args.output
2894+ no_store = args.no_store
2895+ ignoreWarnings = args.ignoreWarnings
2896+ taxonomy_file = args.taxonomy_file
2897+ equivalents_file = args.equivalents_file
2898+ overwrite = args.overwrite
2899+
2900+ if (os.path.exists(output) and not overwrite):
2901+ print "Output matrix file exists. Either remove the file or use the --overwrite flag."
2902+ print "Do you wish to continue and overwrite the file anyway? [Y/n]"
2903+ while True:
2904+ k=inkey()
2905+ if k.lower() == 'n':
2906+ print "Exiting..."
2907+ sys.exit(0)
2908+ if k.lower() == 'y':
2909+ break
2910+
2911+ filename = os.path.basename(input_file)
2912+ dirname = os.path.dirname(input_file)
2913+
2914+ if verbose:
2915+ print "Loading and checking your data"
2916+ # 0) load and check data
2917+ try:
2918+ phyml = supertree_toolkit.load_phyml(input_file)
2919+ project_name = supertree_toolkit.get_project_name(phyml)
2920+ supertree_toolkit._check_data(phyml)
2921+ except NotUniqueError as detail:
2922+ msg = "***Error: Failed to load data.\n"+detail.msg
2923+ print msg
2924+ return
2925+ except InvalidSTKData as detail:
2926+ msg = "***Error: Failed to load data.\n"+detail.msg
2927+ print msg
2928+ return
2929+ except UninformativeTreeError as detail:
2930+ msg = "***Error: Failed to load data.\n"+detail.msg
2931+ print msg
2932+ return
2933+ except TreeParseError as detail:
2934+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2935+ print msg
2936+ return
2937+ except:
2938+ msg = "***Error: Failed to load input due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2939+ print msg
2940+ traceback.print_exc()
2941+ return
2942+
2943+ if verbose:
2944+ print "Checking taxa againt online databases"
2945+ # 1) taxonomy checker with autoreplace
2946+ # Load existing data if any:
2947+ if (not equivalents_file == None):
2948+ equivalents = supertree_toolkit.load_equivalents(equivalents_file)
2949+ else:
2950+ equivalents = None
2951+ equivalents = supertree_toolkit.taxonomic_checker(phyml,existing_data=equivalents,verbose=verbose)
2952+ # save the equivalents for later (as CSV and as sub file)
2953+ data_string_csv = _equivalents_to_csv(equivalents)
2954+ data_string_subs = _equivalents_to_subs(equivalents)
2955+ f = open(os.path.join(dirname,project_name+"_taxonomy_checker.csv"), "w")
2956+ f.write(data_string_csv)
2957+ f.close()
2958+ f = open(os.path.join(dirname,project_name+"_taxonomy_check_subs.dat"), "w")
2959+ f.write(data_string_subs)
2960+ f.close()
2961+
2962+ # now do the replacements - we use the subs file :)
2963+ if verbose:
2964+ print "Swapping in the corrected taxa names"
2965+ try:
2966+ old_taxa, new_taxa = supertree_toolkit.parse_subs_file(os.path.join(dirname,project_name+"_taxonomy_check_subs.dat"))
2967+ except UnableToParseSubsFile as e:
2968+ print e.msg
2969+ sys.exit(-1)
2970+ try:
2971+ phyml = supertree_toolkit.substitute_taxa(phyml,old_taxa,new_taxa,only_existing=False,verbose=verbose)
2972+ except NotUniqueError as detail:
2973+ msg = "***Error: Failed to substituting taxa.\n"+detail.msg
2974+ print msg
2975+ return
2976+ except InvalidSTKData as detail:
2977+ msg = "***Error: Failed substituting taxa.\n"+detail.msg
2978+ print msg
2979+ return
2980+ except UninformativeTreeError as detail:
2981+ msg = "***Error: Failed to substituting taxa.\n"+detail.msg
2982+ print msg
2983+ return
2984+ except TreeParseError as detail:
2985+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2986+ print msg
2987+ return
2988+ except:
2989+ msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2990+ print msg
2991+ traceback.print_exc()
2992+ return
2993+ # save phyml as intermediate step
2994+ f = open(os.path.join(dirname,project_name+"_taxonomy_checked.phyml"), "w")
2995+ f.write(phyml)
2996+ f.close()
2997+
2998+
2999+ if verbose:
3000+ print "Creating taxonomic information"
3001+ # 2) create taxonomy
3002+ if (not taxonomy_file == None):
3003+ taxonomy = supertree_toolkit.load_taxonomy(taxonomy_file)
3004+ else:
3005+ taxonomy = None
3006+ taxonomy = supertree_toolkit.create_taxonomy(phyml,existing_taxonomy=taxonomy,verbose=verbose)
3007+ # save the taxonomy for later
3008+ # Now create the CSV output - seperate out into function in STK (used several times)
3009+ with open(os.path.join(dirname,project_name+"_taxonomy.csv"), 'w') as f:
3010+ writer = csv.writer(f)
3011+ headers = []
3012+ headers.append("OTU")
3013+ headers.extend(supertree_toolkit.taxonomy_levels)
3014+ headers.append("Data source")
3015+ writer.writerow(headers)
3016+ for t in taxonomy:
3017+ otu = t
3018+ try:
3019+ species = taxonomy[t]['species']
3020+ except KeyError:
3021+ species = "-"
3022+ try:
3023+ subgenus = taxonomy[t]['subgenus']
3024+ except KeyError:
3025+ subgenus = "-"
3026+ try:
3027+ genus = taxonomy[t]['genus']
3028+ except KeyError:
3029+ genus = "-"
3030+ try:
3031+ subfamily = taxonomy[t]['subfamily']
3032+ except KeyError:
3033+ subfamily = "-"
3034+ try:
3035+ family = taxonomy[t]['family']
3036+ except KeyError:
3037+ family = "-"
3038+ try:
3039+ superfamily = taxonomy[t]['superfamily']
3040+ except KeyError:
3041+ superfamily = "-"
3042+ try:
3043+ subsection = taxonomy[t]['subsection']
3044+ except KeyError:
3045+ subsection = "-"
3046+ try:
3047+ section = taxonomy[t]['section']
3048+ except KeyError:
3049+ section = "-"
3050+ try:
3051+ infraorder = taxonomy[t]['infraorder']
3052+ except KeyError:
3053+ infraorder = "-"
3054+ try:
3055+ suborder = taxonomy[t]['suborder']
3056+ except KeyError:
3057+ suborder = "-"
3058+ try:
3059+ order = taxonomy[t]['order']
3060+ except KeyError:
3061+ order = "-"
3062+ try:
3063+ superorder = taxonomy[t]['superorder']
3064+ except KeyError:
3065+ superorder = "-"
3066+ try:
3067+ subclass = taxonomy[t]['subclass']
3068+ except KeyError:
3069+ subclass = "-"
3070+ try:
3071+ tclass = taxonomy[t]['class']
3072+ except KeyError:
3073+ tclass = "-"
3074+ try:
3075+ superclass = taxonomy[t]['superclass']
3076+ except KeyError:
3077+ superclass = "-"
3078+ try:
3079+ subphylum = taxonomy[t]['subphylum']
3080+ except KeyError:
3081+ subphylum = "-"
3082+ try:
3083+ phylum = taxonomy[t]['phylum']
3084+ except KeyError:
3085+ phylum = "-"
3086+ try:
3087+ superphylum = taxonomy[t]['superphylum']
3088+ except KeyError:
3089+ superphylum = "-"
3090+ try:
3091+ infrakingdom = taxonomy[t]['infrakingdom']
3092+ except:
3093+ infrakingdom = "-"
3094+ try:
3095+ subkingdom = taxonomy[t]['subkingdom']
3096+ except:
3097+ subkingdom = "-"
3098+ try:
3099+ kingdom = taxonomy[t]['kingdom']
3100+ except KeyError:
3101+ kingdom = "-"
3102+ try:
3103+ provider = taxonomy[t]['provider']
3104+ except KeyError:
3105+ provider = "-"
3106+ this_classification = [
3107+ otu.encode('utf-8'),
3108+ species.encode('utf-8'),
3109+ subgenus.encode('utf-8'),
3110+ genus.encode('utf-8'),
3111+ subfamily.encode('utf-8'),
3112+ family.encode('utf-8'),
3113+ superfamily.encode('utf-8'),
3114+ subsection.encode('utf-8'),
3115+ section.encode('utf-8'),
3116+ infraorder.encode('utf-8'),
3117+ suborder.encode('utf-8'),
3118+ order.encode('utf-8'),
3119+ superorder.encode('utf-8'),
3120+ subclass.encode('utf-8'),
3121+ tclass.encode('utf-8'),
3122+ superclass.encode('utf-8'),
3123+ subphylum.encode('utf-8'),
3124+ phylum.encode('utf-8'),
3125+ superphylum.encode('utf-8'),
3126+ infrakingdom.encode('utf-8'),
3127+ subkingdom.encode('utf-8'),
3128+ kingdom.encode('utf-8'),
3129+ provider.encode('utf-8')]
3130+ writer.writerow(this_classification)
3131+
3132+ # 3) create species level dataset
3133+ if verbose:
3134+ print "Converting data to species level"
3135+ try:
3136+ phyml = supertree_toolkit.generate_species_level_data(phyml,taxonomy,verbose=verbose)
3137+ except NotUniqueError as detail:
3138+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3139+ print msg
3140+ return
3141+ except InvalidSTKData as detail:
3142+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3143+ print msg
3144+ return
3145+ except UninformativeTreeError as detail:
3146+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3147+ print msg
3148+ return
3149+ except TreeParseError as detail:
3150+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
3151+ print msg
3152+ return
3153+ except NoneCompleteTaxonomy as detail:
3154+ msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3155+ print msg
3156+ return
3157+ except:
3158+ # what about no internet conenction? What error do that throw?
3159+ msg = "***Error: failed to carry out auto subs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
3160+ print msg
3161+ traceback.print_exc()
3162+ return
3163+ # save the phyml as intermediate step
3164+ f = open(os.path.join(dirname,project_name+"_species_level.phyml"), "w")
3165+ f.write(phyml)
3166+ f.close()
3167+
3168+ # 4) Remove non-monophyletic taxa (requires TNT to be installed)
3169+ if verbose:
3170+ print "Removing non-monophyletic taxa via mini-supertree method"
3171+ tree_list = supertree_toolkit._find_trees_for_permuting(phyml)
3172+ try:
3173+ for t in tree_list:
3174+ # permute
3175+ output_string = supertree_toolkit.permute_tree(tree_list[t],matrix='hennig',treefile=None,verbose=verbose)
3176+ #save
3177+ if (not output_string == ""):
3178+ file_name = os.path.basename(filename)
3179+ dirname = os.path.dirname(filename)
3180+ new_output = os.path.join(dirname,t,t+"_matrix.tnt")
3181+ try:
3182+ os.makedirs(os.path.join(dirname,t))
3183+ except OSError:
3184+ if not os.path.isdir(os.path.join(dirname,t)):
3185+ raise
3186+ f = open(new_output,'w',0)
3187+ f.write(output_string)
3188+ f.close
3189+ time.sleep(1)
3190+
3191+ # now create the tnt command to deal with this
3192+ # create a tmp file for the output tree
3193+ temp_file_handle, temp_file = tempfile.mkstemp(suffix=".tnt")
3194+ tnt_command = "tnt mxram 512,run "+new_output+",echo= ,timeout 00:10:00,rseed0,rseed*,hold 1000,xmult= level 0,taxname=,nelsen *,tsave *"+temp_file+",save /,quit"
3195+ #tnt_command = "tnt run "+new_output+",ienum,taxname=,nelsen*,tsave *"+temp_file+",save /,quit"
3196+ # run tnt, grab the output and store back in the data
3197+ #try:
3198+ call(tnt_command, shell=True)
3199+ #except CalledProcessError as e:
3200+ # msg = "***Error: Failed to run TNT. Is it installed correctl?.\n"+e.msg
3201+ # print msg
3202+ # return
3203+ #ret = os.system(tnt_command)
3204+ #if (not ret == 0):
3205+ # print "error running tnt"
3206+ # return
3207+
3208+ new_tree = supertree_toolkit.import_tree(temp_file)
3209+ phyml = supertree_toolkit._swap_tree_in_XML(phyml,new_tree,t)
3210+
3211+ except TreeParseError as e:
3212+ msg = "***Error permuting trees.\n"+e.msg
3213+ print msg
3214+ return
3215+
3216+ #4.5) remove MRP_Outgroups
3217+ phyml = supertree_toolkit.substitute_taxa(phyml,'MRP_Outgroup')
3218+ phyml = supertree_toolkit.substitute_taxa(phyml,'MRPOutgroup')
3219+ phyml = supertree_toolkit.substitute_taxa(phyml,'MRP_outgroup')
3220+ phyml = supertree_toolkit.substitute_taxa(phyml,'MRPoutgroup')
3221+ phyml = supertree_toolkit.substitute_taxa(phyml,'MRPOUTGROUP')
3222+
3223+ # save intermediate phyml
3224+ f = open(os.path.join(dirname,project_name+"_nonmonophyl_removed.phyml"), "w")
3225+ f.write(phyml)
3226+ f.close()
3227+
3228+
3229+ # 5) Remove common names
3230+ # no function to do this yet...
3231+
3232+ # 6) Data independance
3233+ if verbose:
3234+ print "Checking data independence"
3235+ data_ind,subsets,phyml = supertree_toolkit.data_independence(phyml,make_new_xml=True)
3236+ # save phyml
3237+ f = open(os.path.join(dirname,project_name+"_data_ind.phyml"), "w")
3238+ f.write(phyml)
3239+ f.close()
3240+
3241+ # 7) Data overlap
3242+ if verbose:
3243+ print "Checking data overlap"
3244+ sufficient_overlap, key_list = supertree_toolkit.data_overlap(phyml,verbose=verbose)
3245+ # process the key_list to remove the unconnected trees
3246+ if not sufficient_overlap:
3247+ # we don't, have enough, then remove all but the largest group.
3248+ # the key contains a list, with the largest group first (thanks networkX!)
3249+ # we can therefore just remove trees from everything but the first in the list
3250+ delete_me = []
3251+ for t in key_list[1::]: # skip 0
3252+ delete_me.extend(t)
3253+ for tree in delete_me:
3254+ phyml = supertree_toolkit._swap_tree_in_XML(phyml, None, tree, delete=True) # delete the tree and clean the data as we go
3255+ # save phyml
3256+ f = open(os.path.join(dirname,project_name+"_data_tax_overlap.phyml"), "w")
3257+ f.write(phyml)
3258+ f.close()
3259+
3260+
3261+ # 8) Create matrix
3262+ if verbose:
3263+ print "Creating matrix"
3264+ try:
3265+ matrix = supertree_toolkit.create_matrix(phyml)
3266+ except NotUniqueError as detail:
3267+ msg = "***Error: Failed to create matrix.\n"+detail.msg
3268+ print msg
3269+ return
3270+ except InvalidSTKData as detail:
3271+ msg = "***Error: Failed to create matrix.\n"+detail.msg
3272+ print msg
3273+ return
3274+ except UninformativeTreeError as detail:
3275+ msg = "***Error: Failed to create matrix.\n"+detail.msg
3276+ print msg
3277+ return
3278+ except TreeParseError as detail:
3279+ msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
3280+ print msg
3281+ return
3282+ except:
3283+ msg = "***Error: Failed to create matrix due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
3284+ print msg
3285+ traceback.print_exc()
3286+ return
3287+
3288+ f = open(output, "w")
3289+ f.write(matrix)
3290+ f.close()
3291+
3292+ return
3293+
3294+
3295+def _equivalents_to_csv(equivalents):
3296+
3297+ output_string = 'Taxa,Equivalents,Status\n'
3298+
3299+ for taxon in sorted(equivalents):
3300+ output_string += taxon + "," + ';'.join(equivalents[taxon][0]) + "," + equivalents[taxon][1] + "\n"
3301+
3302+ return output_string
3303+
3304+
3305+def _equivalents_to_subs(equivalents):
3306+ """Only corrects the yellow ones. Red and green are left alone"""
3307+
3308+ output_string = ""
3309+ for taxon in sorted(equivalents):
3310+ if (equivalents[taxon][1] == 'yellow'):
3311+ # the first name is always the correct one
3312+ output_string += taxon + " = "+equivalents[taxon][0][0]+"\n"
3313+ return output_string
3314
3315 if __name__ == "__main__":
3316 main()
3317
3318=== modified file 'stk/stk_exceptions.py'
3319--- stk/stk_exceptions.py 2013-10-22 08:26:54 +0000
3320+++ stk/stk_exceptions.py 2017-01-12 09:27:31 +0000
3321@@ -134,4 +134,12 @@
3322 def __init__(self, msg):
3323 self.msg = msg
3324
3325+class NoneCompleteTaxonomy(Error):
3326+ """Exception raised when a taxonomy is not complete for these data
3327+ Attributes:
3328+ msg -- explaination of error
3329+ """
3330+
3331+ def __init__(self, msg):
3332+ self.msg = msg
3333
3334
3335=== modified file 'stk/supertree_toolkit.py'
3336--- stk/supertree_toolkit.py 2017-01-11 15:16:21 +0000
3337+++ stk/supertree_toolkit.py 2017-01-12 09:27:31 +0000
3338@@ -44,15 +44,49 @@
3339 import unicodedata
3340 from stk_internals import *
3341 from copy import deepcopy
3342+import Queue
3343+import threading
3344+import urllib2
3345+from urllib import quote_plus
3346+import simplejson as json
3347+import time
3348 import types
3349
3350 #plt.ion()
3351
3352+sys.setrecursionlimit(50000)
3353 # GLOBAL VARIABLES
3354 IDENTICAL = 0
3355 SUBSET = 1
3356 PLATFORM = sys.platform
3357-taxonomy_levels = ['species','genus','family','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
3358+#Logging
3359+import logging
3360+logging.basicConfig(filename='supertreetoolkit.log', level=logging.DEBUG, format='%(asctime)s %(levelname)s:%(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')
3361+
3362+# taxonomy levels
3363+# What we get from EOL
3364+current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
3365+# And the extra ones from ITIS
3366+extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
3367+# all of them in order
3368+taxonomy_levels = ['species','subgenus','genus','tribe','subfamily','family','superfamily','subsection','section','parvorder','infraorder','suborder','order','superorder','subclass','class','superclass','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
3369+
3370+SPECIES = taxonomy_levels[0]
3371+GENUS = taxonomy_levels[1]
3372+FAMILY = taxonomy_levels[2]
3373+SUPERFAMILY = taxonomy_levels[3]
3374+INFRAORDER = taxonomy_levels[4]
3375+SUBORDER = taxonomy_levels[5]
3376+ORDER = taxonomy_levels[6]
3377+SUPERORDER = taxonomy_levels[7]
3378+SUBCLASS = taxonomy_levels[8]
3379+CLASS = taxonomy_levels[9]
3380+SUBPHYLUM = taxonomy_levels[10]
3381+PHYLUM = taxonomy_levels[11]
3382+SUPERPHYLUM = taxonomy_levels[12]
3383+INFRAKINGDOM = taxonomy_levels[13]
3384+SUBKINGDOM = taxonomy_levels[14]
3385+KINGDOM = taxonomy_levels[15]
3386
3387 # supertree_toolkit is the backend for the STK. Loaded by both the GUI and
3388 # CLI, this contains all the functions to actually *do* something
3389@@ -60,6 +94,17 @@
3390 # All functions take XML and a list of other arguments, process the data and return
3391 # it back to the user interface handler to save it somewhere
3392
3393+
3394+def get_project_name(XML):
3395+ """
3396+ Get the name of the dataset currently being worked on
3397+ """
3398+
3399+ xml_root = _parse_xml(XML)
3400+
3401+ return xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
3402+
3403+
3404 def create_name(authors, year, append=''):
3405 """
3406 Construct a sensible from a list of authors and a year for a
3407@@ -161,6 +206,22 @@
3408
3409 return names
3410
3411+def get_all_tree_names(XML):
3412+ """ From a full XML-PHYML string, extract all tree names.
3413+ """
3414+
3415+ xml_root = _parse_xml(XML)
3416+ find = etree.XPath("//source")
3417+ sources = find(xml_root)
3418+ names = []
3419+ for s in sources:
3420+ for st in s.xpath("source_tree"):
3421+ if 'name' in st.attrib and not st.attrib['name'] == "":
3422+ names.append(st.attrib['name'])
3423+
3424+ return names
3425+
3426+
3427 def set_unique_names(XML):
3428 """ Ensures all sources have unique names.
3429 """
3430@@ -249,9 +310,17 @@
3431 if (ele.tag == "source"):
3432 sources.append(ele)
3433
3434+ if overwrite:
3435+ # remove all the names first
3436+ for s in sources:
3437+ for st in s.xpath("source_tree"):
3438+ if 'name' in st.attrib:
3439+ del st.attrib['name']
3440+
3441+
3442 for s in sources:
3443 for st in s.xpath("source_tree"):
3444- if overwrite or not 'name' in st.attrib:
3445+ if not'name' in st.attrib:
3446 tree_name = create_tree_name(XML,st)
3447 st.attrib['name'] = tree_name
3448
3449@@ -339,7 +408,7 @@
3450 taxa = etree.SubElement(s_tree,"taxa_data")
3451 taxa.tail="\n "
3452 # Note: we do not add all elements as otherwise they get set to some option
3453- # rather than remaining blank (and hence blue int he interface)
3454+ # rather than remaining blank (and hence blue in the interface)
3455
3456 # append our new source to the main tree
3457 # if sources has no valid source, overwrite,
3458@@ -877,7 +946,7 @@
3459 # Need to add checks on the file. Problems include:
3460 # TNT: outputs Phyllip format or something - basically a Newick
3461 # string without commas, so add 'em back in
3462- m = re.search(r'proc-;', content)
3463+ m = re.search(r'proc.;', content)
3464 if (m != None):
3465 # TNT output tree
3466 # Done on a Mac? Replace ^M with a newline
3467@@ -1402,6 +1471,36 @@
3468
3469 return _amalgamate_trees(trees,format,anonymous)
3470
3471+def get_taxa_from_tree_for_taxonomy(tree, pretty=False, ignoreErrors=False):
3472+ """Returns a list of all taxa available for the tree passed as argument.
3473+ :param tree: string with the data for the tree in Newick format.
3474+ :type tree: string
3475+ :param pretty: defines if '_' in taxa names should be replaced with spaces.
3476+ :type pretty: boolean
3477+ :param ignoreErrors: should execution continue on error?
3478+ :type ignoreErrors: boolean
3479+ :returns: list of strings with the taxa names, sorted alphabetically
3480+ :rtype: list
3481+ """
3482+ taxa_list = []
3483+
3484+ try:
3485+ taxa_list.extend(_getTaxaFromNewick(tree))
3486+ except TreeParseError as detail:
3487+ if (ignoreErrors):
3488+ logging.warning(detail.msg)
3489+ pass
3490+ else:
3491+ raise TreeParseError( detail.msg )
3492+
3493+ # now uniquify the list of taxa
3494+ taxa_list = _uniquify(taxa_list)
3495+ taxa_list.sort()
3496+
3497+ if (pretty):
3498+ taxa_list = [x.replace('_', ' ') for x in taxa_list]
3499+
3500+ return taxa_list
3501
3502 def get_all_taxa(XML, pretty=False, ignoreErrors=False):
3503 """ Produce a taxa list by scanning all trees within
3504@@ -1422,21 +1521,17 @@
3505 taxa_list.extend(_getTaxaFromNewick(t))
3506 except TreeParseError as detail:
3507 if (ignoreErrors):
3508+ logging.warning(detail.msg)
3509 pass
3510 else:
3511 raise TreeParseError( detail.msg )
3512
3513-
3514-
3515 # now uniquify the list of taxa
3516 taxa_list = _uniquify(taxa_list)
3517 taxa_list.sort()
3518
3519- if (pretty):
3520- unpretty_tl = taxa_list
3521- taxa_list = []
3522- for t in unpretty_tl:
3523- taxa_list.append(t.replace('_',' '))
3524+ if (pretty): #Remove underscores from names
3525+ taxa_list = [x.replace('_', ' ') for x in taxa_list]
3526
3527 return taxa_list
3528
3529@@ -1508,7 +1603,7 @@
3530 return outgroups
3531
3532
3533-def create_matrix(XML,format="hennig",quote=False,taxonomy=None,outgroups=False,ignoreWarnings=False):
3534+def create_matrix(XML,format="hennig",quote=False,taxonomy=None,outgroups=False,ignoreWarnings=False, verbose=False):
3535 """ From all trees in the XML, create a matrix
3536 """
3537
3538@@ -1553,7 +1648,7 @@
3539 taxa.sort()
3540 taxa.insert(0,"MRP_Outgroup")
3541
3542- return _create_matrix(trees, taxa, format=format, quote=quote, weights=weights)
3543+ return _create_matrix(trees, taxa, format=format, quote=quote, weights=weights,verbose=verbose)
3544
3545
3546 def create_matrix_from_trees(trees,format="hennig"):
3547@@ -1925,7 +2020,7 @@
3548 _check_data(XML)
3549
3550 xml_root = _parse_xml(XML)
3551- proj_name = xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
3552+ proj_name = get_project_name(XML)
3553
3554 output_string = "======================\n"
3555 output_string += " Data summary of: " + proj_name + "\n"
3556@@ -1989,6 +2084,188 @@
3557
3558 return output_string
3559
3560+def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
3561+ """ For each name in the database generate a database of the original name,
3562+ possible synonyms and if the taxon is not know, signal that. We do this by
3563+ using the EoL API to grab synonyms of each taxon. """
3564+
3565+ import urllib2
3566+ from urllib import quote_plus
3567+ import simplejson as json
3568+
3569+ if existing_data == None:
3570+ equivalents = {}
3571+ else:
3572+ equivalents = existing_data
3573+
3574+ # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
3575+ # if not, is there another API function to do this?
3576+ # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
3577+ # you get the original ('correct') name - shorten this to two words and you're done.
3578+ for t in name_list:
3579+ if t in equivalents:
3580+ continue
3581+ taxon = t.replace("_"," ")
3582+ if (verbose):
3583+ print "Looking up ", taxon
3584+ # get the data from EOL on taxon
3585+ taxonq = quote_plus(taxon)
3586+ URL = "http://eol.org/api/search/1.0.json?q="+taxonq
3587+ req = urllib2.Request(URL)
3588+ opener = urllib2.build_opener()
3589+ f = opener.open(req)
3590+ data = json.load(f)
3591+ # check if there's some data
3592+ if len(data['results']) == 0:
3593+ equivalents[t] = [[t],'red']
3594+ continue
3595+ amber = False
3596+ if len(data['results']) > 1:
3597+ # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
3598+ # for automatic processing we'll just take the first one though
3599+ # colour is amber in this case
3600+ amber = True
3601+ ID = str(data['results'][0]['id']) # take first hit
3602+ URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=0&videos=0&sounds=0&maps=0&text=0&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"
3603+ req = urllib2.Request(URL)
3604+ opener = urllib2.build_opener()
3605+
3606+ try:
3607+ f = opener.open(req)
3608+ except urllib2.HTTPError:
3609+ equivalents[t] = [[t],'red']
3610+ continue
3611+ data = json.load(f)
3612+ if len(data['scientificName']) == 0:
3613+ # not found a scientific name, so set as red
3614+ equivalents[t] = [[t],'red']
3615+ continue
3616+ correct_name = data['scientificName'].encode("ascii","ignore")
3617+ # we only want the first two bits of the name, not the original author and year if any
3618+ temp_name = correct_name.split(' ')
3619+ if (len(temp_name) > 2):
3620+ correct_name = ' '.join(temp_name[0:2])
3621+ correct_name = correct_name.replace(' ','_')
3622+
3623+ # build up the output dictionary - original name is key, synonyms/missing is value
3624+ if (correct_name == t):
3625+ # if the original matches the 'correct', then it's green
3626+ equivalents[t] = [[t], 'green']
3627+ else:
3628+ # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the
3629+ # 'correct' taxon at the top
3630+ eol_synonyms = data['synonyms']
3631+ synonyms = []
3632+ for s in eol_synonyms:
3633+ ts = s['synonym'].encode("ascii","ignore")
3634+ temp_syn = ts.split(' ')
3635+ if (len(temp_syn) > 2):
3636+ temp_syn = ' '.join(temp_syn[0:2])
3637+ ts = temp_syn
3638+ if (s['relationship'] == "synonym"):
3639+ ts = ts.replace(" ","_")
3640+ synonyms.append(ts)
3641+ synonyms = _uniquify(synonyms)
3642+ # we need to put the correct name at the top of the list now
3643+ if (correct_name in synonyms):
3644+ synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
3645+ elif len(synonyms) == 0:
3646+ synonyms.append(correct_name)
3647+ else:
3648+ synonyms.insert(0,correct_name)
3649+
3650+ if (amber):
3651+ equivalents[t] = [synonyms,'amber']
3652+ else:
3653+ equivalents[t] = [synonyms,'yellow']
3654+ # if our search was empty, then it's red - see above
3655+
3656+ # up to the calling funciton to do something sensible with this
3657+ # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
3658+ # Amber means we found synonyms and multilpe hits. User def needs to sort these!
3659+
3660+ return equivalents
3661+
3662+def taxonomic_checker_tree(tree_file,existing_data=None,verbose=False):
3663+ """ For each name in the database generate a database of the original name,
3664+ possible synonyms and if the taxon is not know, signal that. We do this by
3665+ using the EoL API to grab synonyms of each taxon. """
3666+
3667+ tree = import_tree(tree_file)
3668+ p4tree = _parse_tree(tree)
3669+ taxa = p4tree.getAllLeafNames(p4tree.root)
3670+ if existing_data == None:
3671+ equivalents = {}
3672+ else:
3673+ equivalents = existing_data
3674+
3675+ equivalents = taxonomic_checker_list(taxa,existing_data,verbose)
3676+ return equivalents
3677+
3678+def taxonomic_checker(XML,existing_data=None,verbose=False):
3679+ """ For each name in the database generate a database of the original name,
3680+ possible synonyms and if the taxon is not know, signal that. We do this by
3681+ using the EoL API to grab synonyms of each taxon. """
3682+
3683+ # grab all taxa
3684+ taxa = get_all_taxa(XML)
3685+
3686+ if existing_data == None:
3687+ equivalents = {}
3688+ else:
3689+ equivalents = existing_data
3690+
3691+ equivalents = taxonomic_checker_list(taxa,existing_data,verbose)
3692+ return equivalents
3693+
3694+
3695+def load_equivalents(equiv_csv):
3696+ """Load equivalents data from a csv and convert to a equivalents Dict.
3697+ Structure is key, with a list that is array of synonyms, followed by status ('green',
3698+ 'yellow' or 'red').
3699+
3700+ """
3701+
3702+ import csv
3703+
3704+ equivalents = {}
3705+
3706+ with open(equiv_csv, 'rU') as csvfile:
3707+ equiv_reader = csv.reader(csvfile, delimiter=',')
3708+ equiv_reader.next() # skip header
3709+ for row in equiv_reader:
3710+ i = 1
3711+ equivalents[row[0]] = [row[1].split(';'),row[2]]
3712+
3713+ return equivalents
3714+
3715+def save_taxonomy(taxonomy, output_file):
3716+
3717+ import csv
3718+
3719+ with open(output_file, 'w') as f:
3720+ writer = csv.writer(f)
3721+ row = ['OTU']
3722+ row.extend(taxonomy_levels)
3723+ row.append('Provider')
3724+ writer.writerow(row)
3725+ for t in taxonomy:
3726+ species = t
3727+ row = []
3728+ row.append(t.encode('utf-8'))
3729+ for l in taxonomy_levels:
3730+ try:
3731+ g = taxonomy[t][l]
3732+ except KeyError:
3733+ g = '-'
3734+ row.append(g.encode('utf-8'))
3735+ try:
3736+ provider = taxonomy[t]['provider']
3737+ except KeyError:
3738+ provider = "-"
3739+ row.append(provider)
3740+
3741+ writer.writerow(row)
3742
3743
3744 def load_taxonomy(taxonomy_csv):
3745@@ -2000,20 +2277,443 @@
3746
3747 with open(taxonomy_csv, 'rU') as csvfile:
3748 tax_reader = csv.reader(csvfile, delimiter=',')
3749- tax_reader.next()
3750- for row in tax_reader:
3751- current_taxonomy = {}
3752- i = 1
3753- for t in taxonomy_levels:
3754- if not row[i] == '-':
3755- current_taxonomy[t] = row[i]
3756- i = i+ 1
3757-
3758- current_taxonomy['provider'] = row[17] # data source
3759- taxonomy[row[0]] = current_taxonomy
3760-
3761- return taxonomy
3762-
3763+ try:
3764+ j = 0
3765+ for row in tax_reader:
3766+ if j == 0:
3767+ tax_levels = row[1:-1]
3768+ j += 1
3769+ continue
3770+ i = 1
3771+ current_taxonomy = {}
3772+ for t in tax_levels:
3773+ if not row[i] == '-':
3774+ current_taxonomy[t] = row[i]
3775+ i = i+ 1
3776+ current_taxonomy['provider'] = row[-1] # data source
3777+ taxonomy[row[0].replace(" ","_")] = current_taxonomy
3778+ j += 1
3779+ except:
3780+ pass
3781+
3782+ return taxonomy
3783+
3784+
3785+class TaxonomyFetcher(threading.Thread):
3786+ """ Class to provide the taxonomy fetching functionality as a threaded function to be used individually or working with a pool.
3787+ """
3788+
3789+ def __init__(self, taxonomy, lock, queue, id=0, pref_db=None, verbose=False, ignoreWarnings=False):
3790+ """ Constructor for the threaded model.
3791+ :param taxonomy: previous taxonomy available (if available) or an empty dictionary to store the results .
3792+ :type taxonomy: dictionary
3793+ :param lock: lock to keep the taxonomy threadsafe.
3794+ :type lock: Lock
3795+ :param queue: queue where the taxa are kept to be processed.
3796+ :type queue: Queue of strings
3797+ :param id: id for the thread to use if messages need to be printed.
3798+ :type id: int
3799+ :param pref_db: Gives priority to database. Seems it is unused.
3800+ :type pref_db: string
3801+ :param verbose: Show verbose messages during execution, will also define level of logging. True will set logging level to INFO.
3802+ :type verbose: boolean
3803+ :param ignoreWarnings: Ignore warnings and errors during execution? Errors will be logged with ERROR level on the logging output.
3804+ :type ignoreWarnings: boolean
3805+ """
3806+
3807+ threading.Thread.__init__(self)
3808+ self.taxonomy = taxonomy
3809+ self.lock = lock
3810+ self.queue = queue
3811+ self.id = id
3812+ self.verbose = verbose
3813+ self.pref_db = pref_db
3814+ self.ignoreWarnings = ignoreWarnings
3815+
3816+ def run(self):
3817+ """ Gets and processes a taxon from the queue to get its taxonomy."""
3818+ while True :
3819+ if self.verbose :
3820+ logging.getLogger().setLevel(logging.INFO)
3821+ #get taxon from queue
3822+ taxon = self.queue.get()
3823+
3824+ logging.debug("Starting {} with thread #{} remaining ~{}".format(taxon,str(self.id),str(self.queue.qsize())))
3825+
3826+ #Lock access to the taxonomy
3827+ self.lock.acquire()
3828+ if not taxon in self.taxonomy: # is a new taxon, not previously in the taxonomy
3829+ #Release access to the taxonomy
3830+ self.lock.release()
3831+ if (self.verbose):
3832+ print "Looking up ", taxon
3833+ logging.info("Loolking up taxon: {}".format(str(taxon)))
3834+ try:
3835+ # get the data from EOL on taxon
3836+ taxonq = quote_plus(taxon)
3837+ URL = "http://eol.org/api/search/1.0.json?q="+taxonq
3838+ req = urllib2.Request(URL)
3839+ opener = urllib2.build_opener()
3840+ f = opener.open(req)
3841+ data = json.load(f)
3842+ # check if there's some data
3843+ if len(data['results']) == 0:
3844+ # try PBDB as it might be a fossil
3845+ URL = "http://paleobiodb.org/data1.1/taxa/single.json?name="+taxonq+"&show=phylo&vocab=pbdb"
3846+ req = urllib2.Request(URL)
3847+ opener = urllib2.build_opener()
3848+ f = opener.open(req)
3849+ datapbdb = json.load(f)
3850+ if (len(datapbdb['records']) == 0):
3851+ # no idea!
3852+ with self.lock:
3853+ self.taxonomy[taxon] = {}
3854+ self.queue.task_done()
3855+ continue
3856+ # otherwise, let's fill in info here - only if extinct!
3857+ if datapbdb['records'][0]['is_extant'] == 0:
3858+ this_taxonomy = {}
3859+ this_taxonomy['provider'] = 'PBDB'
3860+ for level in taxonomy_levels:
3861+ try:
3862+ if datapbdb.has_key('records'):
3863+ pbdb_lev = datapbdb['records'][0][level]
3864+ temp_lev = pbdb_lev.split(" ")
3865+ # they might have the author on the end, so strip it off
3866+ if (level == 'species'):
3867+ this_taxonomy[level] = ' '.join(temp_lev[0:2])
3868+ else:
3869+ this_taxonomy[level] = temp_lev[0]
3870+ except KeyError as e:
3871+ logging.exception("Key not found records")
3872+ continue
3873+ # add the taxon at right level too
3874+ try:
3875+ if datapbdb.has_key('records'):
3876+ current_level = datapbdb['records'][0]['rank']
3877+ this_taxonomy[current_level] = datapbdb['records'][0]['taxon_name']
3878+ except KeyError as e:
3879+ self.queue.task_done()
3880+ logging.exception("Key not found records")
3881+ continue
3882+ with self.lock:
3883+ self.taxonomy[taxon] = this_taxonomy
3884+ self.queue.task_done()
3885+ continue
3886+ else:
3887+ # extant, but not in EoL - leave the user to sort this one out
3888+ with self.lock:
3889+ self.taxonomy[taxon] = {}
3890+ self.queue.task_done()
3891+ continue
3892+
3893+
3894+ ID = str(data['results'][0]['id']) # take first hit
3895+ # Now look for taxonomies
3896+ URL = "http://eol.org/api/pages/1.0/"+ID+".json"
3897+ req = urllib2.Request(URL)
3898+ opener = urllib2.build_opener()
3899+ f = opener.open(req)
3900+ data = json.load(f)
3901+ if len(data['taxonConcepts']) == 0:
3902+ with self.lock:
3903+ self.taxonomy[taxon] = {}
3904+ self.queue.task_done()
3905+ continue
3906+ TID = str(data['taxonConcepts'][0]['identifier']) # take first hit
3907+ currentdb = str(data['taxonConcepts'][0]['nameAccordingTo'])
3908+ # loop through and get preferred one if specified
3909+ # now get taxonomy
3910+ if (not self.pref_db is None):
3911+ for db in data['taxonConcepts']:
3912+ currentdb = db['nameAccordingTo'].lower()
3913+ if (self.pref_db.lower() in currentdb):
3914+ TID = str(db['identifier'])
3915+ break
3916+ URL="http://eol.org/api/hierarchy_entries/1.0/"+TID+".json"
3917+ req = urllib2.Request(URL)
3918+ opener = urllib2.build_opener()
3919+ f = opener.open(req)
3920+ data = json.load(f)
3921+ this_taxonomy = {}
3922+ this_taxonomy['provider'] = currentdb
3923+ for a in data['ancestors']:
3924+ try:
3925+ if a.has_key('taxonRank') :
3926+ temp_level = a['taxonRank'].encode("ascii","ignore")
3927+ if (temp_level in taxonomy_levels):
3928+ # note the dump into ASCII
3929+ temp_name = a['scientificName'].encode("ascii","ignore")
3930+ temp_name = temp_name.split(" ")
3931+ if (temp_level == 'species'):
3932+ this_taxonomy[temp_level] = temp_name[0:2]
3933+
3934+ else:
3935+ this_taxonomy[temp_level] = temp_name[0]
3936+ except KeyError as e:
3937+ logging.exception("Key not found: taxonRank")
3938+ continue
3939+ try:
3940+ # add taxonomy in to the taxonomy!
3941+ # some issues here, so let's make sure it's OK
3942+ temp_name = taxon.split(" ")
3943+ if data.has_key('taxonRank') :
3944+ if not data['taxonRank'].lower() == 'species':
3945+ this_taxonomy[data['taxonRank'].lower()] = temp_name[0]
3946+ else:
3947+ this_taxonomy[data['taxonRank'].lower()] = ' '.join(temp_name[0:2])
3948+ except KeyError as e:
3949+ self.queue.task_done()
3950+ logging.exception("Key not found: taxonRank")
3951+ continue
3952+ with self.lock:
3953+ #Send result to dictionary
3954+ self.taxonomy[taxon] = this_taxonomy
3955+ except urllib2.HTTPError:
3956+ print("Network error when processing {} ".format(taxon,))
3957+ logging.info("Network error when processing {} ".format(taxon,))
3958+ self.queue.task_done()
3959+ continue
3960+ except urllib2.URLError:
3961+ print("Network error when processing {} ".format(taxon,))
3962+ logging.info("Network error when processing {} ".format(taxon,))
3963+ self.queue.task_done()
3964+ continue
3965+ else :
3966+ #Nothing to do release the lock on taxonomy
3967+ self.lock.release()
3968+ #Mark task as done
3969+ self.queue.task_done()
3970+
3971+def create_taxonomy_from_taxa(taxa, taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False, threadNumber=5):
3972+ """Uses the taxa provided to generate a taxonomy for all the taxon available.
3973+ :param taxa: list of the taxa.
3974+ :type taxa : list
3975+ :param taxonomy: previous taxonomy available (if available) or an empty
3976+ dictionary to store the results. If None will be init to an empty dictionary
3977+ :type taxonomy: dictionary
3978+ :param pref_db: Gives priority to database. Seems it is unused.
3979+ :type pref_db: string
3980+ :param verbose: Show verbose messages during execution, will also define
3981+ level of logging. True will set logging level to INFO.
3982+ :type verbose: boolean
3983+ :param ignoreWarnings: Ignore warnings and errors during execution? Errors
3984+ will be logged with ERROR level on the logging output.
3985+ :type ignoreWarnings: boolean
3986+ :param threadNumber: Maximum number of threads to use for taxonomy processing.
3987+ :type threadNumber: int
3988+ :returns: dictionary with resulting taxonomy for each taxon (keys)
3989+ :rtype: dictionary
3990+ """
3991+ if verbose :
3992+ logging.getLogger().setLevel(logging.INFO)
3993+ if taxonomy is None:
3994+ taxonomy = {}
3995+
3996+ lock = threading.Lock()
3997+ queue = Queue.Queue()
3998+
3999+ #Starting a few threads as daemons checking the queue
4000+ for i in range(threadNumber) :
4001+ t = TaxonomyFetcher(taxonomy, lock, queue, i, pref_db, verbose, ignoreWarnings)
4002+ t.setDaemon(True)
4003+ t.start()
4004+
4005+ #Popoluate the queue with the taxa.
4006+ for taxon in taxa :
4007+ queue.put(taxon)
4008+
4009+ #Wait till everyone finishes
4010+ queue.join()
4011+ logging.getLogger().setLevel(logging.WARNING)
4012+
4013+def create_taxonomy_from_tree(tree, existing_taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False):
4014+ """ Generates the taxonomy from a tree. Uses a similar method to the XML version but works directly on a string with the tree.
4015+ :param tree: list of the taxa.
4016+ :type tree : list
4017+ :param existing_taxonomy: list of the taxa.
4018+ :type existing_taxonomy: list
4019+ :param pref_db: Gives priority to database. Seems it is unused.
4020+ :type pref_db: string
4021+ :param verbose: Flag for verbosity.
4022+ :type verbose: boolean
4023+ :param ignoreWarnings: Flag for exception processing.
4024+ :type ignoreWarnings: boolean
4025+ :returns: the modified taxonomy
4026+ :rtype: dictionary
4027+ """
4028+ starttime = time.time()
4029+
4030+ if(existing_taxonomy is None) :
4031+ taxonomy = {}
4032+ else :
4033+ taxonomy = existing_taxonomy
4034+
4035+ taxa = get_taxa_from_tree_for_taxonomy(tree, pretty=True)
4036+
4037+ create_taxonomy_from_taxa(taxa, taxonomy)
4038+
4039+ taxonomy = create_extended_taxonomy(taxonomy, starttime, verbose, ignoreWarnings)
4040+
4041+ return taxonomy
4042+
4043+def create_taxonomy(XML, existing_taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False):
4044+ """Generates a taxonomy of the data from EoL data. This is stored as a
4045+ dictionary of taxonomy for each taxon in the dataset. Missing data are
4046+ encoded as '' (blank string). It's up to the calling function to store this
4047+ data to file or display it."""
4048+
4049+ starttime = time.time()
4050+
4051+ if not ignoreWarnings:
4052+ _check_data(XML)
4053+
4054+ if (existing_taxonomy is None):
4055+ taxonomy = {}
4056+ else:
4057+ taxonomy = existing_taxonomy
4058+ taxa = get_all_taxa(XML, pretty=True)
4059+ create_taxonomy_from_taxa(taxa, taxonomy)
4060+ #taxonomy = create_extended_taxonomy(taxonomy, starttime, verbose, ignoreWarnings)
4061+ return taxonomy
4062+
4063+def create_extended_taxonomy(taxonomy, starttime, verbose=False, ignoreWarnings=False):
4064+ """Bring extra taxonomy terms from other databases, shared method for completing the taxonomy
4065+ both for trees comming from XML or directly from trees.
4066+ :param taxonomy: Dictionary with the relationship for taxa and taxonomy terms.
4067+ :type taxonomy: dictionary
4068+ :param starttime: time to keep track of processing time.
4069+ :type starttime: long
4070+ :param verbose: Flag for verbosity.
4071+ :type verbose: boolean
4072+ :param ignoreWarnings: Flag for exception processing.
4073+ :type ignoreWarnings: boolean
4074+ :returns: the modified taxonomy
4075+ :rtype: dictionary
4076+ """
4077+
4078+ if (verbose):
4079+ logging.info('Done basic taxonomy, getting more info from ITIS')
4080+ print("Time elapsed {}".format(str(time.time() - starttime)))
4081+ print "Done basic taxonomy, getting more info from ITIS"
4082+ # fill in the rest of the taxonomy
4083+ # get all genera
4084+ genera = []
4085+ for t in taxonomy:
4086+ if t in taxonomy:
4087+ if GENUS in taxonomy[t]:
4088+ genera.append(taxonomy[t][GENUS])
4089+ genera = _uniquify(genera)
4090+ # We then use ITIS to fill in missing info based on the genera only - that saves us a species level search
4091+ # and we can fill in most of the EoL missing data
4092+ for g in genera:
4093+ if (verbose):
4094+ print "Looking up ", g
4095+ logging.info("Looking up {}".format(str(g)))
4096+ try:
4097+ URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(g.strip())
4098+ except:
4099+ continue
4100+ req = urllib2.Request(URL)
4101+ opener = urllib2.build_opener()
4102+ try:
4103+ f = opener.open(req)
4104+ except urllib2.HTTPError:
4105+ continue
4106+ string = unicode(f.read(),"ISO-8859-1")
4107+ data = json.loads(string)
4108+ if data['scientificNames'][0] == None:
4109+ continue
4110+ tsn = data["scientificNames"][0]["tsn"]
4111+ URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+str(tsn)
4112+ req = urllib2.Request(URL)
4113+ opener = urllib2.build_opener()
4114+ f = opener.open(req)
4115+ try:
4116+ string = unicode(f.read(),"ISO-8859-1")
4117+ except:
4118+ continue
4119+ data = json.loads(string)
4120+ this_taxonomy = {}
4121+ for level in data['hierarchyList']:
4122+ if not level['rankName'].lower() in current_taxonomy_levels:
4123+ # note the dump into ASCII
4124+ if level['rankName'].lower() == 'species':
4125+ this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = ' '.join.level['taxonName'][0:2].encode("ascii","ignore")
4126+ else:
4127+ this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
4128+
4129+ for t in taxonomy:
4130+ if t in taxonomy:
4131+ if GENUS in taxonomy[t]:
4132+ if taxonomy[t][GENUS] == g:
4133+ taxonomy[t].update(this_taxonomy)
4134+
4135+ return taxonomy
4136+
4137+def generate_species_level_data(XML, taxonomy, ignoreWarnings=False, verbose=False):
4138+ """ Based on a taxonomy data set, amend the data to be at species level as
4139+ far as possible. This function creates an internal 'subs file' and calls
4140+ the standard substitution functions. The internal subs are generated by
4141+ looping over the taxa and if not at species-level, working out which level
4142+ they are at and then adding species already in the dataset to replace it
4143+ via a polytomy. This has to be done in one step to avoid adding spurious
4144+ structure to the phylogenies """
4145+
4146+ if not ignoreWarnings:
4147+ _check_data(XML)
4148+
4149+ # if taxonomic checker not done, warn
4150+ if (not taxonomy):
4151+ raise NoneCompleteTaxonomy("Taxonomy is empty. Create a taxonomy first. You'll probably need to hand edit the file to complete")
4152+ return
4153+
4154+ # if missing data in taxonomy, warn
4155+ taxa = get_all_taxa(XML)
4156+ keys = taxonomy.keys()
4157+ if (not ignoreWarnings):
4158+ for t in taxa:
4159+ t = t.replace("_"," ")
4160+ if not t in keys:
4161+ # This idea here is that the caller will catch this, then re-run with ignoreWarnings set to True
4162+ raise NoneCompleteTaxonomy("Taxonomy is not complete. I will soldier on anyway, but this might not work as intended")
4163+
4164+ # get all taxa - see above!
4165+ # for each taxa, if not at species level
4166+ new_taxa = []
4167+ old_taxa = []
4168+ for t in taxa:
4169+ subs = []
4170+ t = t.replace("_"," ")
4171+ if (not SPECIES in taxonomy[t]): # the current taxon is not a species, but higher level taxon
4172+ # work out which level - should we encode this in the data to start with?
4173+ for tl in taxonomy_levels:
4174+ try:
4175+ tax_data = taxonomy[t][tl]
4176+ except KeyError:
4177+ continue
4178+ if (t == taxonomy[t][tl]):
4179+ current_level = tl
4180+ # find all species in the taxonomy that match this level
4181+ for taxon in taxa:
4182+ taxon = taxon.replace("_"," ")
4183+ if (SPECIES in taxonomy[taxon]):
4184+ try:
4185+ if taxonomy[taxon][current_level] == t: # our current taxon
4186+ subs.append(taxon.replace(" ","_"))
4187+ except KeyError:
4188+ continue
4189+
4190+ # create the sub
4191+ if len(subs) > 0:
4192+ old_taxa.append(t.replace(" ","_"))
4193+ new_taxa.append(','.join(subs))
4194+
4195+ # call the sub
4196+ new_XML = substitute_taxa(XML, old_taxa, new_taxa, verbose=verbose)
4197+ new_XML = clean_data(new_XML)
4198+
4199+ return new_XML
4200
4201 def data_overlap(XML, overlap_amount=2, filename=None, detailed=False, show=False, verbose=False, ignoreWarnings=False):
4202 """ Calculate the amount of taxonomic overlap between source trees.
4203@@ -2024,7 +2724,7 @@
4204 If filename is None, no graphic is generated. Otherwise a simple
4205 graphic is generated showing the number of cluster. If detailed is set to
4206 true, a graphic is generated showing *all* trees. For data containing >200
4207- source tres this could be very big and take along time. More likely, you'll run
4208+ source trees this could be very big and take along time. More likely, you'll run
4209 out of memory.
4210 """
4211 import matplotlib
4212@@ -2103,6 +2803,7 @@
4213 sufficient_overlap = True
4214
4215 # The above list actually contains which components are seperate from each other
4216+ key_list = connected_components
4217
4218 if (not filename == None or show):
4219 if (verbose):
4220@@ -2266,7 +2967,9 @@
4221 prev_char = None
4222 prev_taxa = None
4223 prev_name = None
4224- non_ind = {}
4225+ subsets = []
4226+ identical = []
4227+ is_identical = False
4228 for data in data_ind:
4229 name = data[0]
4230 char = data[1]
4231@@ -2275,22 +2978,71 @@
4232 # when sorted, the longer list comes first
4233 if set(taxa).issubset(set(prev_taxa)):
4234 if (taxa == prev_taxa):
4235- non_ind[name] = [prev_name,IDENTICAL]
4236+ if (is_identical):
4237+ identical[-1].append(name)
4238+ else:
4239+ identical.append([name,prev_name])
4240+ is_identical = True
4241+
4242 else:
4243- non_ind[name] = [prev_name,SUBSET]
4244+ subsets.append([prev_name, name])
4245+ prev_name = name
4246+ is_identical = False
4247+ else:
4248+ prev_name = name
4249+ is_identical = False
4250+ else:
4251+ prev_name = name
4252+ is_identical = False
4253+
4254 prev_char = char
4255 prev_taxa = taxa
4256- prev_name = name
4257-
4258+
4259 if (make_new_xml):
4260 new_xml = XML
4261- for name in non_ind:
4262- if (non_ind[name][1] == SUBSET):
4263- new_xml = _swap_tree_in_XML(new_xml,None,name)
4264+ # deal with subsets
4265+ for s in subsets:
4266+ new_xml = _swap_tree_in_XML(new_xml,None,s[1])
4267 new_xml = clean_data(new_xml)
4268- return non_ind, new_xml
4269+ # deal with identical - weight them, if there's 3, weights are 0.3, i.e.
4270+ # weights are 1/no of identical trees
4271+ for i in identical:
4272+ weight = 1.0 / float(len(i))
4273+ new_xml = add_weights(new_xml, i, weight)
4274+
4275+ return identical, subsets, new_xml
4276 else:
4277- return non_ind
4278+ return identical, subsets
4279+
4280+
4281+def add_weights(XML, names, weight):
4282+ """ Add weights for tree, supply array of names and a weight, they get set
4283+ Returns a new XML
4284+ """
4285+
4286+ xml_root = _parse_xml(XML)
4287+ # By getting source, we can then loop over each source_tree
4288+ find = etree.XPath("//source_tree")
4289+ sources = find(xml_root)
4290+ for s in sources:
4291+ s_name = s.attrib['name']
4292+ for n in names:
4293+ if s_name == n:
4294+ if s.xpath("tree/weight/real_value") == []:
4295+ # add weights
4296+ weights_element = etree.Element("weight")
4297+ weights_element.tail="\n"
4298+ real_value = etree.SubElement(weights_element,'real_value')
4299+ real_value.attrib['rank'] = '0'
4300+ real_value.tail = '\n'
4301+ real_value.text = str(weight)
4302+ t = s.xpath("tree")[0]
4303+ t.append(weights_element)
4304+ else:
4305+ s.xpath("tree/weight/real_value")[0].text = str(weight)
4306+
4307+ return etree.tostring(xml_root,pretty_print=True)
4308+
4309
4310 def add_historical_event(XML, event_description):
4311 """
4312@@ -2380,8 +3132,15 @@
4313 # check trees are informative
4314 XML = _check_informative_trees(XML,delete=True)
4315
4316+
4317 # check sources
4318 XML = _check_sources(XML,delete=True)
4319+ XML = all_sourcenames(XML)
4320+
4321+ # fix tree names
4322+ XML = set_unique_names(XML)
4323+ XML = set_all_tree_names(XML,overwrite=True)
4324+
4325
4326 # unpermutable trees
4327 permutable_trees = _find_trees_for_permuting(XML)
4328@@ -2659,7 +3418,7 @@
4329 s.getparent().remove(s)
4330
4331 # edit name (append _subset)
4332- proj_name = xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
4333+ proj_name = get_project_name(XML)
4334 proj_name += "_subset"
4335 xml_root.xpath('/phylo_storage/project_name/string_value')[0].text = proj_name
4336
4337@@ -2928,6 +3687,37 @@
4338
4339 return mrca
4340
4341+
4342+def tree_from_taxonomy(taxonomy, end_level, end_rank):
4343+ """Create a tree from a taxonomy data structure.
4344+ This is not the most efficient way, but works OK
4345+ """
4346+
4347+ # Grab data only for the end_level classification
4348+ required_taxonomy = {}
4349+ for t in taxonomy:
4350+ if (end_level in t):
4351+ required_taxonomy[t] = taxonomy[t]
4352+
4353+ rank_index = taxonomy_levels.index(end_rank)
4354+
4355+ # create basic string
4356+
4357+ # get unique otus
4358+
4359+ # sort by the subfamily
4360+
4361+ # for each genus create a newick string
4362+
4363+ # if it's the same grouping as previous, add as sister clade (i.e. ,)
4364+ # else, prepend a (, append a ) and add new clade (ie. ,)
4365+
4366+
4367+ # return tree
4368+
4369+
4370+
4371+
4372 ################ PRIVATE FUNCTIONS ########################
4373
4374 def _uniquify(l):
4375@@ -2975,13 +3765,25 @@
4376 "The source names in the dataset are not unique. Please run the auto-name function on these data. Name: "+name+"\n"
4377 last_name = name
4378
4379+ # do same for tree names:
4380+ names = get_all_tree_names(XML)
4381+ names.sort()
4382+ last_name = "" # This will actually throw an non-unique error if a name is empty
4383+ # not great, but still an error!
4384+ for name in names:
4385+ if name == last_name:
4386+ # if non-unique throw exception
4387+ message = message + \
4388+ "The tree names in the dataset are not unique. Please run the auto-name function on these data with replace or edit by hand. Name: "+name+"\n"
4389+ last_name = name
4390+
4391 if (not message == ""):
4392 raise NotUniqueError(message)
4393
4394 return
4395
4396
4397-def _assemble_tree_matrix(tree_string):
4398+def _assemble_tree_matrix(tree_string, verbose=False):
4399 """ Assembles the MRP matrix for an individual tree
4400
4401 returns: matrix (2D numpy array: taxa on i, nodes on j)
4402@@ -3009,7 +3811,7 @@
4403 for i in range(0,len(names)):
4404 adjmat.append([1])
4405 adjmat = numpy.array(adjmat)
4406-
4407+ if verbose:
4408 print "Warning: Found uninformative tree in data. Including it in the matrix anyway"
4409
4410 return adjmat, names
4411@@ -3020,7 +3822,7 @@
4412
4413 If the new_taxa array is missing, simply delete the old_taxa
4414 """
4415-
4416+
4417 tree = _correctly_quote_taxa(tree)
4418 # are the input values lists or simple strings?
4419 if (isinstance(old_taxa,str)):
4420@@ -3564,7 +4366,7 @@
4421
4422 return permute_trees
4423
4424-def _create_matrix(trees, taxa, format="hennig", quote=False, weights=None):
4425+def _create_matrix(trees, taxa, format="hennig", quote=False, weights=None, verbose=False):
4426 """
4427 Does the hard work on creating a matrix
4428 """
4429@@ -3585,7 +4387,7 @@
4430 if (not weights == None):
4431 weight = weights[key]
4432 names.append(key)
4433- submatrix, tree_taxa = _assemble_tree_matrix(trees[key])
4434+ submatrix, tree_taxa = _assemble_tree_matrix(trees[key], verbose=verbose)
4435 nChars = len(submatrix[0,:])
4436 # loop over characters in the submatrix
4437 for i in range(1,nChars):
4438@@ -3637,7 +4439,7 @@
4439 matrix_string += string + "\n"
4440 i += 1
4441
4442- matrix_string += "\t;\n"
4443+ matrix_string += "\n"
4444 if (not weights == None):
4445 # get unique weights
4446 unique_weights = _uniquify(weights)
4447@@ -3652,7 +4454,7 @@
4448 matrix_string += " " + str(i)
4449 i += 1
4450 matrix_string += ";\n"
4451- matrix_string += "procedure /;"
4452+ matrix_string += "proc /;"
4453 elif (format == 'nexus'):
4454 matrix_string = "#nexus\n\nbegin data;\n"
4455 matrix_string += "\tdimensions ntax = "+str(len(taxa)) +" nchar = "+str(last_char)+";\n"
4456
4457=== modified file 'stk/test/_substitute_taxa.py'
4458--- stk/test/_substitute_taxa.py 2016-07-14 10:12:17 +0000
4459+++ stk/test/_substitute_taxa.py 2017-01-12 09:27:31 +0000
4460@@ -10,6 +10,7 @@
4461 from stk.supertree_toolkit import check_subs, _tree_contains, _correctly_quote_taxa, _remove_single_poly_taxa
4462 from stk.supertree_toolkit import _swap_tree_in_XML, substitute_taxa, get_all_taxa, _parse_tree, _delete_taxon
4463 from stk.supertree_toolkit import _collapse_nodes, import_tree, subs_from_csv, _getTaxaFromNewick, obtain_trees
4464+from stk.supertree_toolkit import generate_species_level_data
4465 from lxml import etree
4466 from util import *
4467 from stk.stk_exceptions import *
4468@@ -776,7 +777,24 @@
4469 new_tree = _sub_taxa_in_tree(tree2,"Thereuopodina",sub_in,skip_existing=True);
4470 self.assert_(answer2, new_tree)
4471
4472-
4473+
4474+ def test_auto_subs_taxonomy(self):
4475+ """test the automatic subs function with a simple test"""
4476+ XML = etree.tostring(etree.parse('data/input/auto_sub.phyml',parser),pretty_print=True)
4477+ taxonomy = {'Ardea goliath': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Ecdysozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Ardea', 'order': 'Pelecaniformes', 'species': 'Ardea goliath'},
4478+ 'Pelecaniformes': {'kingdom': 'Animalia', 'phylum': 'Chordata', 'order': 'Pelecaniformes', 'class': 'Aves', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013'}, 'Gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes'},
4479+ 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'infraphylum': 'Gnathostomata', 'superclass': 'Tetrapoda', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'order': 'Procellariiformes', 'species': 'Thalassarche melanophris'},
4480+ 'Platalea leucorodia': {'kingdom': 'Animalia', 'subfamily': 'Plataleinae', 'family': 'Threskiornithidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'infraphylum': 'Gnathostomata', 'superclass': 'Tetrapoda', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Platalea', 'order': 'Pelecaniformes', 'species': 'Platalea leucorodia'},
4481+ 'Gallus lafayetii': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus lafayetii'},
4482+ 'Ardea humbloti': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Ecdysozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Ardea', 'order': 'Pelecaniformes', 'species': 'Ardea humbloti'},
4483+ 'Gallus varius': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus varius'}}
4484+ XML = generate_species_level_data(XML, taxonomy)
4485+ expected_XML = etree.tostring(etree.parse('data/output/one_click_subs_output.phyml',parser),pretty_print=True)
4486+ trees = obtain_trees(XML)
4487+ expected_trees = obtain_trees(expected_XML)
4488+ for t in trees:
4489+ self.assert_(_trees_equal(trees[t], expected_trees[t]))
4490+
4491 def test_parrot_edge_case(self):
4492 """Random edge case where the tree dissappeared..."""
4493 trees = ["(((((((Agapornis_lilianae, Agapornis_nigrigenis), Agapornis_personata, Agapornis_fischeri), Agapornis_roseicollis), (Agapornis_pullaria, Agapornis_taranta)), Agapornis_cana), Loriculus_galgulus), Geopsittacus_occidentalis);"]
4494
4495=== modified file 'stk/test/_supertree_toolkit.py'
4496--- stk/test/_supertree_toolkit.py 2015-03-26 09:58:58 +0000
4497+++ stk/test/_supertree_toolkit.py 2017-01-12 09:27:31 +0000
4498@@ -7,12 +7,13 @@
4499 import os
4500 stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir, os.pardir )
4501 sys.path.insert(0, stk_path)
4502-from stk.supertree_toolkit import _check_uniqueness, _check_taxa, _check_data, get_all_characters, data_independence
4503+from stk.supertree_toolkit import _check_uniqueness, _check_taxa, _check_data, get_all_characters, data_independence, add_weights
4504 from stk.supertree_toolkit import get_fossil_taxa, get_publication_years, data_summary, get_character_numbers, get_analyses_used
4505 from stk.supertree_toolkit import data_overlap, read_matrix, subs_file_from_str, clean_data, obtain_trees, get_all_source_names
4506 from stk.supertree_toolkit import add_historical_event, _sort_data, _parse_xml, _check_sources, _swap_tree_in_XML, replace_genera
4507 from stk.supertree_toolkit import get_all_taxa, _get_all_siblings, _parse_tree, get_characters_used, _trees_equal, get_weights
4508-from stk.supertree_toolkit import get_outgroup, set_all_tree_names, create_tree_name, load_taxonomy
4509+from stk.supertree_toolkit import get_outgroup, set_all_tree_names, create_tree_name, taxonomic_checker, load_taxonomy, load_equivalents
4510+from stk.supertree_toolkit import create_taxonomy, create_taxonomy_from_tree, get_all_tree_names
4511 from lxml import etree
4512 from util import *
4513 from stk.stk_exceptions import *
4514@@ -268,19 +269,52 @@
4515
4516 def test_data_independence(self):
4517 XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
4518- expected_dict = {'Hill_2011_2': ['Hill_2011_1', 1], 'Hill_Davis_2011_1': ['Hill_Davis_2011_2', 0]}
4519- non_ind = data_independence(XML)
4520- self.assertDictEqual(expected_dict, non_ind)
4521+ expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
4522+ non_ind,subsets = data_independence(XML)
4523+ expected_subsets = [['Hill_2011_1', 'Hill_2011_2']]
4524+ self.assertListEqual(expected_subsets, subsets)
4525+ self.assertListEqual(expected_idents, non_ind)
4526
4527- def test_data_independence(self):
4528+ def test_data_independence_2(self):
4529 XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
4530- expected_dict = {'Hill_2011_2': ['Hill_2011_1', 1], 'Hill_Davis_2011_1': ['Hill_Davis_2011_2', 0]}
4531- non_ind, new_xml = data_independence(XML,make_new_xml=True)
4532- self.assertDictEqual(expected_dict, non_ind)
4533+ expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
4534+ expected_subsets = [['Hill_2011_1', 'Hill_2011_2']]
4535+ non_ind, subset, new_xml = data_independence(XML,make_new_xml=True)
4536+ self.assertListEqual(expected_idents, non_ind)
4537+ self.assertListEqual(expected_subsets, subset)
4538 # check the second tree has not been removed
4539 self.assertRegexpMatches(new_xml,re.escape('((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,H:1.00000)0.00000:0.00000)0.00000:0.00000;'))
4540 # check that the first tree is removed
4541 self.assertNotRegexpMatches(new_xml,re.escape('((A:1.00000,B:1.00000)0.00000:0.00000,(F:1.00000,E:1.00000)0.00000:0.00000)0.00000:0.00000;'))
4542+
4543+ def test_add_weights(self):
4544+ """Add weights to a bunch of trees"""
4545+ XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
4546+ # see above
4547+ expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
4548+ # so the first should end up with a weight of 0.33333 and the second with 0.5
4549+ for ei in expected_idents:
4550+ weight = 1.0/float(len(ei))
4551+ XML = add_weights(XML, ei, weight)
4552+
4553+ expected_weights = [str(1.0/3.0), str(1.0/3.0), str(1.0/3.0), str(0.5), str(0.5)]
4554+ weights_in_xml = []
4555+ # now check weights have been added to the correct part of the tree
4556+ xml_root = _parse_xml(XML)
4557+ i = 0
4558+ for ei in expected_idents:
4559+ for tree in ei:
4560+ find = etree.XPath("//source_tree")
4561+ trees = find(xml_root)
4562+ for t in trees:
4563+ if t.attrib['name'] == tree:
4564+ # check len(trees) == 0
4565+ weights_in_xml.append(t.xpath("tree/weight/real_value")[0].text)
4566+
4567+ self.assertListEqual(expected_weights,weights_in_xml)
4568+
4569+
4570+
4571
4572 def test_overlap(self):
4573 XML = etree.tostring(etree.parse('data/input/check_overlap_ok.phyml',parser),pretty_print=True)
4574@@ -438,7 +472,7 @@
4575 XML = clean_data(XML)
4576 trees = obtain_trees(XML)
4577 self.assert_(len(trees) == 2)
4578- expected_trees = {'Hill_2011_4': '(A,B,(C,D,E));', 'Hill_2011_2': '(A, B, C, (D, E, F));'}
4579+ expected_trees = {'Hill_2011_2': '(A,B,(C,D,E));', 'Hill_2011_1': '(A, B, C, (D, E, F));'}
4580 for t in trees:
4581 self.assert_(_trees_equal(trees[t],expected_trees[t]))
4582
4583@@ -558,18 +592,78 @@
4584 self.assert_(c in expected_characters)
4585 self.assert_(len(characters) == len(expected_characters))
4586
4587+ def test_create_taxonomy(self):
4588+ XML = etree.tostring(etree.parse('data/input/create_taxonomy.phyml',parser),pretty_print=True)
4589+ # Tested on 11/01/17 and EOL have changed the output
4590+ # old_expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'superclass': 'Sarcopterygii', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Egretta tricolor', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Jeletzkytes criptonodosus': {'superfamily': 'Scaphitoidea', 'family': 'Scaphitidae', 'subkingdom': 'Metazoa', 'subclass': 'Ammonoidea', 'species': 'Jeletzkytes criptonodosus', 'phylum': 'Mollusca', 'suborder': 'Ancyloceratina', 'provider': 'Paleobiology Database', 'genus': 'Jeletzkytes', 'class': 'Cephalopoda'}}
4591+ expected = {'Jeletzkytes criptonodosus': {'superfamily': 'Scaphitoidea', 'family': 'Scaphitidae', 'subkingdom': 'Metazoa', 'subclass': 'Ammonoidea', 'species': 'Jeletzkytes criptonodosus', 'phylum': 'Mollusca', 'suborder': 'Ancyloceratina', 'provider': 'Paleobiology Database', 'genus': 'Jeletzkytes', 'class': 'Cephalopoda'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'infraspecies': 'Egretta', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': ['Egretta', 'tricolor'], 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Archaeopteryx lithographica': {'genus': 'Archaeopteryx', 'provider': 'Paleobiology Database'}}
4592+ if (internet_on()):
4593+ taxonomy = create_taxonomy(XML)
4594+ self.maxDiff = None
4595+ self.assertDictEqual(taxonomy, expected)
4596+ else:
4597+ print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
4598+ return
4599+
4600+ def test_create_taxonomy_from_tree(self):
4601+ """Tests if taxonomy from tree works. Uses same data for normal XML test but goes directly for the tree instead of parsing the XML """
4602+ # Tested on 11/01/17 and this no longer worked, but is correct! EOL returned something different.
4603+ #old_expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'superclass': 'Sarcopterygii', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Egretta tricolor', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}}
4604+ expected = {'Archaeopteryx lithographica': {'genus': 'Archaeopteryx', 'provider': 'Paleobiology Database'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'infraspecies': 'Egretta', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': ['Egretta', 'tricolor'], 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}}
4605+ tree = "(Archaeopteryx_lithographica, (Gallus_gallus, (Thalassarche_melanophris, Egretta_tricolor)));"
4606+ if (internet_on()):
4607+ taxonomy = create_taxonomy_from_tree(tree)
4608+ self.maxDiff = None
4609+ self.assertDictEqual(taxonomy, expected)
4610+ else:
4611+ print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the create_taxonomy function"
4612+ return
4613+
4614+ def test_taxonomy_checker(self):
4615+ expected = {'Thalassarche_melanophrys': [['Thalassarche_melanophris', 'Thalassarche_melanophrys', 'Diomedea_melanophris', 'Thalassarche_[melanophrys', 'Diomedea_melanophrys'], 'amber'], 'Egretta_tricolor': [['Egretta_tricolor'], 'green'], 'Gallus_gallus': [['Gallus_gallus'], 'green']}
4616+ XML = etree.tostring(etree.parse('data/input/check_taxonomy.phyml',parser),pretty_print=True)
4617+ if (internet_on()):
4618+ equivs = taxonomic_checker(XML)
4619+ self.maxDiff = None
4620+ self.assertDictEqual(equivs, expected)
4621+ else:
4622+ print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
4623+ return
4624+
4625+ def test_taxonomy_checker2(self):
4626+ XML = etree.tostring(etree.parse('data/input/check_taxonomy_fixes.phyml',parser),pretty_print=True)
4627+ if (internet_on()):
4628+ # This test is a bit dodgy as it depends on EOL's server speed. Run it a few times before deciding it's broken.
4629+ equivs = taxonomic_checker(XML,verbose=False)
4630+ self.maxDiff = None
4631+ self.assert_(equivs['Agathamera_crassa'][0][0] == 'Agathemera_crassa')
4632+ self.assert_(equivs['Celatoblatta_brunni'][0][0] == 'Maoriblatta_brunni')
4633+ self.assert_(equivs['Blatta_lateralis'][1] == 'amber')
4634+ else:
4635+ print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
4636+ return
4637+
4638+
4639 def test_load_taxonomy(self):
4640 csv_file = "data/input/create_taxonomy.csv"
4641- expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'},
4642- 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'subclass': 'Neoloricata', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'suborder': 'Ischnochitonina', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Egretta', 'order': 'Pelecaniformes', 'species': 'Egretta tricolor'},
4643- 'Gallus gallus': {'kingdom': 'Animalia', 'infrakingdom': 'Protostomia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus gallus'},
4644- 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'order': 'Procellariiformes', 'species': 'Thalassarche melanophris'},
4645- 'Jeletzkytes criptonodosus': {'kingdom': 'Metazoa', 'family': 'Scaphitidae', 'order': 'Ammonoidea', 'phylum': 'Mollusca', 'provider': 'PBDB', 'species': 'Jeletzkytes criptonodosus', 'class': 'Cephalopoda'}}
4646+ expected = {'Jeletzkytes_criptonodosus': {'kingdom': 'Metazoa', 'subclass': 'Cephalopoda', 'species': 'Jeletzkytes criptonodosus', 'suborder': 'Ammonoidea', 'provider': 'PBDB', 'subfamily': 'Scaphitidae', 'class': 'Mollusca'}, 'Archaeopteryx_lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Egretta_tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'subkingdom': 'Bilateria', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'subclass': 'Neoloricata', 'species': 'Egretta tricolor', 'phylum': 'Chordata', 'suborder': 'Ischnochitonina', 'superphylum': 'Lophozoa', 'infrakingdom': 'Protostomia', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus_gallus': {'kingdom': 'Animalia', 'superorder': 'Galliformes', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'class': 'Aves'}, 'Thalassarche_melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'species': 'Thalassarche melanophris', 'order': 'Procellariiformes', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'class': 'Aves'}}
4647 taxonomy = load_taxonomy(csv_file)
4648 self.maxDiff = None
4649
4650 self.assertDictEqual(taxonomy, expected)
4651
4652+
4653+ def test_load_equivalents(self):
4654+ csv_file = "data/input/equivalents.csv"
4655+ expected = {'Turnix_sylvatica': [['Turnix_sylvaticus','Tetrao_sylvaticus','Tetrao_sylvatica','Turnix_sylvatica'],'yellow'],
4656+ 'Xiphorhynchus_pardalotus':[['Xiphorhynchus_pardalotus'],'green'],
4657+ 'Phaenicophaeus_curvirostris':[['Zanclostomus_curvirostris','Rhamphococcyx_curvirostris','Phaenicophaeus_curvirostris','Rhamphococcyx_curvirostr'],'yellow'],
4658+ 'Megalapteryx_benhami':[['Megalapteryx_benhami'],'red']
4659+ }
4660+ equivalents = load_equivalents(csv_file)
4661+ self.assertDictEqual(equivalents, expected)
4662+
4663+
4664 def test_name_tree(self):
4665 XML = etree.tostring(etree.parse('data/input/single_source_no_names.phyml',parser),pretty_print=True)
4666 xml_root = _parse_xml(XML)
4667@@ -583,6 +677,35 @@
4668 XML = etree.tostring(etree.parse('data/input/single_source.phyml',parser),pretty_print=True)
4669 self.assert_(isEqualXML(new_xml,XML))
4670
4671+ def test_all_rename_tree(self):
4672+ XML = etree.tostring(etree.parse('data/input/single_source_same_tree_name.phyml',parser),pretty_print=True)
4673+ new_xml = set_all_tree_names(XML,overwrite=True)
4674+ XML = etree.tostring(etree.parse('data/output/single_source_same_tree_name.phyml',parser),pretty_print=True)
4675+ self.assert_(isEqualXML(new_xml,XML))
4676+
4677+ def test_get_all_tree_names(self):
4678+ XML = etree.tostring(etree.parse('data/input/single_source_same_tree_name.phyml',parser),pretty_print=True)
4679+ names = get_all_tree_names(XML)
4680+ self.assertListEqual(names,['Hill_2011_2','Hill_2011_2'])
4681+
4682+
4683+def internet_on(host="8.8.8.8", port=443, timeout=5):
4684+ import socket
4685+
4686+ """
4687+ Host: 8.8.8.8 (google-public-dns-a.google.com)
4688+ OpenPort: 53/tcp
4689+ Service: domain (DNS/TCP)
4690+ """
4691+ try:
4692+ socket.setdefaulttimeout(timeout)
4693+ socket.socket(socket.AF_INET, socket.SOCK_STREAM).connect((host, port))
4694+ return True
4695+ except Exception as ex:
4696+ print ex.message
4697+ return False
4698+
4699+
4700
4701 if __name__ == '__main__':
4702 unittest.main()
4703
4704=== modified file 'stk/test/_trees.py'
4705--- stk/test/_trees.py 2015-03-26 09:58:58 +0000
4706+++ stk/test/_trees.py 2017-01-12 09:27:31 +0000
4707@@ -5,7 +5,7 @@
4708 sys.path.insert(0,"../../")
4709 from stk.supertree_toolkit import import_tree, obtain_trees, get_all_taxa, _assemble_tree_matrix, create_matrix, _delete_taxon, _sub_taxon,_tree_contains
4710 from stk.supertree_toolkit import _swap_tree_in_XML, substitute_taxa, get_taxa_from_tree, get_characters_from_tree, amalgamate_trees, _uniquify
4711-from stk.supertree_toolkit import import_trees, import_tree, _trees_equal, _find_trees_for_permuting, permute_tree, get_all_source_names, _getTaxaFromNewick
4712+from stk.supertree_toolkit import import_trees, import_tree, _trees_equal, _find_trees_for_permuting, permute_tree, get_all_source_names, _getTaxaFromNewick, _parse_tree
4713 from stk.supertree_toolkit import get_mrca
4714 import os
4715 from lxml import etree
4716@@ -215,6 +215,18 @@
4717 mrca = get_mrca(tree,["A","I", "L"])
4718 self.assert_(mrca == 8)
4719
4720+ def test_get_mrca(self):
4721+ tree = "(B,(C,(D,(E,((A,F),((I,(G,H)),(J,(K,L))))))));"
4722+ mrca = get_mrca(tree,["A","F"])
4723+ print mrca
4724+ #self.assert_(mrca == 8)
4725+ to = _parse_tree('(X,Y,Z,(Q,W));')
4726+ treeobj = _parse_tree(tree)
4727+ newnode = treeobj.addNodeBetweenNodes(10,9)
4728+ treeobj.addSubTree(newnode, to, ignoreRootAssert=True)
4729+ treeobj.draw()
4730+
4731+
4732 def test_get_all_trees(self):
4733 XML = etree.tostring(etree.parse(single_source_input,parser),pretty_print=True)
4734 tree = obtain_trees(XML)
4735
4736=== added file 'stk/test/data/input/auto_sub.phyml'
4737--- stk/test/data/input/auto_sub.phyml 1970-01-01 00:00:00 +0000
4738+++ stk/test/data/input/auto_sub.phyml 2017-01-12 09:27:31 +0000
4739@@ -0,0 +1,97 @@
4740+<?xml version='1.0' encoding='utf-8'?>
4741+<phylo_storage>
4742+ <project_name>
4743+ <string_value lines="1">Test</string_value>
4744+ </project_name>
4745+ <sources>
4746+ <source name="Hill_2011">
4747+ <bibliographic_information>
4748+ <article>
4749+ <authors>
4750+ <author>
4751+ <surname>
4752+ <string_value lines="1">Hill</string_value>
4753+ </surname>
4754+ <other_names>
4755+ <string_value lines="1">Jon</string_value>
4756+ </other_names>
4757+ </author>
4758+ </authors>
4759+ <title>
4760+ <string_value lines="1">A great paper</string_value>
4761+ </title>
4762+ <year>
4763+ <integer_value rank="0">2011</integer_value>
4764+ </year>
4765+ <journal>
4766+ <string_value lines="1">Nature</string_value>
4767+ </journal>
4768+ <pages>
4769+ <string_value lines="1">1-12</string_value>
4770+ </pages>
4771+ </article>
4772+ </bibliographic_information>
4773+ <source_tree name="Hill_2011_1">
4774+ <tree>
4775+ <tree_string>
4776+ <string_value lines="1">(Thalassarche_melanophris, Pelecaniformes, (Gallus, Gallus_varius));</string_value>
4777+ </tree_string>
4778+ <figure_legend>
4779+ <string_value lines="1">NA</string_value>
4780+ </figure_legend>
4781+ <figure_number>
4782+ <string_value lines="1">1</string_value>
4783+ </figure_number>
4784+ <page_number>
4785+ <string_value lines="1">1</string_value>
4786+ </page_number>
4787+ <tree_inference>
4788+ <optimality_criterion name="Maximum Parsimony"/>
4789+ </tree_inference>
4790+ <topology>
4791+ <outgroup>
4792+ <string_value lines="1">A</string_value>
4793+ </outgroup>
4794+ </topology>
4795+ </tree>
4796+ <taxa_data>
4797+ <all_extant/>
4798+ </taxa_data>
4799+ <character_data>
4800+ <character type="molecular" name="12S"/>
4801+ </character_data>
4802+ </source_tree>
4803+ <source_tree name="Hill_2011_2">
4804+ <tree>
4805+ <tree_string>
4806+ <string_value lines="1">(Gallus_lafayetii, (Platalea_leucorodia, (Ardea_humbloti, Ardea_goliath)));</string_value>
4807+ </tree_string>
4808+ <figure_legend>
4809+ <string_value lines="1">NA</string_value>
4810+ </figure_legend>
4811+ <figure_number>
4812+ <string_value lines="1">1</string_value>
4813+ </figure_number>
4814+ <page_number>
4815+ <string_value lines="1">1</string_value>
4816+ </page_number>
4817+ <tree_inference>
4818+ <optimality_criterion name="Maximum Parsimony"/>
4819+ </tree_inference>
4820+ <topology>
4821+ <outgroup>
4822+ <string_value lines="1">A</string_value>
4823+ </outgroup>
4824+ </topology>
4825+ </tree>
4826+ <taxa_data>
4827+ <all_extant/>
4828+ </taxa_data>
4829+ <character_data>
4830+ <character type="molecular" name="12S"/>
4831+ </character_data>
4832+ </source_tree>
4833+ </source>
4834+ </sources>
4835+ <history/>
4836+</phylo_storage>
4837
4838=== modified file 'stk/test/data/input/check_data_ind.phyml'
4839--- stk/test/data/input/check_data_ind.phyml 2014-10-09 09:33:21 +0000
4840+++ stk/test/data/input/check_data_ind.phyml 2017-01-12 09:27:31 +0000
4841@@ -249,6 +249,147 @@
4842 <character type="molecular" name="12S"/>
4843 </character_data>
4844 </source_tree>
4845+ <source_tree name="Hill_Davis_2011_3">
4846+ <tree>
4847+ <tree_string>
4848+ <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,H:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
4849+ </tree_string>
4850+ <figure_legend>
4851+ <string_value lines="1">NA</string_value>
4852+ </figure_legend>
4853+ <figure_number>
4854+ <string_value lines="1">0</string_value>
4855+ </figure_number>
4856+ <page_number>
4857+ <string_value lines="1">0</string_value>
4858+ </page_number>
4859+ <tree_inference>
4860+ <optimality_criterion name="Maximum Parsimony"/>
4861+ </tree_inference>
4862+ <topology>
4863+ <outgroup>
4864+ <string_value lines="1">A</string_value>
4865+ </outgroup>
4866+ </topology>
4867+ </tree>
4868+ <taxa_data>
4869+ <mixed_fossil_and_extant>
4870+ <taxon name="A">
4871+ <fossil/>
4872+ </taxon>
4873+ <taxon name="B">
4874+ <fossil/>
4875+ </taxon>
4876+ </mixed_fossil_and_extant>
4877+ </taxa_data>
4878+ <character_data>
4879+ <character type="molecular" name="12S"/>
4880+ </character_data>
4881+ </source_tree>
4882+ </source>
4883+ <source name="Hill_Davis_2013">
4884+ <bibliographic_information>
4885+ <article>
4886+ <authors>
4887+ <author>
4888+ <surname>
4889+ <string_value lines="1">Hill</string_value>
4890+ </surname>
4891+ <other_names>
4892+ <string_value lines="1">Jon</string_value>
4893+ </other_names>
4894+ </author>
4895+ <author>
4896+ <surname>
4897+ <string_value lines="1">Davis</string_value>
4898+ </surname>
4899+ <other_names>
4900+ <string_value lines="1">Katie</string_value>
4901+ </other_names>
4902+ </author>
4903+ </authors>
4904+ <title>
4905+ <string_value lines="1">Another superb paper</string_value>
4906+ </title>
4907+ <year>
4908+ <integer_value rank="0">2013</integer_value>
4909+ </year>
4910+ </article>
4911+ </bibliographic_information>
4912+ <source_tree name="Hill_Davis_2013_1">
4913+ <tree>
4914+ <tree_string>
4915+ <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,Z:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
4916+ </tree_string>
4917+ <figure_legend>
4918+ <string_value lines="1">NA</string_value>
4919+ </figure_legend>
4920+ <figure_number>
4921+ <string_value lines="1">0</string_value>
4922+ </figure_number>
4923+ <page_number>
4924+ <string_value lines="1">0</string_value>
4925+ </page_number>
4926+ <tree_inference>
4927+ <optimality_criterion name="Maximum Parsimony"/>
4928+ </tree_inference>
4929+ <topology>
4930+ <outgroup>
4931+ <string_value lines="1">A</string_value>
4932+ </outgroup>
4933+ </topology>
4934+ </tree>
4935+ <taxa_data>
4936+ <mixed_fossil_and_extant>
4937+ <taxon name="A">
4938+ <fossil/>
4939+ </taxon>
4940+ <taxon name="B">
4941+ <fossil/>
4942+ </taxon>
4943+ </mixed_fossil_and_extant>
4944+ </taxa_data>
4945+ <character_data>
4946+ <character type="molecular" name="12S"/>
4947+ </character_data>
4948+ </source_tree>
4949+ <source_tree name="Hill_Davis_2013_2">
4950+ <tree>
4951+ <tree_string>
4952+ <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,Z:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
4953+ </tree_string>
4954+ <figure_legend>
4955+ <string_value lines="1">NA</string_value>
4956+ </figure_legend>
4957+ <figure_number>
4958+ <string_value lines="1">0</string_value>
4959+ </figure_number>
4960+ <page_number>
4961+ <string_value lines="1">0</string_value>
4962+ </page_number>
4963+ <tree_inference>
4964+ <optimality_criterion name="Maximum Parsimony"/>
4965+ </tree_inference>
4966+ <topology>
4967+ <outgroup>
4968+ <string_value lines="1">A</string_value>
4969+ </outgroup>
4970+ </topology>
4971+ </tree>
4972+ <taxa_data>
4973+ <mixed_fossil_and_extant>
4974+ <taxon name="A">
4975+ <fossil/>
4976+ </taxon>
4977+ <taxon name="B">
4978+ <fossil/>
4979+ </taxon>
4980+ </mixed_fossil_and_extant>
4981+ </taxa_data>
4982+ <character_data>
4983+ <character type="molecular" name="12S"/>
4984+ </character_data>
4985+ </source_tree>
4986 </source>
4987 </sources>
4988 <history/>
4989
4990=== added file 'stk/test/data/input/check_taxonomy.phyml'
4991--- stk/test/data/input/check_taxonomy.phyml 1970-01-01 00:00:00 +0000
4992+++ stk/test/data/input/check_taxonomy.phyml 2017-01-12 09:27:31 +0000
4993@@ -0,0 +1,67 @@
4994+<?xml version='1.0' encoding='utf-8'?>
4995+<phylo_storage>
4996+ <project_name>
4997+ <string_value lines="1">Test</string_value>
4998+ </project_name>
4999+ <sources>
5000+ <source name="Hill_2011">
The diff has been truncated for viewing.

Subscribers

People subscribed via source and target branches