1
=== modified file 'debian/control'
2
--- debian/control	2016-12-14 16:22:12 +0000
3
+++ debian/control	2017-01-12 09:27:31 +0000
4
@@ -9,7 +9,7 @@
5
9
9
6
10
Package: supertree-toolkit
10
Package: supertree-toolkit
7
11
Architecture: all
11
Architecture: all
9
12
Depends: python-tk, python-dxdiff, python-pygraphviz, python-lxml-dbg, python-lxml, python-gtk2, python-numpy, python-matplotlib, python-lxml, libxml2-utils, python, python-gtksourceview2, python-glade2, python-networkx
12
Depends: python-tk, python-simplejson, python-dxdiff, python-pygraphviz, python-lxml-dbg, python-lxml, python-gtk2, python-numpy, python-matplotlib, python-lxml, libxml2-utils, python, python-gtksourceview2, python-glade2, python-networkx, python-argcomplete
10
13
Recommends: python-psyco
13
Recommends: python-psyco
11
14
Suggests: 
14
Suggests: 
12
15
Conflicts: 
15
Conflicts: 
13
16
16
14
=== modified file 'debian/rules'
15
--- debian/rules	2013-10-14 12:58:59 +0000
16
+++ debian/rules	2017-01-12 09:27:31 +0000
17
@@ -6,5 +6,6 @@
18
6
6
19
7
override_dh_auto_install:
7
override_dh_auto_install:
20
8
	python setup.py install --root=debian/supertree-toolkit --install-layout=deb --install-scripts=/usr/bin
8
	python setup.py install --root=debian/supertree-toolkit --install-layout=deb --install-scripts=/usr/bin
21
9
	argcomplete.autocomplete(parser)
22
9
10
23
10
override_dh_auto_build:
11
override_dh_auto_build:
24
11
12
25
=== added file 'notes.txt'
26
--- notes.txt	1970-01-01 00:00:00 +0000
27
+++ notes.txt	2017-01-12 09:27:31 +0000
28
@@ -0,0 +1,38 @@
29
1
Ideas:
30
2
31
3
Collect data, remove paraphyletic
32
4
33
5
Take taxonomy (from dbs), phyml, users knowledge (encoded as subs file) and information on synonyms (from dbs)
34
6
to create a master subs file that takes the dat to species level
35
7
36
8
User needs to be able to edit taxonomy - CSV file 
37
9
38
10
User needs to choose database source - preferred source.
39
11
40
12
41
13
Taxonomic name checker:
42
14
43
15
 - use database to get synonyms and possible mispellings
44
16
 - Gui is a 2 column table with green, yellow, red. User filles in red (or removes it), green is fine. Yellow - drop down list with alternatives.
45
17
 - Use this to generate a two column CSV file
46
18
 - On CLI, generate a three column CSV. Original name, new name (or blank for unknown) and a list of possibles. Warn user they *must* fill in the second column or remove the row or the taxa will be deleted.
47
19
48
20
For colloqual names, user adds to column 1 of taxonomy csv and then adds the latin name in the approriate column of the database. The subs can then generate the species list.
49
21
50
22
Use these two csv files to generate a subs file, including replacing higher taxa and genera to create a "to species" substtution (can also output this file for later)
51
23
52
24
Generating data to any taxonomic level can happen later - need to check each species is accounted for in the taxonomy, with correct levels - may need another parse of the taxonomy csv
53
25
54
26
55
27
Add data -> paraphyletic taxa -> taxonomy checker -> sub synonyms -> taxonomy generator -> create species level dataset
56
28
57
29
New functions:
58
30
 - taxonomic name checker (this might take a while when online for large dataset) - note that this should be a one for one substitution - seperate function so we can check this?
59
31
 - Pull in taxonomy generator
60
32
 - Add csv file to schema
61
33
 - amaend manual with workflow
62
34
 - warning on multiple subs in data in manual
63
35
 - generate species level subsfile from taxonomy
64
36
 - generate specified taxonomic level data
65
37
66
38
67
0
39
68
=== modified file 'stk/bzr_version.py'
69
--- stk/bzr_version.py	2017-01-11 17:42:56 +0000
70
+++ stk/bzr_version.py	2017-01-12 09:27:31 +0000
71
@@ -4,12 +4,12 @@
72
4
So don't edit it. :)
4
So don't edit it. :)
73
5
"""
5
"""
74
6
6
77
7
version_info = {'branch_nick': u'supertree-toolkit',
7
version_info = {'branch_nick': u'sub_in_subfile',
78
8
 'build_date': '2017-01-11 17:42:27 +0000',
8
 'build_date': '2017-01-11 17:48:33 +0000',
79
9
 'clean': None,
9
 'clean': None,
83
10
 'date': '2017-01-11 17:39:43 +0000',
10
 'date': '2017-01-11 17:48:18 +0000',
84
11
 'revision_id': 'jon.hill@imperial.ac.uk-20170111173943-88so1icr33su3afo',
11
 'revision_id': 'jon.hill@imperial.ac.uk-20170111174818-9q8a9octvnawruuw',
85
12
 'revno': '279'}
12
 'revno': '317'}
86
13
13
87
14
revisions = {}
14
revisions = {}
88
15
15
89
16
16
90
=== modified file 'stk/p4/NexusToken.py'
91
--- stk/p4/NexusToken.py	2012-01-11 08:57:43 +0000
92
+++ stk/p4/NexusToken.py	2017-01-12 09:27:31 +0000
93
@@ -44,6 +44,7 @@
94
44
            gm = ["safeNextTok(), called from %s" % caller]
44
            gm = ["safeNextTok(), called from %s" % caller]
95
45
        else:
45
        else:
96
46
            gm = ["safeNextTok()"]
46
            gm = ["safeNextTok()"]
97
47
        print flob
98
47
        gm.append("Premature Death.")
48
        gm.append("Premature Death.")
99
48
        gm.append("Ran out of understandable things to read in nexus file.")
49
        gm.append("Ran out of understandable things to read in nexus file.")
100
49
        raise Glitch, gm
50
        raise Glitch, gm
101
50
51
102
=== modified file 'stk/p4/NexusToken2.py'
103
--- stk/p4/NexusToken2.py	2012-01-11 08:57:43 +0000
104
+++ stk/p4/NexusToken2.py	2017-01-12 09:27:31 +0000
105
@@ -88,7 +88,7 @@
106
88
        else:
88
        else:
107
89
            gm = ["safeNextTok()"]
89
            gm = ["safeNextTok()"]
108
90
        gm.append("Premature Death.")
90
        gm.append("Premature Death.")
110
91
        gm.append("Ran out of understandable things to read in nexus file.")
91
        gm.append("Ran out of understandable things to read in nexus file." + str(flob))
111
92
        raise Glitch, gm
92
        raise Glitch, gm
112
93
    else:
93
    else:
113
94
        return t
94
        return t
114
95
95
115
=== modified file 'stk/p4/Tree.py'
116
--- stk/p4/Tree.py	2013-08-25 09:24:34 +0000
117
+++ stk/p4/Tree.py	2017-01-12 09:27:31 +0000
118
@@ -996,17 +996,9 @@
119
996
                if not item.name:
996
                if not item.name:
120
997
                    if item == self.root:
997
                    if item == self.root:
121
998
                        if var.fixRootedTrees:
998
                        if var.fixRootedTrees:
127
999
                            if self.name:
999
                            #print "Fixing tree to work with SuperTree scores"
123
1000
                                print "Tree.initFinish()   tree '%s'" % self.name
124
1001
                            else:
125
1002
                                print 'Tree.initFinish()'
126
1003
                            print "Fixing tree to work with SuperTree scores"
128
1004
                            self.removeRoot()
1000
                            self.removeRoot()
129
1005
                        elif var.warnAboutTerminalRootWithNoName:
1001
                        elif var.warnAboutTerminalRootWithNoName:
130
1006
                            if self.name:
131
1007
                                print "Tree.initFinish()   tree '%s'" % self.name
132
1008
                            else:
133
1009
                                print 'Tree.initFinish()'
134
1010
                            print '    Non-fatal warning: the root is terminal, but has no name.'
1002
                            print '    Non-fatal warning: the root is terminal, but has no name.'
135
1011
                            print '    This may be what you wanted.  Or not?'
1003
                            print '    This may be what you wanted.  Or not?'
136
1012
                            print '    (To get rid of this warning, turn off var.warnAboutTerminalRootWithNoName)'
1004
                            print '    (To get rid of this warning, turn off var.warnAboutTerminalRootWithNoName)'
137
1013
1005
138
=== modified file 'stk/p4/Tree_muck.py'
139
--- stk/p4/Tree_muck.py	2015-02-19 14:47:06 +0000
140
+++ stk/p4/Tree_muck.py	2017-01-12 09:27:31 +0000
141
@@ -769,6 +769,7 @@
142
769
    else:
769
    else:
143
770
        gm.append("The 2 specified nodes should have a parent-child relationship")
770
        gm.append("The 2 specified nodes should have a parent-child relationship")
144
771
        raise Glitch, gm
771
        raise Glitch, gm
145
772
    
146
772
    if var.usePfAndNumpy:
773
    if var.usePfAndNumpy:
147
773
        self.deleteCStuff()
774
        self.deleteCStuff()
148
774
775
149
@@ -1629,7 +1630,7 @@
150
1629
1630
151
1630
1631
152
1631
1632
154
1632
def addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None):
1633
def addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None, ignoreRootAssert=False):
155
1633
    """Add a subtree to a tree.
1634
    """Add a subtree to a tree.
156
1634
1635
157
1635
    The nodes from theSubTree are added to self.nodes, and theSubTree
1636
    The nodes from theSubTree are added to self.nodes, and theSubTree
158
@@ -1666,7 +1667,8 @@
159
1666
1667
160
1667
    assert selfNode in self.nodes
1668
    assert selfNode in self.nodes
161
1668
    assert selfNode.parent
1669
    assert selfNode.parent
163
1669
    assert theSubTree.root.leftChild and not theSubTree.root.leftChild.sibling # its a root on a stick
1670
    if not ignoreRootAssert:
164
1671
        assert theSubTree.root.leftChild and not theSubTree.root.leftChild.sibling # its a root on a stick
165
1670
    if not subTreeTaxNames:
1672
    if not subTreeTaxNames:
166
1671
        subTreeTaxNames = [n.name for n in theSubTree.iterLeavesNoRoot()]
1673
        subTreeTaxNames = [n.name for n in theSubTree.iterLeavesNoRoot()]
167
1672
1674
168
1673
1675
169
=== removed file 'stk/scripts/check_nomenclature.py'
170
--- stk/scripts/check_nomenclature.py	2016-07-14 10:12:17 +0000
171
+++ stk/scripts/check_nomenclature.py	1970-01-01 00:00:00 +0000
172
@@ -1,224 +0,0 @@
173
1
#!/usr/bin/env python
174
2
#
175
3
#    Derived from the Supertree Toolkit. Software for managing and manipulating sources
176
4
#    trees ready for supretree construction.
177
5
#    Copyright (C) 2015, Jon Hill, Katie Davis
178
6
#
179
7
#    This program is free software: you can redistribute it and/or modify
180
8
#    it under the terms of the GNU General Public License as published by
181
9
#    the Free Software Foundation, either version 3 of the License, or
182
10
#    (at your option) any later version.
183
11
#
184
12
#    This program is distributed in the hope that it will be useful,
185
13
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
186
14
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
187
15
#    GNU General Public License for more details.
188
16
#
189
17
#    You should have received a copy of the GNU General Public License
190
18
#    along with this program.  If not, see <http://www.gnu.org/licenses/>.
191
19
#
192
20
#    Jon Hill. jon.hill@york.ac.uk. 
193
21
#
194
22
#
195
23
# This is an enitrely self-contained script that does not require the STK to be installed.
196
24
197
25
import urllib2
198
26
from urllib import quote_plus
199
27
import simplejson as json
200
28
import argparse
201
29
import os
202
30
import sys
203
31
import csv
204
32
205
33
def main():
206
34
207
35
    # do stuff
208
36
    parser = argparse.ArgumentParser(
209
37
         prog="Check nomenclature",
210
38
         description="Check nomenclature from a tree file or list against valid names derived from EOL",
211
39
         )
212
40
    parser.add_argument(
213
41
            '-v', 
214
42
            '--verbose', 
215
43
            action='store_true', 
216
44
            help="Verbose output: mainly progress reports.",
217
45
            default=False
218
46
            )
219
47
    parser.add_argument(
220
48
            '--existing', 
221
49
            help="An existing output file to update further, e.g. with a new set of taxa. Supply the file name."
222
50
            )
223
51
    parser.add_argument(
224
52
            'input_file', 
225
53
            metavar='input_file',
226
54
            nargs=1,
227
55
            help="Your input taxa list"
228
56
            )
229
57
    parser.add_argument(
230
58
            'output_file', 
231
59
            metavar='output_file',
232
60
            nargs=1,
233
61
            help="The output file. A CSV-based output, listing name checked, valid name, synonyms and status (red, amber, yellow, green)."
234
62
            )
235
63
236
64
    args = parser.parse_args()
237
65
    verbose = args.verbose
238
66
    input_file = args.input_file[0]
239
67
    output_file = args.output_file[0]
240
68
    existing_data  = args.existing
241
69
242
70
    if (not existing_data == None):
243
71
        exiting_data = load_equivalents(existing_data)
244
72
    else:
245
73
        existing_data = None
246
74
247
75
    with open(input_file,'r') as f:
248
76
        lines = f.read().splitlines()        
249
77
    equivs = taxonomic_checker_list(lines, existing_data, verbose=verbose)
250
78
251
79
   
252
80
    f = open(output_file,"w")
253
81
    for taxon in sorted(equivs.keys()):
254
82
        f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
255
83
    f.close()
256
84
257
85
    return
258
86
259
87
260
88
def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
261
89
    """ For each name in the database generate a database of the original name,
262
90
    possible synonyms and if the taxon is not know, signal that. We do this by
263
91
    using the EoL API to grab synonyms of each taxon.  """
264
92
265
93
266
94
    if existing_data == None:
267
95
        equivalents = {}
268
96
    else:
269
97
        equivalents = existing_data
270
98
271
99
    # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
272
100
    # if not, is there another API function to do this?
273
101
    # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
274
102
    # you get the original ('correct') name - shorten this to two words and you're done.
275
103
    for t in name_list:
276
104
        # make sure t has no spaces.
277
105
        t = t.replace(" ","_")
278
106
        if t in equivalents:
279
107
            continue
280
108
        taxon = t.replace("_"," ")
281
109
        if (verbose):
282
110
            print "Looking up ", taxon
283
111
        # get the data from EOL on taxon
284
112
        taxonq = quote_plus(taxon)
285
113
        URL = "http://eol.org/api/search/1.0.json?q="+taxonq
286
114
        req = urllib2.Request(URL)
287
115
        opener = urllib2.build_opener()
288
116
        f = opener.open(req)
289
117
        data = json.load(f)
290
118
        # check if there's some data
291
119
        if len(data['results']) == 0:
292
120
            equivalents[t] = [[t],'red']
293
121
            continue
294
122
        amber = False
295
123
        if len(data['results']) > 1:
296
124
            # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
297
125
            # for automatic processing we'll just take the first one though
298
126
            # colour is amber in this case
299
127
            amber = True
300
128
        ID = str(data['results'][0]['id']) # take first hit
301
129
        URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=2&videos=0&sounds=0&maps=0&text=2&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"       
302
130
        req = urllib2.Request(URL)
303
131
        opener = urllib2.build_opener()
304
132
        
305
133
        try:
306
134
            f = opener.open(req)
307
135
        except urllib2.HTTPError:
308
136
            equivalents[t] = [[t],'red'] 
309
137
            continue
310
138
        data = json.load(f)
311
139
        if len(data['scientificName']) == 0:
312
140
            # not found a scientific name, so set as red
313
141
            equivalents[t] = [[t],'red']            
314
142
            continue
315
143
        correct_name = data['scientificName'].encode("ascii","ignore")
316
144
        # we only want the first two bits of the name, not the original author and year if any
317
145
        temp_name = correct_name.split(' ')
318
146
        if (len(temp_name) > 2):
319
147
            correct_name = ' '.join(temp_name[0:2])
320
148
        correct_name = correct_name.replace(' ','_')
321
149
        print correct_name, t
322
150
323
151
        # build up the output dictionary - original name is key, synonyms/missing is value
324
152
        if (correct_name == t or correct_name == taxon):
325
153
            # if the original matches the 'correct', then it's green
326
154
            equivalents[t] = [[t], 'green']
327
155
        else:
328
156
            # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the 
329
157
            # 'correct' taxon at the top
330
158
            eol_synonyms = data['synonyms']
331
159
            synonyms = []
332
160
            for s in eol_synonyms:
333
161
                ts = s['synonym'].encode("ascii","ignore")
334
162
                temp_syn = ts.split(' ')
335
163
                if (len(temp_syn) > 2):
336
164
                    temp_syn = ' '.join(temp_syn[0:2])
337
165
                    ts = temp_syn
338
166
                if (s['relationship'] == "synonym"):
339
167
                    ts = ts.replace(" ","_")
340
168
                    synonyms.append(ts)
341
169
            synonyms = _uniquify(synonyms)
342
170
            # we need to put the correct name at the top of the list now
343
171
            if (correct_name in synonyms):
344
172
                synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
345
173
            elif len(synonyms) == 0:
346
174
                synonyms.append(correct_name)
347
175
            else:
348
176
                synonyms.insert(0,correct_name)
349
177
350
178
            if (amber):
351
179
                equivalents[t] = [synonyms,'amber']
352
180
            else:
353
181
                equivalents[t] = [synonyms,'yellow']
354
182
        # if our search was empty, then it's red - see above
355
183
356
184
    # up to the calling funciton to do something sensible with this
357
185
    # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
358
186
    # Amber means we found synonyms and multilpe hits. User def needs to sort these!
359
187
360
188
    return equivalents
361
189
362
190
def load_equivalents(equiv_csv):
363
191
    """Load equivalents data from a csv and convert to a equivalents Dict.
364
192
        Structure is key, with a list that is array of synonyms, followed by status ('green',
365
193
        'yellow', 'amber', or 'red').
366
194
367
195
    """
368
196
369
197
    import csv
370
198
371
199
    equivalents = {}
372
200
373
201
    with open(equiv_csv, 'rU') as csvfile:
374
202
        equiv_reader = csv.reader(csvfile, delimiter=',')
375
203
        equiv_reader.next() # skip header
376
204
        for row in equiv_reader:
377
205
            i = 1
378
206
            equivalents[row[0]] = [row[1].split(';'),row[2]]
379
207
    
380
208
    return equivalents
381
209
382
210
def _uniquify(l):
383
211
    """
384
212
    Make a list, l, contain only unique data
385
213
    """
386
214
    keys = {}
387
215
    for e in l:
388
216
        keys[e] = 1
389
217
390
218
    return keys.keys()
391
219
392
220
if __name__ == "__main__":
393
221
    main()
394
222
395
223
396
224
397
225
0
398
=== added file 'stk/scripts/check_nomenclature.py.moved'
399
--- stk/scripts/check_nomenclature.py.moved	1970-01-01 00:00:00 +0000
400
+++ stk/scripts/check_nomenclature.py.moved	2017-01-12 09:27:31 +0000
401
@@ -0,0 +1,224 @@
402
1
#!/usr/bin/env python
403
2
#
404
3
#    Derived from the Supertree Toolkit. Software for managing and manipulating sources
405
4
#    trees ready for supretree construction.
406
5
#    Copyright (C) 2015, Jon Hill, Katie Davis
407
6
#
408
7
#    This program is free software: you can redistribute it and/or modify
409
8
#    it under the terms of the GNU General Public License as published by
410
9
#    the Free Software Foundation, either version 3 of the License, or
411
10
#    (at your option) any later version.
412
11
#
413
12
#    This program is distributed in the hope that it will be useful,
414
13
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
415
14
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
416
15
#    GNU General Public License for more details.
417
16
#
418
17
#    You should have received a copy of the GNU General Public License
419
18
#    along with this program.  If not, see <http://www.gnu.org/licenses/>.
420
19
#
421
20
#    Jon Hill. jon.hill@york.ac.uk. 
422
21
#
423
22
#
424
23
# This is an enitrely self-contained script that does not require the STK to be installed.
425
24
426
25
import urllib2
427
26
from urllib import quote_plus
428
27
import simplejson as json
429
28
import argparse
430
29
import os
431
30
import sys
432
31
import csv
433
32
434
33
def main():
435
34
436
35
    # do stuff
437
36
    parser = argparse.ArgumentParser(
438
37
         prog="Check nomenclature",
439
38
         description="Check nomenclature from a tree file or list against valid names derived from EOL",
440
39
         )
441
40
    parser.add_argument(
442
41
            '-v', 
443
42
            '--verbose', 
444
43
            action='store_true', 
445
44
            help="Verbose output: mainly progress reports.",
446
45
            default=False
447
46
            )
448
47
    parser.add_argument(
449
48
            '--existing', 
450
49
            help="An existing output file to update further, e.g. with a new set of taxa. Supply the file name."
451
50
            )
452
51
    parser.add_argument(
453
52
            'input_file', 
454
53
            metavar='input_file',
455
54
            nargs=1,
456
55
            help="Your input taxa list"
457
56
            )
458
57
    parser.add_argument(
459
58
            'output_file', 
460
59
            metavar='output_file',
461
60
            nargs=1,
462
61
            help="The output file. A CSV-based output, listing name checked, valid name, synonyms and status (red, amber, yellow, green)."
463
62
            )
464
63
465
64
    args = parser.parse_args()
466
65
    verbose = args.verbose
467
66
    input_file = args.input_file[0]
468
67
    output_file = args.output_file[0]
469
68
    existing_data  = args.existing
470
69
471
70
    if (not existing_data == None):
472
71
        exiting_data = load_equivalents(existing_data)
473
72
    else:
474
73
        existing_data = None
475
74
476
75
    with open(input_file,'r') as f:
477
76
        lines = f.read().splitlines()        
478
77
    equivs = taxonomic_checker_list(lines, existing_data, verbose=verbose)
479
78
480
79
   
481
80
    f = open(output_file,"w")
482
81
    for taxon in sorted(equivs.keys()):
483
82
        f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
484
83
    f.close()
485
84
486
85
    return
487
86
488
87
489
88
def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
490
89
    """ For each name in the database generate a database of the original name,
491
90
    possible synonyms and if the taxon is not know, signal that. We do this by
492
91
    using the EoL API to grab synonyms of each taxon.  """
493
92
494
93
495
94
    if existing_data == None:
496
95
        equivalents = {}
497
96
    else:
498
97
        equivalents = existing_data
499
98
500
99
    # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
501
100
    # if not, is there another API function to do this?
502
101
    # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
503
102
    # you get the original ('correct') name - shorten this to two words and you're done.
504
103
    for t in name_list:
505
104
        # make sure t has no spaces.
506
105
        t = t.replace(" ","_")
507
106
        if t in equivalents:
508
107
            continue
509
108
        taxon = t.replace("_"," ")
510
109
        if (verbose):
511
110
            print "Looking up ", taxon
512
111
        # get the data from EOL on taxon
513
112
        taxonq = quote_plus(taxon)
514
113
        URL = "http://eol.org/api/search/1.0.json?q="+taxonq
515
114
        req = urllib2.Request(URL)
516
115
        opener = urllib2.build_opener()
517
116
        f = opener.open(req)
518
117
        data = json.load(f)
519
118
        # check if there's some data
520
119
        if len(data['results']) == 0:
521
120
            equivalents[t] = [[t],'red']
522
121
            continue
523
122
        amber = False
524
123
        if len(data['results']) > 1:
525
124
            # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
526
125
            # for automatic processing we'll just take the first one though
527
126
            # colour is amber in this case
528
127
            amber = True
529
128
        ID = str(data['results'][0]['id']) # take first hit
530
129
        URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=2&videos=0&sounds=0&maps=0&text=2&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"       
531
130
        req = urllib2.Request(URL)
532
131
        opener = urllib2.build_opener()
533
132
        
534
133
        try:
535
134
            f = opener.open(req)
536
135
        except urllib2.HTTPError:
537
136
            equivalents[t] = [[t],'red'] 
538
137
            continue
539
138
        data = json.load(f)
540
139
        if len(data['scientificName']) == 0:
541
140
            # not found a scientific name, so set as red
542
141
            equivalents[t] = [[t],'red']            
543
142
            continue
544
143
        correct_name = data['scientificName'].encode("ascii","ignore")
545
144
        # we only want the first two bits of the name, not the original author and year if any
546
145
        temp_name = correct_name.split(' ')
547
146
        if (len(temp_name) > 2):
548
147
            correct_name = ' '.join(temp_name[0:2])
549
148
        correct_name = correct_name.replace(' ','_')
550
149
        print correct_name, t
551
150
552
151
        # build up the output dictionary - original name is key, synonyms/missing is value
553
152
        if (correct_name == t or correct_name == taxon):
554
153
            # if the original matches the 'correct', then it's green
555
154
            equivalents[t] = [[t], 'green']
556
155
        else:
557
156
            # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the 
558
157
            # 'correct' taxon at the top
559
158
            eol_synonyms = data['synonyms']
560
159
            synonyms = []
561
160
            for s in eol_synonyms:
562
161
                ts = s['synonym'].encode("ascii","ignore")
563
162
                temp_syn = ts.split(' ')
564
163
                if (len(temp_syn) > 2):
565
164
                    temp_syn = ' '.join(temp_syn[0:2])
566
165
                    ts = temp_syn
567
166
                if (s['relationship'] == "synonym"):
568
167
                    ts = ts.replace(" ","_")
569
168
                    synonyms.append(ts)
570
169
            synonyms = _uniquify(synonyms)
571
170
            # we need to put the correct name at the top of the list now
572
171
            if (correct_name in synonyms):
573
172
                synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
574
173
            elif len(synonyms) == 0:
575
174
                synonyms.append(correct_name)
576
175
            else:
577
176
                synonyms.insert(0,correct_name)
578
177
579
178
            if (amber):
580
179
                equivalents[t] = [synonyms,'amber']
581
180
            else:
582
181
                equivalents[t] = [synonyms,'yellow']
583
182
        # if our search was empty, then it's red - see above
584
183
585
184
    # up to the calling funciton to do something sensible with this
586
185
    # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
587
186
    # Amber means we found synonyms and multilpe hits. User def needs to sort these!
588
187
589
188
    return equivalents
590
189
591
190
def load_equivalents(equiv_csv):
592
191
    """Load equivalents data from a csv and convert to a equivalents Dict.
593
192
        Structure is key, with a list that is array of synonyms, followed by status ('green',
594
193
        'yellow', 'amber', or 'red').
595
194
596
195
    """
597
196
598
197
    import csv
599
198
600
199
    equivalents = {}
601
200
602
201
    with open(equiv_csv, 'rU') as csvfile:
603
202
        equiv_reader = csv.reader(csvfile, delimiter=',')
604
203
        equiv_reader.next() # skip header
605
204
        for row in equiv_reader:
606
205
            i = 1
607
206
            equivalents[row[0]] = [row[1].split(';'),row[2]]
608
207
    
609
208
    return equivalents
610
209
611
210
def _uniquify(l):
612
211
    """
613
212
    Make a list, l, contain only unique data
614
213
    """
615
214
    keys = {}
616
215
    for e in l:
617
216
        keys[e] = 1
618
217
619
218
    return keys.keys()
620
219
621
220
if __name__ == "__main__":
622
221
    main()
623
222
624
223
625
224
626
0
225
627
=== modified file 'stk/scripts/create_colours_itol.py'
628
--- stk/scripts/create_colours_itol.py	2014-12-09 10:58:48 +0000
629
+++ stk/scripts/create_colours_itol.py	2017-01-12 09:27:31 +0000
630
@@ -88,17 +88,8 @@
631
88
        saturation=0.25
88
        saturation=0.25
632
89
        value=0.8
89
        value=0.8
633
90
90
645
91
    index = 3 # family
91
    index = stk.taxonomy_levels.index(level.lower())+1
646
92
    if (level == "Superfamily"):
92
    print index
636
93
        index = 4
637
94
    elif (level == "Infraorder"):
638
95
        index = 5
639
96
    elif (level == "Suborder"):
640
97
        index = 6
641
98
    elif (level == "Order"):
642
99
        index = 7
643
100
    elif (level == "Genus"):
644
101
        index = 2
647
102
93
648
103
    if (tree):
94
    if (tree):
649
104
        tree_data = stk.import_tree(input_file)
95
        tree_data = stk.import_tree(input_file)
650
105
96
651
=== modified file 'stk/scripts/create_taxonomy.py'
652
--- stk/scripts/create_taxonomy.py	2014-03-13 18:45:05 +0000
653
+++ stk/scripts/create_taxonomy.py	2017-01-12 09:27:31 +0000
654
@@ -16,6 +16,8 @@
655
16
import supertree_toolkit as stk
16
import supertree_toolkit as stk
656
17
import csv
17
import csv
657
18
18
658
19
taxonomy_levels = stk.taxonomy_levels
659
20
660
19
def main():
21
def main():
661
20
22
662
21
    # do stuff
23
    # do stuff
663
@@ -66,13 +68,6 @@
664
66
        f.close()
68
        f.close()
665
67
69
666
68
    taxonomy = {}
70
    taxonomy = {}
667
69
    # What we get from EOL
668
70
    current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
669
71
    # And the extra ones from ITIS
670
72
    extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
671
73
    # all of them in order
672
74
    taxonomy_levels = ['species','genus','family','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
673
75
674
76
71
675
77
    for taxon in taxa:
72
    for taxon in taxa:
676
78
        taxon = taxon.replace("_"," ")
73
        taxon = taxon.replace("_"," ")
677
@@ -180,99 +175,8 @@
678
180
                continue
175
                continue
679
181
        
176
        
680
182
177
774
183
    # Now create the CSV output
178
    stk.save_taxonomy(taxonomy, output_file)
775
184
    with open(output_file, 'w') as f:
179
     
683
185
        writer = csv.writer(f)
684
186
        writer.writerow(taxonomy_levels)
685
187
        for t in taxonomy:
686
188
            species = t
687
189
            try:
688
190
                genus = taxonomy[t]['genus']
689
191
            except KeyError:
690
192
                genus = "-"
691
193
            try:
692
194
                family = taxonomy[t]['family']
693
195
            except KeyError:
694
196
                family = "-"
695
197
            try:
696
198
                superfamily = taxonomy[t]['superfamily']
697
199
            except KeyError:
698
200
                superfamily = "-"
699
201
            try:
700
202
                infraorder = taxonomy[t]['infraorder']
701
203
            except KeyError:
702
204
                infraorder = "-"
703
205
            try:
704
206
                suborder = taxonomy[t]['suborder']
705
207
            except KeyError:
706
208
                suborder = "-"
707
209
            try:
708
210
                order = taxonomy[t]['order']
709
211
            except KeyError:
710
212
                order = "-"
711
213
            try:
712
214
                superorder = taxonomy[t]['superorder']
713
215
            except KeyError:
714
216
                superorder = "-"
715
217
            try:
716
218
                subclass = taxonomy[t]['subclass']
717
219
            except KeyError:
718
220
                subclass = "-"
719
221
            try:
720
222
                tclass = taxonomy[t]['class']
721
223
            except KeyError:
722
224
                tclass = "-"
723
225
            try:
724
226
                subphylum = taxonomy[t]['subphylum']
725
227
            except KeyError:
726
228
                subphylum = "-"
727
229
            try:
728
230
                phylum = taxonomy[t]['phylum']
729
231
            except KeyError:
730
232
                phylum = "-"
731
233
            try:
732
234
                superphylum = taxonomy[t]['superphylum']
733
235
            except KeyError:
734
236
                superphylum = "-"
735
237
            try:
736
238
                infrakingdom = taxonomy[t]['infrakingdom']
737
239
            except:
738
240
                infrakingdom = "-"
739
241
            try:
740
242
                subkingdom = taxonomy[t]['subkingdom']
741
243
            except:
742
244
                subkingdom = "-"
743
245
            try:
744
246
                kingdom = taxonomy[t]['kingdom']
745
247
            except KeyError:
746
248
                kingdom = "-"
747
249
            try:
748
250
                provider = taxonomy[t]['provider']
749
251
            except KeyError:
750
252
                provider = "-"
751
253
752
254
753
255
            this_classification = [
754
256
                    species.encode('utf-8'),
755
257
                    genus.encode('utf-8'),
756
258
                    family.encode('utf-8'),
757
259
                    superfamily.encode('utf-8'),
758
260
                    infraorder.encode('utf-8'),
759
261
                    suborder.encode('utf-8'),
760
262
                    order.encode('utf-8'),
761
263
                    superorder.encode('utf-8'),
762
264
                    subclass.encode('utf-8'),
763
265
                    tclass.encode('utf-8'),
764
266
                    subphylum.encode('utf-8'),
765
267
                    phylum.encode('utf-8'),
766
268
                    superphylum.encode('utf-8'),
767
269
                    infrakingdom.encode('utf-8'),
768
270
                    subkingdom.encode('utf-8'),
769
271
                    kingdom.encode('utf-8'),
770
272
                    provider.encode('utf-8')]
771
273
            writer.writerow(this_classification)
772
274
            
773
275
    
776
276
def _uniquify(l):
180
def _uniquify(l):
777
277
    """
181
    """
778
278
    Make a list, l, contain only unique data
182
    Make a list, l, contain only unique data
779
279
183
780
=== modified file 'stk/scripts/fill_in_with_taxonomy.py'
781
--- stk/scripts/fill_in_with_taxonomy.py	2016-12-14 16:22:12 +0000
782
+++ stk/scripts/fill_in_with_taxonomy.py	2017-01-12 09:27:31 +0000
783
@@ -23,21 +23,90 @@
784
23
from urllib import quote_plus
23
from urllib import quote_plus
785
24
import simplejson as json
24
import simplejson as json
786
25
import argparse
25
import argparse
787
26
import copy
788
26
import os
27
import os
789
27
import sys
28
import sys
790
28
stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir )
29
stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir )
791
29
sys.path.insert(0, stk_path)
30
sys.path.insert(0, stk_path)
792
30
import supertree_toolkit as stk
31
import supertree_toolkit as stk
793
31
import csv
32
import csv
803
32
33
from ete2 import Tree
804
33
# What we get from EOL
34
import tempfile
805
34
current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
35
import re
806
35
# And the extra ones from ITIS
36
807
36
extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
37
taxonomy_levels = stk.taxonomy_levels
808
37
# all of them in order
38
#tlevels = ['species','genus','family','superfamily','suborder','order','class','phylum','kingdom']
809
38
taxonomy_levels = ['species','genus','subfamily','family','tribe','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
39
tlevels = ['species','genus', 'subfamily', 'family','infraorder','order','class','phylum','kingdom']
810
39
40
811
40
def get_tree_taxa_taxonomy(taxon,wsdlObjectWoRMS):
41
def get_tree_taxa_taxonomy_eol(taxon):
812
42
813
43
    taxonq = quote_plus(taxon)
814
44
    URL = "http://eol.org/api/search/1.0.json?q="+taxonq
815
45
    req = urllib2.Request(URL)
816
46
    opener = urllib2.build_opener()
817
47
    f = opener.open(req)
818
48
    data = json.load(f)
819
49
    
820
50
    if data['results'] == []:
821
51
        return {}
822
52
    ID = str(data['results'][0]['id']) # take first hit
823
53
    # Now look for taxonomies
824
54
    URL = "http://eol.org/api/pages/1.0/"+ID+".json"
825
55
    req = urllib2.Request(URL)
826
56
    opener = urllib2.build_opener()
827
57
    f = opener.open(req)
828
58
    data = json.load(f)
829
59
    if len(data['taxonConcepts']) == 0:
830
60
        return {}
831
61
    TID = str(data['taxonConcepts'][0]['identifier']) # take first hit
832
62
    currentdb = str(data['taxonConcepts'][0]['nameAccordingTo'])
833
63
    # loop through and get preferred one if specified
834
64
    # now get taxonomy        
835
65
    for db in data['taxonConcepts']:
836
66
        currentdb = db['nameAccordingTo'].lower()
837
67
        TID = str(db['identifier'])
838
68
        break
839
69
    URL="http://eol.org/api/hierarchy_entries/1.0/"+TID+".json"
840
70
    req = urllib2.Request(URL)
841
71
    opener = urllib2.build_opener()
842
72
    f = opener.open(req)
843
73
    data = json.load(f)
844
74
    tax_array = {}
845
75
    tax_array['provider'] = currentdb
846
76
    for a in data['ancestors']:
847
77
        try:
848
78
            if a.has_key('taxonRank') :
849
79
                temp_level = a['taxonRank'].encode("ascii","ignore")
850
80
                if (temp_level in taxonomy_levels):
851
81
                    # note the dump into ASCII
852
82
                    temp_name = a['scientificName'].encode("ascii","ignore")
853
83
                    temp_name = temp_name.split(" ")
854
84
                    if (temp_level == 'species'):
855
85
                        tax_array[temp_level] = "_".join(temp_name[0:2])
856
86
                        
857
87
                    else:
858
88
                        tax_array[temp_level] = temp_name[0]  
859
89
        except KeyError as e:
860
90
            logging.exception("Key not found: taxonRank")
861
91
            continue
862
92
    try:
863
93
        # add taxonomy in to the taxonomy!
864
94
        # some issues here, so let's make sure it's OK
865
95
        temp_name = taxon.split(" ")            
866
96
        if data.has_key('taxonRank') :
867
97
            if not data['taxonRank'].lower() == 'species':
868
98
                tax_array[data['taxonRank'].lower()] = temp_name[0]
869
99
            else:
870
100
                tax_array[data['taxonRank'].lower()] = ' '.join(temp_name[0:2])
871
101
    except KeyError as e:
872
102
       return tax_array 
873
103
874
104
    return tax_array
875
105
876
106
def get_tree_taxa_taxonomy_worms(taxon):
877
107
878
108
    from SOAPpy import WSDL    
879
109
    wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
880
41
110
881
42
    taxon_data = wsdlObjectWoRMS.getAphiaRecords(taxon.replace('_',' '))
111
    taxon_data = wsdlObjectWoRMS.getAphiaRecords(taxon.replace('_',' '))
882
43
    if taxon_data == None:
112
    if taxon_data == None:
883
@@ -51,6 +120,8 @@
884
51
    classification = wsdlObjectWoRMS.getAphiaClassificationByID(taxon_id)
120
    classification = wsdlObjectWoRMS.getAphiaClassificationByID(taxon_id)
885
52
    # construct array
121
    # construct array
886
53
    tax_array = {}
122
    tax_array = {}
887
123
    if (classification == ""):
888
124
        return {}
889
54
    # classification is a nested dictionary, so we need to iterate down it
125
    # classification is a nested dictionary, so we need to iterate down it
890
55
    current_child = classification.child
126
    current_child = classification.child
891
56
    while True:
127
    while True:
892
@@ -60,27 +131,252 @@
893
60
            break
131
            break
894
61
    return tax_array
132
    return tax_array
895
62
133
899
63
134
def get_tree_taxa_taxonomy_itis(taxon):
900
64
135
901
65
def get_taxonomy_worms(taxonomy, start_otu):
136
    URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(taxon.replace('_',' ').strip())
902
137
    req = urllib2.Request(URL)
903
138
    opener = urllib2.build_opener()
904
139
    f = opener.open(req)    
905
140
    string = unicode(f.read(),"ISO-8859-1")
906
141
    this_item = json.loads(string)
907
142
    if this_item['scientificNames'] == [None]: # not found
908
143
        return {}
909
144
    tsn = this_item['scientificNames'][0]['tsn'] # there might be records that aren't valid - they point to the valid one though
910
145
    # so call another function to get any valid names
911
146
    URL="http://www.itis.gov/ITISWebService/jsonservice/getAcceptedNamesFromTSN?tsn="+tsn
912
147
    req = urllib2.Request(URL)
913
148
    opener = urllib2.build_opener()
914
149
    f = opener.open(req)
915
150
    string = unicode(f.read(),"ISO-8859-1")
916
151
    this_item = json.loads(string)
917
152
    if not this_item['acceptedNames'] == [None]:
918
153
        tsn = this_item['acceptedNames'][0]['acceptedTsn']
919
154
920
155
    URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+str(tsn)
921
156
    req = urllib2.Request(URL)
922
157
    opener = urllib2.build_opener()
923
158
    f = opener.open(req)
924
159
    string = unicode(f.read(),"ISO-8859-1")
925
160
    data = json.loads(string)
926
161
    # construct array
927
162
    this_taxonomy = {}
928
163
    for level in data['hierarchyList']:
929
164
        if level['rankName'].lower() in taxonomy_levels:
930
165
            # note the dump into ASCII            
931
166
            this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
932
167
933
168
    return this_taxonomy
934
169
935
170
936
171
937
172
def get_taxonomy_eol(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
938
173
        
939
174
    # this is the recursive function
940
175
    def get_children(taxonomy, ID, aphiaIDsDone):
941
176
942
177
        # get data
943
178
        URL="http://eol.org/api/hierarchy_entries/1.0/"+str(ID)+".json?common_names=false&synonyms=false&cache_ttl="
944
179
        req = urllib2.Request(URL)
945
180
        opener = urllib2.build_opener()
946
181
        f = opener.open(req)
947
182
        string = unicode(f.read(),"ISO-8859-1")
948
183
        this_item = json.loads(string)
949
184
        if this_item == None:
950
185
            return taxonomy  
951
186
        if this_item['taxonRank'].lower().strip() == 'species':
952
187
            # add data to taxonomy dictionary
953
188
            taxon = this_item['scientificName'].split()[0:2] # just the first two words
954
189
            taxon = " ".join(taxon[0:2])
955
190
            # NOTE following line means existing items are *not* updated
956
191
            if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
957
192
                this_taxonomy = {}
958
193
                for level in this_item['ancestors']:
959
194
                    if level['taxonRank'].lower() in taxonomy_levels:
960
195
                        # note the dump into ASCII            
961
196
                        this_taxonomy[level['taxonRank'].lower().encode("ascii","ignore")] = level['scientificName'].encode("ascii","ignore")
962
197
                # add species:
963
198
                this_taxonomy['species'] = taxon.replace(" ","_")
964
199
                if verbose:
965
200
                    print "\tAdding "+taxon
966
201
                taxonomy[taxon] = this_taxonomy
967
202
                if not tmpfile == None:
968
203
                    stk.save_taxonomy(taxonomy,tmpfile)
969
204
                return taxonomy
970
205
            else:
971
206
                return taxonomy
972
207
        all_children = []
973
208
        for level in this_item['children']:
974
209
            if not level == None:
975
210
                all_children.append(level['taxonID'])
976
211
        
977
212
        if (len(all_children) == 0):
978
213
            return taxonomy
979
214
980
215
        for child in all_children:
981
216
            if child in aphiaIDsDone: # we get stuck sometime
982
217
                continue
983
218
            aphiaIDsDone.append(child)
984
219
            taxonomy = get_children(taxonomy, child, aphiaIDsDone)
985
220
        return taxonomy
986
221
            
987
222
988
223
    # main bit of the get_taxonomy_eol function
989
224
    taxonq = quote_plus(start_otu)
990
225
    URL = "http://eol.org/api/search/1.0.json?q="+taxonq
991
226
    req = urllib2.Request(URL)
992
227
    opener = urllib2.build_opener()
993
228
    f = opener.open(req)
994
229
    data = json.load(f)
995
230
    start_id = str(data['results'][0]['id']) # this is the page ID. We get the species ID next
996
231
    URL = "http://eol.org/api/pages/1.0/"+start_id+".json"
997
232
    req = urllib2.Request(URL)
998
233
    opener = urllib2.build_opener()
999
234
    f = opener.open(req)
1000
235
    data = json.load(f)
1001
236
    if len(data['taxonConcepts']) == 0:
1002
237
        print "Error finding you start taxa. Spelling?"
1003
238
        return None  
1004
239
    start_id = data['taxonConcepts'][0]['identifier']
1005
240
    start_taxonomy_level = data['taxonConcepts'][0]['taxonRank'].lower()
1006
241
1007
242
    aphiaIDsDone = []
1008
243
    if not skip:
1009
244
        taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
1010
245
1011
246
    return taxonomy, start_taxonomy_level
1012
247
1013
248
1014
249
1015
250
def get_taxonomy_itis(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
1016
251
    import simplejson as json
1017
252
        
1018
253
    # this is the recursive function
1019
254
    def get_children(taxonomy, ID, aphiaIDsDone):
1020
255
1021
256
        # get data
1022
257
        URL="http://www.itis.gov/ITISWebService/jsonservice/getFullRecordFromTSN?tsn="+ID
1023
258
        req = urllib2.Request(URL)
1024
259
        opener = urllib2.build_opener()
1025
260
        f = opener.open(req)
1026
261
        string = unicode(f.read(),"ISO-8859-1")
1027
262
        this_item = json.loads(string)
1028
263
        if this_item == None:
1029
264
            return taxonomy
1030
265
        if not this_item['usage']['taxonUsageRating'].lower() == 'valid':
1031
266
            print "rejecting " , this_item['scientificName']['combinedName']
1032
267
            return taxonomy        
1033
268
        if this_item['taxRank']['rankName'].lower().strip() == 'species':
1034
269
            # add data to taxonomy dictionary
1035
270
            taxon = this_item['scientificName']['combinedName']
1036
271
            # NOTE following line means existing items are *not* updated
1037
272
            if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
1038
273
                # get the taxonomy of this species
1039
274
                tsn = this_item["scientificName"]["tsn"]
1040
275
                URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+tsn
1041
276
                req = urllib2.Request(URL)
1042
277
                opener = urllib2.build_opener()
1043
278
                f = opener.open(req)
1044
279
                string = unicode(f.read(),"ISO-8859-1")
1045
280
                data = json.loads(string)
1046
281
                this_taxonomy = {}
1047
282
                for level in data['hierarchyList']:
1048
283
                    if level['rankName'].lower() in taxonomy_levels:
1049
284
                        # note the dump into ASCII            
1050
285
                        this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
1051
286
                if verbose:
1052
287
                    print "\tAdding "+taxon
1053
288
                taxonomy[taxon] = this_taxonomy
1054
289
                if not tmpfile == None:
1055
290
                    stk.save_taxonomy(taxonomy,tmpfile)
1056
291
                return taxonomy
1057
292
            else:
1058
293
                return taxonomy
1059
294
1060
295
        all_children = []
1061
296
        URL="http://www.itis.gov/ITISWebService/jsonservice/getHierarchyDownFromTSN?tsn="+ID
1062
297
        req = urllib2.Request(URL)
1063
298
        opener = urllib2.build_opener()
1064
299
        f = opener.open(req)
1065
300
        string = unicode(f.read(),"ISO-8859-1")
1066
301
        this_item = json.loads(string)
1067
302
        if this_item == None:
1068
303
            return taxonomy
1069
304
1070
305
        for level in this_item['hierarchyList']:
1071
306
            if not level == None:
1072
307
                all_children.append(level['tsn'])
1073
308
        
1074
309
        if (len(all_children) == 0):
1075
310
            return taxonomy
1076
311
1077
312
        for child in all_children:
1078
313
            if child in aphiaIDsDone: # we get stuck sometime
1079
314
                continue
1080
315
            aphiaIDsDone.append(child)
1081
316
            taxonomy = get_children(taxonomy, child, aphiaIDsDone)
1082
317
            
1083
318
        return taxonomy
1084
319
            
1085
320
1086
321
    # main bit of the get_taxonomy_worms function
1087
322
    URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(start_otu.strip())
1088
323
    req = urllib2.Request(URL)
1089
324
    opener = urllib2.build_opener()
1090
325
    f = opener.open(req)
1091
326
    string = unicode(f.read(),"ISO-8859-1")
1092
327
    this_item = json.loads(string)
1093
328
    start_id = this_item['scientificNames'][0]['tsn'] # there might be records that aren't valid - they point to the valid one though
1094
329
    # call it again via the ID this time to make sure we've got the right one.
1095
330
    # so call another function to get any valid names
1096
331
    URL="http://www.itis.gov/ITISWebService/jsonservice/getAcceptedNamesFromTSN?tsn="+start_id
1097
332
    req = urllib2.Request(URL)
1098
333
    opener = urllib2.build_opener()
1099
334
    f = opener.open(req)
1100
335
    string = unicode(f.read(),"ISO-8859-1")
1101
336
    this_item = json.loads(string)
1102
337
    if not this_item['acceptedNames'] == [None]:
1103
338
        start_id = this_item['acceptedNames'][0]['acceptedTsn']
1104
339
1105
340
    URL="http://www.itis.gov/ITISWebService/jsonservice/getFullRecordFromTSN?tsn="+start_id
1106
341
    req = urllib2.Request(URL)
1107
342
    opener = urllib2.build_opener()
1108
343
    f = opener.open(req)
1109
344
    string = unicode(f.read(),"ISO-8859-1")
1110
345
    this_item = json.loads(string)
1111
346
    start_taxonomy_level = this_item['taxRank']['rankName'].lower()
1112
347
1113
348
    aphiaIDsDone = []
1114
349
    if not skip:
1115
350
        taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
1116
351
1117
352
    return taxonomy, start_taxonomy_level
1118
353
1119
354
1120
355
1121
356
1122
357
def get_taxonomy_worms(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
1123
66
    """ Gets and processes a taxon from the queue to get its taxonomy."""
358
    """ Gets and processes a taxon from the queue to get its taxonomy."""
1124
67
    from SOAPpy import WSDL    
359
    from SOAPpy import WSDL    
1125
68
360
1126
69
    wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
361
    wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
1127
70
362
1128
71
    # this is the recursive function
363
    # this is the recursive function
1130
72
    def get_children(taxonomy, ID):
364
    def get_children(taxonomy, ID, aphiaIDsDone):
1131
73
365
1132
74
        # get data
366
        # get data
1133
75
        this_item = wsdlObjectWoRMS.getAphiaRecordByID(ID)
367
        this_item = wsdlObjectWoRMS.getAphiaRecordByID(ID)
1134
76
        if this_item == None:
368
        if this_item == None:
1135
77
            return taxonomy
369
            return taxonomy
1136
370
        if not this_item['status'].lower() == 'accepted':
1137
371
            print "rejecting " , this_item.valid_name
1138
372
            return taxonomy        
1139
78
        if this_item['rank'].lower() == 'species':
373
        if this_item['rank'].lower() == 'species':
1140
79
            # add data to taxonomy dictionary
374
            # add data to taxonomy dictionary
1144
80
            # get the taxonomy of this species
375
            taxon = this_item.valid_name
1145
81
            classification = wsdlObjectWoRMS.getAphiaClassificationByID(ID)
376
            # NOTE following line means existing items are *not* updated
1143
82
            taxon = this_item.scientificname
1146
83
            if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
377
            if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
1147
378
                # get the taxonomy of this species
1148
379
                classification = wsdlObjectWoRMS.getAphiaClassificationByID(ID)
1149
84
                # construct array
380
                # construct array
1150
85
                tax_array = {}
381
                tax_array = {}
1151
86
                # classification is a nested dictionary, so we need to iterate down it
382
                # classification is a nested dictionary, so we need to iterate down it
1152
@@ -92,16 +388,36 @@
1153
92
                    current_child = current_child.child
388
                    current_child = current_child.child
1154
93
                    if current_child == '': # empty one is a string for some reason
389
                    if current_child == '': # empty one is a string for some reason
1155
94
                        break
390
                        break
1157
95
                taxonomy[this_item.scientificname] = tax_array
391
                if verbose:
1158
392
                    print "\tAdding "+this_item.scientificname
1159
393
                taxonomy[this_item.valid_name] = tax_array
1160
394
                if not tmpfile == None:
1161
395
                    stk.save_taxonomy(taxonomy,tmpfile)
1162
96
                return taxonomy
396
                return taxonomy
1163
97
            else:
397
            else:
1164
98
                return taxonomy
398
                return taxonomy
1165
99
399
1171
100
        children = wsdlObjectWoRMS.getAphiaChildrenByID(ID, 1, False)
400
        all_children = []
1172
101
        
401
        start = 1
1173
102
        for child in children:
402
        while True:
1174
103
            taxonomy = get_children(taxonomy, child['valid_AphiaID'])
403
            children = wsdlObjectWoRMS.getAphiaChildrenByID(ID, start, False)
1175
104
404
            if (children is None or children == None):
1176
405
                break
1177
406
            if (len(children) < 50):
1178
407
                all_children.extend(children)
1179
408
                break
1180
409
            all_children.extend(children)
1181
410
            start += 50
1182
411
1183
412
        if (len(all_children) == 0):
1184
413
            return taxonomy
1185
414
1186
415
        for child in all_children:
1187
416
            if child['valid_AphiaID'] in aphiaIDsDone: # we get stuck sometime
1188
417
                continue
1189
418
            aphiaIDsDone.append(child['valid_AphiaID'])
1190
419
            taxonomy = get_children(taxonomy, child['valid_AphiaID'], aphiaIDsDone)
1191
420
            
1192
105
        return taxonomy
421
        return taxonomy
1193
106
            
422
            
1194
107
423
1195
@@ -111,12 +427,17 @@
1196
111
        start_id = start_taxa[0]['valid_AphiaID'] # there might be records that aren't valid - they point to the valid one though
427
        start_id = start_taxa[0]['valid_AphiaID'] # there might be records that aren't valid - they point to the valid one though
1197
112
        # call it again via the ID this time to make sure we've got the right one.
428
        # call it again via the ID this time to make sure we've got the right one.
1198
113
        start_taxa = wsdlObjectWoRMS.getAphiaRecordByID(start_id)
429
        start_taxa = wsdlObjectWoRMS.getAphiaRecordByID(start_id)
1202
114
        start_taxonomy_level = start_taxa['rank'].lower()
430
        if start_taxa == None:
1203
115
    except HTTPError:
431
            start_taxonomy_level = 'infraorder'
1204
116
        print "Error"
432
        else:
1205
433
            start_taxonomy_level = start_taxa['rank'].lower()
1206
434
    except urllib2.HTTPError:
1207
435
        print "Error finding start_otu taxonomic level. Do you have an internet connection?"
1208
117
        sys.exit(-1)
436
        sys.exit(-1)
1209
118
437
1211
119
    taxonomy = get_children(taxonomy,start_id)
438
    aphiaIDsDone = []
1212
439
    if not skip:
1213
440
        taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
1214
120
441
1215
121
    return taxonomy, start_taxonomy_level
442
    return taxonomy, start_taxonomy_level
1216
122
            
443
            
1217
@@ -136,9 +457,16 @@
1218
136
            default=False
457
            default=False
1219
137
            )
458
            )
1220
138
    parser.add_argument(
459
    parser.add_argument(
1221
460
            '-s', 
1222
461
            '--skip', 
1223
462
            action='store_true', 
1224
463
            help="Skip online checking, just use taxonomy files",
1225
464
            default=False
1226
465
            )
1227
466
    parser.add_argument(
1228
139
            '--pref_db',
467
            '--pref_db',
1229
140
            help="Taxonomy database to use. Default is Species 2000/ITIS",
468
            help="Taxonomy database to use. Default is Species 2000/ITIS",
1231
141
            choices=['itis', 'worms', 'ncbi'],
469
            choices=['itis', 'worms', 'ncbi', 'eol'],
1232
142
            default = 'worms'
470
            default = 'worms'
1233
143
            )
471
            )
1234
144
    parser.add_argument(
472
    parser.add_argument(
1235
@@ -178,58 +506,250 @@
1236
178
    top_level = args.top_level[0]
506
    top_level = args.top_level[0]
1237
179
    save_taxonomy_file = args.save_taxonomy
507
    save_taxonomy_file = args.save_taxonomy
1238
180
    tree_taxonomy = args.tree_taxonomy
508
    tree_taxonomy = args.tree_taxonomy
1239
509
    taxonomy = args.taxonomy_from_file
1240
181
    pref_db = args.pref_db
510
    pref_db = args.pref_db
1241
511
    skip = args.skip
1242
182
    if (save_taxonomy_file == None):
512
    if (save_taxonomy_file == None):
1243
183
        save_taxonomy = False
513
        save_taxonomy = False
1244
184
    else:
514
    else:
1245
185
        save_taxonomy = True
515
        save_taxonomy = True
1246
516
    load_tree_taxonomy = False
1247
517
    if (not tree_taxonomy == None):
1248
518
        tree_taxonomy_file = tree_taxonomy
1249
519
        load_tree_taxonomy = True
1250
520
    if skip:
1251
521
        if taxonomy == None:
1252
522
            print "Error: If you're skipping checking online, then you need to supply taxonomy files"
1253
523
            return
1254
186
524
1255
187
    # grab taxa in tree
525
    # grab taxa in tree
1256
188
    tree = stk.import_tree(input_file)
526
    tree = stk.import_tree(input_file)
1257
189
    taxa_list = stk._getTaxaFromNewick(tree)
527
    taxa_list = stk._getTaxaFromNewick(tree)
1262
190
528
    
1263
191
    taxonomy = {}
529
    if verbose:
1264
192
530
        print "Taxa count for input tree: ", len(taxa_list)
1265
193
    # we're going to add the taxa in the tree to the taxonomy, to stop them
531
1266
532
    # load in any taxonomy files - we still call the APIs as a) they may have updated data and
1267
533
    # b) the user may have missed some first time round (i.e. expanded the tree and not redone 
1268
534
    # the taxonomy
1269
535
    if (taxonomy == None):
1270
536
        taxonomy = {}
1271
537
    else:
1272
538
        taxonomy = stk.load_taxonomy(taxonomy)
1273
539
        tree_taxonomy = {}    
1274
540
        # this might also have tree_taxonomy in too - let's check this
1275
541
        for t in taxa_list:
1276
542
            if t in taxonomy:
1277
543
                tree_taxonomy[t] = taxonomy[t]
1278
544
            elif t.replace("_"," ") in taxonomy:
1279
545
                tree_taxonomy[t] = taxonomy[t.replace("_"," ")]
1280
546
       
1281
547
    if (load_tree_taxonomy): # overwrite the good work above...
1282
548
        tree_taxonomy = stk.load_taxonomy(tree_taxonomy_file)
1283
549
    if (tree_taxonomy == None):
1284
550
        tree_taxonomy = {}
1285
551
1286
552
    # we're going to add the taxa in the tree to the main WORMS taxonomy, to stop them
1287
194
    # being fetched in first place. We delete them later
553
    # being fetched in first place. We delete them later
1288
554
    # If you've loaded a taxonomy created by this script, this overwrites the tree taxa in the main taxonomy dict
1289
555
    # Don't worry, we put them back in before saving again!
1290
195
    for taxon in taxa_list:
556
    for taxon in taxa_list:
1291
196
        taxon = taxon.replace('_',' ')
557
        taxon = taxon.replace('_',' ')
1294
197
        taxonomy[taxon] = []
558
        taxonomy[taxon] = {}
1293
198
1295
199
559
1296
200
    if (pref_db == 'itis'):
560
    if (pref_db == 'itis'):
1297
201
        # get taxonomy info from itis
561
        # get taxonomy info from itis
1300
202
        print "Sorry, ITIS is not implemented yet"
562
        if (verbose):
1301
203
        pass
563
            print "Getting data from ITIS"
1302
564
        if (verbose):
1303
565
            print "Dealing with taxa in tree"
1304
566
        for t in taxa_list:
1305
567
            if verbose:
1306
568
                print "\t"+t
1307
569
            if not(t in tree_taxonomy or t.replace("_"," ") in tree_taxonomy):
1308
570
                # we don't have data - NOTE we assume things are *not* updated here if we do
1309
571
                tree_taxonomy[t] = get_tree_taxa_taxonomy_itis(t)
1310
572
       
1311
573
        if save_taxonomy:
1312
574
            if (verbose):
1313
575
                print "Saving tree taxonomy"
1314
576
            # note -temporary save as we overwrite this file later.
1315
577
            stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
1316
578
1317
579
        # get taxonomy from worms
1318
580
        if verbose:
1319
581
            print "Now dealing with all other taxa - this might take a while..."
1320
582
        # create a temp file so we can checkpoint and continue
1321
583
        tmpf, tmpfile = tempfile.mkstemp()
1322
584
        
1323
585
        if os.path.isfile('.fit_lock'):
1324
586
            f = open('.fit_lock','r')
1325
587
            tf = f.read()
1326
588
            f.close()
1327
589
            if os.path.isfile(tf.strip()):
1328
590
                taxonomy = stk.load_taxonomy(tf.strip())
1329
591
            os.remove('.fit_lock')
1330
592
        
1331
593
        # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
1332
594
        # where we left off.
1333
595
        with open(".fit_lock", 'w') as f:
1334
596
            f.write(tmpfile)
1335
597
        # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
1336
598
        taxonomy, start_level = get_taxonomy_itis(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
1337
599
1338
600
        # clean up
1339
601
        os.close(tmpf)
1340
602
        os.remove('.fit_lock')
1341
603
        try:
1342
604
            os.remove('tmpfile')
1343
605
        except OSError:
1344
606
            pass
1345
204
    elif (pref_db == 'worms'):
607
    elif (pref_db == 'worms'):
1346
608
        if (verbose):
1347
609
            print "Getting data from WoRMS"
1348
205
        # get tree taxonomy from worms
610
        # get tree taxonomy from worms
1357
206
        if (tree_taxonomy == None):
611
        if (verbose):
1358
207
            tree_taxonomy = {}
612
            print "Dealing with taxa in tree"
1359
208
            for t in taxa_list:
613
        
1360
209
                from SOAPpy import WSDL    
614
        for t in taxa_list:
1361
210
                wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
615
            if verbose:
1362
211
                tree_taxonomy[t] = get_tree_taxa_taxonomy(t,wsdlObjectWoRMS)
616
                print "\t"+t
1363
212
        else:
617
            if not(t in tree_taxonomy or t.replace("_"," ") in tree_taxonomy):
1364
213
            tree_taxonomy = stk.load_taxonomy(tree_taxonomy)
618
                # we don't have data - NOTE we assume things are *not* updated here if we do
1365
619
                tree_taxonomy[t] = get_tree_taxa_taxonomy_worms(t)
1366
620
1367
621
        if save_taxonomy:
1368
622
            if (verbose):
1369
623
                print "Saving tree taxonomy"
1370
624
            # note -temporary save as we overwrite this file later.
1371
625
            stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
1372
626
1373
214
        # get taxonomy from worms
627
        # get taxonomy from worms
1375
215
        taxonomy, start_level = get_taxonomy_worms(taxonomy,top_level)
628
        if verbose:
1376
629
            print "Now dealing with all other taxa - this might take a while..."
1377
630
        # create a temp file so we can checkpoint and continue
1378
631
        tmpf, tmpfile = tempfile.mkstemp()
1379
632
        
1380
633
        if os.path.isfile('.fit_lock'):
1381
634
            f = open('.fit_lock','r')
1382
635
            tf = f.read()
1383
636
            f.close()
1384
637
            if os.path.isfile(tf.strip()):
1385
638
                taxonomy = stk.load_taxonomy(tf.strip())
1386
639
            os.remove('.fit_lock')
1387
640
        
1388
641
        # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
1389
642
        # where we left off.
1390
643
        with open(".fit_lock", 'w') as f:
1391
644
            f.write(tmpfile)
1392
645
        # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
1393
646
        taxonomy, start_level = get_taxonomy_worms(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
1394
647
1395
648
        # clean up
1396
649
        os.close(tmpf)
1397
650
        os.remove('.fit_lock')
1398
651
        try:
1399
652
            os.remove('tmpfile')
1400
653
        except OSError:
1401
654
            pass
1402
216
655
1403
217
    elif (pref_db == 'ncbi'):
656
    elif (pref_db == 'ncbi'):
1404
218
        # get taxonomy from ncbi
657
        # get taxonomy from ncbi
1405
219
        print "Sorry, NCBI is not implemented yet"        
658
        print "Sorry, NCBI is not implemented yet"        
1406
220
        pass
659
        pass
1407
660
    elif (pref_db == 'eol'):
1408
661
        if (verbose):
1409
662
            print "Getting data from EOL"
1410
663
        # get tree taxonomy from worms
1411
664
        if (verbose):
1412
665
            print "Dealing with taxa in tree"
1413
666
        for t in taxa_list:
1414
667
            if verbose:
1415
668
                print "\t"+t
1416
669
            try:
1417
670
                tree_taxonomy[t]
1418
671
                pass # we have data - NOTE we assume things are *not* updated here...
1419
672
            except KeyError:
1420
673
                try:
1421
674
                    tree_taxonomy[t.replace('_',' ')]
1422
675
                except KeyError:
1423
676
                    tree_taxonomy[t] = get_tree_taxa_taxonomy_eol(t)
1424
677
       
1425
678
        if save_taxonomy:
1426
679
            if (verbose):
1427
680
                print "Saving tree taxonomy"
1428
681
            # note -temporary save as we overwrite this file later.
1429
682
            stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
1430
683
1431
684
        # get taxonomy from worms
1432
685
        if verbose:
1433
686
            print "Now dealing with all other taxa - this might take a while..."
1434
687
        # create a temp file so we can checkpoint and continue
1435
688
        tmpf, tmpfile = tempfile.mkstemp()
1436
689
        
1437
690
        if os.path.isfile('.fit_lock'):
1438
691
            f = open('.fit_lock','r')
1439
692
            tf = f.read()
1440
693
            f.close()
1441
694
            if os.path.isfile(tf.strip()):
1442
695
                taxonomy = stk.load_taxonomy(tf.strip())
1443
696
            os.remove('.fit_lock')
1444
697
        
1445
698
        # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
1446
699
        # where we left off.
1447
700
        with open(".fit_lock", 'w') as f:
1448
701
            f.write(tmpfile)
1449
702
        # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
1450
703
        taxonomy, start_level = get_taxonomy_eol(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
1451
704
1452
705
        # clean up
1453
706
        os.close(tmpf)
1454
707
        os.remove('.fit_lock')
1455
708
        try:
1456
709
            os.remove('tmpfile')
1457
710
        except OSError:
1458
711
            pass
1459
221
    else:
712
    else:
1461
222
        print "ERROR: Didn't understand you database choice"
713
        print "ERROR: Didn't understand your database choice"
1462
223
        sys.exit(-1)
714
        sys.exit(-1)
1463
224
715
1464
225
    # clean up taxonomy, deleting the ones already in the tree
716
    # clean up taxonomy, deleting the ones already in the tree
1465
226
    for taxon in taxa_list:
717
    for taxon in taxa_list:
1468
227
        taxon = taxon.replace('_',' ')        
718
        taxon = taxon.replace('_',' ')
1469
228
        del taxonomy[taxon]
719
        try:
1470
720
            del taxonomy[taxon]
1471
721
        except KeyError:
1472
722
            pass # if it's not there, so we care?
1473
723
1474
724
    # We now have 2 taxonomies:
1475
725
    #  - for taxa in the tree
1476
726
    #  - for all other taxa in the clade of interest
1477
727
1478
728
    if save_taxonomy:
1479
729
        tot_taxonomy = taxonomy.copy()
1480
730
        tot_taxonomy.update(tree_taxonomy)
1481
731
        stk.save_taxonomy(tot_taxonomy,save_taxonomy_file)
1482
732
1483
733
1484
734
    orig_taxa_list = taxa_list
1485
735
1486
736
    remove_higher_level = [] # for storing the higher level taxa in the original tree that need deleting
1487
737
    generic = []
1488
738
    # find all the generic and build an internal subs file
1489
739
    for t in taxa_list:
1490
740
        t = t.replace(" ","_")
1491
741
        if t.find("_") == -1:
1492
742
            # no underscore, so just generic
1493
743
            generic.append(t)
1494
229
744
1495
230
    # step up the taxonomy levels from genus, adding taxa to the correct node
745
    # step up the taxonomy levels from genus, adding taxa to the correct node
1496
231
    # as a polytomy
746
    # as a polytomy
1498
232
    for level in taxonomy_levels[1::]: # skip species....
747
    start_level = start_level.encode('utf-8').strip()
1499
748
    if verbose:
1500
749
        print "I think your start OTU is at: ", start_level
1501
750
    for level in tlevels[1::]: # skip species....
1502
751
        if verbose:
1503
752
            print "Dealing with ",level
1504
233
        new_taxa = []
753
        new_taxa = []
1505
234
        for t in taxonomy:
754
        for t in taxonomy:
1506
235
            # skip odd ones that should be in there
755
            # skip odd ones that should be in there
1507
@@ -239,135 +759,61 @@
1508
239
                except KeyError:
759
                except KeyError:
1509
240
                    continue # don't have this info
760
                    continue # don't have this info
1510
241
        new_taxa = _uniquify(new_taxa)
761
        new_taxa = _uniquify(new_taxa)
1511
762
1512
242
        for nt in new_taxa:
763
        for nt in new_taxa:
1514
243
            taxa_to_add = []
764
            taxa_to_add = {}
1515
244
            taxa_in_clade = []
765
            taxa_in_clade = []
1516
245
            for t in taxonomy:
766
            for t in taxonomy:
1517
246
                if start_level in taxonomy[t] and taxonomy[t][start_level] == top_level:
767
                if start_level in taxonomy[t] and taxonomy[t][start_level] == top_level:
1518
247
                    try:
768
                    try:
1521
248
                        if taxonomy[t][level] == nt:
769
                        if taxonomy[t][level] == nt and not t in taxa_list:
1522
249
                            taxa_to_add.append(t.replace(' ','_'))
770
                            taxa_to_add[t] = taxonomy[t]
1523
250
                    except KeyError:
771
                    except KeyError:
1524
251
                        continue
772
                        continue
1525
773
1526
252
            # add to tree
774
            # add to tree
1527
253
            for t in taxa_list:
775
            for t in taxa_list:
1528
254
                if level in tree_taxonomy[t] and tree_taxonomy[t][level] == nt:
776
                if level in tree_taxonomy[t] and tree_taxonomy[t][level] == nt:
1529
255
                    taxa_in_clade.append(t)
777
                    taxa_in_clade.append(t)
1536
256
            if len(taxa_in_clade) > 0:
778
                    if t in generic:
1537
257
                tree = add_taxa(tree, taxa_to_add, taxa_in_clade)
779
                        # we are appending taxa to this higher taxon, so we need to remove it
1538
258
                for t in taxa_to_add: # clean up taxonomy
780
                        remove_higher_level.append(t)
1539
259
                    del taxonomy[t.replace('_',' ')]
781
1540
260
782
1541
261
783
            if len(taxa_in_clade) > 0 and len(taxa_to_add) > 0:
1542
784
                tree = add_taxa(tree, taxa_to_add, taxa_in_clade,level)
1543
785
                try:
1544
786
                    taxa_list = stk._getTaxaFromNewick(tree) 
1545
787
                except stk.TreeParseError as e:
1546
788
                    print taxa_to_add, taxa_in_clade, level, tree
1547
789
                    print e.msg
1548
790
                    return
1549
791
1550
792
                for t in taxa_to_add:
1551
793
                    tree_taxonomy[t.replace(' ','_')] = taxa_to_add[t]
1552
794
                    try:
1553
795
                        del taxonomy[t.replace('_',' ')]
1554
796
                    except KeyError:
1555
797
                        # It might have _ or it might not...
1556
798
                        del taxonomy[t]
1557
799
1558
800
1559
801
    # remove singelton nodes
1560
802
    tree = stk._collapse_nodes(tree) 
1561
803
    tree = stk._collapse_nodes(tree) 
1562
804
    tree = stk._collapse_nodes(tree) 
1563
805
    
1564
806
    tree = stk._sub_taxa_in_tree(tree, remove_higher_level)
1565
262
    trees = {}
807
    trees = {}
1566
263
    trees['tree_1'] = tree
808
    trees['tree_1'] = tree
1567
264
    output = stk._amalgamate_trees(trees,format='nexus')
809
    output = stk._amalgamate_trees(trees,format='nexus')
1568
265
    f = open(output_file, "w")
810
    f = open(output_file, "w")
1569
266
    f.write(output)
811
    f.write(output)
1570
267
    f.close()
812
    f.close()
1674
268
813
    taxa_list = stk._getTaxaFromNewick(tree)
1675
269
    if not save_taxonomy_file == None:
814
    
1676
270
        with open(save_taxonomy_file, 'w') as f:
815
    print "Final taxa count:", len(taxa_list)
1677
271
            writer = csv.writer(f)
816
 
1575
272
            headers = []
1576
273
            headers.append("OTU")
1577
274
            headers.extend(taxonomy_levels)
1578
275
            headers.append("Data source")
1579
276
            writer.writerow(headers)
1580
277
            for t in taxonomy:
1581
278
                otu = t
1582
279
                try:
1583
280
                    species = taxonomy[t]['species']
1584
281
                except KeyError:
1585
282
                    species = "-"
1586
283
                try:
1587
284
                    genus = taxonomy[t]['genus']
1588
285
                except KeyError:
1589
286
                    genus = "-"
1590
287
                try:
1591
288
                    family = taxonomy[t]['family']
1592
289
                except KeyError:
1593
290
                    family = "-"
1594
291
                try:
1595
292
                    superfamily = taxonomy[t]['superfamily']
1596
293
                except KeyError:
1597
294
                    superfamily = "-"
1598
295
                try:
1599
296
                    infraorder = taxonomy[t]['infraorder']
1600
297
                except KeyError:
1601
298
                    infraorder = "-"
1602
299
                try:
1603
300
                    suborder = taxonomy[t]['suborder']
1604
301
                except KeyError:
1605
302
                    suborder = "-"
1606
303
                try:
1607
304
                    order = taxonomy[t]['order']
1608
305
                except KeyError:
1609
306
                    order = "-"
1610
307
                try:
1611
308
                    superorder = taxonomy[t]['superorder']
1612
309
                except KeyError:
1613
310
                    superorder = "-"
1614
311
                try:
1615
312
                    subclass = taxonomy[t]['subclass']
1616
313
                except KeyError:
1617
314
                    subclass = "-"
1618
315
                try:
1619
316
                    tclass = taxonomy[t]['class']
1620
317
                except KeyError:
1621
318
                    tclass = "-"
1622
319
                try:
1623
320
                    subphylum = taxonomy[t]['subphylum']
1624
321
                except KeyError:
1625
322
                    subphylum = "-"
1626
323
                try:
1627
324
                    phylum = taxonomy[t]['phylum']
1628
325
                except KeyError:
1629
326
                    phylum = "-"
1630
327
                try:
1631
328
                    superphylum = taxonomy[t]['superphylum']
1632
329
                except KeyError:
1633
330
                    superphylum = "-"
1634
331
                try:
1635
332
                    infrakingdom = taxonomy[t]['infrakingdom']
1636
333
                except:
1637
334
                    infrakingdom = "-"
1638
335
                try:
1639
336
                    subkingdom = taxonomy[t]['subkingdom']
1640
337
                except:
1641
338
                    subkingdom = "-"
1642
339
                try:
1643
340
                    kingdom = taxonomy[t]['kingdom']
1644
341
                except KeyError:
1645
342
                    kingdom = "-"
1646
343
                try:
1647
344
                    provider = taxonomy[t]['provider']
1648
345
                except KeyError:
1649
346
                    provider = "-"
1650
347
1651
348
                if (isinstance(species, list)):
1652
349
                    species = " ".join(species)
1653
350
                this_classification = [
1654
351
                        otu.encode('utf-8'),
1655
352
                        species.encode('utf-8'),
1656
353
                        genus.encode('utf-8'),
1657
354
                        family.encode('utf-8'),
1658
355
                        superfamily.encode('utf-8'),
1659
356
                        infraorder.encode('utf-8'),
1660
357
                        suborder.encode('utf-8'),
1661
358
                        order.encode('utf-8'),
1662
359
                        superorder.encode('utf-8'),
1663
360
                        subclass.encode('utf-8'),
1664
361
                        tclass.encode('utf-8'),
1665
362
                        subphylum.encode('utf-8'),
1666
363
                        phylum.encode('utf-8'),
1667
364
                        superphylum.encode('utf-8'),
1668
365
                        infrakingdom.encode('utf-8'),
1669
366
                        subkingdom.encode('utf-8'),
1670
367
                        kingdom.encode('utf-8'),
1671
368
                        provider.encode('utf-8')]
1672
369
                writer.writerow(this_classification)
1673
370
1678
371
817
1679
372
def _uniquify(l):
818
def _uniquify(l):
1680
373
    """
819
    """
1681
@@ -379,28 +825,119 @@
1682
379
825
1683
380
    return keys.keys()
826
    return keys.keys()
1684
381
827
1686
382
def add_taxa(tree, new_taxa, taxa_in_clade):
828
def add_taxa(tree, new_taxa, taxa_in_clade, level):
1687
383
829
1688
384
    # create new tree of the new taxa
830
    # create new tree of the new taxa
1691
385
    #tree_string = "(" + ",".join(new_taxa) + ");"
831
    additionalTaxa = tree_from_taxonomy(level,new_taxa)
1690
386
    #additionalTaxa = stk._parse_tree(tree_string) 
1692
387
832
1693
388
    # find mrca parent
833
    # find mrca parent
1694
389
    treeobj = stk._parse_tree(tree)
834
    treeobj = stk._parse_tree(tree)
1695
390
    mrca = stk.get_mrca(tree,taxa_in_clade)
835
    mrca = stk.get_mrca(tree,taxa_in_clade)
1707
391
    mrca_parent = treeobj.node(mrca).parent
836
    if (mrca == 0):
1708
392
837
        # we need to make a new tree! The additional taxa are being placed at the root of the tree
1709
393
    # insert a node into the tree between the MRCA and it's parent (p4.addNodeBetweenNodes)
838
        t = Tree()
1710
394
    newNode = treeobj.addNodeBetweenNodes(mrca, mrca_parent)
839
        A = t.add_child()
1711
395
840
        B = t.add_child()
1712
396
    # add the new tree at the new node using p4.addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None)
841
        t1 = Tree(additionalTaxa)
1713
397
    #treeobj.addSubTree(newNode, additionalTaxa)
842
        t2 = Tree(tree)
1714
398
    for t in new_taxa:
843
        A.add_child(t1)
1715
399
        treeobj.addSibLeaf(newNode,t)
844
        B.add_child(t2)
1716
400
845
        return t.write(format=9)
1717
401
    # return new tree
846
    else:
1718
847
        mrca = treeobj.nodes[mrca]
1719
848
        additionalTaxa = stk._parse_tree(additionalTaxa)
1720
849
        
1721
850
        if len(taxa_in_clade) == 1:
1722
851
            taxon = treeobj.node(taxa_in_clade[0])
1723
852
            mrca = treeobj.addNodeBetweenNodes(taxon,mrca)
1724
853
1725
854
1726
855
        # insert a node into the tree between the MRCA and it's parent (p4.addNodeBetweenNodes)
1727
856
        # newNode = treeobj.addNodeBetweenNodes(mrca, mrca_parent)
1728
857
1729
858
        # add the new tree at the new node using p4.addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None)
1730
859
        treeobj.addSubTree(mrca, additionalTaxa, ignoreRootAssert=True)
1731
860
1732
402
    return treeobj.writeNewick(fName=None,toString=True).strip()
861
    return treeobj.writeNewick(fName=None,toString=True).strip()
1733
403
862
1734
863
1735
864
1736
865
def tree_from_taxonomy(top_level, tree_taxonomy):
1737
866
1738
867
    start_level = taxonomy_levels.index(top_level)
1739
868
    new_taxa = tree_taxonomy.keys()
1740
869
1741
870
    tl_types = []
1742
871
    for tt in tree_taxonomy:
1743
872
        tl_types.append(tree_taxonomy[tt][top_level])
1744
873
1745
874
    tl_types = _uniquify(tl_types)
1746
875
    levels_to_worry_about = tlevels[0:tlevels.index(top_level)+1]
1747
876
        
1748
877
    t = Tree()
1749
878
    nodes = {}
1750
879
    nodes[top_level] = []
1751
880
    for tl in tl_types:
1752
881
        n = t.add_child(name=tl)
1753
882
        nodes[top_level].append({tl:n})
1754
883
1755
884
    for l in levels_to_worry_about[-2::-1]:
1756
885
        names = []
1757
886
        nodes[l] = []
1758
887
        ci = levels_to_worry_about.index(l)
1759
888
        for tt in tree_taxonomy:
1760
889
            try:
1761
890
                names.append(tree_taxonomy[tt][l])
1762
891
            except KeyError:
1763
892
                pass
1764
893
        names = _uniquify(names)
1765
894
        for n in names:
1766
895
            # find my parent
1767
896
            parent = None
1768
897
            for tt in tree_taxonomy:
1769
898
                try:
1770
899
                    if tree_taxonomy[tt][l] == n:
1771
900
                        try:
1772
901
                            parent = tree_taxonomy[tt][levels_to_worry_about[ci+1]]
1773
902
                            level = ci+1
1774
903
                        except KeyError:
1775
904
                            try:
1776
905
                                parent = tree_taxonomy[tt][levels_to_worry_about[ci+2]]
1777
906
                                level = ci+2
1778
907
                            except KeyError:
1779
908
                                try:
1780
909
                                    parent = tree_taxonomy[tt][levels_to_worry_about[ci+3]]
1781
910
                                    level = ci+3
1782
911
                                except KeyError:
1783
912
                                    print "ERROR: tried to find some taxonomic info for "+tt+" from tree_taxonomy file/downloaded data and I went two levels up, but failed find any. Looked at:\n"
1784
913
                                    print "\t"+levels_to_worry_about[ci+1]
1785
914
                                    print "\t"+levels_to_worry_about[ci+2]
1786
915
                                    print "\t"+levels_to_worry_about[ci+3]
1787
916
                                    print "This is the taxonomy info I have for "+tt
1788
917
                                    print tree_taxonomy[tt]
1789
918
                                    sys.exit(1)
1790
919
1791
920
                        k = []
1792
921
                        for nd in nodes[levels_to_worry_about[level]]:
1793
922
                            k.extend(nd.keys())
1794
923
                        i = 0
1795
924
                        for kk in k:
1796
925
                            if kk == parent:
1797
926
                                break
1798
927
                            i += 1
1799
928
                        parent_id = i
1800
929
                        break
1801
930
                except KeyError:
1802
931
                    pass # no data at this level for this beastie
1803
932
            # find out where to attach it
1804
933
            node_id = nodes[levels_to_worry_about[level]][parent_id][parent]
1805
934
            nd = node_id.add_child(name=n.replace(" ","_"))
1806
935
            nodes[l].append({n:nd})
1807
936
1808
937
    tree = t.write(format=9)  
1809
938
    
1810
939
    return tree
1811
940
1812
404
if __name__ == "__main__":
941
if __name__ == "__main__":
1813
405
    main()
942
    main()
1814
406
943
1815
407
944
1816
=== modified file 'stk/scripts/plot_character_taxa_matrix.py'
1817
--- stk/scripts/plot_character_taxa_matrix.py	2014-12-10 08:55:43 +0000
1818
+++ stk/scripts/plot_character_taxa_matrix.py	2017-01-12 09:27:31 +0000
1819
@@ -42,6 +42,18 @@
1820
42
            default=False
42
            default=False
1821
43
            )
43
            )
1822
44
    parser.add_argument(
44
    parser.add_argument(
1823
45
            '-t', 
1824
46
            '--taxonomy', 
1825
47
            help="Use taxonomy to sort the taxa on the axis. Supply a STK taxonomy file",
1826
48
            )
1827
49
    parser.add_argument(
1828
50
            '--level',
1829
51
            choices=['family','superfamily','infraorder','suborder','order'],
1830
52
            default='family',
1831
53
            help="""What level to group the taxonomy at. Default is family. 
1832
54
                    Note data for a particular levelmay be missing in taxonomy."""
1833
55
            )
1834
56
    parser.add_argument(
1835
45
            'input_file', 
57
            'input_file', 
1836
46
            metavar='input_file',
58
            metavar='input_file',
1837
47
            nargs=1,
59
            nargs=1,
1838
@@ -59,14 +71,58 @@
1839
59
    verbose = args.verbose
71
    verbose = args.verbose
1840
60
    input_file = args.input_file[0]
72
    input_file = args.input_file[0]
1841
61
    output_file = args.output_file[0]
73
    output_file = args.output_file[0]
1842
74
    taxonomy = args.taxonomy
1843
75
    level = args.level
1844
62
76
1845
63
    XML = stk.load_phyml(input_file)
77
    XML = stk.load_phyml(input_file)
1846
78
    if not taxonomy == None:
1847
79
        taxonomy = stk.load_taxonomy(taxonomy)
1848
80
1849
64
    all_taxa = stk.get_all_taxa(XML)
81
    all_taxa = stk.get_all_taxa(XML)
1850
65
    all_chars_d = stk.get_all_characters(XML)
82
    all_chars_d = stk.get_all_characters(XML)
1851
66
    all_chars = []
83
    all_chars = []
1852
67
    for c in all_chars_d:
84
    for c in all_chars_d:
1853
68
        all_chars.extend(all_chars_d[c])
85
        all_chars.extend(all_chars_d[c])
1854
69
86
1855
87
    if not taxonomy == None:
1856
88
        tax_data = {}
1857
89
        new_all_taxa = []
1858
90
        for t in all_taxa:
1859
91
            taxon = t.replace("_"," ")
1860
92
            try:
1861
93
                if taxonomy[taxon][level] == "":
1862
94
                    # skip this
1863
95
                    continue
1864
96
                tax_data[t] = taxonomy[taxon][level]
1865
97
            except KeyError:
1866
98
                print "Couldn't find "+t+" in taxonomy. Adding as null data"
1867
99
                tax_data[t] = 'zzzzz' # it's at the end...
1868
100
1869
101
        from sets import Set
1870
102
        unique = set(tax_data.values())
1871
103
        unique = list(unique)
1872
104
        unique.sort()
1873
105
        print "Groups are:"
1874
106
        print unique
1875
107
        counts = []
1876
108
        for u in unique:
1877
109
            count = 0
1878
110
            for t in tax_data:
1879
111
                if tax_data[t] == u:
1880
112
                    count += 1
1881
113
                    new_all_taxa.append(t)
1882
114
            counts.append(count)
1883
115
1884
116
        all_taxa = new_all_taxa
1885
117
        # cumulate counts
1886
118
        count_cumulate = []
1887
119
        count_cumulate.append(counts[0])
1888
120
        for c in counts[1::]:
1889
121
            count_cumulate.append(c+count_cumulate[-1])
1890
122
1891
123
        print count_cumulate
1892
124
            
1893
125
1894
70
    taxa_character_matrix = {}
126
    taxa_character_matrix = {}
1895
71
    for t in all_taxa:
127
    for t in all_taxa:
1896
72
        taxa_character_matrix[t] = []
128
        taxa_character_matrix[t] = []
1897
@@ -77,7 +133,8 @@
1898
77
        taxa = stk.get_taxa_from_tree(XML,t, sort=True)
133
        taxa = stk.get_taxa_from_tree(XML,t, sort=True)
1899
78
        for taxon in taxa:
134
        for taxon in taxa:
1900
79
            taxon = taxon.replace(" ","_")
135
            taxon = taxon.replace(" ","_")
1902
80
            taxa_character_matrix[taxon].extend(chars)
136
            if taxon in all_taxa:
1903
137
                taxa_character_matrix[taxon].extend(chars)
1904
81
    
138
    
1905
82
    for t in taxa_character_matrix:
139
    for t in taxa_character_matrix:
1906
83
        array = taxa_character_matrix[t]
140
        array = taxa_character_matrix[t]
1907
@@ -92,6 +149,31 @@
1908
92
                x.append(i)
149
                x.append(i)
1909
93
                y.append(j)
150
                y.append(j)
1910
94
151
1911
152
1912
153
    i = 0
1913
154
    for j in all_chars:
1914
155
        # do a substitution of character names to tidy things up
1915
156
        if j.lower().startswith('mitochondrial carrier; adenine nucleotide translocator'):
1916
157
            j = "ANT"
1917
158
        if j.lower().startswith('mitochondrially encoded 12s'):
1918
159
            j = '12S'
1919
160
        if j.lower().startswith('complete mitochondrial genome'):
1920
161
            j = 'Mitogenome'
1921
162
        if j.lower().startswith('mtdna'):
1922
163
            j = "mtDNA restriction sites"
1923
164
        if j.lower().startswith('h3 histone'):
1924
165
            j = 'H3'
1925
166
        if j.lower().startswith('mitochondrially encoded cytochrome'):
1926
167
            j = 'COI'
1927
168
        if j.lower().startswith('rna, 28s'):
1928
169
            j = '28S'
1929
170
        if j.lower().startswith('rna, 18s'):
1930
171
            j = '18S'
1931
172
        if j.lower().startswith('mitochondrially encoded 16s'):
1932
173
            j = '16S'
1933
174
        all_chars[i] = j
1934
175
        i += 1
1935
176
1936
95
    fig=figure(figsize=(22,17),dpi=90)
177
    fig=figure(figsize=(22,17),dpi=90)
1937
96
    fig.subplots_adjust(left=0.3)
178
    fig.subplots_adjust(left=0.3)
1938
97
    ax = fig.add_subplot(1,1,1)
179
    ax = fig.add_subplot(1,1,1)
1939
98
180
1940
=== modified file 'stk/scripts/plot_tree_taxa_matrix.py'
1941
--- stk/scripts/plot_tree_taxa_matrix.py	2014-12-10 08:55:43 +0000
1942
+++ stk/scripts/plot_tree_taxa_matrix.py	2017-01-12 09:27:31 +0000
1943
@@ -43,6 +43,18 @@
1944
43
            default=False
43
            default=False
1945
44
            )
44
            )
1946
45
    parser.add_argument(
45
    parser.add_argument(
1947
46
            '-t', 
1948
47
            '--taxonomy', 
1949
48
            help="Use taxonomy to sort the taxa on the axis. Supply a STK taxonomy file",
1950
49
            )
1951
50
    parser.add_argument(
1952
51
            '--level',
1953
52
            choices=['family','superfamily','infraorder','suborder','order'],
1954
53
            default='family',
1955
54
            help="""What level to group the taxonomy at. Default is family. 
1956
55
                    Note data for a particular levelmay be missing in taxonomy."""
1957
56
            )
1958
57
    parser.add_argument(
1959
46
            'input_file', 
58
            'input_file', 
1960
47
            metavar='input_file',
59
            metavar='input_file',
1961
48
            nargs=1,
60
            nargs=1,
1962
@@ -60,13 +72,57 @@
1963
60
    verbose = args.verbose
72
    verbose = args.verbose
1964
61
    input_file = args.input_file[0]
73
    input_file = args.input_file[0]
1965
62
    output_file = args.output_file[0]
74
    output_file = args.output_file[0]
1966
75
    taxonomy = args.taxonomy
1967
76
    level = args.level
1968
63
77
1969
64
    XML = stk.load_phyml(input_file)
78
    XML = stk.load_phyml(input_file)
1970
79
    if not taxonomy == None:
1971
80
        taxonomy = stk.load_taxonomy(taxonomy)
1972
81
1973
65
    all_taxa = stk.get_all_taxa(XML)
82
    all_taxa = stk.get_all_taxa(XML)
1974
66
83
1975
67
    taxa_tree_matrix = {}
84
    taxa_tree_matrix = {}
1976
68
    for t in all_taxa:
85
    for t in all_taxa:
1977
69
        taxa_tree_matrix[t] = []
86
        taxa_tree_matrix[t] = []
1978
87
        
1979
88
    if not taxonomy == None:
1980
89
        tax_data = {}
1981
90
        new_all_taxa = []
1982
91
        for t in all_taxa:
1983
92
            taxon = t.replace("_"," ")
1984
93
            try:
1985
94
                if taxonomy[taxon][level] == "":
1986
95
                    # skip this
1987
96
                    continue
1988
97
                tax_data[t] = taxonomy[taxon][level]
1989
98
            except KeyError:
1990
99
                print "Couldn't find "+t+" in taxonomy. Adding as null data"
1991
100
                tax_data[t] = 'zzzzz' # it's at the end...
1992
101
1993
102
        from sets import Set
1994
103
        unique = set(tax_data.values())
1995
104
        unique = list(unique)
1996
105
        unique.sort()
1997
106
        print "Groups are:"
1998
107
        print unique
1999
108
        counts = []
2000
109
        for u in unique:
2001
110
            count = 0
2002
111
            for t in tax_data:
2003
112
                if tax_data[t] == u:
2004
113
                    count += 1
2005
114
                    new_all_taxa.append(t)
2006
115
            counts.append(count)
2007
116
2008
117
        all_taxa = new_all_taxa
2009
118
        # cumulate counts
2010
119
        count_cumulate = []
2011
120
        count_cumulate.append(counts[0])
2012
121
        for c in counts[1::]:
2013
122
            count_cumulate.append(c+count_cumulate[-1])
2014
123
2015
124
        print count_cumulate
2016
125
2017
70
126
2018
71
    trees = stk.obtain_trees(XML)
127
    trees = stk.obtain_trees(XML)
2019
72
    i = 0
128
    i = 0
2020
73
129
2021
=== modified file 'stk/scripts/remove_poorly_constrained_taxa.py'
2022
--- stk/scripts/remove_poorly_constrained_taxa.py	2014-04-18 11:57:14 +0000
2023
+++ stk/scripts/remove_poorly_constrained_taxa.py	2017-01-12 09:27:31 +0000
2024
@@ -12,8 +12,8 @@
2025
12
12
2026
13
    # do stuff
13
    # do stuff
2027
14
    parser = argparse.ArgumentParser(
14
    parser = argparse.ArgumentParser(
2030
15
         prog="convert tree from specific to generic",
15
         prog="remove poorly contrained taxa",
2031
16
         description="""Converts a tree at specific level to generic level""",
16
         description="""Remove taxa that appea in one source tree only.""",
2032
17
         )
17
         )
2033
18
    parser.add_argument(
18
    parser.add_argument(
2034
19
            '-v', 
19
            '-v', 
2035
@@ -34,6 +34,13 @@
2036
34
                 " to removal those in polytomies *and* only in one other tree."
34
                 " to removal those in polytomies *and* only in one other tree."
2037
35
            )
35
            )
2038
36
    parser.add_argument(
36
    parser.add_argument(
2039
37
            '--tree_only', 
2040
38
            default=False,
2041
39
            action='store_true',
2042
40
            help="Restrict removal of taxa that only occur in one source tree. Default"+
2043
41
                 " to removal those in polytomies *and* only in one other tree."
2044
42
            )
2045
43
    parser.add_argument(
2046
37
            'input_phyml', 
44
            'input_phyml', 
2047
38
            metavar='input_phyml',
45
            metavar='input_phyml',
2048
39
            nargs=1,
46
            nargs=1,
2049
@@ -43,13 +50,13 @@
2050
43
            'input_tree', 
50
            'input_tree', 
2051
44
            metavar='input_tree',
51
            metavar='input_tree',
2052
45
            nargs=1,
52
            nargs=1,
2054
46
            help="Your tree"
53
            help="Your tree - can be NULL or None"
2055
47
            )
54
            )
2056
48
    parser.add_argument(
55
    parser.add_argument(
2057
49
            'output_tree', 
56
            'output_tree', 
2058
50
            metavar='output_tree',
57
            metavar='output_tree',
2059
51
            nargs=1,
58
            nargs=1,
2061
52
            help="Your output tree"
59
            help="Your output tree or phyml - if input_tree is none, this is the Phyml"
2062
53
            )
60
            )
2063
54
61
2064
55
62
2065
@@ -62,14 +69,20 @@
2066
62
        dl = True
69
        dl = True
2067
63
    poly_only = args.poly_only
70
    poly_only = args.poly_only
2068
64
    input_tree = args.input_tree[0]
71
    input_tree = args.input_tree[0]
2070
65
    output_tree = args.output_tree[0]
72
    if input_tree == 'NULL' or input_tree == 'None':
2071
73
        input_tree = None
2072
74
    output_file = args.output_tree[0]
2073
66
    input_phyml = args.input_phyml[0]
75
    input_phyml = args.input_phyml[0]
2074
67
76
2075
68
    XML = stk.load_phyml(input_phyml)
77
    XML = stk.load_phyml(input_phyml)
2076
69
    # load tree
78
    # load tree
2078
70
    supertree = stk.import_tree(input_tree)
79
    if (not input_tree == None):
2079
80
        supertree = stk.import_tree(input_tree)
2080
81
        taxa = stk._getTaxaFromNewick(supertree)
2081
82
    else:
2082
83
        supertree = None
2083
84
        taxa = stk.get_all_taxa(XML) 
2084
71
    # grab taxa
85
    # grab taxa
2085
72
    taxa = stk._getTaxaFromNewick(supertree)
2086
73
    delete_list = []
86
    delete_list = []
2087
74
87
2088
75
    # loop over taxa in supertree and get some stats
88
    # loop over taxa in supertree and get some stats
2089
@@ -115,19 +128,29 @@
2090
115
128
2091
116
    print "Taxa: "+str(len(taxa))
129
    print "Taxa: "+str(len(taxa))
2092
117
    print "Deleting: "+str(len(delete_list))
130
    print "Deleting: "+str(len(delete_list))
2106
118
    # done, so delete the problem taxa from the supertree
131
2107
119
    for t in delete_list:
132
    if not supertree == None:
2108
120
        # remove taxa from supertree
133
        # done, so delete the problem taxa from the supertree
2109
121
        supertree = stk._sub_taxa_in_tree(supertree,t)
134
        for t in delete_list:
2110
122
135
            # remove taxa from supertree
2111
123
    # save supertree
136
            supertree = stk._sub_taxa_in_tree(supertree,t)
2112
124
    tree = {}
137
2113
125
    tree['Tree_1'] = supertree
138
        # save supertree
2114
126
    output = stk._amalgamate_trees(tree,format='nexus')
139
        tree = {}
2115
127
    # write file
140
        tree['Tree_1'] = supertree
2116
128
    f = open(output_tree,"w")
141
        output = stk._amalgamate_trees(tree,format='nexus')
2117
129
    f.write(output)
142
        # write file
2118
130
    f.close()
143
        f = open(output_file,"w")
2119
144
        f.write(output)
2120
145
        f.close()
2121
146
    else:
2122
147
        new_phyml =  stk.substitute_taxa(XML,delete_list)
2123
148
        # write file
2124
149
        f = open(output_file,"w")
2125
150
        f.write(new_phyml)
2126
151
        f.close()
2127
152
2128
153
2129
131
154
2130
132
    if (dl):
155
    if (dl):
2131
133
        # write file
156
        # write file
2132
134
157
2133
=== added file 'stk/scripts/tree_from_taxonomy.py'
2134
--- stk/scripts/tree_from_taxonomy.py	1970-01-01 00:00:00 +0000
2135
+++ stk/scripts/tree_from_taxonomy.py	2017-01-12 09:27:31 +0000
2136
@@ -0,0 +1,142 @@
2137
1
#    trees ready for supretree construction.
2138
2
#    Copyright (C) 2015, Jon Hill, Katie Davis
2139
3
#
2140
4
#    This program is free software: you can redistribute it and/or modify
2141
5
#    it under the terms of the GNU General Public License as published by
2142
6
#    the Free Software Foundation, either version 3 of the License, or
2143
7
#    (at your option) any later version.
2144
8
#
2145
9
#    This program is distributed in the hope that it will be useful,
2146
10
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
2147
11
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
2148
12
#    GNU General Public License for more details.
2149
13
#
2150
14
#    You should have received a copy of the GNU General Public License
2151
15
#    along with this program.  If not, see <http://www.gnu.org/licenses/>.
2152
16
#
2153
17
#    Jon Hill. jon.hill@york.ac.uk
2154
18
2155
19
import argparse
2156
20
import copy
2157
21
import os
2158
22
import sys
2159
23
stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir )
2160
24
sys.path.insert(0, stk_path)
2161
25
import supertree_toolkit as stk
2162
26
import csv
2163
27
from ete2 import Tree
2164
28
2165
29
taxonomy_levels = ['species','subgenus','genus','subfamily','family','superfamily','subsection','section','infraorder','suborder','order','superorder','subclass','class','superclass','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
2166
30
tlevels = ['species','genus','family','order','class','phylum','kingdom']
2167
31
2168
32
2169
33
def main():
2170
34
2171
35
    # do stuff
2172
36
    parser = argparse.ArgumentParser(
2173
37
         prog="create a tree from a taxonomy file",
2174
38
         description="Create a taxonomic tree",
2175
39
         )
2176
40
    parser.add_argument(
2177
41
            '-v', 
2178
42
            '--verbose', 
2179
43
            action='store_true', 
2180
44
            help="Verbose output: mainly progress reports.",
2181
45
            default=False
2182
46
            )
2183
47
    parser.add_argument(
2184
48
            'top_level', 
2185
49
            nargs=1,
2186
50
            help="The top level group to start with, e.g. family"
2187
51
            )
2188
52
    parser.add_argument(
2189
53
            'input_file', 
2190
54
            metavar='input_file',
2191
55
            nargs=1,
2192
56
            help="Your taxonomy file"
2193
57
            )
2194
58
    parser.add_argument(
2195
59
            'output_file', 
2196
60
            metavar='output_file',
2197
61
            nargs=1,
2198
62
            help="Your new tree file"
2199
63
            )
2200
64
2201
65
    args = parser.parse_args()
2202
66
    verbose = args.verbose
2203
67
    input_file = args.input_file[0]
2204
68
    output_file = args.output_file[0]
2205
69
    top_level = args.top_level[0]
2206
70
2207
71
    start_level = taxonomy_levels.index(top_level)
2208
72
    tree_taxonomy = stk.load_taxonomy(input_file)
2209
73
    new_taxa = tree_taxonomy.keys()
2210
74
2211
75
    tl_types = []
2212
76
    for tt in tree_taxonomy:
2213
77
        tl_types.append(tree_taxonomy[tt][top_level])
2214
78
2215
79
    tl_types = _uniquify(tl_types)
2216
80
    levels_to_worry_about = tlevels[0:tlevels.index(top_level)+1]
2217
81
        
2218
82
    #print levels_to_worry_about[-2::-1]
2219
83
    
2220
84
    t = Tree()
2221
85
    nodes = {}
2222
86
    nodes[top_level] = []
2223
87
    for tl in tl_types:
2224
88
        n = t.add_child(name=tl)
2225
89
        nodes[top_level].append({tl:n})
2226
90
2227
91
    for l in levels_to_worry_about[-2::-1]:
2228
92
        #print t
2229
93
        names = []
2230
94
        nodes[l] = []
2231
95
        ci = levels_to_worry_about.index(l)
2232
96
        for tt in tree_taxonomy:
2233
97
            names.append(tree_taxonomy[tt][l])
2234
98
        names = _uniquify(names)
2235
99
        for n in names:
2236
100
            #print n
2237
101
            # find my parent
2238
102
            parent = None
2239
103
            for tt in tree_taxonomy:
2240
104
                if tree_taxonomy[tt][l] == n:
2241
105
                    parent = tree_taxonomy[tt][levels_to_worry_about[ci+1]]
2242
106
                    k = []
2243
107
                    for nd in nodes[levels_to_worry_about[ci+1]]:
2244
108
                        k.extend(nd.keys())
2245
109
                    i = 0
2246
110
                    for kk in k:
2247
111
                        print kk
2248
112
                        if kk == parent:
2249
113
                            break
2250
114
                        i += 1
2251
115
                    parent_id = i
2252
116
                    break
2253
117
            # find out where to attach it
2254
118
            node_id = nodes[levels_to_worry_about[ci+1]][parent_id][parent]
2255
119
            nd = node_id.add_child(name=n.replace(" ","_"))
2256
120
            nodes[l].append({n:nd})
2257
121
2258
122
    tree = t.write(format=9)  
2259
123
    tree = stk._collapse_nodes(tree) 
2260
124
    tree = stk._collapse_nodes(tree) 
2261
125
    print tree
2262
126
2263
127
2264
128
def _uniquify(l):
2265
129
    """
2266
130
    Make a list, l, contain only unique data
2267
131
    """
2268
132
    keys = {}
2269
133
    for e in l:
2270
134
        keys[e] = 1
2271
135
2272
136
    return keys.keys()
2273
137
2274
138
if __name__ == "__main__":
2275
139
    main()
2276
140
2277
141
2278
142
2279
0
143
2280
=== modified file 'stk/stk'
2281
--- stk/stk	2014-12-09 10:58:48 +0000
2282
+++ stk/stk	2017-01-12 09:27:31 +0000
2283
@@ -23,6 +23,7 @@
2284
23
import sys
23
import sys
2285
24
import argparse
24
import argparse
2286
25
import traceback
25
import traceback
2287
26
import time
2288
26
try:
27
try:
2289
27
    __file__
28
    __file__
2290
28
except NameError:
29
except NameError:
2291
@@ -41,6 +42,10 @@
2292
41
import string
42
import string
2293
42
import stk.p4 as p4
43
import stk.p4 as p4
2294
43
import lxml
44
import lxml
2295
45
import csv
2296
46
import tempfile
2297
47
from subprocess import check_call, CalledProcessError, call
2298
48
2299
44
import stk.bzr_version as bzr_version
49
import stk.bzr_version as bzr_version
2300
45
d = bzr_version.version_info
50
d = bzr_version.version_info
2301
46
build = d.get('revno','<unknown revno>')
51
build = d.get('revno','<unknown revno>')
2302
@@ -366,7 +371,7 @@
2303
366
371
2304
367
    # Clean data
372
    # Clean data
2305
368
    parser_cm = subparsers.add_parser('clean_data',
373
    parser_cm = subparsers.add_parser('clean_data',
2307
369
            help='Remove errant taxa, uninformative trees and empty sources.'
374
            help='Renames all sources and trees sensibly. Removes errant taxa, uninformative trees and empty sources.'
2308
370
            )
375
            )
2309
371
    parser_cm.add_argument('input', 
376
    parser_cm.add_argument('input', 
2310
372
            help='The input phyml file')
377
            help='The input phyml file')
2311
@@ -488,7 +493,81 @@
2312
488
    parser_cm.add_argument('subs',
493
    parser_cm.add_argument('subs',
2313
489
            help='The subs file')
494
            help='The subs file')
2314
490
    parser_cm.set_defaults(func=check_subs)
495
    parser_cm.set_defaults(func=check_subs)
2316
491
    
496
   
2317
497
    # taxonomic name checker
2318
498
    parser_cm = subparsers.add_parser('check_otus',
2319
499
            help='Check your OTUs against EoL.'
2320
500
            )
2321
501
    parser_cm.add_argument('input',
2322
502
            help='The input Phyml. Also accepts tree files or a simple list')
2323
503
    parser_cm.add_argument('output', 
2324
504
            help='The output CSV file. Taxon, synonyms, status')
2325
505
    parser_cm.add_argument('--overwrite',
2326
506
            action='store_true',
2327
507
            default=False,
2328
508
            help="Overwrite the existing file without asking for confirmation")
2329
509
    parser_cm.set_defaults(func=check_otus)
2330
510
2331
511
    # create taxonomy csv file
2332
512
    parser_cm = subparsers.add_parser('create_taxonomy',
2333
513
            help='Create a taxonomy file in CSV for you to then augment.'
2334
514
            )
2335
515
    parser_cm.add_argument('input',
2336
516
            help='The input Phyml. Also accepts tree files or a simple list')
2337
517
    parser_cm.add_argument('output', 
2338
518
            help='The output CSV file. Name, followed by classification and source')
2339
519
    parser_cm.add_argument('--overwrite',
2340
520
            action='store_true',
2341
521
            default=False,
2342
522
            help="Overwrite the existing file without asking for confirmation")
2343
523
    parser_cm.add_argument('--taxonomy',
2344
524
            help="Give a starting taxonomy file, e.g. one you ran earlier",)
2345
525
    parser_cm.set_defaults(func=create_taxonomy)
2346
526
2347
527
2348
528
    # do the subs in a one go using taxonomy
2349
529
    parser_cm = subparsers.add_parser('auto_subs',
2350
530
            help='Using a taxonomy, generate a species level version of your data in one go.'
2351
531
            )
2352
532
    parser_cm.add_argument('input',
2353
533
            help='The input Phyml')
2354
534
    parser_cm.add_argument('taxonomy',
2355
535
            help='Your taxonomy file',
2356
536
            )
2357
537
    parser_cm.add_argument('output', 
2358
538
            help='The output phyml')
2359
539
    parser_cm.add_argument('--overwrite',
2360
540
            action='store_true',
2361
541
            default=False,
2362
542
            help="Overwrite the existing file without asking for confirmation")
2363
543
    #parser_cm.add_argument('--level',
2364
544
    #        choices=supertree_toolkit.taxonomy_levels,
2365
545
    #        help="Taxonomic level to output at",)
2366
546
    parser_cm.set_defaults(func=auto_subs)
2367
547
2368
548
2369
549
    # attempt to process the data into a matrix all automatically
2370
550
    parser_cm = subparsers.add_parser('process',
2371
551
            help='Generate a species-level matrix, and do all the checks and processing automatically. Note this creates a taxonomy and does all the processing, but will not be perfect (as taxonomies are not perfect)'
2372
552
            )
2373
553
    parser_cm.add_argument('input',
2374
554
            help='The input Phyml')
2375
555
    parser_cm.add_argument('output', 
2376
556
            help='The output matrix')
2377
557
    parser_cm.add_argument('--taxonomy_file',
2378
558
            help='Existing taxonomy file to prevent redownloading data. Any taxa not in the file will be checked online, so partial complete file are OK.')
2379
559
    parser_cm.add_argument('--equivalents_file',
2380
560
            help='Existing equivalents file from a taxonomic name check. Any taxa not in the file will be checked online, so partially complete files are OK.')
2381
561
    parser_cm.add_argument('--overwrite',
2382
562
            action='store_true',
2383
563
            default=False,
2384
564
            help="Overwrite the existing file without asking for confirmation")
2385
565
    parser_cm.add_argument('--no_store',
2386
566
            action="store_true",
2387
567
            default=False,
2388
568
            help="Do not store intermediate files -- not recommended")
2389
569
    parser_cm.set_defaults(func=process)
2390
570
2391
492
571
2392
493
    # before we let argparse work its magic, check for --version
572
    # before we let argparse work its magic, check for --version
2393
494
    if "--version" in sys.argv:
573
    if "--version" in sys.argv:
2394
@@ -602,7 +681,7 @@
2395
602
    # check if output files are there
681
    # check if output files are there
2396
603
    if (output_file and os.path.exists(output_file) and not overwrite):
682
    if (output_file and os.path.exists(output_file) and not overwrite):
2397
604
        print "Output file exists. Either remove the file or use the --overwrite flag."
683
        print "Output file exists. Either remove the file or use the --overwrite flag."
2399
605
        print "Do you wish to continue? [Y/n]"
684
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2400
606
        while True:
685
        while True:
2401
607
            k=inkey()
686
            k=inkey()
2402
608
            if k.lower() == 'n':
687
            if k.lower() == 'n':
2403
@@ -612,7 +691,7 @@
2404
612
                break
691
                break
2405
613
    if (not newphyml == None and os.path.exists(newphyml) and not overwrite):
692
    if (not newphyml == None and os.path.exists(newphyml) and not overwrite):
2406
614
        print "Output Phyml file exists. Either remove the file or use the --overwrite flag."
693
        print "Output Phyml file exists. Either remove the file or use the --overwrite flag."
2408
615
        print "Do you wish to continue? [Y/n]"
694
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2409
616
        while True:
695
        while True:
2410
617
            k=inkey()
696
            k=inkey()
2411
618
            if k.lower() == 'n':
697
            if k.lower() == 'n':
2412
@@ -624,9 +703,9 @@
2413
624
    XML = supertree_toolkit.load_phyml(input_file)
703
    XML = supertree_toolkit.load_phyml(input_file)
2414
625
    try:
704
    try:
2415
626
        if (newphyml == None):
705
        if (newphyml == None):
2417
627
            data_independence = supertree_toolkit.data_independence(XML,ignoreWarnings=ignoreWarnings)
706
            data_independence, subsets = supertree_toolkit.data_independence(XML,ignoreWarnings=ignoreWarnings)
2418
628
        else:
707
        else:
2420
629
            data_independence, new_phyml = supertree_toolkit.data_independence(XML,make_new_xml=True,ignoreWarnings=ignoreWarnings)
708
            data_independence, subsets, new_phyml = supertree_toolkit.data_independence(XML,make_new_xml=True,ignoreWarnings=ignoreWarnings)
2421
630
    except NotUniqueError as detail:
709
    except NotUniqueError as detail:
2422
631
        msg = "***Error: Failed to check independence.\n"+detail.msg
710
        msg = "***Error: Failed to check independence.\n"+detail.msg
2423
632
        print msg
711
        print msg
2424
@@ -644,7 +723,7 @@
2425
644
        print msg
723
        print msg
2426
645
        return
724
        return
2427
646
    except:
725
    except:
2429
647
        msg = "***Error: failed to check independence due to unknown error."
726
        msg = "***Error: failed to check independence due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2430
648
        print msg
727
        print msg
2431
649
        traceback.print_exc()
728
        traceback.print_exc()
2432
650
        return
729
        return
2433
@@ -653,16 +732,14 @@
2434
653
    data_ind = ""
732
    data_ind = ""
2435
654
    #column headers
733
    #column headers
2436
655
    data_ind = "Source trees that are subsets of others\n"
734
    data_ind = "Source trees that are subsets of others\n"
2441
656
    data_ind = data_ind + "Flagged tree, is a subset of:\n"
735
    data_ind = data_ind + "Flagged tree(s), is/are subset(s) of:\n"
2442
657
    for name in data_independence:
736
    for names in subsets:
2443
658
        if ( data_independence[name][1] == supertree_toolkit.SUBSET ):
737
        data_ind += names[1:] + "," + names[0] + "\n"
2440
659
            data_ind += name + "," + data_independence[name][0] + "\n"
2444
660
    
738
    
2445
661
    data_ind += "\n\nSource trees that are identical to others\n"
739
    data_ind += "\n\nSource trees that are identical to others\n"
2450
662
    data_ind = data_ind + "Flagged tree, is identical to:\n"
740
    data_ind = data_ind + "Flagged tree(s), is/are identical to:\n"
2451
663
    for name in data_independence:
741
    for names in data_independence:
2452
664
        if ( data_independence[name][1] == supertree_toolkit.IDENTICAL ):
742
        data_ind += names[1:] + "," + names[0] + "\n"
2449
665
            data_ind += name + "," + data_independence[name][0] + "\n"
2453
666
        
743
        
2454
667
744
2455
668
    if (output_file == False or
745
    if (output_file == False or
2456
@@ -762,7 +839,7 @@
2457
762
    # Does the output file already exist?
839
    # Does the output file already exist?
2458
763
    if (os.path.exists(output_file) and not overwrite):
840
    if (os.path.exists(output_file) and not overwrite):
2459
764
        print "Output file exists. Either remove the file or use the --overwrite flag."
841
        print "Output file exists. Either remove the file or use the --overwrite flag."
2461
765
        print "Do you wish to continue? [Y/n]"
842
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2462
766
        while True:
843
        while True:
2463
767
            k=inkey()
844
            k=inkey()
2464
768
            if k.lower() == 'n':
845
            if k.lower() == 'n':
2465
@@ -771,6 +848,7 @@
2466
771
            if k.lower() == 'y':
848
            if k.lower() == 'y':
2467
772
                break
849
                break
2468
773
    try:
850
    try:
2469
851
2470
774
        XML = supertree_toolkit.load_phyml(input_file)
852
        XML = supertree_toolkit.load_phyml(input_file)
2471
775
        input_is_xml = True
853
        input_is_xml = True
2472
776
    except:
854
    except:
2473
@@ -896,7 +974,7 @@
2474
896
    # Does the output file already exist?
974
    # Does the output file already exist?
2475
897
    if (os.path.exists(output_file) and not overwrite):
975
    if (os.path.exists(output_file) and not overwrite):
2476
898
        print "Output file exists. Either remove the file or use the --overwrite flag."
976
        print "Output file exists. Either remove the file or use the --overwrite flag."
2478
899
        print "Do you wish to continue? [Y/n]"
977
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2479
900
        while True:
978
        while True:
2480
901
            k=inkey()
979
            k=inkey()
2481
902
            if k.lower() == 'n':
980
            if k.lower() == 'n':
2482
@@ -942,7 +1020,7 @@
2483
942
        print msg
1020
        print msg
2484
943
        return
1021
        return
2485
944
    except: 
1022
    except: 
2487
945
        msg = "***Error: Failed sbstituting taxa due to unknown error.\n"
1023
        msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2488
946
        print msg
1024
        print msg
2489
947
        traceback.print_exc()
1025
        traceback.print_exc()
2490
948
        return 
1026
        return 
2491
@@ -983,7 +1061,7 @@
2492
983
1061
2493
984
    if (os.path.exists(output_file) and not overwrite):
1062
    if (os.path.exists(output_file) and not overwrite):
2494
985
        print "Output file exists. Either remove the file or use the --overwrite flag."
1063
        print "Output file exists. Either remove the file or use the --overwrite flag."
2496
986
        print "Do you wish to continue? [Y/n]"
1064
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2497
987
        while True:
1065
        while True:
2498
988
            k=inkey()
1066
            k=inkey()
2499
989
            if k.lower() == 'n':
1067
            if k.lower() == 'n':
2500
@@ -1013,7 +1091,7 @@
2501
1013
        print msg
1091
        print msg
2502
1014
        return
1092
        return
2503
1015
    except: 
1093
    except: 
2505
1016
        msg = "***Error: Failed sbstituting taxa due to unknown error.\n"
1094
        msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2506
1017
        print msg
1095
        print msg
2507
1018
        traceback.print_exc()
1096
        traceback.print_exc()
2508
1019
        return 
1097
        return 
2509
@@ -1060,7 +1138,7 @@
2510
1060
        print msg
1138
        print msg
2511
1061
        return
1139
        return
2512
1062
    except: 
1140
    except: 
2514
1063
        msg = "***Error: Failed to export data due to unknown error.\n"
1141
        msg = "***Error: Failed to export data due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2515
1064
        print msg
1142
        print msg
2516
1065
        traceback.print_exc()
1143
        traceback.print_exc()
2517
1066
        return 
1144
        return 
2518
@@ -1115,7 +1193,7 @@
2519
1115
        print msg
1193
        print msg
2520
1116
        return
1194
        return
2521
1117
    except: 
1195
    except: 
2523
1118
        msg = "***Error: Failed to check overlap due to unknown error.\n"
1196
        msg = "***Error: Failed to check overlap due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2524
1119
        print msg
1197
        print msg
2525
1120
        traceback.print_exc()
1198
        traceback.print_exc()
2526
1121
        return 
1199
        return 
2527
@@ -1161,7 +1239,7 @@
2528
1161
    # check if output files are there
1239
    # check if output files are there
2529
1162
    if (output_file and os.path.exists(output_file) and not overwrite):
1240
    if (output_file and os.path.exists(output_file) and not overwrite):
2530
1163
        print "Output file exists. Either remove the file or use the --overwrite flag."
1241
        print "Output file exists. Either remove the file or use the --overwrite flag."
2532
1164
        print "Do you wish to continue? [Y/n]"
1242
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2533
1165
        while True:
1243
        while True:
2534
1166
            k=inkey()
1244
            k=inkey()
2535
1167
            if k.lower() == 'n':
1245
            if k.lower() == 'n':
2536
@@ -1191,7 +1269,7 @@
2537
1191
        print msg
1269
        print msg
2538
1192
        return
1270
        return
2539
1193
    except: 
1271
    except: 
2541
1194
        msg = "***Error: Failed to export trees due to unknown error.\n"
1272
        msg = "***Error: Failed to export trees due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2542
1195
        print msg
1273
        print msg
2543
1196
        traceback.print_exc()
1274
        traceback.print_exc()
2544
1197
        return 
1275
        return 
2545
@@ -1220,7 +1298,7 @@
2546
1220
    # check if output files are there
1298
    # check if output files are there
2547
1221
    if (output_file and os.path.exists(output_file) and not overwrite):
1299
    if (output_file and os.path.exists(output_file) and not overwrite):
2548
1222
        print "Output file exists. Either remove the file or use the --overwrite flag."
1300
        print "Output file exists. Either remove the file or use the --overwrite flag."
2550
1223
        print "Do you wish to continue? [Y/n]"
1301
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2551
1224
        while True:
1302
        while True:
2552
1225
            k=inkey()
1303
            k=inkey()
2553
1226
            if k.lower() == 'n':
1304
            if k.lower() == 'n':
2554
@@ -1309,7 +1387,7 @@
2555
1309
            print msg
1387
            print msg
2556
1310
            return
1388
            return
2557
1311
        except: 
1389
        except: 
2559
1312
            msg = "***Error: Failed to permute trees due to unknown error.\n"
1390
            msg = "***Error: Failed to permute trees due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2560
1313
            print msg
1391
            print msg
2561
1314
            traceback.print_exc()
1392
            traceback.print_exc()
2562
1315
            return 
1393
            return 
2563
@@ -1347,7 +1425,7 @@
2564
1347
    # check if output files are there
1425
    # check if output files are there
2565
1348
    if (os.path.exists(output_file) and not overwrite):
1426
    if (os.path.exists(output_file) and not overwrite):
2566
1349
        print "Output file exists. Either remove the file or use the --overwrite flag."
1427
        print "Output file exists. Either remove the file or use the --overwrite flag."
2568
1350
        print "Do you wish to continue? [Y/n]"
1428
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2569
1351
        while True:
1429
        while True:
2570
1352
            k=inkey()
1430
            k=inkey()
2571
1353
            if k.lower() == 'n':
1431
            if k.lower() == 'n':
2572
@@ -1376,7 +1454,7 @@
2573
1376
        print msg
1454
        print msg
2574
1377
        return
1455
        return
2575
1378
    except: 
1456
    except: 
2577
1379
        msg = "***Error: Failed to clean data due to unknown error.\n"
1457
        msg = "***Error: Failed to clean data due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2578
1380
        print msg
1458
        print msg
2579
1381
        traceback.print_exc()
1459
        traceback.print_exc()
2580
1382
        return 
1460
        return 
2581
@@ -1404,7 +1482,7 @@
2582
1404
    # check if output files are there
1482
    # check if output files are there
2583
1405
    if (os.path.exists(output_file) and not overwrite):
1483
    if (os.path.exists(output_file) and not overwrite):
2584
1406
        print "Output file exists. Either remove the file or use the --overwrite flag."
1484
        print "Output file exists. Either remove the file or use the --overwrite flag."
2586
1407
        print "Do you wish to continue? [Y/n]"
1485
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2587
1408
        while True:
1486
        while True:
2588
1409
            k=inkey()
1487
            k=inkey()
2589
1410
            if k.lower() == 'n':
1488
            if k.lower() == 'n':
2590
@@ -1433,7 +1511,7 @@
2591
1433
        print msg
1511
        print msg
2592
1434
        return
1512
        return
2593
1435
    except: 
1513
    except: 
2595
1436
        msg = "***Error: Failed to replace genera due to unknown error.\n"
1514
        msg = "***Error: Failed to replace genera due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2596
1437
        print msg
1515
        print msg
2597
1438
        traceback.print_exc()
1516
        traceback.print_exc()
2598
1439
        return 
1517
        return 
2599
@@ -1488,7 +1566,7 @@
2600
1488
            new_trees = {}
1566
            new_trees = {}
2601
1489
            i = 1
1567
            i = 1
2602
1490
            for t in trees:
1568
            for t in trees:
2604
1491
                new_trees['tree_'+str(i)] = t
1569
                new_trees['tree_'+str(i)] = supertree_toolkit._collapse_nodes(t)
2605
1492
                i += 1
1570
                i += 1
2606
1493
            output = supertree_toolkit._amalgamate_trees(new_trees,format=output_format)
1571
            output = supertree_toolkit._amalgamate_trees(new_trees,format=output_format)
2607
1494
        except TreeParseError as detail:
1572
        except TreeParseError as detail:
2608
@@ -1503,7 +1581,7 @@
2609
1503
    # check if output files are there
1581
    # check if output files are there
2610
1504
    if (os.path.exists(output_file) and not overwrite):
1582
    if (os.path.exists(output_file) and not overwrite):
2611
1505
        print "Output file exists. Either remove the file or use the --overwrite flag."
1583
        print "Output file exists. Either remove the file or use the --overwrite flag."
2613
1506
        print "Do you wish to continue? [Y/n]"
1584
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2614
1507
        while True:
1585
        while True:
2615
1508
            k=inkey()
1586
            k=inkey()
2616
1509
            if k.lower() == 'n':
1587
            if k.lower() == 'n':
2617
@@ -1540,7 +1618,7 @@
2618
1540
    # check if output files are there
1618
    # check if output files are there
2619
1541
    if (os.path.exists(output_file) and not overwrite):
1619
    if (os.path.exists(output_file) and not overwrite):
2620
1542
        print "Output file exists. Either remove the file or use the --overwrite flag."
1620
        print "Output file exists. Either remove the file or use the --overwrite flag."
2622
1543
        print "Do you wish to continue? [Y/n]"
1621
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2623
1544
        while True:
1622
        while True:
2624
1545
            k=inkey()
1623
            k=inkey()
2625
1546
            if k.lower() == 'n':
1624
            if k.lower() == 'n':
2626
@@ -1589,7 +1667,7 @@
2627
1589
        print msg
1667
        print msg
2628
1590
        return
1668
        return
2629
1591
    except: 
1669
    except: 
2631
1592
        msg = "***Error: Failed to create subset due to unknown error.\n"
1670
        msg = "***Error: Failed to create subset due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2632
1593
        print msg
1671
        print msg
2633
1594
        traceback.print_exc()
1672
        traceback.print_exc()
2634
1595
        return 
1673
        return 
2635
@@ -1637,6 +1715,681 @@
2636
1637
    print "**************************************************************\n"
1715
    print "**************************************************************\n"
2637
1638
1716
2638
1639
1717
2639
1718
def check_otus(args):
2640
1719
    """check out the OTUs in the Phyml - are they considered valid?"""
2641
1720
2642
1721
    verbose = args.verbose
2643
1722
    input_file = args.input
2644
1723
    output_file = args.output
2645
1724
2646
1725
    print input_file
2647
1726
    if (input_file.endswith(".phyml")):
2648
1727
        XML = supertree_toolkit.load_phyml(input_file)
2649
1728
        try:
2650
1729
            equivs = supertree_toolkit.taxonomic_checker(XML, verbose=verbose)
2651
1730
        except NotUniqueError as detail:
2652
1731
            msg = "***Error: Failed to check OTUs.\n"+detail.msg
2653
1732
            print msg
2654
1733
            return
2655
1734
        except InvalidSTKData as detail:
2656
1735
            msg = "***Error: Failed to check OTUs.\n"+detail.msg
2657
1736
            print msg
2658
1737
            return
2659
1738
        except UninformativeTreeError as detail:
2660
1739
            msg = "***Error: Failed to check OTUs.\n"+detail.msg
2661
1740
            print msg
2662
1741
            return
2663
1742
        except TreeParseError as detail:
2664
1743
            msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2665
1744
            print msg
2666
1745
            return
2667
1746
        except:
2668
1747
            # what about no internet conenction? What error do that throw?
2669
1748
            msg = "***Error: failed to create OTUs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2670
1749
            print msg
2671
1750
            traceback.print_exc()
2672
1751
            return
2673
1752
    elif (input_file.endswith(".txt") or input_file.endswith('.dat')):
2674
1753
        # read file - assume one taxa per line
2675
1754
        with open(input_file,'r') as f:
2676
1755
            lines = f.read().splitlines()        
2677
1756
        equivs = supertree_toolkit.taxonomic_checker_list(lines, verbose=verbose)
2678
1757
    else:
2679
1758
        # assume a tree!
2680
1759
        equivs = supertree_toolkit.taxonomic_checker_tree(input_file, verbose=verbose)
2681
1760
2682
1761
2683
1762
2684
1763
    f = open(output_file,"w")
2685
1764
    for taxon in sorted(equivs.keys()):
2686
1765
        f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
2687
1766
    f.close()
2688
1767
2689
1768
2690
1769
2691
1770
def create_taxonomy(args):
2692
1771
    """create a taxonomic heirachy for each OTU in the Phyml"""
2693
1772
2694
1773
    verbose = args.verbose
2695
1774
    input_file = args.input
2696
1775
    output_file = args.output
2697
1776
    existing_taxonomy = args.taxonomy
2698
1777
    ignoreWarnings = args.ignoreWarnings
2699
1778
2700
1779
    XML = supertree_toolkit.load_phyml(input_file)
2701
1780
    if (not existing_taxonomy == None):
2702
1781
        existing_taxonomy = supertree_toolkit.load_taxonomy(existing_taxonomy) # load it in and create the dictionary
2703
1782
        pass
2704
1783
2705
1784
    try:
2706
1785
        taxonomy = supertree_toolkit.create_taxonomy(XML,existing_taxonomy=existing_taxonomy,verbose=verbose,ignoreWarnings=ignoreWarnings)
2707
1786
    except NotUniqueError as detail:
2708
1787
        msg = "***Error: Failed to create taxonomy.\n"+detail.msg
2709
1788
        print msg
2710
1789
        return
2711
1790
    except InvalidSTKData as detail:
2712
1791
        msg = "***Error: Failed to create taxonomy.\n"+detail.msg
2713
1792
        print msg
2714
1793
        return
2715
1794
    except UninformativeTreeError as detail:
2716
1795
        msg = "***Error: Failed to create taxonomy.\n"+detail.msg
2717
1796
        print msg
2718
1797
        return
2719
1798
    except TreeParseError as detail:
2720
1799
        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2721
1800
        print msg
2722
1801
        return
2723
1802
    except:
2724
1803
        # what about no internet conenction? What error do that throw?
2725
1804
        msg = "***Error: failed to create taxonomy due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2726
1805
        print msg
2727
1806
        traceback.print_exc()
2728
1807
        return
2729
1808
    
2730
1809
    # Now create the CSV output
2731
1810
    with open(output_file, 'w') as f:
2732
1811
        writer = csv.writer(f)
2733
1812
        headers = []
2734
1813
        headers.append("OTU")
2735
1814
        headers.extend(supertree_toolkit.taxonomy_levels)
2736
1815
        headers.append("Data source")
2737
1816
        writer.writerow(headers)
2738
1817
        for t in taxonomy:
2739
1818
            otu = t
2740
1819
            try:
2741
1820
                species = taxonomy[t]['species']
2742
1821
            except KeyError:
2743
1822
                species = "-"
2744
1823
            try:
2745
1824
                genus = taxonomy[t]['genus']
2746
1825
            except KeyError:
2747
1826
                genus = "-"
2748
1827
            try:
2749
1828
                family = taxonomy[t]['family']
2750
1829
            except KeyError:
2751
1830
                family = "-"
2752
1831
            try:
2753
1832
                superfamily = taxonomy[t]['superfamily']
2754
1833
            except KeyError:
2755
1834
                superfamily = "-"
2756
1835
            try:
2757
1836
                infraorder = taxonomy[t]['infraorder']
2758
1837
            except KeyError:
2759
1838
                infraorder = "-"
2760
1839
            try:
2761
1840
                suborder = taxonomy[t]['suborder']
2762
1841
            except KeyError:
2763
1842
                suborder = "-"
2764
1843
            try:
2765
1844
                order = taxonomy[t]['order']
2766
1845
            except KeyError:
2767
1846
                order = "-"
2768
1847
            try:
2769
1848
                superorder = taxonomy[t]['superorder']
2770
1849
            except KeyError:
2771
1850
                superorder = "-"
2772
1851
            try:
2773
1852
                subclass = taxonomy[t]['subclass']
2774
1853
            except KeyError:
2775
1854
                subclass = "-"
2776
1855
            try:
2777
1856
                tclass = taxonomy[t]['class']
2778
1857
            except KeyError:
2779
1858
                tclass = "-"
2780
1859
            try:
2781
1860
                subphylum = taxonomy[t]['subphylum']
2782
1861
            except KeyError:
2783
1862
                subphylum = "-"
2784
1863
            try:
2785
1864
                phylum = taxonomy[t]['phylum']
2786
1865
            except KeyError:
2787
1866
                phylum = "-"
2788
1867
            try:
2789
1868
                superphylum = taxonomy[t]['superphylum']
2790
1869
            except KeyError:
2791
1870
                superphylum = "-"
2792
1871
            try:
2793
1872
                infrakingdom = taxonomy[t]['infrakingdom']
2794
1873
            except:
2795
1874
                infrakingdom = "-"
2796
1875
            try:
2797
1876
                subkingdom = taxonomy[t]['subkingdom']
2798
1877
            except:
2799
1878
                subkingdom = "-"
2800
1879
            try:
2801
1880
                kingdom = taxonomy[t]['kingdom']
2802
1881
            except KeyError:
2803
1882
                kingdom = "-"
2804
1883
            try:
2805
1884
                provider = taxonomy[t]['provider']
2806
1885
            except KeyError:
2807
1886
                provider = "-"
2808
1887
2809
1888
            if (isinstance(species, list)):
2810
1889
                species = " ".join(species)
2811
1890
            this_classification = [
2812
1891
                    otu.encode('utf-8'),
2813
1892
                    species.encode('utf-8'),
2814
1893
                    genus.encode('utf-8'),
2815
1894
                    family.encode('utf-8'),
2816
1895
                    superfamily.encode('utf-8'),
2817
1896
                    infraorder.encode('utf-8'),
2818
1897
                    suborder.encode('utf-8'),
2819
1898
                    order.encode('utf-8'),
2820
1899
                    superorder.encode('utf-8'),
2821
1900
                    subclass.encode('utf-8'),
2822
1901
                    tclass.encode('utf-8'),
2823
1902
                    subphylum.encode('utf-8'),
2824
1903
                    phylum.encode('utf-8'),
2825
1904
                    superphylum.encode('utf-8'),
2826
1905
                    infrakingdom.encode('utf-8'),
2827
1906
                    subkingdom.encode('utf-8'),
2828
1907
                    kingdom.encode('utf-8'),
2829
1908
                    provider.encode('utf-8')]
2830
1909
            writer.writerow(this_classification)
2831
1910
2832
1911
def auto_subs(args):
2833
1912
    """Get all OTUs to the same taxonomic level"""
2834
1913
2835
1914
    
2836
1915
    verbose = args.verbose
2837
1916
    input_file = args.input
2838
1917
    output = args.output
2839
1918
    taxonomy = args.taxonomy
2840
1919
    ignoreWarnings = args.ignoreWarnings
2841
1920
2842
1921
    if (os.path.exists(output) and not overwrite):
2843
1922
        print "Output Phyml file exists. Either remove the file or use the --overwrite flag."
2844
1923
        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
2845
1924
        while True:
2846
1925
            k=inkey()
2847
1926
            if k.lower() == 'n':
2848
1927
                print "Exiting..."
2849
1928
                sys.exit(0)
2850
1929
            if k.lower() == 'y':
2851
1930
                break
2852
1931
2853
1932
    XML = supertree_toolkit.load_phyml(input_file)
2854
1933
    taxonomy = supertree_toolkit.load_taxonomy(taxonomy) # load it in and create the dictionary
2855
1934
2856
1935
    try:
2857
1936
        newXML = supertree_toolkit.generate_species_level_data(XML,taxonomy,verbose=verbose,ignoreWarnings=ignoreWarnings)
2858
1937
    except NotUniqueError as detail:
2859
1938
        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2860
1939
        print msg
2861
1940
        return
2862
1941
    except InvalidSTKData as detail:
2863
1942
        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2864
1943
        print msg
2865
1944
        return
2866
1945
    except UninformativeTreeError as detail:
2867
1946
        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2868
1947
        print msg
2869
1948
        return
2870
1949
    except TreeParseError as detail:
2871
1950
        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2872
1951
        print msg
2873
1952
        return
2874
1953
    except NoneCompleteTaxonomy as detail:
2875
1954
        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
2876
1955
        print msg
2877
1956
        return
2878
1957
    except:
2879
1958
        # what about no internet conenction? What error do that throw?
2880
1959
        msg = "***Error: failed to carry out auto subs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
2881
1960
        print msg
2882
1961
        traceback.print_exc()
2883
1962
        return
2884
1963
2885
1964
    f = open(output,"w")
2886
1965
    f.write(newXML)
2887
1966
    f.close()
2888
1967
2889
1968
def process(args):
2890
1969
2891
1970
    verbose = args.verbose
2892
1971
    input_file = args.input
2893
1972
    output = args.output
2894
1973
    no_store = args.no_store
2895
1974
    ignoreWarnings = args.ignoreWarnings
2896
1975
    taxonomy_file = args.taxonomy_file
2897
1976
    equivalents_file = args.equivalents_file
2898
1977
    overwrite = args.overwrite
2899
1978
2900
1979
    if (os.path.exists(output) and not overwrite):
2901
1980
        print "Output matrix file exists. Either remove the file or use the --overwrite flag."
2902
1981
        print "Do you wish to continue and overwrite the file anyway? [Y/n]"
2903
1982
        while True:
2904
1983
            k=inkey()
2905
1984
            if k.lower() == 'n':
2906
1985
                print "Exiting..."
2907
1986
                sys.exit(0)
2908
1987
            if k.lower() == 'y':
2909
1988
                break
2910
1989
2911
1990
    filename = os.path.basename(input_file)
2912
1991
    dirname = os.path.dirname(input_file)
2913
1992
2914
1993
    if verbose:
2915
1994
        print "Loading and checking your data"
2916
1995
    # 0) load and check data
2917
1996
    try:
2918
1997
        phyml = supertree_toolkit.load_phyml(input_file)
2919
1998
        project_name = supertree_toolkit.get_project_name(phyml)
2920
1999
        supertree_toolkit._check_data(phyml)
2921
2000
    except NotUniqueError as detail:
2922
2001
        msg = "***Error: Failed to load data.\n"+detail.msg
2923
2002
        print msg
2924
2003
        return
2925
2004
    except InvalidSTKData as detail:
2926
2005
        msg = "***Error: Failed to load data.\n"+detail.msg
2927
2006
        print msg
2928
2007
        return
2929
2008
    except UninformativeTreeError as detail:
2930
2009
        msg = "***Error: Failed to load data.\n"+detail.msg
2931
2010
        print msg
2932
2011
        return
2933
2012
    except TreeParseError as detail:
2934
2013
        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2935
2014
        print msg
2936
2015
        return
2937
2016
    except: 
2938
2017
        msg = "***Error: Failed to load input due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2939
2018
        print msg
2940
2019
        traceback.print_exc()
2941
2020
        return 
2942
2021
2943
2022
    if verbose:
2944
2023
        print "Checking taxa againt online databases"
2945
2024
    # 1) taxonomy checker with autoreplace
2946
2025
    # Load existing data if any:
2947
2026
    if (not equivalents_file == None):
2948
2027
        equivalents = supertree_toolkit.load_equivalents(equivalents_file)
2949
2028
    else:
2950
2029
        equivalents = None
2951
2030
    equivalents = supertree_toolkit.taxonomic_checker(phyml,existing_data=equivalents,verbose=verbose)    
2952
2031
    # save the equivalents for later (as CSV and as sub file)
2953
2032
    data_string_csv = _equivalents_to_csv(equivalents)
2954
2033
    data_string_subs = _equivalents_to_subs(equivalents)
2955
2034
    f = open(os.path.join(dirname,project_name+"_taxonomy_checker.csv"), "w")
2956
2035
    f.write(data_string_csv)
2957
2036
    f.close()
2958
2037
    f = open(os.path.join(dirname,project_name+"_taxonomy_check_subs.dat"), "w")
2959
2038
    f.write(data_string_subs)
2960
2039
    f.close()
2961
2040
    
2962
2041
    # now do the replacements - we use the subs file :)
2963
2042
    if verbose:
2964
2043
        print "Swapping in the corrected taxa names"    
2965
2044
    try:
2966
2045
        old_taxa, new_taxa = supertree_toolkit.parse_subs_file(os.path.join(dirname,project_name+"_taxonomy_check_subs.dat"))
2967
2046
    except UnableToParseSubsFile as e:
2968
2047
        print e.msg
2969
2048
        sys.exit(-1)
2970
2049
    try:
2971
2050
        phyml = supertree_toolkit.substitute_taxa(phyml,old_taxa,new_taxa,only_existing=False,verbose=verbose)
2972
2051
    except NotUniqueError as detail:
2973
2052
        msg = "***Error: Failed to substituting taxa.\n"+detail.msg
2974
2053
        print msg
2975
2054
        return
2976
2055
    except InvalidSTKData as detail:
2977
2056
        msg = "***Error: Failed substituting taxa.\n"+detail.msg
2978
2057
        print msg
2979
2058
        return
2980
2059
    except UninformativeTreeError as detail:
2981
2060
        msg = "***Error: Failed to substituting taxa.\n"+detail.msg
2982
2061
        print msg
2983
2062
        return
2984
2063
    except TreeParseError as detail:
2985
2064
        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
2986
2065
        print msg
2987
2066
        return
2988
2067
    except: 
2989
2068
        msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
2990
2069
        print msg
2991
2070
        traceback.print_exc()
2992
2071
        return 
2993
2072
    # save phyml as intermediate step
2994
2073
    f = open(os.path.join(dirname,project_name+"_taxonomy_checked.phyml"), "w")
2995
2074
    f.write(phyml)
2996
2075
    f.close()
2997
2076
2998
2077
    
2999
2078
    if verbose:
3000
2079
        print "Creating taxonomic information"    
3001
2080
    # 2) create taxonomy
3002
2081
    if (not taxonomy_file == None):
3003
2082
        taxonomy = supertree_toolkit.load_taxonomy(taxonomy_file)
3004
2083
    else:
3005
2084
        taxonomy = None
3006
2085
    taxonomy = supertree_toolkit.create_taxonomy(phyml,existing_taxonomy=taxonomy,verbose=verbose)
3007
2086
    # save the taxonomy for later
3008
2087
    # Now create the CSV output - seperate out into function in STK (used several times)
3009
2088
    with open(os.path.join(dirname,project_name+"_taxonomy.csv"), 'w') as f:
3010
2089
        writer = csv.writer(f)
3011
2090
        headers = []
3012
2091
        headers.append("OTU")
3013
2092
        headers.extend(supertree_toolkit.taxonomy_levels)
3014
2093
        headers.append("Data source")
3015
2094
        writer.writerow(headers)
3016
2095
        for t in taxonomy:
3017
2096
            otu = t
3018
2097
            try:
3019
2098
                species = taxonomy[t]['species']
3020
2099
            except KeyError:
3021
2100
                species = "-"
3022
2101
            try:
3023
2102
                subgenus = taxonomy[t]['subgenus']
3024
2103
            except KeyError:
3025
2104
                subgenus = "-"
3026
2105
            try:
3027
2106
                genus = taxonomy[t]['genus']
3028
2107
            except KeyError:
3029
2108
                genus = "-"
3030
2109
            try:
3031
2110
                subfamily = taxonomy[t]['subfamily']
3032
2111
            except KeyError:
3033
2112
                subfamily = "-"
3034
2113
            try:
3035
2114
                family = taxonomy[t]['family']
3036
2115
            except KeyError:
3037
2116
                family = "-"
3038
2117
            try:
3039
2118
                superfamily = taxonomy[t]['superfamily']
3040
2119
            except KeyError:
3041
2120
                superfamily = "-"
3042
2121
            try:
3043
2122
                subsection = taxonomy[t]['subsection']
3044
2123
            except KeyError:
3045
2124
                subsection = "-"
3046
2125
            try:
3047
2126
                section = taxonomy[t]['section']
3048
2127
            except KeyError:
3049
2128
                section = "-"
3050
2129
            try:
3051
2130
                infraorder = taxonomy[t]['infraorder']
3052
2131
            except KeyError:
3053
2132
                infraorder = "-"
3054
2133
            try:
3055
2134
                suborder = taxonomy[t]['suborder']
3056
2135
            except KeyError:
3057
2136
                suborder = "-"
3058
2137
            try:
3059
2138
                order = taxonomy[t]['order']
3060
2139
            except KeyError:
3061
2140
                order = "-"
3062
2141
            try:
3063
2142
                superorder = taxonomy[t]['superorder']
3064
2143
            except KeyError:
3065
2144
                superorder = "-"
3066
2145
            try:
3067
2146
                subclass = taxonomy[t]['subclass']
3068
2147
            except KeyError:
3069
2148
                subclass = "-"
3070
2149
            try:
3071
2150
                tclass = taxonomy[t]['class']
3072
2151
            except KeyError:
3073
2152
                tclass = "-"
3074
2153
            try:
3075
2154
                superclass = taxonomy[t]['superclass']
3076
2155
            except KeyError:
3077
2156
                superclass = "-"
3078
2157
            try:
3079
2158
                subphylum = taxonomy[t]['subphylum']
3080
2159
            except KeyError:
3081
2160
                subphylum = "-"
3082
2161
            try:
3083
2162
                phylum = taxonomy[t]['phylum']
3084
2163
            except KeyError:
3085
2164
                phylum = "-"
3086
2165
            try:
3087
2166
                superphylum = taxonomy[t]['superphylum']
3088
2167
            except KeyError:
3089
2168
                superphylum = "-"
3090
2169
            try:
3091
2170
                infrakingdom = taxonomy[t]['infrakingdom']
3092
2171
            except:
3093
2172
                infrakingdom = "-"
3094
2173
            try:
3095
2174
                subkingdom = taxonomy[t]['subkingdom']
3096
2175
            except:
3097
2176
                subkingdom = "-"
3098
2177
            try:
3099
2178
                kingdom = taxonomy[t]['kingdom']
3100
2179
            except KeyError:
3101
2180
                kingdom = "-"
3102
2181
            try:
3103
2182
                provider = taxonomy[t]['provider']
3104
2183
            except KeyError:
3105
2184
                provider = "-"
3106
2185
            this_classification = [
3107
2186
                    otu.encode('utf-8'),
3108
2187
                    species.encode('utf-8'),
3109
2188
                    subgenus.encode('utf-8'),
3110
2189
                    genus.encode('utf-8'),
3111
2190
                    subfamily.encode('utf-8'),
3112
2191
                    family.encode('utf-8'),
3113
2192
                    superfamily.encode('utf-8'),
3114
2193
                    subsection.encode('utf-8'),
3115
2194
                    section.encode('utf-8'),
3116
2195
                    infraorder.encode('utf-8'),
3117
2196
                    suborder.encode('utf-8'),
3118
2197
                    order.encode('utf-8'),
3119
2198
                    superorder.encode('utf-8'),
3120
2199
                    subclass.encode('utf-8'),
3121
2200
                    tclass.encode('utf-8'),
3122
2201
                    superclass.encode('utf-8'),
3123
2202
                    subphylum.encode('utf-8'),
3124
2203
                    phylum.encode('utf-8'),
3125
2204
                    superphylum.encode('utf-8'),
3126
2205
                    infrakingdom.encode('utf-8'),
3127
2206
                    subkingdom.encode('utf-8'),
3128
2207
                    kingdom.encode('utf-8'),
3129
2208
                    provider.encode('utf-8')]
3130
2209
            writer.writerow(this_classification)
3131
2210
3132
2211
    # 3) create species level dataset
3133
2212
    if verbose:
3134
2213
        print "Converting data to species level"
3135
2214
    try:
3136
2215
        phyml = supertree_toolkit.generate_species_level_data(phyml,taxonomy,verbose=verbose)
3137
2216
    except NotUniqueError as detail:
3138
2217
        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3139
2218
        print msg
3140
2219
        return
3141
2220
    except InvalidSTKData as detail:
3142
2221
        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3143
2222
        print msg
3144
2223
        return
3145
2224
    except UninformativeTreeError as detail:
3146
2225
        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3147
2226
        print msg
3148
2227
        return
3149
2228
    except TreeParseError as detail:
3150
2229
        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
3151
2230
        print msg
3152
2231
        return
3153
2232
    except NoneCompleteTaxonomy as detail:
3154
2233
        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
3155
2234
        print msg
3156
2235
        return
3157
2236
    except:
3158
2237
        # what about no internet conenction? What error do that throw?
3159
2238
        msg = "***Error: failed to carry out auto subs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
3160
2239
        print msg
3161
2240
        traceback.print_exc()
3162
2241
        return
3163
2242
    # save the phyml as intermediate step
3164
2243
    f = open(os.path.join(dirname,project_name+"_species_level.phyml"), "w")
3165
2244
    f.write(phyml)
3166
2245
    f.close()
3167
2246
3168
2247
    # 4) Remove non-monophyletic taxa (requires TNT to be installed)
3169
2248
    if verbose:
3170
2249
        print "Removing non-monophyletic taxa via mini-supertree method"
3171
2250
    tree_list = supertree_toolkit._find_trees_for_permuting(phyml)
3172
2251
    try:
3173
2252
        for t in tree_list:
3174
2253
            # permute
3175
2254
            output_string = supertree_toolkit.permute_tree(tree_list[t],matrix='hennig',treefile=None,verbose=verbose)
3176
2255
            #save
3177
2256
            if (not output_string == ""):
3178
2257
                file_name = os.path.basename(filename)
3179
2258
                dirname = os.path.dirname(filename)
3180
2259
                new_output = os.path.join(dirname,t,t+"_matrix.tnt")
3181
2260
                try: 
3182
2261
                   os.makedirs(os.path.join(dirname,t))
3183
2262
                except OSError:
3184
2263
                    if not os.path.isdir(os.path.join(dirname,t)):
3185
2264
                        raise
3186
2265
                f = open(new_output,'w',0)
3187
2266
                f.write(output_string)
3188
2267
                f.close
3189
2268
                time.sleep(1)
3190
2269
3191
2270
                # now create the tnt command to deal with this
3192
2271
                # create a tmp file for the output tree
3193
2272
                temp_file_handle, temp_file = tempfile.mkstemp(suffix=".tnt")
3194
2273
                tnt_command = "tnt mxram 512,run "+new_output+",echo= ,timeout 00:10:00,rseed0,rseed*,hold 1000,xmult= level 0,taxname=,nelsen *,tsave *"+temp_file+",save /,quit"
3195
2274
                #tnt_command = "tnt run "+new_output+",ienum,taxname=,nelsen*,tsave *"+temp_file+",save /,quit"
3196
2275
                # run tnt, grab the output and store back in the data
3197
2276
                #try:
3198
2277
                call(tnt_command, shell=True)
3199
2278
                #except CalledProcessError as e:
3200
2279
                #    msg = "***Error: Failed to run TNT. Is it installed correctl?.\n"+e.msg
3201
2280
                #    print msg
3202
2281
                #    return
3203
2282
                #ret = os.system(tnt_command)
3204
2283
                #if (not ret == 0):
3205
2284
                #    print "error running tnt"
3206
2285
                #    return
3207
2286
3208
2287
                new_tree = supertree_toolkit.import_tree(temp_file)
3209
2288
                phyml = supertree_toolkit._swap_tree_in_XML(phyml,new_tree,t)
3210
2289
3211
2290
    except TreeParseError as e:
3212
2291
        msg = "***Error permuting trees.\n"+e.msg
3213
2292
        print msg
3214
2293
        return
3215
2294
3216
2295
    #4.5) remove MRP_Outgroups
3217
2296
    phyml = supertree_toolkit.substitute_taxa(phyml,'MRP_Outgroup')
3218
2297
    phyml = supertree_toolkit.substitute_taxa(phyml,'MRPOutgroup')
3219
2298
    phyml = supertree_toolkit.substitute_taxa(phyml,'MRP_outgroup')
3220
2299
    phyml = supertree_toolkit.substitute_taxa(phyml,'MRPoutgroup')
3221
2300
    phyml = supertree_toolkit.substitute_taxa(phyml,'MRPOUTGROUP')  
3222
2301
3223
2302
    # save intermediate phyml
3224
2303
    f = open(os.path.join(dirname,project_name+"_nonmonophyl_removed.phyml"), "w")
3225
2304
    f.write(phyml)
3226
2305
    f.close()
3227
2306
3228
2307
3229
2308
    # 5) Remove common names
3230
2309
    # no function to do this yet...
3231
2310
3232
2311
    # 6) Data independance
3233
2312
    if verbose:
3234
2313
        print "Checking data independence"
3235
2314
    data_ind,subsets,phyml = supertree_toolkit.data_independence(phyml,make_new_xml=True)
3236
2315
    # save phyml
3237
2316
    f = open(os.path.join(dirname,project_name+"_data_ind.phyml"), "w")
3238
2317
    f.write(phyml)
3239
2318
    f.close()
3240
2319
3241
2320
    # 7) Data overlap
3242
2321
    if verbose:
3243
2322
        print "Checking data overlap"
3244
2323
    sufficient_overlap, key_list = supertree_toolkit.data_overlap(phyml,verbose=verbose)
3245
2324
    # process the key_list to remove the unconnected trees
3246
2325
    if not sufficient_overlap:
3247
2326
        # we don't, have enough, then remove all but the largest group.
3248
2327
        # the key contains a list, with the largest group first (thanks networkX!)
3249
2328
        # we can therefore just remove trees from everything but the first in the list
3250
2329
        delete_me = []
3251
2330
        for t in key_list[1::]: # skip 0
3252
2331
            delete_me.extend(t)
3253
2332
        for tree in delete_me:
3254
2333
            phyml = supertree_toolkit._swap_tree_in_XML(phyml, None, tree, delete=True) # delete the tree and clean the data as we go 
3255
2334
    # save phyml
3256
2335
    f = open(os.path.join(dirname,project_name+"_data_tax_overlap.phyml"), "w")
3257
2336
    f.write(phyml)
3258
2337
    f.close()
3259
2338
3260
2339
3261
2340
    # 8) Create matrix
3262
2341
    if verbose:
3263
2342
        print "Creating matrix"
3264
2343
    try:
3265
2344
        matrix = supertree_toolkit.create_matrix(phyml)
3266
2345
    except NotUniqueError as detail:
3267
2346
        msg = "***Error: Failed to create matrix.\n"+detail.msg
3268
2347
        print msg
3269
2348
        return
3270
2349
    except InvalidSTKData as detail:
3271
2350
        msg = "***Error: Failed to create matrix.\n"+detail.msg
3272
2351
        print msg
3273
2352
        return
3274
2353
    except UninformativeTreeError as detail:
3275
2354
        msg = "***Error: Failed to create matrix.\n"+detail.msg
3276
2355
        print msg
3277
2356
        return
3278
2357
    except TreeParseError as detail:
3279
2358
        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
3280
2359
        print msg
3281
2360
        return
3282
2361
    except: 
3283
2362
        msg = "***Error: Failed to create matrix due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
3284
2363
        print msg
3285
2364
        traceback.print_exc()
3286
2365
        return 
3287
2366
3288
2367
    f = open(output, "w")
3289
2368
    f.write(matrix)
3290
2369
    f.close()
3291
2370
3292
2371
    return
3293
2372
3294
2373
3295
2374
def _equivalents_to_csv(equivalents):
3296
2375
3297
2376
    output_string = 'Taxa,Equivalents,Status\n'
3298
2377
3299
2378
    for taxon in sorted(equivalents):
3300
2379
        output_string += taxon + "," + ';'.join(equivalents[taxon][0]) + "," + equivalents[taxon][1] + "\n"
3301
2380
3302
2381
    return output_string
3303
2382
3304
2383
3305
2384
def _equivalents_to_subs(equivalents):
3306
2385
    """Only corrects the yellow ones. Red and green are left alone"""
3307
2386
3308
2387
    output_string = ""
3309
2388
    for taxon in sorted(equivalents):
3310
2389
        if (equivalents[taxon][1] == 'yellow'):
3311
2390
            # the first name is always the correct one
3312
2391
            output_string += taxon + " = "+equivalents[taxon][0][0]+"\n"
3313
2392
    return output_string
3314
1640
2393
3315
1641
if __name__ == "__main__":
2394
if __name__ == "__main__":
3316
1642
    main()
2395
    main()
3317
1643
2396
3318
=== modified file 'stk/stk_exceptions.py'
3319
--- stk/stk_exceptions.py	2013-10-22 08:26:54 +0000
3320
+++ stk/stk_exceptions.py	2017-01-12 09:27:31 +0000
3321
@@ -134,4 +134,12 @@
3322
134
    def __init__(self, msg):
134
    def __init__(self, msg):
3323
135
        self.msg = msg
135
        self.msg = msg
3324
136
136
3325
137
class NoneCompleteTaxonomy(Error):
3326
138
    """Exception raised when a taxonomy is not complete for these data
3327
139
    Attributes:
3328
140
          msg -- explaination of error
3329
141
    """
3330
142
3331
143
    def __init__(self, msg):
3332
144
        self.msg = msg
3333
137
145
3334
138
146
3335
=== modified file 'stk/supertree_toolkit.py'
3336
--- stk/supertree_toolkit.py	2017-01-11 15:16:21 +0000
3337
+++ stk/supertree_toolkit.py	2017-01-12 09:27:31 +0000
3338
@@ -44,15 +44,49 @@
3339
44
import unicodedata
44
import unicodedata
3340
45
from stk_internals import *
45
from stk_internals import *
3341
46
from copy import deepcopy
46
from copy import deepcopy
3342
47
import Queue
3343
48
import threading
3344
49
import urllib2
3345
50
from urllib import quote_plus
3346
51
import simplejson as json
3347
52
import time
3348
47
import types
53
import types
3349
48
54
3350
49
#plt.ion()
55
#plt.ion()
3351
50
56
3352
57
sys.setrecursionlimit(50000)
3353
51
# GLOBAL VARIABLES
58
# GLOBAL VARIABLES
3354
52
IDENTICAL = 0
59
IDENTICAL = 0
3355
53
SUBSET = 1
60
SUBSET = 1
3356
54
PLATFORM = sys.platform
61
PLATFORM = sys.platform
3358
55
taxonomy_levels = ['species','genus','family','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
62
#Logging
3359
63
import logging
3360
64
logging.basicConfig(filename='supertreetoolkit.log', level=logging.DEBUG, format='%(asctime)s %(levelname)s:%(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')
3361
65
3362
66
# taxonomy levels
3363
67
# What we get from EOL
3364
68
current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
3365
69
# And the extra ones from ITIS
3366
70
extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
3367
71
# all of them in order
3368
72
taxonomy_levels = ['species','subgenus','genus','tribe','subfamily','family','superfamily','subsection','section','parvorder','infraorder','suborder','order','superorder','subclass','class','superclass','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
3369
73
3370
74
SPECIES = taxonomy_levels[0]
3371
75
GENUS = taxonomy_levels[1]
3372
76
FAMILY = taxonomy_levels[2]
3373
77
SUPERFAMILY = taxonomy_levels[3]
3374
78
INFRAORDER = taxonomy_levels[4]
3375
79
SUBORDER = taxonomy_levels[5]
3376
80
ORDER = taxonomy_levels[6]
3377
81
SUPERORDER = taxonomy_levels[7]
3378
82
SUBCLASS = taxonomy_levels[8]
3379
83
CLASS = taxonomy_levels[9]
3380
84
SUBPHYLUM = taxonomy_levels[10]
3381
85
PHYLUM = taxonomy_levels[11]
3382
86
SUPERPHYLUM = taxonomy_levels[12] 
3383
87
INFRAKINGDOM = taxonomy_levels[13]
3384
88
SUBKINGDOM = taxonomy_levels[14]
3385
89
KINGDOM = taxonomy_levels[15]
3386
56
90
3387
57
# supertree_toolkit is the backend for the STK. Loaded by both the GUI and
91
# supertree_toolkit is the backend for the STK. Loaded by both the GUI and
3388
58
# CLI, this contains all the functions to actually *do* something
92
# CLI, this contains all the functions to actually *do* something
3389
@@ -60,6 +94,17 @@
3390
60
# All functions take XML and a list of other arguments, process the data and return
94
# All functions take XML and a list of other arguments, process the data and return
3391
61
# it back to the user interface handler to save it somewhere
95
# it back to the user interface handler to save it somewhere
3392
62
96
3393
97
3394
98
def get_project_name(XML):
3395
99
    """
3396
100
    Get the name of the dataset currently being worked on
3397
101
    """
3398
102
3399
103
    xml_root = _parse_xml(XML)
3400
104
3401
105
    return xml_root.xpath('/phylo_storage/project_name/string_value')[0].text 
3402
106
3403
107
3404
63
def create_name(authors, year, append=''):
108
def create_name(authors, year, append=''):
3405
64
    """ 
109
    """ 
3406
65
    Construct a sensible from a list of authors and a year for a 
110
    Construct a sensible from a list of authors and a year for a 
3407
@@ -161,6 +206,22 @@
3408
161
    
206
    
3409
162
    return names
207
    return names
3410
163
208
3411
209
def get_all_tree_names(XML):
3412
210
    """ From a full XML-PHYML string, extract all tree names.
3413
211
    """
3414
212
3415
213
    xml_root = _parse_xml(XML)
3416
214
    find = etree.XPath("//source")
3417
215
    sources = find(xml_root)
3418
216
    names = []
3419
217
    for s in sources:
3420
218
        for st in s.xpath("source_tree"):
3421
219
            if 'name' in st.attrib and not st.attrib['name'] == "":
3422
220
                names.append(st.attrib['name'])
3423
221
    
3424
222
    return names
3425
223
3426
224
3427
164
def set_unique_names(XML):
225
def set_unique_names(XML):
3428
165
    """ Ensures all sources have unique names.
226
    """ Ensures all sources have unique names.
3429
166
    """
227
    """
3430
@@ -249,9 +310,17 @@
3431
249
        if (ele.tag == "source"):
310
        if (ele.tag == "source"):
3432
250
            sources.append(ele)
311
            sources.append(ele)
3433
251
312
3434
313
    if overwrite:
3435
314
        # remove all the names first
3436
315
        for s in sources:
3437
316
            for st in s.xpath("source_tree"):
3438
317
                if 'name' in st.attrib:
3439
318
                    del st.attrib['name']
3440
319
3441
320
3442
252
    for s in sources:
321
    for s in sources:
3443
253
        for st in s.xpath("source_tree"):
322
        for st in s.xpath("source_tree"):
3445
254
            if overwrite or not 'name' in st.attrib:
323
            if not'name' in st.attrib:
3446
255
                tree_name = create_tree_name(XML,st)
324
                tree_name = create_tree_name(XML,st)
3447
256
                st.attrib['name'] = tree_name
325
                st.attrib['name'] = tree_name
3448
257
   
326
   
3449
@@ -339,7 +408,7 @@
3450
339
            taxa = etree.SubElement(s_tree,"taxa_data")
408
            taxa = etree.SubElement(s_tree,"taxa_data")
3451
340
            taxa.tail="\n      "
409
            taxa.tail="\n      "
3452
341
            # Note: we do not add all elements as otherwise they get set to some option
410
            # Note: we do not add all elements as otherwise they get set to some option
3454
342
            # rather than remaining blank (and hence blue int he interface)
411
            # rather than remaining blank (and hence blue in the interface)
3455
343
412
3456
344
            # append our new source to the main tree
413
            # append our new source to the main tree
3457
345
            # if sources has no valid source, overwrite,
414
            # if sources has no valid source, overwrite,
3458
@@ -877,7 +946,7 @@
3459
877
    # Need to add checks on the file. Problems include:
946
    # Need to add checks on the file. Problems include:
3460
878
# TNT: outputs Phyllip format or something - basically a Newick
947
# TNT: outputs Phyllip format or something - basically a Newick
3461
879
# string without commas, so add 'em back in
948
# string without commas, so add 'em back in
3463
880
    m = re.search(r'proc-;', content)
949
    m = re.search(r'proc.;', content)
3464
881
    if (m != None):
950
    if (m != None):
3465
882
        # TNT output tree
951
        # TNT output tree
3466
883
        # Done on a Mac? Replace ^M with a newline
952
        # Done on a Mac? Replace ^M with a newline
3467
@@ -1402,6 +1471,36 @@
3468
1402
1471
3469
1403
    return _amalgamate_trees(trees,format,anonymous)
1472
    return _amalgamate_trees(trees,format,anonymous)
3470
1404
        
1473
        
3471
1474
def get_taxa_from_tree_for_taxonomy(tree, pretty=False, ignoreErrors=False):
3472
1475
    """Returns a list of all taxa available for the tree passed as argument.
3473
1476
    :param tree: string with the data for the tree in Newick format.
3474
1477
    :type tree: string
3475
1478
    :param pretty: defines if '_' in taxa names should be replaced with spaces. 
3476
1479
    :type pretty: boolean
3477
1480
    :param ignoreErrors: should execution continue on error?
3478
1481
    :type ignoreErrors: boolean
3479
1482
    :returns: list of strings with the taxa names, sorted alphabetically
3480
1483
    :rtype: list
3481
1484
    """
3482
1485
    taxa_list = []
3483
1486
3484
1487
    try:
3485
1488
        taxa_list.extend(_getTaxaFromNewick(tree))
3486
1489
    except TreeParseError as detail:
3487
1490
        if (ignoreErrors):
3488
1491
            logging.warning(detail.msg)
3489
1492
            pass
3490
1493
        else:
3491
1494
            raise TreeParseError( detail.msg )
3492
1495
3493
1496
    # now uniquify the list of taxa
3494
1497
    taxa_list = _uniquify(taxa_list)
3495
1498
    taxa_list.sort()
3496
1499
3497
1500
    if (pretty):
3498
1501
        taxa_list = [x.replace('_', ' ') for x in taxa_list]
3499
1502
3500
1503
    return taxa_list
3501
1405
1504
3502
1406
def get_all_taxa(XML, pretty=False, ignoreErrors=False):
1505
def get_all_taxa(XML, pretty=False, ignoreErrors=False):
3503
1407
    """ Produce a taxa list by scanning all trees within 
1506
    """ Produce a taxa list by scanning all trees within 
3504
@@ -1422,21 +1521,17 @@
3505
1422
            taxa_list.extend(_getTaxaFromNewick(t))
1521
            taxa_list.extend(_getTaxaFromNewick(t))
3506
1423
        except TreeParseError as detail:
1522
        except TreeParseError as detail:
3507
1424
            if (ignoreErrors):
1523
            if (ignoreErrors):
3508
1524
                logging.warning(detail.msg)
3509
1425
                pass
1525
                pass
3510
1426
            else:
1526
            else:
3511
1427
                raise TreeParseError( detail.msg )
1527
                raise TreeParseError( detail.msg )
3512
1428
1528
3513
1429
3514
1430
3515
1431
    # now uniquify the list of taxa
1529
    # now uniquify the list of taxa
3516
1432
    taxa_list = _uniquify(taxa_list)
1530
    taxa_list = _uniquify(taxa_list)
3517
1433
    taxa_list.sort()
1531
    taxa_list.sort()
3518
1434
1532
3524
1435
    if (pretty):
1533
    if (pretty): #Remove underscores from names
3525
1436
        unpretty_tl = taxa_list
1534
        taxa_list = [x.replace('_', ' ') for x in taxa_list]
3521
1437
        taxa_list = []
3522
1438
        for t in unpretty_tl:
3523
1439
            taxa_list.append(t.replace('_',' '))
3526
1440
1535
3527
1441
    return taxa_list
1536
    return taxa_list
3528
1442
1537
3529
@@ -1508,7 +1603,7 @@
3530
1508
    return outgroups
1603
    return outgroups
3531
1509
1604
3532
1510
1605
3534
1511
def create_matrix(XML,format="hennig",quote=False,taxonomy=None,outgroups=False,ignoreWarnings=False):
1606
def create_matrix(XML,format="hennig",quote=False,taxonomy=None,outgroups=False,ignoreWarnings=False, verbose=False):
3535
1512
    """ From all trees in the XML, create a matrix
1607
    """ From all trees in the XML, create a matrix
3536
1513
    """
1608
    """
3537
1514
1609
3538
@@ -1553,7 +1648,7 @@
3539
1553
        taxa.sort()
1648
        taxa.sort()
3540
1554
    taxa.insert(0,"MRP_Outgroup")
1649
    taxa.insert(0,"MRP_Outgroup")
3541
1555
        
1650
        
3543
1556
    return _create_matrix(trees, taxa, format=format, quote=quote, weights=weights)
1651
    return _create_matrix(trees, taxa, format=format, quote=quote, weights=weights,verbose=verbose)
3544
1557
1652
3545
1558
1653
3546
1559
def create_matrix_from_trees(trees,format="hennig"):
1654
def create_matrix_from_trees(trees,format="hennig"):
3547
@@ -1925,7 +2020,7 @@
3548
1925
        _check_data(XML)
2020
        _check_data(XML)
3549
1926
2021
3550
1927
    xml_root = _parse_xml(XML)
2022
    xml_root = _parse_xml(XML)
3552
1928
    proj_name = xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
2023
    proj_name = get_project_name(XML)
3553
1929
2024
3554
1930
    output_string  = "======================\n"
2025
    output_string  = "======================\n"
3555
1931
    output_string += " Data summary of: " + proj_name + "\n" 
2026
    output_string += " Data summary of: " + proj_name + "\n" 
3556
@@ -1989,6 +2084,188 @@
3557
1989
2084
3558
1990
    return output_string
2085
    return output_string
3559
1991
2086
3560
2087
def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
3561
2088
    """ For each name in the database generate a database of the original name,
3562
2089
    possible synonyms and if the taxon is not know, signal that. We do this by
3563
2090
    using the EoL API to grab synonyms of each taxon.  """
3564
2091
3565
2092
    import urllib2
3566
2093
    from urllib import quote_plus
3567
2094
    import simplejson as json
3568
2095
3569
2096
    if existing_data == None:
3570
2097
        equivalents = {}
3571
2098
    else:
3572
2099
        equivalents = existing_data
3573
2100
3574
2101
    # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
3575
2102
    # if not, is there another API function to do this?
3576
2103
    # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
3577
2104
    # you get the original ('correct') name - shorten this to two words and you're done.
3578
2105
    for t in name_list:
3579
2106
        if t in equivalents:
3580
2107
            continue
3581
2108
        taxon = t.replace("_"," ")
3582
2109
        if (verbose):
3583
2110
            print "Looking up ", taxon
3584
2111
        # get the data from EOL on taxon
3585
2112
        taxonq = quote_plus(taxon)
3586
2113
        URL = "http://eol.org/api/search/1.0.json?q="+taxonq
3587
2114
        req = urllib2.Request(URL)
3588
2115
        opener = urllib2.build_opener()
3589
2116
        f = opener.open(req)
3590
2117
        data = json.load(f)
3591
2118
        # check if there's some data
3592
2119
        if len(data['results']) == 0:
3593
2120
            equivalents[t] = [[t],'red']
3594
2121
            continue
3595
2122
        amber = False
3596
2123
        if len(data['results']) > 1:
3597
2124
            # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
3598
2125
            # for automatic processing we'll just take the first one though
3599
2126
            # colour is amber in this case
3600
2127
            amber = True
3601
2128
        ID = str(data['results'][0]['id']) # take first hit
3602
2129
        URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=0&videos=0&sounds=0&maps=0&text=0&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"       
3603
2130
        req = urllib2.Request(URL)
3604
2131
        opener = urllib2.build_opener()
3605
2132
        
3606
2133
        try:
3607
2134
            f = opener.open(req)
3608
2135
        except urllib2.HTTPError:
3609
2136
            equivalents[t] = [[t],'red'] 
3610
2137
            continue
3611
2138
        data = json.load(f)
3612
2139
        if len(data['scientificName']) == 0:
3613
2140
            # not found a scientific name, so set as red
3614
2141
            equivalents[t] = [[t],'red']            
3615
2142
            continue
3616
2143
        correct_name = data['scientificName'].encode("ascii","ignore")
3617
2144
        # we only want the first two bits of the name, not the original author and year if any
3618
2145
        temp_name = correct_name.split(' ')
3619
2146
        if (len(temp_name) > 2):
3620
2147
            correct_name = ' '.join(temp_name[0:2])
3621
2148
        correct_name = correct_name.replace(' ','_')
3622
2149
3623
2150
        # build up the output dictionary - original name is key, synonyms/missing is value
3624
2151
        if (correct_name == t):
3625
2152
            # if the original matches the 'correct', then it's green
3626
2153
            equivalents[t] = [[t], 'green']
3627
2154
        else:
3628
2155
            # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the 
3629
2156
            # 'correct' taxon at the top
3630
2157
            eol_synonyms = data['synonyms']
3631
2158
            synonyms = []
3632
2159
            for s in eol_synonyms:
3633
2160
                ts = s['synonym'].encode("ascii","ignore")
3634
2161
                temp_syn = ts.split(' ')
3635
2162
                if (len(temp_syn) > 2):
3636
2163
                    temp_syn = ' '.join(temp_syn[0:2])
3637
2164
                    ts = temp_syn
3638
2165
                if (s['relationship'] == "synonym"):
3639
2166
                    ts = ts.replace(" ","_")
3640
2167
                    synonyms.append(ts)
3641
2168
            synonyms = _uniquify(synonyms)
3642
2169
            # we need to put the correct name at the top of the list now
3643
2170
            if (correct_name in synonyms):
3644
2171
                synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
3645
2172
            elif len(synonyms) == 0:
3646
2173
                synonyms.append(correct_name)
3647
2174
            else:
3648
2175
                synonyms.insert(0,correct_name)
3649
2176
3650
2177
            if (amber):
3651
2178
                equivalents[t] = [synonyms,'amber']
3652
2179
            else:
3653
2180
                equivalents[t] = [synonyms,'yellow']
3654
2181
        # if our search was empty, then it's red - see above
3655
2182
3656
2183
    # up to the calling funciton to do something sensible with this
3657
2184
    # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
3658
2185
    # Amber means we found synonyms and multilpe hits. User def needs to sort these!
3659
2186
3660
2187
    return equivalents
3661
2188
3662
2189
def taxonomic_checker_tree(tree_file,existing_data=None,verbose=False):
3663
2190
    """ For each name in the database generate a database of the original name,
3664
2191
    possible synonyms and if the taxon is not know, signal that. We do this by
3665
2192
    using the EoL API to grab synonyms of each taxon.  """
3666
2193
3667
2194
    tree = import_tree(tree_file)
3668
2195
    p4tree = _parse_tree(tree) 
3669
2196
    taxa = p4tree.getAllLeafNames(p4tree.root) 
3670
2197
    if existing_data == None:
3671
2198
        equivalents = {}
3672
2199
    else:
3673
2200
        equivalents = existing_data
3674
2201
3675
2202
    equivalents = taxonomic_checker_list(taxa,existing_data,verbose)
3676
2203
    return equivalents
3677
2204
3678
2205
def taxonomic_checker(XML,existing_data=None,verbose=False):
3679
2206
    """ For each name in the database generate a database of the original name,
3680
2207
    possible synonyms and if the taxon is not know, signal that. We do this by
3681
2208
    using the EoL API to grab synonyms of each taxon.  """
3682
2209
3683
2210
    # grab all taxa
3684
2211
    taxa = get_all_taxa(XML)
3685
2212
3686
2213
    if existing_data == None:
3687
2214
        equivalents = {}
3688
2215
    else:
3689
2216
        equivalents = existing_data
3690
2217
3691
2218
    equivalents = taxonomic_checker_list(taxa,existing_data,verbose)
3692
2219
    return equivalents
3693
2220
3694
2221
3695
2222
def load_equivalents(equiv_csv):
3696
2223
    """Load equivalents data from a csv and convert to a equivalents Dict.
3697
2224
        Structure is key, with a list that is array of synonyms, followed by status ('green',
3698
2225
        'yellow' or 'red').
3699
2226
3700
2227
    """
3701
2228
3702
2229
    import csv
3703
2230
3704
2231
    equivalents = {}
3705
2232
3706
2233
    with open(equiv_csv, 'rU') as csvfile:
3707
2234
        equiv_reader = csv.reader(csvfile, delimiter=',')
3708
2235
        equiv_reader.next() # skip header
3709
2236
        for row in equiv_reader:
3710
2237
            i = 1
3711
2238
            equivalents[row[0]] = [row[1].split(';'),row[2]]
3712
2239
    
3713
2240
    return equivalents
3714
2241
3715
2242
def save_taxonomy(taxonomy, output_file):
3716
2243
3717
2244
    import csv
3718
2245
3719
2246
    with open(output_file, 'w') as f:
3720
2247
        writer = csv.writer(f)
3721
2248
        row = ['OTU']
3722
2249
        row.extend(taxonomy_levels)
3723
2250
        row.append('Provider')
3724
2251
        writer.writerow(row)
3725
2252
        for t in taxonomy:
3726
2253
            species = t
3727
2254
            row = []
3728
2255
            row.append(t.encode('utf-8'))
3729
2256
            for l in taxonomy_levels:
3730
2257
                try:
3731
2258
                    g = taxonomy[t][l]
3732
2259
                except KeyError:
3733
2260
                    g = '-'
3734
2261
                row.append(g.encode('utf-8'))
3735
2262
            try:
3736
2263
                provider = taxonomy[t]['provider']
3737
2264
            except KeyError:
3738
2265
                provider = "-"
3739
2266
            row.append(provider)
3740
2267
3741
2268
            writer.writerow(row)
3742
1992
2269
3743
1993
2270
3744
1994
def load_taxonomy(taxonomy_csv):
2271
def load_taxonomy(taxonomy_csv):
3745
@@ -2000,20 +2277,443 @@
3746
2000
2277
3747
2001
    with open(taxonomy_csv, 'rU') as csvfile:
2278
    with open(taxonomy_csv, 'rU') as csvfile:
3748
2002
        tax_reader = csv.reader(csvfile, delimiter=',')
2279
        tax_reader = csv.reader(csvfile, delimiter=',')
3763
2003
        tax_reader.next()
2280
        try:
3764
2004
        for row in tax_reader:
2281
            j = 0
3765
2005
            current_taxonomy = {}
2282
            for row in tax_reader:
3766
2006
            i = 1
2283
                if j == 0:
3767
2007
            for t in taxonomy_levels:
2284
                    tax_levels = row[1:-1]
3768
2008
                if not row[i] == '-':
2285
                    j += 1
3769
2009
                    current_taxonomy[t] = row[i]
2286
                    continue
3770
2010
                i = i+ 1
2287
                i = 1
3771
2011
2288
                current_taxonomy = {}
3772
2012
            current_taxonomy['provider'] = row[17] # data source
2289
                for t in tax_levels:
3773
2013
            taxonomy[row[0]] = current_taxonomy
2290
                    if not row[i] == '-':
3774
2014
    
2291
                        current_taxonomy[t] = row[i]
3775
2015
    return taxonomy
2292
                    i = i+ 1
3776
2016
2293
                current_taxonomy['provider'] = row[-1] # data source
3777
2294
                taxonomy[row[0].replace(" ","_")] = current_taxonomy
3778
2295
                j += 1
3779
2296
        except:
3780
2297
            pass
3781
2298
    
3782
2299
    return taxonomy
3783
2300
3784
2301
3785
2302
class TaxonomyFetcher(threading.Thread):
3786
2303
    """ Class to provide the taxonomy fetching functionality as a threaded function to be used individually or working with a pool.
3787
2304
    """
3788
2305
3789
2306
    def __init__(self, taxonomy, lock, queue, id=0, pref_db=None, verbose=False, ignoreWarnings=False):
3790
2307
        """ Constructor for the threaded model.
3791
2308
        :param taxonomy: previous taxonomy available (if available) or an empty dictionary to store the results .
3792
2309
        :type taxonomy: dictionary
3793
2310
        :param lock: lock to keep the taxonomy threadsafe.
3794
2311
        :type lock: Lock
3795
2312
        :param queue: queue where the taxa are kept to be processed.
3796
2313
        :type queue: Queue of strings
3797
2314
        :param id: id for the thread to use if messages need to be printed.
3798
2315
        :type id: int 
3799
2316
        :param pref_db: Gives priority to database. Seems it is unused.
3800
2317
        :type pref_db: string 
3801
2318
        :param verbose: Show verbose messages during execution, will also define level of logging. True will set logging level to INFO.
3802
2319
        :type verbose: boolean
3803
2320
        :param ignoreWarnings: Ignore warnings and errors during execution? Errors will be logged with ERROR level on the logging output.
3804
2321
        :type ignoreWarnings: boolean 
3805
2322
        """
3806
2323
3807
2324
        threading.Thread.__init__(self)
3808
2325
        self.taxonomy = taxonomy
3809
2326
        self.lock = lock
3810
2327
        self.queue = queue
3811
2328
        self.id = id
3812
2329
        self.verbose = verbose
3813
2330
        self.pref_db = pref_db
3814
2331
        self.ignoreWarnings = ignoreWarnings
3815
2332
3816
2333
    def run(self):
3817
2334
        """ Gets and processes a taxon from the queue to get its taxonomy."""
3818
2335
        while True :
3819
2336
            if self.verbose :
3820
2337
                logging.getLogger().setLevel(logging.INFO)
3821
2338
            #get taxon from queue
3822
2339
            taxon = self.queue.get()
3823
2340
3824
2341
            logging.debug("Starting {} with thread #{} remaining ~{}".format(taxon,str(self.id),str(self.queue.qsize())))
3825
2342
             
3826
2343
            #Lock access to the taxonomy
3827
2344
            self.lock.acquire()
3828
2345
            if not taxon in self.taxonomy: # is a new taxon, not previously in the taxonomy
3829
2346
                #Release access to the taxonomy
3830
2347
                self.lock.release()
3831
2348
                if (self.verbose):
3832
2349
                    print "Looking up ", taxon
3833
2350
                    logging.info("Loolking up taxon: {}".format(str(taxon)))
3834
2351
                try:
3835
2352
                    # get the data from EOL on taxon
3836
2353
                    taxonq = quote_plus(taxon)
3837
2354
                    URL = "http://eol.org/api/search/1.0.json?q="+taxonq
3838
2355
                    req = urllib2.Request(URL)
3839
2356
                    opener = urllib2.build_opener()
3840
2357
                    f = opener.open(req)
3841
2358
                    data = json.load(f)
3842
2359
                    # check if there's some data
3843
2360
                    if len(data['results']) == 0:
3844
2361
                        # try PBDB as it might be a fossil
3845
2362
                        URL = "http://paleobiodb.org/data1.1/taxa/single.json?name="+taxonq+"&show=phylo&vocab=pbdb"
3846
2363
                        req = urllib2.Request(URL)
3847
2364
                        opener = urllib2.build_opener()
3848
2365
                        f = opener.open(req)
3849
2366
                        datapbdb = json.load(f)
3850
2367
                        if (len(datapbdb['records']) == 0):
3851
2368
                            # no idea!
3852
2369
                            with self.lock:
3853
2370
                                self.taxonomy[taxon] = {}
3854
2371
                            self.queue.task_done()
3855
2372
                            continue
3856
2373
                        # otherwise, let's fill in info here - only if extinct!
3857
2374
                        if datapbdb['records'][0]['is_extant'] == 0:
3858
2375
                            this_taxonomy = {}
3859
2376
                            this_taxonomy['provider'] = 'PBDB'
3860
2377
                            for level in taxonomy_levels:
3861
2378
                                try:
3862
2379
                                    if datapbdb.has_key('records'):
3863
2380
                                        pbdb_lev = datapbdb['records'][0][level]
3864
2381
                                        temp_lev = pbdb_lev.split(" ")
3865
2382
                                        # they might have the author on the end, so strip it off
3866
2383
                                        if (level == 'species'):
3867
2384
                                            this_taxonomy[level] = ' '.join(temp_lev[0:2])
3868
2385
                                        else:
3869
2386
                                            this_taxonomy[level] = temp_lev[0]       
3870
2387
                                except KeyError as e:
3871
2388
                                    logging.exception("Key not found records")
3872
2389
                                    continue
3873
2390
                            # add the taxon at right level too
3874
2391
                            try:
3875
2392
                                if datapbdb.has_key('records'):
3876
2393
                                    current_level = datapbdb['records'][0]['rank']
3877
2394
                                    this_taxonomy[current_level] = datapbdb['records'][0]['taxon_name']
3878
2395
                            except KeyError as e:
3879
2396
                                self.queue.task_done()
3880
2397
                                logging.exception("Key not found records")
3881
2398
                                continue
3882
2399
                            with self.lock:
3883
2400
                                self.taxonomy[taxon] = this_taxonomy
3884
2401
                            self.queue.task_done()
3885
2402
                            continue
3886
2403
                        else:
3887
2404
                            # extant, but not in EoL - leave the user to sort this one out
3888
2405
                            with self.lock:
3889
2406
                                self.taxonomy[taxon] = {}
3890
2407
                            self.queue.task_done()
3891
2408
                            continue
3892
2409
3893
2410
                                
3894
2411
                    ID = str(data['results'][0]['id']) # take first hit
3895
2412
                    # Now look for taxonomies
3896
2413
                    URL = "http://eol.org/api/pages/1.0/"+ID+".json"
3897
2414
                    req = urllib2.Request(URL)
3898
2415
                    opener = urllib2.build_opener()
3899
2416
                    f = opener.open(req)
3900
2417
                    data = json.load(f)
3901
2418
                    if len(data['taxonConcepts']) == 0:
3902
2419
                        with self.lock:
3903
2420
                            self.taxonomy[taxon] = {}
3904
2421
                        self.queue.task_done()
3905
2422
                        continue
3906
2423
                    TID = str(data['taxonConcepts'][0]['identifier']) # take first hit
3907
2424
                    currentdb = str(data['taxonConcepts'][0]['nameAccordingTo'])
3908
2425
                    # loop through and get preferred one if specified
3909
2426
                    # now get taxonomy
3910
2427
                    if (not self.pref_db is None):
3911
2428
                        for db in data['taxonConcepts']:
3912
2429
                            currentdb = db['nameAccordingTo'].lower()
3913
2430
                            if (self.pref_db.lower() in currentdb):
3914
2431
                                TID = str(db['identifier'])
3915
2432
                                break
3916
2433
                    URL="http://eol.org/api/hierarchy_entries/1.0/"+TID+".json"
3917
2434
                    req = urllib2.Request(URL)
3918
2435
                    opener = urllib2.build_opener()
3919
2436
                    f = opener.open(req)
3920
2437
                    data = json.load(f)
3921
2438
                    this_taxonomy = {}
3922
2439
                    this_taxonomy['provider'] = currentdb
3923
2440
                    for a in data['ancestors']:
3924
2441
                        try:
3925
2442
                            if a.has_key('taxonRank') :
3926
2443
                                temp_level = a['taxonRank'].encode("ascii","ignore")
3927
2444
                                if (temp_level in taxonomy_levels):
3928
2445
                                    # note the dump into ASCII
3929
2446
                                    temp_name = a['scientificName'].encode("ascii","ignore")
3930
2447
                                    temp_name = temp_name.split(" ")
3931
2448
                                    if (temp_level == 'species'):
3932
2449
                                        this_taxonomy[temp_level] = temp_name[0:2]
3933
2450
                                        
3934
2451
                                    else:
3935
2452
                                        this_taxonomy[temp_level] = temp_name[0]  
3936
2453
                        except KeyError as e:
3937
2454
                            logging.exception("Key not found: taxonRank")
3938
2455
                            continue
3939
2456
                    try:
3940
2457
                        # add taxonomy in to the taxonomy!
3941
2458
                        # some issues here, so let's make sure it's OK
3942
2459
                        temp_name = taxon.split(" ")            
3943
2460
                        if data.has_key('taxonRank') :
3944
2461
                            if not data['taxonRank'].lower() == 'species':
3945
2462
                                this_taxonomy[data['taxonRank'].lower()] = temp_name[0]
3946
2463
                            else:
3947
2464
                                this_taxonomy[data['taxonRank'].lower()] = ' '.join(temp_name[0:2])
3948
2465
                    except KeyError as e:
3949
2466
                        self.queue.task_done()
3950
2467
                        logging.exception("Key not found: taxonRank")
3951
2468
                        continue
3952
2469
                    with self.lock:
3953
2470
                        #Send result to dictionary
3954
2471
                        self.taxonomy[taxon] = this_taxonomy
3955
2472
                except urllib2.HTTPError:
3956
2473
                    print("Network error when processing {} ".format(taxon,))
3957
2474
                    logging.info("Network error when processing {} ".format(taxon,))
3958
2475
                    self.queue.task_done()
3959
2476
                    continue
3960
2477
                except urllib2.URLError:
3961
2478
                    print("Network error when processing {} ".format(taxon,))
3962
2479
                    logging.info("Network error when processing {} ".format(taxon,))
3963
2480
                    self.queue.task_done()
3964
2481
                    continue
3965
2482
            else :
3966
2483
                #Nothing to do release the lock on taxonomy
3967
2484
                self.lock.release()
3968
2485
            #Mark task as done
3969
2486
            self.queue.task_done()
3970
2487
3971
2488
def create_taxonomy_from_taxa(taxa, taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False, threadNumber=5):
3972
2489
    """Uses the taxa provided to generate a taxonomy for all the taxon available. 
3973
2490
    :param taxa: list of the taxa.
3974
2491
    :type taxa : list 
3975
2492
    :param taxonomy: previous taxonomy available (if available) or an empty 
3976
2493
    dictionary to store the results. If None will be init to an empty dictionary
3977
2494
    :type taxonomy: dictionary
3978
2495
    :param pref_db: Gives priority to database. Seems it is unused.
3979
2496
    :type pref_db: string 
3980
2497
    :param verbose: Show verbose messages during execution, will also define 
3981
2498
    level of logging. True will set logging level to INFO.
3982
2499
    :type verbose: boolean
3983
2500
    :param ignoreWarnings: Ignore warnings and errors during execution? Errors 
3984
2501
    will be logged with ERROR level on the logging output.
3985
2502
    :type ignoreWarnings: boolean 
3986
2503
    :param threadNumber: Maximum number of threads to use for taxonomy processing.
3987
2504
    :type threadNumber: int
3988
2505
    :returns: dictionary with resulting taxonomy for each taxon (keys) 
3989
2506
    :rtype: dictionary 
3990
2507
    """
3991
2508
    if verbose :
3992
2509
        logging.getLogger().setLevel(logging.INFO)
3993
2510
    if taxonomy is None:
3994
2511
        taxonomy = {}
3995
2512
3996
2513
    lock = threading.Lock()
3997
2514
    queue = Queue.Queue()
3998
2515
3999
2516
    #Starting a few threads as daemons checking the queue
4000
2517
    for i in range(threadNumber) :
4001
2518
        t = TaxonomyFetcher(taxonomy, lock, queue, i, pref_db, verbose, ignoreWarnings)
4002
2519
        t.setDaemon(True)
4003
2520
        t.start()
4004
2521
    
4005
2522
    #Popoluate the queue with the taxa.
4006
2523
    for taxon in taxa :
4007
2524
        queue.put(taxon)
4008
2525
    
4009
2526
    #Wait till everyone finishes
4010
2527
    queue.join()
4011
2528
    logging.getLogger().setLevel(logging.WARNING)
4012
2529
4013
2530
def create_taxonomy_from_tree(tree, existing_taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False):
4014
2531
    """ Generates the taxonomy from a tree. Uses a similar method to the XML version but works directly on a string with the tree.
4015
2532
    :param tree: list of the taxa.
4016
2533
    :type tree : list 
4017
2534
    :param existing_taxonomy: list of the taxa.
4018
2535
    :type existing_taxonomy: list 
4019
2536
    :param pref_db: Gives priority to database. Seems it is unused.
4020
2537
    :type pref_db: string 
4021
2538
    :param verbose: Flag for verbosity.
4022
2539
    :type verbose: boolean
4023
2540
    :param ignoreWarnings: Flag for exception processing.
4024
2541
    :type ignoreWarnings: boolean
4025
2542
    :returns: the modified taxonomy
4026
2543
    :rtype: dictionary
4027
2544
    """
4028
2545
    starttime = time.time()
4029
2546
4030
2547
    if(existing_taxonomy is None) :
4031
2548
        taxonomy = {}
4032
2549
    else :
4033
2550
        taxonomy = existing_taxonomy
4034
2551
4035
2552
    taxa = get_taxa_from_tree_for_taxonomy(tree, pretty=True)
4036
2553
    
4037
2554
    create_taxonomy_from_taxa(taxa, taxonomy)
4038
2555
    
4039
2556
    taxonomy = create_extended_taxonomy(taxonomy, starttime, verbose, ignoreWarnings)
4040
2557
    
4041
2558
    return taxonomy
4042
2559
4043
2560
def create_taxonomy(XML, existing_taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False):
4044
2561
    """Generates a taxonomy of the data from EoL data. This is stored as a
4045
2562
    dictionary of taxonomy for each taxon in the dataset. Missing data are
4046
2563
    encoded as '' (blank string). It's up to the calling function to store this
4047
2564
    data to file or display it."""
4048
2565
    
4049
2566
    starttime = time.time()
4050
2567
4051
2568
    if not ignoreWarnings:
4052
2569
        _check_data(XML)
4053
2570
4054
2571
    if (existing_taxonomy is None):
4055
2572
        taxonomy = {}
4056
2573
    else:
4057
2574
        taxonomy = existing_taxonomy
4058
2575
    taxa = get_all_taxa(XML, pretty=True)
4059
2576
    create_taxonomy_from_taxa(taxa, taxonomy)
4060
2577
    #taxonomy = create_extended_taxonomy(taxonomy, starttime, verbose, ignoreWarnings)
4061
2578
    return taxonomy
4062
2579
4063
2580
def create_extended_taxonomy(taxonomy, starttime, verbose=False, ignoreWarnings=False):
4064
2581
    """Bring extra taxonomy terms from other databases, shared method for completing the taxonomy
4065
2582
    both for trees comming from XML or directly from trees.
4066
2583
    :param taxonomy: Dictionary with the relationship for taxa and taxonomy terms.
4067
2584
    :type taxonomy: dictionary
4068
2585
    :param starttime: time to keep track of processing time.
4069
2586
    :type starttime: long
4070
2587
    :param verbose: Flag for verbosity.
4071
2588
    :type verbose: boolean
4072
2589
    :param ignoreWarnings: Flag for exception processing.
4073
2590
    :type ignoreWarnings: boolean
4074
2591
    :returns: the modified taxonomy
4075
2592
    :rtype: dictionary
4076
2593
    """
4077
2594
    
4078
2595
    if (verbose):
4079
2596
        logging.info('Done basic taxonomy, getting more info from ITIS')
4080
2597
        print("Time elapsed {}".format(str(time.time() - starttime)))
4081
2598
        print "Done basic taxonomy, getting more info from ITIS"
4082
2599
    # fill in the rest of the taxonomy
4083
2600
    # get all genera
4084
2601
    genera = []
4085
2602
    for t in taxonomy:
4086
2603
        if t in taxonomy:
4087
2604
            if GENUS in taxonomy[t]:
4088
2605
                genera.append(taxonomy[t][GENUS])
4089
2606
    genera = _uniquify(genera)
4090
2607
    # We then use ITIS to fill in missing info based on the genera only - that saves us a species level search
4091
2608
    # and we can fill in most of the EoL missing data
4092
2609
    for g in genera:
4093
2610
        if (verbose):
4094
2611
            print "Looking up ", g
4095
2612
            logging.info("Looking up {}".format(str(g)))
4096
2613
        try:
4097
2614
            URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(g.strip())
4098
2615
        except:
4099
2616
            continue
4100
2617
        req = urllib2.Request(URL)
4101
2618
        opener = urllib2.build_opener()
4102
2619
        try:
4103
2620
            f = opener.open(req)
4104
2621
        except urllib2.HTTPError:
4105
2622
            continue
4106
2623
        string = unicode(f.read(),"ISO-8859-1")
4107
2624
        data = json.loads(string)
4108
2625
        if data['scientificNames'][0] == None:
4109
2626
            continue
4110
2627
        tsn = data["scientificNames"][0]["tsn"]
4111
2628
        URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+str(tsn)
4112
2629
        req = urllib2.Request(URL)
4113
2630
        opener = urllib2.build_opener()
4114
2631
        f = opener.open(req)
4115
2632
        try:
4116
2633
            string = unicode(f.read(),"ISO-8859-1")
4117
2634
        except:
4118
2635
            continue
4119
2636
        data = json.loads(string)
4120
2637
        this_taxonomy = {}
4121
2638
        for level in data['hierarchyList']:
4122
2639
            if not level['rankName'].lower() in current_taxonomy_levels:
4123
2640
                # note the dump into ASCII            
4124
2641
                if level['rankName'].lower() == 'species':
4125
2642
                    this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = ' '.join.level['taxonName'][0:2].encode("ascii","ignore")
4126
2643
                else:
4127
2644
                    this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
4128
2645
4129
2646
        for t in taxonomy:
4130
2647
            if t in taxonomy:
4131
2648
                if GENUS in taxonomy[t]:
4132
2649
                    if taxonomy[t][GENUS] == g:
4133
2650
                        taxonomy[t].update(this_taxonomy)
4134
2651
4135
2652
    return taxonomy
4136
2653
4137
2654
def generate_species_level_data(XML, taxonomy, ignoreWarnings=False, verbose=False):
4138
2655
    """ Based on a taxonomy data set, amend the data to be at species level as
4139
2656
    far as possible.  This function creates an internal 'subs file' and calls
4140
2657
    the standard substitution functions.  The internal subs are generated by
4141
2658
    looping over the taxa and if not at species-level, working out which level
4142
2659
    they are at and then adding species already in the dataset to replace it
4143
2660
    via a polytomy. This has to be done in one step to avoid adding spurious
4144
2661
    structure to the phylogenies """
4145
2662
4146
2663
    if not ignoreWarnings:
4147
2664
        _check_data(XML)
4148
2665
4149
2666
    # if taxonomic checker not done, warn
4150
2667
    if (not taxonomy):
4151
2668
        raise NoneCompleteTaxonomy("Taxonomy is empty. Create a taxonomy first. You'll probably need to hand edit the file to complete")
4152
2669
        return
4153
2670
4154
2671
    # if missing data in taxonomy, warn
4155
2672
    taxa = get_all_taxa(XML)
4156
2673
    keys = taxonomy.keys()
4157
2674
    if (not ignoreWarnings):
4158
2675
        for t in taxa:
4159
2676
            t = t.replace("_"," ")
4160
2677
            if not t in keys:
4161
2678
                # This idea here is that the caller will catch this, then re-run with ignoreWarnings set to True
4162
2679
                raise NoneCompleteTaxonomy("Taxonomy is not complete. I will soldier on anyway, but this might not work as intended")
4163
2680
4164
2681
    # get all taxa - see above!
4165
2682
    # for each taxa, if not at species level
4166
2683
    new_taxa = []
4167
2684
    old_taxa = []
4168
2685
    for t in taxa:
4169
2686
        subs = []
4170
2687
        t = t.replace("_"," ")
4171
2688
        if (not SPECIES in taxonomy[t]): # the current taxon is not a species, but higher level taxon
4172
2689
            # work out which level - should we encode this in the data to start with?
4173
2690
            for tl in taxonomy_levels:
4174
2691
                try:
4175
2692
                    tax_data = taxonomy[t][tl]
4176
2693
                except KeyError:
4177
2694
                    continue
4178
2695
                if (t == taxonomy[t][tl]):
4179
2696
                    current_level = tl
4180
2697
                    # find all species in the taxonomy that match this level
4181
2698
                    for taxon in taxa:
4182
2699
                        taxon = taxon.replace("_"," ")
4183
2700
                        if (SPECIES in taxonomy[taxon]):
4184
2701
                            try:
4185
2702
                                if taxonomy[taxon][current_level] == t: # our current taxon
4186
2703
                                    subs.append(taxon.replace(" ","_"))
4187
2704
                            except KeyError:
4188
2705
                                continue
4189
2706
4190
2707
        # create the sub
4191
2708
        if len(subs) > 0:
4192
2709
            old_taxa.append(t.replace(" ","_"))
4193
2710
            new_taxa.append(','.join(subs))
4194
2711
4195
2712
    # call the sub
4196
2713
    new_XML = substitute_taxa(XML, old_taxa, new_taxa, verbose=verbose)
4197
2714
    new_XML = clean_data(new_XML)
4198
2715
    
4199
2716
    return new_XML
4200
2017
2717
4201
2018
def data_overlap(XML, overlap_amount=2, filename=None, detailed=False, show=False, verbose=False, ignoreWarnings=False):
2718
def data_overlap(XML, overlap_amount=2, filename=None, detailed=False, show=False, verbose=False, ignoreWarnings=False):
4202
2019
    """ Calculate the amount of taxonomic overlap between source trees.
2719
    """ Calculate the amount of taxonomic overlap between source trees.
4203
@@ -2024,7 +2724,7 @@
4204
2024
    If filename is None, no graphic is generated. Otherwise a simple
2724
    If filename is None, no graphic is generated. Otherwise a simple
4205
2025
    graphic is generated showing the number of cluster. If detailed is set to
2725
    graphic is generated showing the number of cluster. If detailed is set to
4206
2026
    true, a graphic is generated showing *all* trees. For data containing >200
2726
    true, a graphic is generated showing *all* trees. For data containing >200
4208
2027
    source tres this could be very big and take along time. More likely, you'll run
2727
    source trees this could be very big and take along time. More likely, you'll run
4209
2028
    out of memory.
2728
    out of memory.
4210
2029
    """
2729
    """
4211
2030
    import matplotlib
2730
    import matplotlib
4212
@@ -2103,6 +2803,7 @@
4213
2103
        sufficient_overlap = True
2803
        sufficient_overlap = True
4214
2104
2804
4215
2105
    # The above list actually contains which components are seperate from each other
2805
    # The above list actually contains which components are seperate from each other
4216
2806
    key_list = connected_components
4217
2106
2807
4218
2107
    if (not filename == None or show):
2808
    if (not filename == None or show):
4219
2108
        if (verbose):
2809
        if (verbose):
4220
@@ -2266,7 +2967,9 @@
4221
2266
    prev_char = None
2967
    prev_char = None
4222
2267
    prev_taxa = None
2968
    prev_taxa = None
4223
2268
    prev_name = None
2969
    prev_name = None
4225
2269
    non_ind = {}
2970
    subsets = []
4226
2971
    identical = []
4227
2972
    is_identical = False
4228
2270
    for data in data_ind:
2973
    for data in data_ind:
4229
2271
        name = data[0]
2974
        name = data[0]
4230
2272
        char = data[1]
2975
        char = data[1]
4231
@@ -2275,22 +2978,71 @@
4232
2275
            # when sorted, the longer list comes first
2978
            # when sorted, the longer list comes first
4233
2276
            if set(taxa).issubset(set(prev_taxa)):
2979
            if set(taxa).issubset(set(prev_taxa)):
4234
2277
                if (taxa == prev_taxa):
2980
                if (taxa == prev_taxa):
4236
2278
                    non_ind[name] = [prev_name,IDENTICAL]
2981
                    if (is_identical):
4237
2982
                        identical[-1].append(name)
4238
2983
                    else:
4239
2984
                        identical.append([name,prev_name])
4240
2985
                        is_identical = True
4241
2986
4242
2279
                else:
2987
                else:
4244
2280
                    non_ind[name] = [prev_name,SUBSET]
2988
                    subsets.append([prev_name, name])
4245
2989
                    prev_name = name
4246
2990
                    is_identical = False
4247
2991
            else:
4248
2992
                prev_name = name
4249
2993
                is_identical = False
4250
2994
        else:
4251
2995
            prev_name = name
4252
2996
            is_identical = False
4253
2997
            
4254
2281
        prev_char = char
2998
        prev_char = char
4255
2282
        prev_taxa = taxa
2999
        prev_taxa = taxa
4258
2283
        prev_name = name
3000
        
4257
2284
4259
2285
    if (make_new_xml):
3001
    if (make_new_xml):
4260
2286
        new_xml = XML
3002
        new_xml = XML
4264
2287
        for name in non_ind:
3003
        # deal with subsets
4265
2288
            if (non_ind[name][1] == SUBSET):
3004
        for s in subsets:
4266
2289
                new_xml = _swap_tree_in_XML(new_xml,None,name) 
3005
            new_xml = _swap_tree_in_XML(new_xml,None,s[1]) 
4267
2290
        new_xml = clean_data(new_xml)
3006
        new_xml = clean_data(new_xml)
4269
2291
        return non_ind, new_xml
3007
        # deal with identical - weight them, if there's 3, weights are 0.3, i.e. 
4270
3008
        # weights are 1/no of identical trees
4271
3009
        for i in identical:
4272
3010
            weight = 1.0 / float(len(i))
4273
3011
            new_xml = add_weights(new_xml, i, weight)
4274
3012
4275
3013
        return identical, subsets, new_xml
4276
2292
    else:
3014
    else:
4278
2293
        return non_ind
3015
        return identical, subsets
4279
3016
4280
3017
4281
3018
def add_weights(XML, names, weight):
4282
3019
    """ Add weights for tree, supply array of names and a weight, they get set
4283
3020
        Returns a new XML
4284
3021
    """
4285
3022
4286
3023
    xml_root = _parse_xml(XML)
4287
3024
    # By getting source, we can then loop over each source_tree
4288
3025
    find = etree.XPath("//source_tree")
4289
3026
    sources = find(xml_root)
4290
3027
    for s in sources:
4291
3028
        s_name = s.attrib['name']
4292
3029
        for n in names:
4293
3030
            if s_name == n:
4294
3031
                if s.xpath("tree/weight/real_value") == []:
4295
3032
                    # add weights
4296
3033
                    weights_element = etree.Element("weight")
4297
3034
                    weights_element.tail="\n"
4298
3035
                    real_value = etree.SubElement(weights_element,'real_value')
4299
3036
                    real_value.attrib['rank'] = '0'
4300
3037
                    real_value.tail = '\n'
4301
3038
                    real_value.text = str(weight)
4302
3039
                    t = s.xpath("tree")[0]                    
4303
3040
                    t.append(weights_element)
4304
3041
                else:
4305
3042
                    s.xpath("tree/weight/real_value")[0].text = str(weight)
4306
3043
4307
3044
    return etree.tostring(xml_root,pretty_print=True)
4308
3045
4309
2294
3046
4310
2295
def add_historical_event(XML, event_description):
3047
def add_historical_event(XML, event_description):
4311
2296
    """
3048
    """
4312
@@ -2380,8 +3132,15 @@
4313
2380
    # check trees are informative
3132
    # check trees are informative
4314
2381
    XML = _check_informative_trees(XML,delete=True)
3133
    XML = _check_informative_trees(XML,delete=True)
4315
2382
3134
4316
3135
    
4317
2383
    # check sources
3136
    # check sources
4318
2384
    XML = _check_sources(XML,delete=True)
3137
    XML = _check_sources(XML,delete=True)
4319
3138
    XML = all_sourcenames(XML)
4320
3139
    
4321
3140
    # fix tree names
4322
3141
    XML = set_unique_names(XML)
4323
3142
    XML = set_all_tree_names(XML,overwrite=True)
4324
3143
    
4325
2385
3144
4326
2386
    # unpermutable trees
3145
    # unpermutable trees
4327
2387
    permutable_trees = _find_trees_for_permuting(XML)
3146
    permutable_trees = _find_trees_for_permuting(XML)
4328
@@ -2659,7 +3418,7 @@
4329
2659
        s.getparent().remove(s)
3418
        s.getparent().remove(s)
4330
2660
3419
4331
2661
    # edit name (append _subset)
3420
    # edit name (append _subset)
4333
2662
    proj_name = xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
3421
    proj_name = get_project_name(XML)
4334
2663
    proj_name += "_subset"
3422
    proj_name += "_subset"
4335
2664
    xml_root.xpath('/phylo_storage/project_name/string_value')[0].text = proj_name
3423
    xml_root.xpath('/phylo_storage/project_name/string_value')[0].text = proj_name
4336
2665
3424
4337
@@ -2928,6 +3687,37 @@
4338
2928
3687
4339
2929
    return mrca
3688
    return mrca
4340
2930
3689
4341
3690
4342
3691
def tree_from_taxonomy(taxonomy, end_level, end_rank):
4343
3692
    """Create a tree from a taxonomy data structure.
4344
3693
    This is not the most efficient way, but works OK
4345
3694
    """
4346
3695
4347
3696
    # Grab data only for the end_level classification
4348
3697
    required_taxonomy = {}
4349
3698
    for t in taxonomy:
4350
3699
        if (end_level in t):
4351
3700
            required_taxonomy[t] = taxonomy[t]
4352
3701
4353
3702
    rank_index = taxonomy_levels.index(end_rank)
4354
3703
4355
3704
    # create basic string
4356
3705
4357
3706
        # get unique otus
4358
3707
4359
3708
        # sort by the subfamily
4360
3709
4361
3710
        # for each genus create a newick string
4362
3711
4363
3712
        # if it's the same grouping as previous, add as sister clade (i.e. ,)
4364
3713
        # else, prepend a (, append a ) and add new clade (ie. ,)
4365
3714
4366
3715
    
4367
3716
    # return tree
4368
3717
4369
3718
4370
3719
4371
3720
4372
2931
################ PRIVATE FUNCTIONS ########################
3721
################ PRIVATE FUNCTIONS ########################
4373
2932
3722
4374
2933
def _uniquify(l):
3723
def _uniquify(l):
4375
@@ -2975,13 +3765,25 @@
4376
2975
                    "The source names in the dataset are not unique. Please run the auto-name function on these data. Name: "+name+"\n"
3765
                    "The source names in the dataset are not unique. Please run the auto-name function on these data. Name: "+name+"\n"
4377
2976
        last_name = name
3766
        last_name = name
4378
2977
3767
4379
3768
    # do same for tree names:
4380
3769
    names = get_all_tree_names(XML)
4381
3770
    names.sort()
4382
3771
    last_name = "" # This will actually throw an non-unique error if a name is empty
4383
3772
    # not great, but still an error!
4384
3773
    for name in names:
4385
3774
        if name == last_name:
4386
3775
            # if non-unique throw exception
4387
3776
            message = message + \
4388
3777
                    "The tree names in the dataset are not unique. Please run the auto-name function on these data with replace or edit by hand. Name: "+name+"\n"
4389
3778
        last_name = name
4390
3779
4391
2978
    if (not message == ""):
3780
    if (not message == ""):
4392
2979
        raise NotUniqueError(message)
3781
        raise NotUniqueError(message)
4393
2980
3782
4394
2981
    return
3783
    return
4395
2982
3784
4396
2983
3785
4398
2984
def _assemble_tree_matrix(tree_string):
3786
def _assemble_tree_matrix(tree_string, verbose=False):
4399
2985
    """ Assembles the MRP matrix for an individual tree
3787
    """ Assembles the MRP matrix for an individual tree
4400
2986
3788
4401
2987
        returns: matrix (2D numpy array: taxa on i, nodes on j)
3789
        returns: matrix (2D numpy array: taxa on i, nodes on j)
4402
@@ -3009,7 +3811,7 @@
4403
3009
        for i in range(0,len(names)):
3811
        for i in range(0,len(names)):
4404
3010
            adjmat.append([1])
3812
            adjmat.append([1])
4405
3011
        adjmat = numpy.array(adjmat)
3813
        adjmat = numpy.array(adjmat)
4407
3012
3814
    if verbose:
4408
3013
        print "Warning: Found uninformative tree in data. Including it in the matrix anyway"
3815
        print "Warning: Found uninformative tree in data. Including it in the matrix anyway"
4409
3014
3816
4410
3015
    return adjmat, names
3817
    return adjmat, names
4411
@@ -3020,7 +3822,7 @@
4412
3020
    
3822
    
4413
3021
    If the new_taxa array is missing, simply delete the old_taxa
3823
    If the new_taxa array is missing, simply delete the old_taxa
4414
3022
    """
3824
    """
4416
3023
   
3825
  
4417
3024
    tree = _correctly_quote_taxa(tree)
3826
    tree = _correctly_quote_taxa(tree)
4418
3025
    # are the input values lists or simple strings?
3827
    # are the input values lists or simple strings?
4419
3026
    if (isinstance(old_taxa,str)):
3828
    if (isinstance(old_taxa,str)):
4420
@@ -3564,7 +4366,7 @@
4421
3564
4366
4422
3565
    return permute_trees
4367
    return permute_trees
4423
3566
4368
4425
3567
def _create_matrix(trees, taxa, format="hennig", quote=False, weights=None):
4369
def _create_matrix(trees, taxa, format="hennig", quote=False, weights=None, verbose=False):
4426
3568
    """
4370
    """
4427
3569
    Does the hard work on creating a matrix
4371
    Does the hard work on creating a matrix
4428
3570
    """
4372
    """
4429
@@ -3585,7 +4387,7 @@
4430
3585
        if (not weights == None):
4387
        if (not weights == None):
4431
3586
            weight = weights[key]
4388
            weight = weights[key]
4432
3587
        names.append(key)
4389
        names.append(key)
4434
3588
        submatrix, tree_taxa = _assemble_tree_matrix(trees[key])
4390
        submatrix, tree_taxa = _assemble_tree_matrix(trees[key], verbose=verbose)
4435
3589
        nChars = len(submatrix[0,:])
4391
        nChars = len(submatrix[0,:])
4436
3590
        # loop over characters in the submatrix
4392
        # loop over characters in the submatrix
4437
3591
        for i in range(1,nChars):
4393
        for i in range(1,nChars):
4438
@@ -3637,7 +4439,7 @@
4439
3637
            matrix_string += string + "\n"
4439
            matrix_string += string + "\n"
4440
3638
            i += 1
4440
            i += 1
4441
3639
            
4441
            
4443
3640
        matrix_string += "\t;\n"
4442
        matrix_string += "\n"
4444
3641
        if (not weights == None):
4443
        if (not weights == None):
4445
3642
            # get unique weights
4444
            # get unique weights
4446
3643
            unique_weights = _uniquify(weights)
4445
            unique_weights = _uniquify(weights)
4447
@@ -3652,7 +4454,7 @@
4448
3652
                        matrix_string += " " + str(i)
4454
                        matrix_string += " " + str(i)
4449
3653
                    i += 1
4455
                    i += 1
4450
3654
                matrix_string += ";\n"
4456
                matrix_string += ";\n"
4452
3655
        matrix_string += "procedure /;"
4457
        matrix_string += "proc /;"
4453
3656
    elif (format == 'nexus'):
4458
    elif (format == 'nexus'):
4454
3657
        matrix_string = "#nexus\n\nbegin data;\n"
4459
        matrix_string = "#nexus\n\nbegin data;\n"
4455
3658
        matrix_string += "\tdimensions ntax = "+str(len(taxa)) +" nchar = "+str(last_char)+";\n"
4460
        matrix_string += "\tdimensions ntax = "+str(len(taxa)) +" nchar = "+str(last_char)+";\n"
4456
3659
4461
4457
=== modified file 'stk/test/_substitute_taxa.py'
4458
--- stk/test/_substitute_taxa.py	2016-07-14 10:12:17 +0000
4459
+++ stk/test/_substitute_taxa.py	2017-01-12 09:27:31 +0000
4460
@@ -10,6 +10,7 @@
4461
10
from stk.supertree_toolkit import check_subs, _tree_contains, _correctly_quote_taxa, _remove_single_poly_taxa
10
from stk.supertree_toolkit import check_subs, _tree_contains, _correctly_quote_taxa, _remove_single_poly_taxa
4462
11
from stk.supertree_toolkit import _swap_tree_in_XML, substitute_taxa, get_all_taxa, _parse_tree, _delete_taxon
11
from stk.supertree_toolkit import _swap_tree_in_XML, substitute_taxa, get_all_taxa, _parse_tree, _delete_taxon
4463
12
from stk.supertree_toolkit import _collapse_nodes, import_tree, subs_from_csv, _getTaxaFromNewick, obtain_trees
12
from stk.supertree_toolkit import _collapse_nodes, import_tree, subs_from_csv, _getTaxaFromNewick, obtain_trees
4464
13
from stk.supertree_toolkit import generate_species_level_data
4465
13
from lxml import etree
14
from lxml import etree
4466
14
from util import *
15
from util import *
4467
15
from stk.stk_exceptions import *
16
from stk.stk_exceptions import *
4468
@@ -776,7 +777,24 @@
4469
776
        new_tree = _sub_taxa_in_tree(tree2,"Thereuopodina",sub_in,skip_existing=True);
777
        new_tree = _sub_taxa_in_tree(tree2,"Thereuopodina",sub_in,skip_existing=True);
4470
777
        self.assert_(answer2, new_tree)
778
        self.assert_(answer2, new_tree)
4471
778
779
4473
779
    
780
   
4474
781
    def test_auto_subs_taxonomy(self):
4475
782
        """test the automatic subs function with a simple test"""
4476
783
        XML = etree.tostring(etree.parse('data/input/auto_sub.phyml',parser),pretty_print=True)
4477
784
        taxonomy = {'Ardea goliath': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Ecdysozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Ardea', 'order': 'Pelecaniformes', 'species': 'Ardea goliath'}, 
4478
785
                    'Pelecaniformes': {'kingdom': 'Animalia', 'phylum': 'Chordata', 'order': 'Pelecaniformes', 'class': 'Aves', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013'}, 'Gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes'}, 
4479
786
                    'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'infraphylum': 'Gnathostomata', 'superclass': 'Tetrapoda', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'order': 'Procellariiformes', 'species': 'Thalassarche melanophris'}, 
4480
787
                    'Platalea leucorodia': {'kingdom': 'Animalia', 'subfamily': 'Plataleinae', 'family': 'Threskiornithidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'infraphylum': 'Gnathostomata', 'superclass': 'Tetrapoda', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Platalea', 'order': 'Pelecaniformes', 'species': 'Platalea leucorodia'}, 
4481
788
                    'Gallus lafayetii': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus lafayetii'}, 
4482
789
                    'Ardea humbloti': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Ecdysozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Ardea', 'order': 'Pelecaniformes', 'species': 'Ardea humbloti'}, 
4483
790
                    'Gallus varius': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus varius'}}
4484
791
        XML = generate_species_level_data(XML, taxonomy)
4485
792
        expected_XML = etree.tostring(etree.parse('data/output/one_click_subs_output.phyml',parser),pretty_print=True)
4486
793
        trees = obtain_trees(XML) 
4487
794
        expected_trees = obtain_trees(expected_XML)
4488
795
        for t in trees:
4489
796
            self.assert_(_trees_equal(trees[t], expected_trees[t]))
4490
797
4491
780
    def test_parrot_edge_case(self):
798
    def test_parrot_edge_case(self):
4492
781
        """Random edge case where the tree dissappeared..."""
799
        """Random edge case where the tree dissappeared..."""
4493
782
        trees = ["(((((((Agapornis_lilianae, Agapornis_nigrigenis), Agapornis_personata, Agapornis_fischeri), Agapornis_roseicollis), (Agapornis_pullaria, Agapornis_taranta)), Agapornis_cana), Loriculus_galgulus), Geopsittacus_occidentalis);"]
800
        trees = ["(((((((Agapornis_lilianae, Agapornis_nigrigenis), Agapornis_personata, Agapornis_fischeri), Agapornis_roseicollis), (Agapornis_pullaria, Agapornis_taranta)), Agapornis_cana), Loriculus_galgulus), Geopsittacus_occidentalis);"]
4494
783
801
4495
=== modified file 'stk/test/_supertree_toolkit.py'
4496
--- stk/test/_supertree_toolkit.py	2015-03-26 09:58:58 +0000
4497
+++ stk/test/_supertree_toolkit.py	2017-01-12 09:27:31 +0000
4498
@@ -7,12 +7,13 @@
4499
7
import os
7
import os
4500
8
stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir, os.pardir )
8
stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir, os.pardir )
4501
9
sys.path.insert(0, stk_path)
9
sys.path.insert(0, stk_path)
4503
10
from stk.supertree_toolkit import _check_uniqueness, _check_taxa, _check_data, get_all_characters, data_independence
10
from stk.supertree_toolkit import _check_uniqueness, _check_taxa, _check_data, get_all_characters, data_independence, add_weights
4504
11
from stk.supertree_toolkit import get_fossil_taxa, get_publication_years, data_summary, get_character_numbers, get_analyses_used
11
from stk.supertree_toolkit import get_fossil_taxa, get_publication_years, data_summary, get_character_numbers, get_analyses_used
4505
12
from stk.supertree_toolkit import data_overlap, read_matrix, subs_file_from_str, clean_data, obtain_trees, get_all_source_names
12
from stk.supertree_toolkit import data_overlap, read_matrix, subs_file_from_str, clean_data, obtain_trees, get_all_source_names
4506
13
from stk.supertree_toolkit import add_historical_event, _sort_data, _parse_xml, _check_sources, _swap_tree_in_XML, replace_genera
13
from stk.supertree_toolkit import add_historical_event, _sort_data, _parse_xml, _check_sources, _swap_tree_in_XML, replace_genera
4507
14
from stk.supertree_toolkit import get_all_taxa, _get_all_siblings, _parse_tree, get_characters_used, _trees_equal, get_weights
14
from stk.supertree_toolkit import get_all_taxa, _get_all_siblings, _parse_tree, get_characters_used, _trees_equal, get_weights
4509
15
from stk.supertree_toolkit import get_outgroup, set_all_tree_names, create_tree_name, load_taxonomy
15
from stk.supertree_toolkit import get_outgroup, set_all_tree_names, create_tree_name, taxonomic_checker, load_taxonomy, load_equivalents
4510
16
from stk.supertree_toolkit import create_taxonomy, create_taxonomy_from_tree, get_all_tree_names
4511
16
from lxml import etree
17
from lxml import etree
4512
17
from util import *
18
from util import *
4513
18
from stk.stk_exceptions import *
19
from stk.stk_exceptions import *
4514
@@ -268,19 +269,52 @@
4515
268
    
269
    
4516
269
    def test_data_independence(self):
270
    def test_data_independence(self):
4517
270
        XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
271
        XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
4521
271
        expected_dict = {'Hill_2011_2': ['Hill_2011_1', 1], 'Hill_Davis_2011_1': ['Hill_Davis_2011_2', 0]}
272
        expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
4522
272
        non_ind = data_independence(XML)
273
        non_ind,subsets = data_independence(XML)
4523
273
        self.assertDictEqual(expected_dict, non_ind)
274
        expected_subsets = [['Hill_2011_1', 'Hill_2011_2']]
4524
275
        self.assertListEqual(expected_subsets, subsets)
4525
276
        self.assertListEqual(expected_idents, non_ind)
4526
274
277
4528
275
    def test_data_independence(self):
278
    def test_data_independence_2(self):
4529
276
        XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
279
        XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
4533
277
        expected_dict = {'Hill_2011_2': ['Hill_2011_1', 1], 'Hill_Davis_2011_1': ['Hill_Davis_2011_2', 0]}
280
        expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
4534
278
        non_ind, new_xml = data_independence(XML,make_new_xml=True)
281
        expected_subsets = [['Hill_2011_1', 'Hill_2011_2']]
4535
279
        self.assertDictEqual(expected_dict, non_ind)
282
        non_ind, subset, new_xml = data_independence(XML,make_new_xml=True)
4536
283
        self.assertListEqual(expected_idents, non_ind)
4537
284
        self.assertListEqual(expected_subsets, subset)
4538
280
        # check the second tree has not been removed
285
        # check the second tree has not been removed
4539
281
        self.assertRegexpMatches(new_xml,re.escape('((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,H:1.00000)0.00000:0.00000)0.00000:0.00000;'))
286
        self.assertRegexpMatches(new_xml,re.escape('((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,H:1.00000)0.00000:0.00000)0.00000:0.00000;'))
4540
282
        # check that the first tree is removed
287
        # check that the first tree is removed
4541
283
        self.assertNotRegexpMatches(new_xml,re.escape('((A:1.00000,B:1.00000)0.00000:0.00000,(F:1.00000,E:1.00000)0.00000:0.00000)0.00000:0.00000;'))
288
        self.assertNotRegexpMatches(new_xml,re.escape('((A:1.00000,B:1.00000)0.00000:0.00000,(F:1.00000,E:1.00000)0.00000:0.00000)0.00000:0.00000;'))
4542
289
4543
290
    def test_add_weights(self):
4544
291
        """Add weights to a bunch of trees"""
4545
292
        XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
4546
293
        # see above
4547
294
        expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
4548
295
        # so the first should end up with a weight of 0.33333 and the second with 0.5
4549
296
        for ei in expected_idents:
4550
297
            weight = 1.0/float(len(ei))
4551
298
            XML = add_weights(XML, ei, weight)
4552
299
4553
300
        expected_weights = [str(1.0/3.0), str(1.0/3.0), str(1.0/3.0), str(0.5), str(0.5)]
4554
301
        weights_in_xml = []
4555
302
        # now check weights have been added to the correct part of the tree
4556
303
        xml_root = _parse_xml(XML)
4557
304
        i = 0
4558
305
        for ei in expected_idents:
4559
306
            for tree in ei:
4560
307
                find = etree.XPath("//source_tree")
4561
308
                trees = find(xml_root)
4562
309
                for t in trees:
4563
310
                    if t.attrib['name'] == tree:
4564
311
                        # check len(trees) == 0
4565
312
                        weights_in_xml.append(t.xpath("tree/weight/real_value")[0].text)
4566
313
4567
314
        self.assertListEqual(expected_weights,weights_in_xml) 
4568
315
            
4569
316
        
4570
317
4571
284
    
318
    
4572
285
    def test_overlap(self):
319
    def test_overlap(self):
4573
286
        XML = etree.tostring(etree.parse('data/input/check_overlap_ok.phyml',parser),pretty_print=True)
320
        XML = etree.tostring(etree.parse('data/input/check_overlap_ok.phyml',parser),pretty_print=True)
4574
@@ -438,7 +472,7 @@
4575
438
        XML = clean_data(XML)
472
        XML = clean_data(XML)
4576
439
        trees = obtain_trees(XML)
473
        trees = obtain_trees(XML)
4577
440
        self.assert_(len(trees) == 2)
474
        self.assert_(len(trees) == 2)
4579
441
        expected_trees = {'Hill_2011_4': '(A,B,(C,D,E));', 'Hill_2011_2': '(A, B, C, (D, E, F));'}
475
        expected_trees = {'Hill_2011_2': '(A,B,(C,D,E));', 'Hill_2011_1': '(A, B, C, (D, E, F));'}
4580
442
        for t in trees:
476
        for t in trees:
4581
443
            self.assert_(_trees_equal(trees[t],expected_trees[t]))
477
            self.assert_(_trees_equal(trees[t],expected_trees[t]))
4582
444
478
4583
@@ -558,18 +592,78 @@
4584
558
            self.assert_(c in expected_characters)
592
            self.assert_(c in expected_characters)
4585
559
        self.assert_(len(characters) == len(expected_characters))
593
        self.assert_(len(characters) == len(expected_characters))
4586
560
594
4587
595
    def test_create_taxonomy(self):
4588
596
        XML = etree.tostring(etree.parse('data/input/create_taxonomy.phyml',parser),pretty_print=True)
4589
597
        # Tested on 11/01/17 and EOL have changed the output
4590
598
        # old_expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'superclass': 'Sarcopterygii', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Egretta tricolor', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Jeletzkytes criptonodosus': {'superfamily': 'Scaphitoidea', 'family': 'Scaphitidae', 'subkingdom': 'Metazoa', 'subclass': 'Ammonoidea', 'species': 'Jeletzkytes criptonodosus', 'phylum': 'Mollusca', 'suborder': 'Ancyloceratina', 'provider': 'Paleobiology Database', 'genus': 'Jeletzkytes', 'class': 'Cephalopoda'}}
4591
599
        expected = {'Jeletzkytes criptonodosus': {'superfamily': 'Scaphitoidea', 'family': 'Scaphitidae', 'subkingdom': 'Metazoa', 'subclass': 'Ammonoidea', 'species': 'Jeletzkytes criptonodosus', 'phylum': 'Mollusca', 'suborder': 'Ancyloceratina', 'provider': 'Paleobiology Database', 'genus': 'Jeletzkytes', 'class': 'Cephalopoda'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'infraspecies': 'Egretta', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': ['Egretta', 'tricolor'], 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Archaeopteryx lithographica': {'genus': 'Archaeopteryx', 'provider': 'Paleobiology Database'}}
4592
600
        if (internet_on()):
4593
601
            taxonomy = create_taxonomy(XML)
4594
602
            self.maxDiff = None
4595
603
            self.assertDictEqual(taxonomy, expected)
4596
604
        else:
4597
605
            print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
4598
606
        return
4599
607
    
4600
608
    def test_create_taxonomy_from_tree(self):
4601
609
        """Tests if taxonomy from tree works. Uses same data for normal XML test but goes directly for the tree instead of parsing the XML """
4602
610
        # Tested on 11/01/17 and this no longer worked, but is correct! EOL returned something different.
4603
611
        #old_expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'superclass': 'Sarcopterygii', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Egretta tricolor', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}}
4604
612
        expected = {'Archaeopteryx lithographica': {'genus': 'Archaeopteryx', 'provider': 'Paleobiology Database'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'infraspecies': 'Egretta', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': ['Egretta', 'tricolor'], 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}}
4605
613
        tree = "(Archaeopteryx_lithographica, (Gallus_gallus, (Thalassarche_melanophris, Egretta_tricolor)));"
4606
614
        if (internet_on()):
4607
615
            taxonomy = create_taxonomy_from_tree(tree)
4608
616
            self.maxDiff = None
4609
617
            self.assertDictEqual(taxonomy, expected)
4610
618
        else:
4611
619
            print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the create_taxonomy function"
4612
620
        return
4613
621
    
4614
622
    def test_taxonomy_checker(self):
4615
623
        expected = {'Thalassarche_melanophrys': [['Thalassarche_melanophris', 'Thalassarche_melanophrys', 'Diomedea_melanophris', 'Thalassarche_[melanophrys', 'Diomedea_melanophrys'], 'amber'], 'Egretta_tricolor': [['Egretta_tricolor'], 'green'], 'Gallus_gallus': [['Gallus_gallus'], 'green']}
4616
624
        XML = etree.tostring(etree.parse('data/input/check_taxonomy.phyml',parser),pretty_print=True)
4617
625
        if (internet_on()):
4618
626
            equivs = taxonomic_checker(XML)
4619
627
            self.maxDiff = None
4620
628
            self.assertDictEqual(equivs, expected)
4621
629
        else:
4622
630
            print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
4623
631
        return
4624
632
4625
633
    def test_taxonomy_checker2(self):
4626
634
        XML = etree.tostring(etree.parse('data/input/check_taxonomy_fixes.phyml',parser),pretty_print=True)
4627
635
        if (internet_on()):
4628
636
            # This test is a bit dodgy as it depends on EOL's server speed. Run it a few times before deciding it's broken.
4629
637
            equivs = taxonomic_checker(XML,verbose=False)
4630
638
            self.maxDiff = None
4631
639
            self.assert_(equivs['Agathamera_crassa'][0][0] == 'Agathemera_crassa')
4632
640
            self.assert_(equivs['Celatoblatta_brunni'][0][0] == 'Maoriblatta_brunni')
4633
641
            self.assert_(equivs['Blatta_lateralis'][1] == 'amber')
4634
642
        else:
4635
643
            print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
4636
644
        return
4637
645
4638
646
4639
561
    def test_load_taxonomy(self):
647
    def test_load_taxonomy(self):
4640
562
        csv_file = "data/input/create_taxonomy.csv"
648
        csv_file = "data/input/create_taxonomy.csv"
4646
563
        expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'},
649
        expected = {'Jeletzkytes_criptonodosus': {'kingdom': 'Metazoa', 'subclass': 'Cephalopoda', 'species': 'Jeletzkytes criptonodosus', 'suborder': 'Ammonoidea', 'provider': 'PBDB', 'subfamily': 'Scaphitidae', 'class': 'Mollusca'}, 'Archaeopteryx_lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Egretta_tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'subkingdom': 'Bilateria', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'subclass': 'Neoloricata', 'species': 'Egretta tricolor', 'phylum': 'Chordata', 'suborder': 'Ischnochitonina', 'superphylum': 'Lophozoa', 'infrakingdom': 'Protostomia', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus_gallus': {'kingdom': 'Animalia', 'superorder': 'Galliformes', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'class': 'Aves'}, 'Thalassarche_melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'species': 'Thalassarche melanophris', 'order': 'Procellariiformes', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'class': 'Aves'}}
4642
564
                    'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'subclass': 'Neoloricata', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'suborder': 'Ischnochitonina', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Egretta', 'order': 'Pelecaniformes', 'species': 'Egretta tricolor'}, 
4643
565
                    'Gallus gallus': {'kingdom': 'Animalia', 'infrakingdom': 'Protostomia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus gallus'}, 
4644
566
                    'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'order': 'Procellariiformes', 'species': 'Thalassarche melanophris'},
4645
567
                    'Jeletzkytes criptonodosus': {'kingdom': 'Metazoa', 'family': 'Scaphitidae', 'order': 'Ammonoidea', 'phylum': 'Mollusca', 'provider': 'PBDB', 'species': 'Jeletzkytes criptonodosus', 'class': 'Cephalopoda'}}
4647
568
        taxonomy = load_taxonomy(csv_file)
650
        taxonomy = load_taxonomy(csv_file)
4648
569
        self.maxDiff = None
651
        self.maxDiff = None
4649
570
652
4650
571
        self.assertDictEqual(taxonomy, expected)
653
        self.assertDictEqual(taxonomy, expected)
4651
572
654
4652
655
4653
656
    def test_load_equivalents(self):
4654
657
        csv_file = "data/input/equivalents.csv"
4655
658
        expected = {'Turnix_sylvatica': [['Turnix_sylvaticus','Tetrao_sylvaticus','Tetrao_sylvatica','Turnix_sylvatica'],'yellow'],
4656
659
                    'Xiphorhynchus_pardalotus':[['Xiphorhynchus_pardalotus'],'green'],
4657
660
                    'Phaenicophaeus_curvirostris':[['Zanclostomus_curvirostris','Rhamphococcyx_curvirostris','Phaenicophaeus_curvirostris','Rhamphococcyx_curvirostr'],'yellow'],
4658
661
                    'Megalapteryx_benhami':[['Megalapteryx_benhami'],'red']
4659
662
                    }
4660
663
        equivalents = load_equivalents(csv_file)
4661
664
        self.assertDictEqual(equivalents, expected)
4662
665
4663
666
4664
573
    def test_name_tree(self):
667
    def test_name_tree(self):
4665
574
        XML = etree.tostring(etree.parse('data/input/single_source_no_names.phyml',parser),pretty_print=True)
668
        XML = etree.tostring(etree.parse('data/input/single_source_no_names.phyml',parser),pretty_print=True)
4666
575
        xml_root = _parse_xml(XML)
669
        xml_root = _parse_xml(XML)
4667
@@ -583,6 +677,35 @@
4668
583
        XML = etree.tostring(etree.parse('data/input/single_source.phyml',parser),pretty_print=True)
677
        XML = etree.tostring(etree.parse('data/input/single_source.phyml',parser),pretty_print=True)
4669
584
        self.assert_(isEqualXML(new_xml,XML))
678
        self.assert_(isEqualXML(new_xml,XML))
4670
585
679
4671
680
    def test_all_rename_tree(self):
4672
681
        XML = etree.tostring(etree.parse('data/input/single_source_same_tree_name.phyml',parser),pretty_print=True)
4673
682
        new_xml = set_all_tree_names(XML,overwrite=True)
4674
683
        XML = etree.tostring(etree.parse('data/output/single_source_same_tree_name.phyml',parser),pretty_print=True)
4675
684
        self.assert_(isEqualXML(new_xml,XML))
4676
685
4677
686
    def test_get_all_tree_names(self):
4678
687
        XML = etree.tostring(etree.parse('data/input/single_source_same_tree_name.phyml',parser),pretty_print=True)
4679
688
        names = get_all_tree_names(XML)
4680
689
        self.assertListEqual(names,['Hill_2011_2','Hill_2011_2'])
4681
690
4682
691
4683
692
def internet_on(host="8.8.8.8", port=443, timeout=5):
4684
693
    import socket
4685
694
4686
695
    """
4687
696
      Host: 8.8.8.8 (google-public-dns-a.google.com)
4688
697
      OpenPort: 53/tcp
4689
698
      Service: domain (DNS/TCP)
4690
699
    """
4691
700
    try:
4692
701
        socket.setdefaulttimeout(timeout)
4693
702
        socket.socket(socket.AF_INET, socket.SOCK_STREAM).connect((host, port))
4694
703
        return True
4695
704
    except Exception as ex:
4696
705
        print ex.message
4697
706
        return False    
4698
707
4699
708
4700
586
709
4701
587
if __name__ == '__main__':
710
if __name__ == '__main__':
4702
588
    unittest.main()
711
    unittest.main()
4703
589
712
4704
=== modified file 'stk/test/_trees.py'
4705
--- stk/test/_trees.py	2015-03-26 09:58:58 +0000
4706
+++ stk/test/_trees.py	2017-01-12 09:27:31 +0000
4707
@@ -5,7 +5,7 @@
4708
5
sys.path.insert(0,"../../")
5
sys.path.insert(0,"../../")
4709
6
from stk.supertree_toolkit import import_tree, obtain_trees, get_all_taxa, _assemble_tree_matrix, create_matrix, _delete_taxon, _sub_taxon,_tree_contains
6
from stk.supertree_toolkit import import_tree, obtain_trees, get_all_taxa, _assemble_tree_matrix, create_matrix, _delete_taxon, _sub_taxon,_tree_contains
4710
7
from stk.supertree_toolkit import _swap_tree_in_XML, substitute_taxa, get_taxa_from_tree, get_characters_from_tree, amalgamate_trees, _uniquify
7
from stk.supertree_toolkit import _swap_tree_in_XML, substitute_taxa, get_taxa_from_tree, get_characters_from_tree, amalgamate_trees, _uniquify
4712
8
from stk.supertree_toolkit import import_trees, import_tree, _trees_equal, _find_trees_for_permuting, permute_tree, get_all_source_names, _getTaxaFromNewick
8
from stk.supertree_toolkit import import_trees, import_tree, _trees_equal, _find_trees_for_permuting, permute_tree, get_all_source_names, _getTaxaFromNewick, _parse_tree
4713
9
from stk.supertree_toolkit import get_mrca
9
from stk.supertree_toolkit import get_mrca
4714
10
import os
10
import os
4715
11
from lxml import etree
11
from lxml import etree
4716
@@ -215,6 +215,18 @@
4717
215
        mrca = get_mrca(tree,["A","I", "L"])
215
        mrca = get_mrca(tree,["A","I", "L"])
4718
216
        self.assert_(mrca == 8)
216
        self.assert_(mrca == 8)
4719
217
217
4720
218
    def test_get_mrca(self):
4721
219
        tree = "(B,(C,(D,(E,((A,F),((I,(G,H)),(J,(K,L))))))));"
4722
220
        mrca = get_mrca(tree,["A","F"])
4723
221
        print mrca
4724
222
        #self.assert_(mrca == 8)
4725
223
        to = _parse_tree('(X,Y,Z,(Q,W));')
4726
224
        treeobj = _parse_tree(tree)
4727
225
        newnode = treeobj.addNodeBetweenNodes(10,9)
4728
226
        treeobj.addSubTree(newnode, to, ignoreRootAssert=True)
4729
227
        treeobj.draw()
4730
228
4731
229
4732
218
    def test_get_all_trees(self):
230
    def test_get_all_trees(self):
4733
219
        XML = etree.tostring(etree.parse(single_source_input,parser),pretty_print=True)
231
        XML = etree.tostring(etree.parse(single_source_input,parser),pretty_print=True)
4734
220
        tree = obtain_trees(XML)
232
        tree = obtain_trees(XML)
4735
221
233
4736
=== added file 'stk/test/data/input/auto_sub.phyml'
4737
--- stk/test/data/input/auto_sub.phyml	1970-01-01 00:00:00 +0000
4738
+++ stk/test/data/input/auto_sub.phyml	2017-01-12 09:27:31 +0000
4739
@@ -0,0 +1,97 @@
4740
1
<?xml version='1.0' encoding='utf-8'?>
4741
2
<phylo_storage>
4742
3
  <project_name>
4743
4
    <string_value lines="1">Test</string_value>
4744
5
  </project_name>
4745
6
  <sources>
4746
7
    <source name="Hill_2011">
4747
8
      <bibliographic_information>
4748
9
        <article>
4749
10
          <authors>
4750
11
            <author>
4751
12
              <surname>
4752
13
                <string_value lines="1">Hill</string_value>
4753
14
              </surname>
4754
15
              <other_names>
4755
16
                <string_value lines="1">Jon</string_value>
4756
17
              </other_names>
4757
18
            </author>
4758
19
          </authors>
4759
20
          <title>
4760
21
            <string_value lines="1">A great paper</string_value>
4761
22
          </title>
4762
23
          <year>
4763
24
            <integer_value rank="0">2011</integer_value>
4764
25
          </year>
4765
26
          <journal>
4766
27
            <string_value lines="1">Nature</string_value>
4767
28
          </journal>
4768
29
          <pages>
4769
30
            <string_value lines="1">1-12</string_value>
4770
31
          </pages>
4771
32
        </article>
4772
33
      </bibliographic_information>
4773
34
      <source_tree name="Hill_2011_1">
4774
35
        <tree>
4775
36
          <tree_string>
4776
37
            <string_value lines="1">(Thalassarche_melanophris, Pelecaniformes, (Gallus, Gallus_varius));</string_value>
4777
38
          </tree_string>
4778
39
          <figure_legend>
4779
40
            <string_value lines="1">NA</string_value>
4780
41
          </figure_legend>
4781
42
          <figure_number>
4782
43
            <string_value lines="1">1</string_value>
4783
44
          </figure_number>
4784
45
          <page_number>
4785
46
            <string_value lines="1">1</string_value>
4786
47
          </page_number>
4787
48
          <tree_inference>
4788
49
            <optimality_criterion name="Maximum Parsimony"/>
4789
50
          </tree_inference>
4790
51
          <topology>
4791
52
            <outgroup>
4792
53
              <string_value lines="1">A</string_value>
4793
54
            </outgroup>
4794
55
          </topology>
4795
56
        </tree>
4796
57
        <taxa_data>
4797
58
          <all_extant/>
4798
59
        </taxa_data>
4799
60
        <character_data>
4800
61
          <character type="molecular" name="12S"/>
4801
62
        </character_data>
4802
63
      </source_tree>
4803
64
      <source_tree name="Hill_2011_2">
4804
65
        <tree>
4805
66
          <tree_string>
4806
67
            <string_value lines="1">(Gallus_lafayetii, (Platalea_leucorodia, (Ardea_humbloti, Ardea_goliath)));</string_value>
4807
68
          </tree_string>
4808
69
          <figure_legend>
4809
70
            <string_value lines="1">NA</string_value>
4810
71
          </figure_legend>
4811
72
          <figure_number>
4812
73
            <string_value lines="1">1</string_value>
4813
74
          </figure_number>
4814
75
          <page_number>
4815
76
            <string_value lines="1">1</string_value>
4816
77
          </page_number>
4817
78
          <tree_inference>
4818
79
            <optimality_criterion name="Maximum Parsimony"/>
4819
80
          </tree_inference>
4820
81
          <topology>
4821
82
            <outgroup>
4822
83
              <string_value lines="1">A</string_value>
4823
84
            </outgroup>
4824
85
          </topology>
4825
86
        </tree>
4826
87
        <taxa_data>
4827
88
          <all_extant/>
4828
89
        </taxa_data>
4829
90
        <character_data>
4830
91
          <character type="molecular" name="12S"/>
4831
92
        </character_data>
4832
93
      </source_tree>
4833
94
    </source>
4834
95
  </sources>
4835
96
  <history/>
4836
97
</phylo_storage>
4837
0
98
4838
=== modified file 'stk/test/data/input/check_data_ind.phyml'
4839
--- stk/test/data/input/check_data_ind.phyml	2014-10-09 09:33:21 +0000
4840
+++ stk/test/data/input/check_data_ind.phyml	2017-01-12 09:27:31 +0000
4841
@@ -249,6 +249,147 @@
4842
249
          <character type="molecular" name="12S"/>
249
          <character type="molecular" name="12S"/>
4843
250
        </character_data>
250
        </character_data>
4844
251
      </source_tree>
251
      </source_tree>
4845
252
      <source_tree name="Hill_Davis_2011_3">
4846
253
        <tree>
4847
254
          <tree_string>
4848
255
            <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,H:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
4849
256
          </tree_string>
4850
257
          <figure_legend>
4851
258
            <string_value lines="1">NA</string_value>
4852
259
          </figure_legend>
4853
260
          <figure_number>
4854
261
            <string_value lines="1">0</string_value>
4855
262
          </figure_number>
4856
263
          <page_number>
4857
264
            <string_value lines="1">0</string_value>
4858
265
          </page_number>
4859
266
          <tree_inference>
4860
267
            <optimality_criterion name="Maximum Parsimony"/>
4861
268
          </tree_inference>
4862
269
          <topology>
4863
270
            <outgroup>
4864
271
              <string_value lines="1">A</string_value>
4865
272
            </outgroup>
4866
273
          </topology>
4867
274
        </tree>
4868
275
        <taxa_data>
4869
276
          <mixed_fossil_and_extant>
4870
277
            <taxon name="A">
4871
278
              <fossil/>
4872
279
            </taxon>
4873
280
            <taxon name="B">
4874
281
              <fossil/>
4875
282
            </taxon>
4876
283
          </mixed_fossil_and_extant>
4877
284
        </taxa_data>
4878
285
        <character_data>
4879
286
          <character type="molecular" name="12S"/>
4880
287
        </character_data>
4881
288
      </source_tree>
4882
289
    </source>
4883
290
    <source name="Hill_Davis_2013">
4884
291
      <bibliographic_information>
4885
292
        <article>
4886
293
          <authors>
4887
294
            <author>
4888
295
              <surname>
4889
296
                <string_value lines="1">Hill</string_value>
4890
297
              </surname>
4891
298
              <other_names>
4892
299
                <string_value lines="1">Jon</string_value>
4893
300
              </other_names>
4894
301
            </author>
4895
302
            <author>
4896
303
              <surname>
4897
304
                <string_value lines="1">Davis</string_value>
4898
305
              </surname>
4899
306
              <other_names>
4900
307
                <string_value lines="1">Katie</string_value>
4901
308
              </other_names>
4902
309
            </author>
4903
310
          </authors>
4904
311
          <title>
4905
312
            <string_value lines="1">Another superb paper</string_value>
4906
313
          </title>
4907
314
          <year>
4908
315
            <integer_value rank="0">2013</integer_value>
4909
316
          </year>
4910
317
        </article>
4911
318
      </bibliographic_information>
4912
319
      <source_tree name="Hill_Davis_2013_1">
4913
320
        <tree>
4914
321
          <tree_string>
4915
322
            <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,Z:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
4916
323
          </tree_string>
4917
324
          <figure_legend>
4918
325
            <string_value lines="1">NA</string_value>
4919
326
          </figure_legend>
4920
327
          <figure_number>
4921
328
            <string_value lines="1">0</string_value>
4922
329
          </figure_number>
4923
330
          <page_number>
4924
331
            <string_value lines="1">0</string_value>
4925
332
          </page_number>
4926
333
          <tree_inference>
4927
334
            <optimality_criterion name="Maximum Parsimony"/>
4928
335
          </tree_inference>
4929
336
          <topology>
4930
337
            <outgroup>
4931
338
              <string_value lines="1">A</string_value>
4932
339
            </outgroup>
4933
340
          </topology>
4934
341
        </tree>
4935
342
        <taxa_data>
4936
343
          <mixed_fossil_and_extant>
4937
344
            <taxon name="A">
4938
345
              <fossil/>
4939
346
            </taxon>
4940
347
            <taxon name="B">
4941
348
              <fossil/>
4942
349
            </taxon>
4943
350
          </mixed_fossil_and_extant>
4944
351
        </taxa_data>
4945
352
        <character_data>
4946
353
          <character type="molecular" name="12S"/>
4947
354
        </character_data>
4948
355
      </source_tree>
4949
356
      <source_tree name="Hill_Davis_2013_2">
4950
357
        <tree>
4951
358
          <tree_string>
4952
359
            <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,Z:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
4953
360
          </tree_string>
4954
361
          <figure_legend>
4955
362
            <string_value lines="1">NA</string_value>
4956
363
          </figure_legend>
4957
364
          <figure_number>
4958
365
            <string_value lines="1">0</string_value>
4959
366
          </figure_number>
4960
367
          <page_number>
4961
368
            <string_value lines="1">0</string_value>
4962
369
          </page_number>
4963
370
          <tree_inference>
4964
371
            <optimality_criterion name="Maximum Parsimony"/>
4965
372
          </tree_inference>
4966
373
          <topology>
4967
374
            <outgroup>
4968
375
              <string_value lines="1">A</string_value>
4969
376
            </outgroup>
4970
377
          </topology>
4971
378
        </tree>
4972
379
        <taxa_data>
4973
380
          <mixed_fossil_and_extant>
4974
381
            <taxon name="A">
4975
382
              <fossil/>
4976
383
            </taxon>
4977
384
            <taxon name="B">
4978
385
              <fossil/>
4979
386
            </taxon>
4980
387
          </mixed_fossil_and_extant>
4981
388
        </taxa_data>
4982
389
        <character_data>
4983
390
          <character type="molecular" name="12S"/>
4984
391
        </character_data>
4985
392
      </source_tree>
4986
252
    </source>
393
    </source>
4987
253
  </sources>
394
  </sources>
4988
254
  <history/>
395
  <history/>
4989
255
396
4990
=== added file 'stk/test/data/input/check_taxonomy.phyml'
4991
--- stk/test/data/input/check_taxonomy.phyml	1970-01-01 00:00:00 +0000
4992
+++ stk/test/data/input/check_taxonomy.phyml	2017-01-12 09:27:31 +0000
4993
@@ -0,0 +1,67 @@
4994
1
<?xml version='1.0' encoding='utf-8'?>
4995
2
<phylo_storage>
4996
3
  <project_name>
4997
4
    <string_value lines="1">Test</string_value>
4998
5
  </project_name>
4999
6
  <sources>
5000
7
    <source name="Hill_2011">
Reviewer	Review Type	Date Requested	Status
Jon Hill			Approve on 2017-01-12
Review via email: mp+314598@code.launchpad.net