Supertree Toolkit

Merge lp:~jon-hill/supertree-toolkit/sub_in_subfile into lp:supertree-toolkit

sub_in_subfile
Merge into stk

Proposed by Jon Hill on 2017-01-12

Status:

Merged

Merged at revision:

281

Proposed branch:

lp:~jon-hill/supertree-toolkit/sub_in_subfile

Merge into:

lp:supertree-toolkit

Diff against target:

13342 lines (+11326/-781)

44 files modified

debian/control (+1/-1)
debian/rules (+1/-0)
notes.txt (+38/-0)
stk/bzr_version.py (+5/-5)
stk/p4/NexusToken.py (+1/-0)
stk/p4/NexusToken2.py (+1/-1)
stk/p4/Tree.py (+1/-9)
stk/p4/Tree_muck.py (+4/-2)
stk/scripts/check_nomenclature.py (+0/-224)
stk/scripts/check_nomenclature.py.moved (+224/-0)
stk/scripts/create_colours_itol.py (+2/-11)
stk/scripts/create_taxonomy.py (+4/-100)
stk/scripts/fill_in_with_taxonomy.py (+711/-174)
stk/scripts/plot_character_taxa_matrix.py (+83/-1)
stk/scripts/plot_tree_taxa_matrix.py (+56/-0)
stk/scripts/remove_poorly_constrained_taxa.py (+43/-20)
stk/scripts/tree_from_taxonomy.py (+142/-0)
stk/stk (+787/-34)
stk/stk_exceptions.py (+8/-0)
stk/supertree_toolkit.py (+849/-47)
stk/test/_substitute_taxa.py (+19/-1)
stk/test/_supertree_toolkit.py (+138/-15)
stk/test/_trees.py (+13/-1)
stk/test/data/input/auto_sub.phyml (+97/-0)
stk/test/data/input/check_data_ind.phyml (+141/-0)
stk/test/data/input/check_taxonomy.phyml (+67/-0)
stk/test/data/input/check_taxonomy_fixes.phyml (+378/-0)
stk/test/data/input/create_taxonomy.csv (+6/-6)
stk/test/data/input/create_taxonomy.phyml (+67/-0)
stk/test/data/input/equivalents.csv (+5/-0)
stk/test/data/input/mrca.tre (+1/-0)
stk/test/data/input/old_stk_test_data_ind.phyml (+1324/-0)
stk/test/data/input/old_stk_test_data_tax_overlap.phyml (+627/-0)
stk/test/data/input/old_stk_test_nonmonophyl_removed.phyml (+1324/-0)
stk/test/data/input/old_stk_test_species_level.phyml (+1324/-0)
stk/test/data/input/old_stk_test_taxonomy.csv (+334/-0)
stk/test/data/input/old_stk_test_taxonomy_check_subs.dat (+26/-0)
stk/test/data/input/old_stk_test_taxonomy_checked.phyml (+1324/-0)
stk/test/data/input/old_stk_test_taxonomy_checker.csv (+336/-0)
stk/test/data/output/one_click_subs_output.phyml (+97/-0)
stk/test/util.py (+7/-0)
stk_gui/gui/gui.glade (+670/-124)
stk_gui/plugins/phyml/name_author.py (+4/-1)
stk_gui/stk_gui/interface.py (+36/-4)

To merge this branch:

bzr merge lp:~jon-hill/supertree-toolkit/sub_in_subfile

Critical

In Progress

Link a bug report

Reviewer	Review Type	Date Requested	Status
Jon Hill			Approve on 2017-01-12
Review via email: mp+314598@code.launchpad.net

Description of the change

Adding taxonomic awareness and fixing a lot of bugs

lp:~jon-hill/supertree-toolkit/sub_in_subfile updated on 2017-01-12

322. By Jon Hill on 2017-01-12: removing file that shouldn't be there
323. By Jon Hill on 2017-01-12: removing file that shouldn't be there

Revision history for this message

Jon Hill (jon-hill) on 2017-01-12:

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk

The diff has been truncated for viewing.

Subscribers

People subscribed via source and target branches

to all changes:

Jaime Tovar

Jon Hill

STK Developers

 === modified file 'debian/control'
 --- debian/control	2016-12-14 16:22:12 +0000
 +++ debian/control	2017-01-12 09:27:31 +0000
@@ -9,7 +9,7 @@
  Package: supertree-toolkit
  Architecture: all
--Depends: python-tk, python-dxdiff, python-pygraphviz, python-lxml-dbg, python-lxml, python-gtk2, python-numpy, python-matplotlib, python-lxml, libxml2-utils, python, python-gtksourceview2, python-glade2, python-networkx
++Depends: python-tk, python-simplejson, python-dxdiff, python-pygraphviz, python-lxml-dbg, python-lxml, python-gtk2, python-numpy, python-matplotlib, python-lxml, libxml2-utils, python, python-gtksourceview2, python-glade2, python-networkx, python-argcomplete
  Recommends: python-psyco
  Suggests:
  Conflicts:
 === modified file 'debian/rules'
 --- debian/rules	2013-10-14 12:58:59 +0000
 +++ debian/rules	2017-01-12 09:27:31 +0000
@@ -6,5 +6,6 @@
  override_dh_auto_install:
  	python setup.py install --root=debian/supertree-toolkit --install-layout=deb --install-scripts=/usr/bin
++	argcomplete.autocomplete(parser)
  override_dh_auto_build:
 === added file 'notes.txt'
 --- notes.txt	1970-01-01 00:00:00 +0000
 +++ notes.txt	2017-01-12 09:27:31 +0000
@@ -0,0 +1,38 @@
++Ideas:
++
++Collect data, remove paraphyletic
++
++Take taxonomy (from dbs), phyml, users knowledge (encoded as subs file) and information on synonyms (from dbs)
++to create a master subs file that takes the dat to species level
++
++User needs to be able to edit taxonomy - CSV file
++
++User needs to choose database source - preferred source.
++
++
++Taxonomic name checker:
++
++ - use database to get synonyms and possible mispellings
++ - Gui is a 2 column table with green, yellow, red. User filles in red (or removes it), green is fine. Yellow - drop down list with alternatives.
++ - Use this to generate a two column CSV file
++ - On CLI, generate a three column CSV. Original name, new name (or blank for unknown) and a list of possibles. Warn user they *must* fill in the second column or remove the row or the taxa will be deleted.
++
++For colloqual names, user adds to column 1 of taxonomy csv and then adds the latin name in the approriate column of the database. The subs can then generate the species list.
++
++Use these two csv files to generate a subs file, including replacing higher taxa and genera to create a "to species" substtution (can also output this file for later)
++
++Generating data to any taxonomic level can happen later - need to check each species is accounted for in the taxonomy, with correct levels - may need another parse of the taxonomy csv
++
++
++Add data -> paraphyletic taxa -> taxonomy checker -> sub synonyms -> taxonomy generator -> create species level dataset
++
++New functions:
++ - taxonomic name checker (this might take a while when online for large dataset) - note that this should be a one for one substitution - seperate function so we can check this?
++ - Pull in taxonomy generator
++ - Add csv file to schema
++ - amaend manual with workflow
++ - warning on multiple subs in data in manual
++ - generate species level subsfile from taxonomy
++ - generate specified taxonomic level data
++
++
 === modified file 'stk/bzr_version.py'
 --- stk/bzr_version.py	2017-01-11 17:42:56 +0000
 +++ stk/bzr_version.py	2017-01-12 09:27:31 +0000
@@ -4,12 +4,12 @@
  So don't edit it. :)
  """
--version_info = {'branch_nick': u'supertree-toolkit',
-- 'build_date': '2017-01-11 17:42:27 +0000',
++version_info = {'branch_nick': u'sub_in_subfile',
++ 'build_date': '2017-01-11 17:48:33 +0000',
   'clean': None,
-- 'date': '2017-01-11 17:39:43 +0000',
-- 'revision_id': 'jon.hill@imperial.ac.uk-20170111173943-88so1icr33su3afo',
-- 'revno': '279'}
++ 'date': '2017-01-11 17:48:18 +0000',
++ 'revision_id': 'jon.hill@imperial.ac.uk-20170111174818-9q8a9octvnawruuw',
++ 'revno': '317'}
  revisions = {}
 === modified file 'stk/p4/NexusToken.py'
 --- stk/p4/NexusToken.py	2012-01-11 08:57:43 +0000
 +++ stk/p4/NexusToken.py	2017-01-12 09:27:31 +0000
@@ -44,6 +44,7 @@
              gm = ["safeNextTok(), called from %s" % caller]
          else:
              gm = ["safeNextTok()"]
++        print flob
          gm.append("Premature Death.")
          gm.append("Ran out of understandable things to read in nexus file.")
          raise Glitch, gm
 === modified file 'stk/p4/NexusToken2.py'
 --- stk/p4/NexusToken2.py	2012-01-11 08:57:43 +0000
 +++ stk/p4/NexusToken2.py	2017-01-12 09:27:31 +0000
@@ -88,7 +88,7 @@
          else:
              gm = ["safeNextTok()"]
          gm.append("Premature Death.")
--        gm.append("Ran out of understandable things to read in nexus file.")
++        gm.append("Ran out of understandable things to read in nexus file." + str(flob))
          raise Glitch, gm
      else:
          return t
 === modified file 'stk/p4/Tree.py'
 --- stk/p4/Tree.py	2013-08-25 09:24:34 +0000
 +++ stk/p4/Tree.py	2017-01-12 09:27:31 +0000
@@ -996,17 +996,9 @@
                  if not item.name:
                      if item == self.root:
                          if var.fixRootedTrees:
--                            if self.name:
--                                print "Tree.initFinish()   tree '%s'" % self.name
--                            else:
--                                print 'Tree.initFinish()'
--                            print "Fixing tree to work with SuperTree scores"
++                            #print "Fixing tree to work with SuperTree scores"
                              self.removeRoot()
                          elif var.warnAboutTerminalRootWithNoName:
--                            if self.name:
--                                print "Tree.initFinish()   tree '%s'" % self.name
--                            else:
--                                print 'Tree.initFinish()'
                              print '    Non-fatal warning: the root is terminal, but has no name.'
                              print '    This may be what you wanted.  Or not?'
                              print '    (To get rid of this warning, turn off var.warnAboutTerminalRootWithNoName)'
 === modified file 'stk/p4/Tree_muck.py'
 --- stk/p4/Tree_muck.py	2015-02-19 14:47:06 +0000
 +++ stk/p4/Tree_muck.py	2017-01-12 09:27:31 +0000
@@ -769,6 +769,7 @@
      else:
          gm.append("The 2 specified nodes should have a parent-child relationship")
          raise Glitch, gm
++
      if var.usePfAndNumpy:
          self.deleteCStuff()
@@ -1629,7 +1630,7 @@
--def addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None):
++def addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None, ignoreRootAssert=False):
      """Add a subtree to a tree.
      The nodes from theSubTree are added to self.nodes, and theSubTree
@@ -1666,7 +1667,8 @@
      assert selfNode in self.nodes
      assert selfNode.parent
--    assert theSubTree.root.leftChild and not theSubTree.root.leftChild.sibling # its a root on a stick
++    if not ignoreRootAssert:
++        assert theSubTree.root.leftChild and not theSubTree.root.leftChild.sibling # its a root on a stick
      if not subTreeTaxNames:
          subTreeTaxNames = [n.name for n in theSubTree.iterLeavesNoRoot()]
 === removed file 'stk/scripts/check_nomenclature.py'
 --- stk/scripts/check_nomenclature.py	2016-07-14 10:12:17 +0000
 +++ stk/scripts/check_nomenclature.py	1970-01-01 00:00:00 +0000
@@ -1,224 +0,0 @@
--#!/usr/bin/env python
--#
--#    Derived from the Supertree Toolkit. Software for managing and manipulating sources
--#    trees ready for supretree construction.
--#    Copyright (C) 2015, Jon Hill, Katie Davis
--#
--#    This program is free software: you can redistribute it and/or modify
--#    it under the terms of the GNU General Public License as published by
--#    the Free Software Foundation, either version 3 of the License, or
--#    (at your option) any later version.
--#
--#    This program is distributed in the hope that it will be useful,
--#    but WITHOUT ANY WARRANTY; without even the implied warranty of
--#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
--#    GNU General Public License for more details.
--#
--#    You should have received a copy of the GNU General Public License
--#    along with this program.  If not, see <http://www.gnu.org/licenses/>.
--#
--#    Jon Hill. jon.hill@york.ac.uk.
--#
--#
--# This is an enitrely self-contained script that does not require the STK to be installed.
--
--import urllib2
--from urllib import quote_plus
--import simplejson as json
--import argparse
--import os
--import sys
--import csv
--
--def main():
--
--    # do stuff
--    parser = argparse.ArgumentParser(
--         prog="Check nomenclature",
--         description="Check nomenclature from a tree file or list against valid names derived from EOL",
--         )
--    parser.add_argument(
--            '-v',
--            '--verbose',
--            action='store_true',
--            help="Verbose output: mainly progress reports.",
--            default=False
--            )
--    parser.add_argument(
--            '--existing',
--            help="An existing output file to update further, e.g. with a new set of taxa. Supply the file name."
--            )
--    parser.add_argument(
--            'input_file',
--            metavar='input_file',
--            nargs=1,
--            help="Your input taxa list"
--            )
--    parser.add_argument(
--            'output_file',
--            metavar='output_file',
--            nargs=1,
--            help="The output file. A CSV-based output, listing name checked, valid name, synonyms and status (red, amber, yellow, green)."
--            )
--
--    args = parser.parse_args()
--    verbose = args.verbose
--    input_file = args.input_file[0]
--    output_file = args.output_file[0]
--    existing_data  = args.existing
--
--    if (not existing_data == None):
--        exiting_data = load_equivalents(existing_data)
--    else:
--        existing_data = None
--
--    with open(input_file,'r') as f:
--        lines = f.read().splitlines()
--    equivs = taxonomic_checker_list(lines, existing_data, verbose=verbose)
--
--
--    f = open(output_file,"w")
--    for taxon in sorted(equivs.keys()):
--        f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
--    f.close()
--
--    return
--
--
--def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
--    """ For each name in the database generate a database of the original name,
--    possible synonyms and if the taxon is not know, signal that. We do this by
--    using the EoL API to grab synonyms of each taxon.  """
--
--
--    if existing_data == None:
--        equivalents = {}
--    else:
--        equivalents = existing_data
--
--    # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
--    # if not, is there another API function to do this?
--    # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
--    # you get the original ('correct') name - shorten this to two words and you're done.
--    for t in name_list:
--        # make sure t has no spaces.
--        t = t.replace(" ","_")
--        if t in equivalents:
--            continue
--        taxon = t.replace("_"," ")
--        if (verbose):
--            print "Looking up ", taxon
--        # get the data from EOL on taxon
--        taxonq = quote_plus(taxon)
--        URL = "http://eol.org/api/search/1.0.json?q="+taxonq
--        req = urllib2.Request(URL)
--        opener = urllib2.build_opener()
--        f = opener.open(req)
--        data = json.load(f)
--        # check if there's some data
--        if len(data['results']) == 0:
--            equivalents[t] = [[t],'red']
--            continue
--        amber = False
--        if len(data['results']) > 1:
--            # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
--            # for automatic processing we'll just take the first one though
--            # colour is amber in this case
--            amber = True
--        ID = str(data['results'][0]['id']) # take first hit
--        URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=2&videos=0&sounds=0&maps=0&text=2&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"
--        req = urllib2.Request(URL)
--        opener = urllib2.build_opener()
--
--        try:
--            f = opener.open(req)
--        except urllib2.HTTPError:
--            equivalents[t] = [[t],'red']
--            continue
--        data = json.load(f)
--        if len(data['scientificName']) == 0:
--            # not found a scientific name, so set as red
--            equivalents[t] = [[t],'red']
--            continue
--        correct_name = data['scientificName'].encode("ascii","ignore")
--        # we only want the first two bits of the name, not the original author and year if any
--        temp_name = correct_name.split(' ')
--        if (len(temp_name) > 2):
--            correct_name = ' '.join(temp_name[0:2])
--        correct_name = correct_name.replace(' ','_')
--        print correct_name, t
--
--        # build up the output dictionary - original name is key, synonyms/missing is value
--        if (correct_name == t or correct_name == taxon):
--            # if the original matches the 'correct', then it's green
--            equivalents[t] = [[t], 'green']
--        else:
--            # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the
--            # 'correct' taxon at the top
--            eol_synonyms = data['synonyms']
--            synonyms = []
--            for s in eol_synonyms:
--                ts = s['synonym'].encode("ascii","ignore")
--                temp_syn = ts.split(' ')
--                if (len(temp_syn) > 2):
--                    temp_syn = ' '.join(temp_syn[0:2])
--                    ts = temp_syn
--                if (s['relationship'] == "synonym"):
--                    ts = ts.replace(" ","_")
--                    synonyms.append(ts)
--            synonyms = _uniquify(synonyms)
--            # we need to put the correct name at the top of the list now
--            if (correct_name in synonyms):
--                synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
--            elif len(synonyms) == 0:
--                synonyms.append(correct_name)
--            else:
--                synonyms.insert(0,correct_name)
--
--            if (amber):
--                equivalents[t] = [synonyms,'amber']
--            else:
--                equivalents[t] = [synonyms,'yellow']
--        # if our search was empty, then it's red - see above
--
--    # up to the calling funciton to do something sensible with this
--    # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
--    # Amber means we found synonyms and multilpe hits. User def needs to sort these!
--
--    return equivalents
--
--def load_equivalents(equiv_csv):
--    """Load equivalents data from a csv and convert to a equivalents Dict.
--        Structure is key, with a list that is array of synonyms, followed by status ('green',
--        'yellow', 'amber', or 'red').
--
--    """
--
--    import csv
--
--    equivalents = {}
--
--    with open(equiv_csv, 'rU') as csvfile:
--        equiv_reader = csv.reader(csvfile, delimiter=',')
--        equiv_reader.next() # skip header
--        for row in equiv_reader:
--            i = 1
--            equivalents[row[0]] = [row[1].split(';'),row[2]]
--
--    return equivalents
--
--def _uniquify(l):
--    """
--    Make a list, l, contain only unique data
--    """
--    keys = {}
--    for e in l:
--        keys[e] = 1
--
--    return keys.keys()
--
--if __name__ == "__main__":
--    main()
--
--
--
 === added file 'stk/scripts/check_nomenclature.py.moved'
 --- stk/scripts/check_nomenclature.py.moved	1970-01-01 00:00:00 +0000
 +++ stk/scripts/check_nomenclature.py.moved	2017-01-12 09:27:31 +0000
@@ -0,0 +1,224 @@
++#!/usr/bin/env python
++#
++#    Derived from the Supertree Toolkit. Software for managing and manipulating sources
++#    trees ready for supretree construction.
++#    Copyright (C) 2015, Jon Hill, Katie Davis
++#
++#    This program is free software: you can redistribute it and/or modify
++#    it under the terms of the GNU General Public License as published by
++#    the Free Software Foundation, either version 3 of the License, or
++#    (at your option) any later version.
++#
++#    This program is distributed in the hope that it will be useful,
++#    but WITHOUT ANY WARRANTY; without even the implied warranty of
++#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
++#    GNU General Public License for more details.
++#
++#    You should have received a copy of the GNU General Public License
++#    along with this program.  If not, see <http://www.gnu.org/licenses/>.
++#
++#    Jon Hill. jon.hill@york.ac.uk.
++#
++#
++# This is an enitrely self-contained script that does not require the STK to be installed.
++
++import urllib2
++from urllib import quote_plus
++import simplejson as json
++import argparse
++import os
++import sys
++import csv
++
++def main():
++
++    # do stuff
++    parser = argparse.ArgumentParser(
++         prog="Check nomenclature",
++         description="Check nomenclature from a tree file or list against valid names derived from EOL",
++         )
++    parser.add_argument(
++            '-v',
++            '--verbose',
++            action='store_true',
++            help="Verbose output: mainly progress reports.",
++            default=False
++            )
++    parser.add_argument(
++            '--existing',
++            help="An existing output file to update further, e.g. with a new set of taxa. Supply the file name."
++            )
++    parser.add_argument(
++            'input_file',
++            metavar='input_file',
++            nargs=1,
++            help="Your input taxa list"
++            )
++    parser.add_argument(
++            'output_file',
++            metavar='output_file',
++            nargs=1,
++            help="The output file. A CSV-based output, listing name checked, valid name, synonyms and status (red, amber, yellow, green)."
++            )
++
++    args = parser.parse_args()
++    verbose = args.verbose
++    input_file = args.input_file[0]
++    output_file = args.output_file[0]
++    existing_data  = args.existing
++
++    if (not existing_data == None):
++        exiting_data = load_equivalents(existing_data)
++    else:
++        existing_data = None
++
++    with open(input_file,'r') as f:
++        lines = f.read().splitlines()
++    equivs = taxonomic_checker_list(lines, existing_data, verbose=verbose)
++
++
++    f = open(output_file,"w")
++    for taxon in sorted(equivs.keys()):
++        f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
++    f.close()
++
++    return
++
++
++def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
++    """ For each name in the database generate a database of the original name,
++    possible synonyms and if the taxon is not know, signal that. We do this by
++    using the EoL API to grab synonyms of each taxon.  """
++
++
++    if existing_data == None:
++        equivalents = {}
++    else:
++        equivalents = existing_data
++
++    # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
++    # if not, is there another API function to do this?
++    # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
++    # you get the original ('correct') name - shorten this to two words and you're done.
++    for t in name_list:
++        # make sure t has no spaces.
++        t = t.replace(" ","_")
++        if t in equivalents:
++            continue
++        taxon = t.replace("_"," ")
++        if (verbose):
++            print "Looking up ", taxon
++        # get the data from EOL on taxon
++        taxonq = quote_plus(taxon)
++        URL = "http://eol.org/api/search/1.0.json?q="+taxonq
++        req = urllib2.Request(URL)
++        opener = urllib2.build_opener()
++        f = opener.open(req)
++        data = json.load(f)
++        # check if there's some data
++        if len(data['results']) == 0:
++            equivalents[t] = [[t],'red']
++            continue
++        amber = False
++        if len(data['results']) > 1:
++            # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
++            # for automatic processing we'll just take the first one though
++            # colour is amber in this case
++            amber = True
++        ID = str(data['results'][0]['id']) # take first hit
++        URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=2&videos=0&sounds=0&maps=0&text=2&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"
++        req = urllib2.Request(URL)
++        opener = urllib2.build_opener()
++
++        try:
++            f = opener.open(req)
++        except urllib2.HTTPError:
++            equivalents[t] = [[t],'red']
++            continue
++        data = json.load(f)
++        if len(data['scientificName']) == 0:
++            # not found a scientific name, so set as red
++            equivalents[t] = [[t],'red']
++            continue
++        correct_name = data['scientificName'].encode("ascii","ignore")
++        # we only want the first two bits of the name, not the original author and year if any
++        temp_name = correct_name.split(' ')
++        if (len(temp_name) > 2):
++            correct_name = ' '.join(temp_name[0:2])
++        correct_name = correct_name.replace(' ','_')
++        print correct_name, t
++
++        # build up the output dictionary - original name is key, synonyms/missing is value
++        if (correct_name == t or correct_name == taxon):
++            # if the original matches the 'correct', then it's green
++            equivalents[t] = [[t], 'green']
++        else:
++            # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the
++            # 'correct' taxon at the top
++            eol_synonyms = data['synonyms']
++            synonyms = []
++            for s in eol_synonyms:
++                ts = s['synonym'].encode("ascii","ignore")
++                temp_syn = ts.split(' ')
++                if (len(temp_syn) > 2):
++                    temp_syn = ' '.join(temp_syn[0:2])
++                    ts = temp_syn
++                if (s['relationship'] == "synonym"):
++                    ts = ts.replace(" ","_")
++                    synonyms.append(ts)
++            synonyms = _uniquify(synonyms)
++            # we need to put the correct name at the top of the list now
++            if (correct_name in synonyms):
++                synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
++            elif len(synonyms) == 0:
++                synonyms.append(correct_name)
++            else:
++                synonyms.insert(0,correct_name)
++
++            if (amber):
++                equivalents[t] = [synonyms,'amber']
++            else:
++                equivalents[t] = [synonyms,'yellow']
++        # if our search was empty, then it's red - see above
++
++    # up to the calling funciton to do something sensible with this
++    # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
++    # Amber means we found synonyms and multilpe hits. User def needs to sort these!
++
++    return equivalents
++
++def load_equivalents(equiv_csv):
++    """Load equivalents data from a csv and convert to a equivalents Dict.
++        Structure is key, with a list that is array of synonyms, followed by status ('green',
++        'yellow', 'amber', or 'red').
++
++    """
++
++    import csv
++
++    equivalents = {}
++
++    with open(equiv_csv, 'rU') as csvfile:
++        equiv_reader = csv.reader(csvfile, delimiter=',')
++        equiv_reader.next() # skip header
++        for row in equiv_reader:
++            i = 1
++            equivalents[row[0]] = [row[1].split(';'),row[2]]
++
++    return equivalents
++
++def _uniquify(l):
++    """
++    Make a list, l, contain only unique data
++    """
++    keys = {}
++    for e in l:
++        keys[e] = 1
++
++    return keys.keys()
++
++if __name__ == "__main__":
++    main()
++
++
++
 === modified file 'stk/scripts/create_colours_itol.py'
 --- stk/scripts/create_colours_itol.py	2014-12-09 10:58:48 +0000
 +++ stk/scripts/create_colours_itol.py	2017-01-12 09:27:31 +0000
@@ -88,17 +88,8 @@
          saturation=0.25
          value=0.8
--    index = 3 # family
--    if (level == "Superfamily"):
--        index = 4
--    elif (level == "Infraorder"):
--        index = 5
--    elif (level == "Suborder"):
--        index = 6
--    elif (level == "Order"):
--        index = 7
--    elif (level == "Genus"):
--        index = 2
++    index = stk.taxonomy_levels.index(level.lower())+1
++    print index
      if (tree):
          tree_data = stk.import_tree(input_file)
 === modified file 'stk/scripts/create_taxonomy.py'
 --- stk/scripts/create_taxonomy.py	2014-03-13 18:45:05 +0000
 +++ stk/scripts/create_taxonomy.py	2017-01-12 09:27:31 +0000
@@ -16,6 +16,8 @@
  import supertree_toolkit as stk
  import csv
++taxonomy_levels = stk.taxonomy_levels
++
  def main():
      # do stuff
@@ -66,13 +68,6 @@
          f.close()
      taxonomy = {}
--    # What we get from EOL
--    current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
--    # And the extra ones from ITIS
--    extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
--    # all of them in order
--    taxonomy_levels = ['species','genus','family','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
--
      for taxon in taxa:
          taxon = taxon.replace("_"," ")
@@ -180,99 +175,8 @@
                  continue
--    # Now create the CSV output
--    with open(output_file, 'w') as f:
--        writer = csv.writer(f)
--        writer.writerow(taxonomy_levels)
--        for t in taxonomy:
--            species = t
--            try:
--                genus = taxonomy[t]['genus']
--            except KeyError:
--                genus = "-"
--            try:
--                family = taxonomy[t]['family']
--            except KeyError:
--                family = "-"
--            try:
--                superfamily = taxonomy[t]['superfamily']
--            except KeyError:
--                superfamily = "-"
--            try:
--                infraorder = taxonomy[t]['infraorder']
--            except KeyError:
--                infraorder = "-"
--            try:
--                suborder = taxonomy[t]['suborder']
--            except KeyError:
--                suborder = "-"
--            try:
--                order = taxonomy[t]['order']
--            except KeyError:
--                order = "-"
--            try:
--                superorder = taxonomy[t]['superorder']
--            except KeyError:
--                superorder = "-"
--            try:
--                subclass = taxonomy[t]['subclass']
--            except KeyError:
--                subclass = "-"
--            try:
--                tclass = taxonomy[t]['class']
--            except KeyError:
--                tclass = "-"
--            try:
--                subphylum = taxonomy[t]['subphylum']
--            except KeyError:
--                subphylum = "-"
--            try:
--                phylum = taxonomy[t]['phylum']
--            except KeyError:
--                phylum = "-"
--            try:
--                superphylum = taxonomy[t]['superphylum']
--            except KeyError:
--                superphylum = "-"
--            try:
--                infrakingdom = taxonomy[t]['infrakingdom']
--            except:
--                infrakingdom = "-"
--            try:
--                subkingdom = taxonomy[t]['subkingdom']
--            except:
--                subkingdom = "-"
--            try:
--                kingdom = taxonomy[t]['kingdom']
--            except KeyError:
--                kingdom = "-"
--            try:
--                provider = taxonomy[t]['provider']
--            except KeyError:
--                provider = "-"
--
--
--            this_classification = [
--                    species.encode('utf-8'),
--                    genus.encode('utf-8'),
--                    family.encode('utf-8'),
--                    superfamily.encode('utf-8'),
--                    infraorder.encode('utf-8'),
--                    suborder.encode('utf-8'),
--                    order.encode('utf-8'),
--                    superorder.encode('utf-8'),
--                    subclass.encode('utf-8'),
--                    tclass.encode('utf-8'),
--                    subphylum.encode('utf-8'),
--                    phylum.encode('utf-8'),
--                    superphylum.encode('utf-8'),
--                    infrakingdom.encode('utf-8'),
--                    subkingdom.encode('utf-8'),
--                    kingdom.encode('utf-8'),
--                    provider.encode('utf-8')]
--            writer.writerow(this_classification)
--
--
++    stk.save_taxonomy(taxonomy, output_file)
++
  def _uniquify(l):
      """
      Make a list, l, contain only unique data
 === modified file 'stk/scripts/fill_in_with_taxonomy.py'
 --- stk/scripts/fill_in_with_taxonomy.py	2016-12-14 16:22:12 +0000
 +++ stk/scripts/fill_in_with_taxonomy.py	2017-01-12 09:27:31 +0000
@@ -23,21 +23,90 @@
  from urllib import quote_plus
  import simplejson as json
  import argparse
++import copy
  import os
  import sys
  stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir )
  sys.path.insert(0, stk_path)
  import supertree_toolkit as stk
  import csv
--
--# What we get from EOL
--current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
--# And the extra ones from ITIS
--extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
--# all of them in order
--taxonomy_levels = ['species','genus','subfamily','family','tribe','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
--
--def get_tree_taxa_taxonomy(taxon,wsdlObjectWoRMS):
++from ete2 import Tree
++import tempfile
++import re
++
++taxonomy_levels = stk.taxonomy_levels
++#tlevels = ['species','genus','family','superfamily','suborder','order','class','phylum','kingdom']
++tlevels = ['species','genus', 'subfamily', 'family','infraorder','order','class','phylum','kingdom']
++
++def get_tree_taxa_taxonomy_eol(taxon):
++
++    taxonq = quote_plus(taxon)
++    URL = "http://eol.org/api/search/1.0.json?q="+taxonq
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    data = json.load(f)
++
++    if data['results'] == []:
++        return {}
++    ID = str(data['results'][0]['id']) # take first hit
++    # Now look for taxonomies
++    URL = "http://eol.org/api/pages/1.0/"+ID+".json"
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    data = json.load(f)
++    if len(data['taxonConcepts']) == 0:
++        return {}
++    TID = str(data['taxonConcepts'][0]['identifier']) # take first hit
++    currentdb = str(data['taxonConcepts'][0]['nameAccordingTo'])
++    # loop through and get preferred one if specified
++    # now get taxonomy
++    for db in data['taxonConcepts']:
++        currentdb = db['nameAccordingTo'].lower()
++        TID = str(db['identifier'])
++        break
++    URL="http://eol.org/api/hierarchy_entries/1.0/"+TID+".json"
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    data = json.load(f)
++    tax_array = {}
++    tax_array['provider'] = currentdb
++    for a in data['ancestors']:
++        try:
++            if a.has_key('taxonRank') :
++                temp_level = a['taxonRank'].encode("ascii","ignore")
++                if (temp_level in taxonomy_levels):
++                    # note the dump into ASCII
++                    temp_name = a['scientificName'].encode("ascii","ignore")
++                    temp_name = temp_name.split(" ")
++                    if (temp_level == 'species'):
++                        tax_array[temp_level] = "_".join(temp_name[0:2])
++
++                    else:
++                        tax_array[temp_level] = temp_name[0]
++        except KeyError as e:
++            logging.exception("Key not found: taxonRank")
++            continue
++    try:
++        # add taxonomy in to the taxonomy!
++        # some issues here, so let's make sure it's OK
++        temp_name = taxon.split(" ")
++        if data.has_key('taxonRank') :
++            if not data['taxonRank'].lower() == 'species':
++                tax_array[data['taxonRank'].lower()] = temp_name[0]
++            else:
++                tax_array[data['taxonRank'].lower()] = ' '.join(temp_name[0:2])
++    except KeyError as e:
++       return tax_array
++
++    return tax_array
++
++def get_tree_taxa_taxonomy_worms(taxon):
++
++    from SOAPpy import WSDL
++    wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
      taxon_data = wsdlObjectWoRMS.getAphiaRecords(taxon.replace('_',' '))
      if taxon_data == None:
@@ -51,6 +120,8 @@
      classification = wsdlObjectWoRMS.getAphiaClassificationByID(taxon_id)
      # construct array
      tax_array = {}
++    if (classification == ""):
++        return {}
      # classification is a nested dictionary, so we need to iterate down it
      current_child = classification.child
      while True:
@@ -60,27 +131,252 @@
              break
      return tax_array
--
--
--def get_taxonomy_worms(taxonomy, start_otu):
++def get_tree_taxa_taxonomy_itis(taxon):
++
++    URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(taxon.replace('_',' ').strip())
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    string = unicode(f.read(),"ISO-8859-1")
++    this_item = json.loads(string)
++    if this_item['scientificNames'] == [None]: # not found
++        return {}
++    tsn = this_item['scientificNames'][0]['tsn'] # there might be records that aren't valid - they point to the valid one though
++    # so call another function to get any valid names
++    URL="http://www.itis.gov/ITISWebService/jsonservice/getAcceptedNamesFromTSN?tsn="+tsn
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    string = unicode(f.read(),"ISO-8859-1")
++    this_item = json.loads(string)
++    if not this_item['acceptedNames'] == [None]:
++        tsn = this_item['acceptedNames'][0]['acceptedTsn']
++
++    URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+str(tsn)
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    string = unicode(f.read(),"ISO-8859-1")
++    data = json.loads(string)
++    # construct array
++    this_taxonomy = {}
++    for level in data['hierarchyList']:
++        if level['rankName'].lower() in taxonomy_levels:
++            # note the dump into ASCII
++            this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
++
++    return this_taxonomy
++
++
++
++def get_taxonomy_eol(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
++
++    # this is the recursive function
++    def get_children(taxonomy, ID, aphiaIDsDone):
++
++        # get data
++        URL="http://eol.org/api/hierarchy_entries/1.0/"+str(ID)+".json?common_names=false&synonyms=false&cache_ttl="
++        req = urllib2.Request(URL)
++        opener = urllib2.build_opener()
++        f = opener.open(req)
++        string = unicode(f.read(),"ISO-8859-1")
++        this_item = json.loads(string)
++        if this_item == None:
++            return taxonomy
++        if this_item['taxonRank'].lower().strip() == 'species':
++            # add data to taxonomy dictionary
++            taxon = this_item['scientificName'].split()[0:2] # just the first two words
++            taxon = " ".join(taxon[0:2])
++            # NOTE following line means existing items are *not* updated
++            if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
++                this_taxonomy = {}
++                for level in this_item['ancestors']:
++                    if level['taxonRank'].lower() in taxonomy_levels:
++                        # note the dump into ASCII
++                        this_taxonomy[level['taxonRank'].lower().encode("ascii","ignore")] = level['scientificName'].encode("ascii","ignore")
++                # add species:
++                this_taxonomy['species'] = taxon.replace(" ","_")
++                if verbose:
++                    print "\tAdding "+taxon
++                taxonomy[taxon] = this_taxonomy
++                if not tmpfile == None:
++                    stk.save_taxonomy(taxonomy,tmpfile)
++                return taxonomy
++            else:
++                return taxonomy
++        all_children = []
++        for level in this_item['children']:
++            if not level == None:
++                all_children.append(level['taxonID'])
++
++        if (len(all_children) == 0):
++            return taxonomy
++
++        for child in all_children:
++            if child in aphiaIDsDone: # we get stuck sometime
++                continue
++            aphiaIDsDone.append(child)
++            taxonomy = get_children(taxonomy, child, aphiaIDsDone)
++        return taxonomy
++
++
++    # main bit of the get_taxonomy_eol function
++    taxonq = quote_plus(start_otu)
++    URL = "http://eol.org/api/search/1.0.json?q="+taxonq
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    data = json.load(f)
++    start_id = str(data['results'][0]['id']) # this is the page ID. We get the species ID next
++    URL = "http://eol.org/api/pages/1.0/"+start_id+".json"
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    data = json.load(f)
++    if len(data['taxonConcepts']) == 0:
++        print "Error finding you start taxa. Spelling?"
++        return None
++    start_id = data['taxonConcepts'][0]['identifier']
++    start_taxonomy_level = data['taxonConcepts'][0]['taxonRank'].lower()
++
++    aphiaIDsDone = []
++    if not skip:
++        taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
++
++    return taxonomy, start_taxonomy_level
++
++
++
++def get_taxonomy_itis(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
++    import simplejson as json
++
++    # this is the recursive function
++    def get_children(taxonomy, ID, aphiaIDsDone):
++
++        # get data
++        URL="http://www.itis.gov/ITISWebService/jsonservice/getFullRecordFromTSN?tsn="+ID
++        req = urllib2.Request(URL)
++        opener = urllib2.build_opener()
++        f = opener.open(req)
++        string = unicode(f.read(),"ISO-8859-1")
++        this_item = json.loads(string)
++        if this_item == None:
++            return taxonomy
++        if not this_item['usage']['taxonUsageRating'].lower() == 'valid':
++            print "rejecting " , this_item['scientificName']['combinedName']
++            return taxonomy
++        if this_item['taxRank']['rankName'].lower().strip() == 'species':
++            # add data to taxonomy dictionary
++            taxon = this_item['scientificName']['combinedName']
++            # NOTE following line means existing items are *not* updated
++            if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
++                # get the taxonomy of this species
++                tsn = this_item["scientificName"]["tsn"]
++                URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+tsn
++                req = urllib2.Request(URL)
++                opener = urllib2.build_opener()
++                f = opener.open(req)
++                string = unicode(f.read(),"ISO-8859-1")
++                data = json.loads(string)
++                this_taxonomy = {}
++                for level in data['hierarchyList']:
++                    if level['rankName'].lower() in taxonomy_levels:
++                        # note the dump into ASCII
++                        this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
++                if verbose:
++                    print "\tAdding "+taxon
++                taxonomy[taxon] = this_taxonomy
++                if not tmpfile == None:
++                    stk.save_taxonomy(taxonomy,tmpfile)
++                return taxonomy
++            else:
++                return taxonomy
++
++        all_children = []
++        URL="http://www.itis.gov/ITISWebService/jsonservice/getHierarchyDownFromTSN?tsn="+ID
++        req = urllib2.Request(URL)
++        opener = urllib2.build_opener()
++        f = opener.open(req)
++        string = unicode(f.read(),"ISO-8859-1")
++        this_item = json.loads(string)
++        if this_item == None:
++            return taxonomy
++
++        for level in this_item['hierarchyList']:
++            if not level == None:
++                all_children.append(level['tsn'])
++
++        if (len(all_children) == 0):
++            return taxonomy
++
++        for child in all_children:
++            if child in aphiaIDsDone: # we get stuck sometime
++                continue
++            aphiaIDsDone.append(child)
++            taxonomy = get_children(taxonomy, child, aphiaIDsDone)
++
++        return taxonomy
++
++
++    # main bit of the get_taxonomy_worms function
++    URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(start_otu.strip())
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    string = unicode(f.read(),"ISO-8859-1")
++    this_item = json.loads(string)
++    start_id = this_item['scientificNames'][0]['tsn'] # there might be records that aren't valid - they point to the valid one though
++    # call it again via the ID this time to make sure we've got the right one.
++    # so call another function to get any valid names
++    URL="http://www.itis.gov/ITISWebService/jsonservice/getAcceptedNamesFromTSN?tsn="+start_id
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    string = unicode(f.read(),"ISO-8859-1")
++    this_item = json.loads(string)
++    if not this_item['acceptedNames'] == [None]:
++        start_id = this_item['acceptedNames'][0]['acceptedTsn']
++
++    URL="http://www.itis.gov/ITISWebService/jsonservice/getFullRecordFromTSN?tsn="+start_id
++    req = urllib2.Request(URL)
++    opener = urllib2.build_opener()
++    f = opener.open(req)
++    string = unicode(f.read(),"ISO-8859-1")
++    this_item = json.loads(string)
++    start_taxonomy_level = this_item['taxRank']['rankName'].lower()
++
++    aphiaIDsDone = []
++    if not skip:
++        taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
++
++    return taxonomy, start_taxonomy_level
++
++
++
++
++def get_taxonomy_worms(taxonomy, start_otu, verbose,tmpfile=None,skip=False):
      """ Gets and processes a taxon from the queue to get its taxonomy."""
      from SOAPpy import WSDL
      wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
      # this is the recursive function
--    def get_children(taxonomy, ID):
++    def get_children(taxonomy, ID, aphiaIDsDone):
          # get data
          this_item = wsdlObjectWoRMS.getAphiaRecordByID(ID)
          if this_item == None:
              return taxonomy
++        if not this_item['status'].lower() == 'accepted':
++            print "rejecting " , this_item.valid_name
++            return taxonomy
          if this_item['rank'].lower() == 'species':
              # add data to taxonomy dictionary
--            # get the taxonomy of this species
--            classification = wsdlObjectWoRMS.getAphiaClassificationByID(ID)
--            taxon = this_item.scientificname
++            taxon = this_item.valid_name
++            # NOTE following line means existing items are *not* updated
              if not taxon in taxonomy: # is a new taxon, not previously in the taxonomy
++                # get the taxonomy of this species
++                classification = wsdlObjectWoRMS.getAphiaClassificationByID(ID)
                  # construct array
                  tax_array = {}
                  # classification is a nested dictionary, so we need to iterate down it
@@ -92,16 +388,36 @@
                      current_child = current_child.child
                      if current_child == '': # empty one is a string for some reason
                          break
--                taxonomy[this_item.scientificname] = tax_array
++                if verbose:
++                    print "\tAdding "+this_item.scientificname
++                taxonomy[this_item.valid_name] = tax_array
++                if not tmpfile == None:
++                    stk.save_taxonomy(taxonomy,tmpfile)
                  return taxonomy
              else:
                  return taxonomy
--        children = wsdlObjectWoRMS.getAphiaChildrenByID(ID, 1, False)
--
--        for child in children:
--            taxonomy = get_children(taxonomy, child['valid_AphiaID'])
--
++        all_children = []
++        start = 1
++        while True:
++            children = wsdlObjectWoRMS.getAphiaChildrenByID(ID, start, False)
++            if (children is None or children == None):
++                break
++            if (len(children) < 50):
++                all_children.extend(children)
++                break
++            all_children.extend(children)
++            start += 50
++
++        if (len(all_children) == 0):
++            return taxonomy
++
++        for child in all_children:
++            if child['valid_AphiaID'] in aphiaIDsDone: # we get stuck sometime
++                continue
++            aphiaIDsDone.append(child['valid_AphiaID'])
++            taxonomy = get_children(taxonomy, child['valid_AphiaID'], aphiaIDsDone)
++
          return taxonomy
@@ -111,12 +427,17 @@
          start_id = start_taxa[0]['valid_AphiaID'] # there might be records that aren't valid - they point to the valid one though
          # call it again via the ID this time to make sure we've got the right one.
          start_taxa = wsdlObjectWoRMS.getAphiaRecordByID(start_id)
--        start_taxonomy_level = start_taxa['rank'].lower()
--    except HTTPError:
--        print "Error"
++        if start_taxa == None:
++            start_taxonomy_level = 'infraorder'
++        else:
++            start_taxonomy_level = start_taxa['rank'].lower()
++    except urllib2.HTTPError:
++        print "Error finding start_otu taxonomic level. Do you have an internet connection?"
          sys.exit(-1)
--    taxonomy = get_children(taxonomy,start_id)
++    aphiaIDsDone = []
++    if not skip:
++        taxonomy = get_children(taxonomy,start_id,aphiaIDsDone)
      return taxonomy, start_taxonomy_level
@@ -136,9 +457,16 @@
              default=False
+             )
      parser.add_argument(
++            '-s',
++            '--skip',
++            action='store_true',
++            help="Skip online checking, just use taxonomy files",
++            default=False
++            )
++    parser.add_argument(
              '--pref_db',
              help="Taxonomy database to use. Default is Species 2000/ITIS",
--            choices=['itis', 'worms', 'ncbi'],
++            choices=['itis', 'worms', 'ncbi', 'eol'],
              default = 'worms'
+             )
      parser.add_argument(
@@ -178,58 +506,250 @@
      top_level = args.top_level[0]
      save_taxonomy_file = args.save_taxonomy
      tree_taxonomy = args.tree_taxonomy
++    taxonomy = args.taxonomy_from_file
      pref_db = args.pref_db
++    skip = args.skip
      if (save_taxonomy_file == None):
          save_taxonomy = False
      else:
          save_taxonomy = True
++    load_tree_taxonomy = False
++    if (not tree_taxonomy == None):
++        tree_taxonomy_file = tree_taxonomy
++        load_tree_taxonomy = True
++    if skip:
++        if taxonomy == None:
++            print "Error: If you're skipping checking online, then you need to supply taxonomy files"
++            return
      # grab taxa in tree
      tree = stk.import_tree(input_file)
      taxa_list = stk._getTaxaFromNewick(tree)
--
--    taxonomy = {}
--
--    # we're going to add the taxa in the tree to the taxonomy, to stop them
++
++    if verbose:
++        print "Taxa count for input tree: ", len(taxa_list)
++
++    # load in any taxonomy files - we still call the APIs as a) they may have updated data and
++    # b) the user may have missed some first time round (i.e. expanded the tree and not redone
++    # the taxonomy
++    if (taxonomy == None):
++        taxonomy = {}
++    else:
++        taxonomy = stk.load_taxonomy(taxonomy)
++        tree_taxonomy = {}
++        # this might also have tree_taxonomy in too - let's check this
++        for t in taxa_list:
++            if t in taxonomy:
++                tree_taxonomy[t] = taxonomy[t]
++            elif t.replace("_"," ") in taxonomy:
++                tree_taxonomy[t] = taxonomy[t.replace("_"," ")]
++
++    if (load_tree_taxonomy): # overwrite the good work above...
++        tree_taxonomy = stk.load_taxonomy(tree_taxonomy_file)
++    if (tree_taxonomy == None):
++        tree_taxonomy = {}
++
++    # we're going to add the taxa in the tree to the main WORMS taxonomy, to stop them
      # being fetched in first place. We delete them later
++    # If you've loaded a taxonomy created by this script, this overwrites the tree taxa in the main taxonomy dict
++    # Don't worry, we put them back in before saving again!
      for taxon in taxa_list:
          taxon = taxon.replace('_',' ')
--        taxonomy[taxon] = []
--
++        taxonomy[taxon] = {}
      if (pref_db == 'itis'):
          # get taxonomy info from itis
--        print "Sorry, ITIS is not implemented yet"
--        pass
++        if (verbose):
++            print "Getting data from ITIS"
++        if (verbose):
++            print "Dealing with taxa in tree"
++        for t in taxa_list:
++            if verbose:
++                print "\t"+t
++            if not(t in tree_taxonomy or t.replace("_"," ") in tree_taxonomy):
++                # we don't have data - NOTE we assume things are *not* updated here if we do
++                tree_taxonomy[t] = get_tree_taxa_taxonomy_itis(t)
++
++        if save_taxonomy:
++            if (verbose):
++                print "Saving tree taxonomy"
++            # note -temporary save as we overwrite this file later.
++            stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
++
++        # get taxonomy from worms
++        if verbose:
++            print "Now dealing with all other taxa - this might take a while..."
++        # create a temp file so we can checkpoint and continue
++        tmpf, tmpfile = tempfile.mkstemp()
++
++        if os.path.isfile('.fit_lock'):
++            f = open('.fit_lock','r')
++            tf = f.read()
++            f.close()
++            if os.path.isfile(tf.strip()):
++                taxonomy = stk.load_taxonomy(tf.strip())
++            os.remove('.fit_lock')
++
++        # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
++        # where we left off.
++        with open(".fit_lock", 'w') as f:
++            f.write(tmpfile)
++        # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
++        taxonomy, start_level = get_taxonomy_itis(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
++
++        # clean up
++        os.close(tmpf)
++        os.remove('.fit_lock')
++        try:
++            os.remove('tmpfile')
++        except OSError:
++            pass
      elif (pref_db == 'worms'):
++        if (verbose):
++            print "Getting data from WoRMS"
          # get tree taxonomy from worms
--        if (tree_taxonomy == None):
--            tree_taxonomy = {}
--            for t in taxa_list:
--                from SOAPpy import WSDL
--                wsdlObjectWoRMS = WSDL.Proxy('http://www.marinespecies.org/aphia.php?p=soap&wsdl=1')
--                tree_taxonomy[t] = get_tree_taxa_taxonomy(t,wsdlObjectWoRMS)
--        else:
--            tree_taxonomy = stk.load_taxonomy(tree_taxonomy)
++        if (verbose):
++            print "Dealing with taxa in tree"
++
++        for t in taxa_list:
++            if verbose:
++                print "\t"+t
++            if not(t in tree_taxonomy or t.replace("_"," ") in tree_taxonomy):
++                # we don't have data - NOTE we assume things are *not* updated here if we do
++                tree_taxonomy[t] = get_tree_taxa_taxonomy_worms(t)
++
++        if save_taxonomy:
++            if (verbose):
++                print "Saving tree taxonomy"
++            # note -temporary save as we overwrite this file later.
++            stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
++
          # get taxonomy from worms
--        taxonomy, start_level = get_taxonomy_worms(taxonomy,top_level)
++        if verbose:
++            print "Now dealing with all other taxa - this might take a while..."
++        # create a temp file so we can checkpoint and continue
++        tmpf, tmpfile = tempfile.mkstemp()
++
++        if os.path.isfile('.fit_lock'):
++            f = open('.fit_lock','r')
++            tf = f.read()
++            f.close()
++            if os.path.isfile(tf.strip()):
++                taxonomy = stk.load_taxonomy(tf.strip())
++            os.remove('.fit_lock')
++
++        # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
++        # where we left off.
++        with open(".fit_lock", 'w') as f:
++            f.write(tmpfile)
++        # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
++        taxonomy, start_level = get_taxonomy_worms(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
++
++        # clean up
++        os.close(tmpf)
++        os.remove('.fit_lock')
++        try:
++            os.remove('tmpfile')
++        except OSError:
++            pass
      elif (pref_db == 'ncbi'):
          # get taxonomy from ncbi
          print "Sorry, NCBI is not implemented yet"
          pass
++    elif (pref_db == 'eol'):
++        if (verbose):
++            print "Getting data from EOL"
++        # get tree taxonomy from worms
++        if (verbose):
++            print "Dealing with taxa in tree"
++        for t in taxa_list:
++            if verbose:
++                print "\t"+t
++            try:
++                tree_taxonomy[t]
++                pass # we have data - NOTE we assume things are *not* updated here...
++            except KeyError:
++                try:
++                    tree_taxonomy[t.replace('_',' ')]
++                except KeyError:
++                    tree_taxonomy[t] = get_tree_taxa_taxonomy_eol(t)
++
++        if save_taxonomy:
++            if (verbose):
++                print "Saving tree taxonomy"
++            # note -temporary save as we overwrite this file later.
++            stk.save_taxonomy(tree_taxonomy,save_taxonomy_file+'_tree.csv')
++
++        # get taxonomy from worms
++        if verbose:
++            print "Now dealing with all other taxa - this might take a while..."
++        # create a temp file so we can checkpoint and continue
++        tmpf, tmpfile = tempfile.mkstemp()
++
++        if os.path.isfile('.fit_lock'):
++            f = open('.fit_lock','r')
++            tf = f.read()
++            f.close()
++            if os.path.isfile(tf.strip()):
++                taxonomy = stk.load_taxonomy(tf.strip())
++            os.remove('.fit_lock')
++
++        # create lock file - if this is here, then we load from the file in the lock file (or try to) and continue
++        # where we left off.
++        with open(".fit_lock", 'w') as f:
++            f.write(tmpfile)
++        # bit naughty with tmpfile - we're using the filename rather than handle to write to it. Have to for write_taxonomy function
++        taxonomy, start_level = get_taxonomy_eol(taxonomy,top_level,verbose,tmpfile=tmpfile,skip=skip) # this skips ones already there
++
++        # clean up
++        os.close(tmpf)
++        os.remove('.fit_lock')
++        try:
++            os.remove('tmpfile')
++        except OSError:
++            pass
      else:
--        print "ERROR: Didn't understand you database choice"
++        print "ERROR: Didn't understand your database choice"
          sys.exit(-1)
      # clean up taxonomy, deleting the ones already in the tree
      for taxon in taxa_list:
--        taxon = taxon.replace('_',' ')
--        del taxonomy[taxon]
++        taxon = taxon.replace('_',' ')
++        try:
++            del taxonomy[taxon]
++        except KeyError:
++            pass # if it's not there, so we care?
++
++    # We now have 2 taxonomies:
++    #  - for taxa in the tree
++    #  - for all other taxa in the clade of interest
++
++    if save_taxonomy:
++        tot_taxonomy = taxonomy.copy()
++        tot_taxonomy.update(tree_taxonomy)
++        stk.save_taxonomy(tot_taxonomy,save_taxonomy_file)
++
++
++    orig_taxa_list = taxa_list
++
++    remove_higher_level = [] # for storing the higher level taxa in the original tree that need deleting
++    generic = []
++    # find all the generic and build an internal subs file
++    for t in taxa_list:
++        t = t.replace(" ","_")
++        if t.find("_") == -1:
++            # no underscore, so just generic
++            generic.append(t)
      # step up the taxonomy levels from genus, adding taxa to the correct node
      # as a polytomy
--    for level in taxonomy_levels[1::]: # skip species....
++    start_level = start_level.encode('utf-8').strip()
++    if verbose:
++        print "I think your start OTU is at: ", start_level
++    for level in tlevels[1::]: # skip species....
++        if verbose:
++            print "Dealing with ",level
          new_taxa = []
          for t in taxonomy:
              # skip odd ones that should be in there
@@ -239,135 +759,61 @@
                  except KeyError:
                      continue # don't have this info
          new_taxa = _uniquify(new_taxa)
++
          for nt in new_taxa:
--            taxa_to_add = []
++            taxa_to_add = {}
              taxa_in_clade = []
              for t in taxonomy:
                  if start_level in taxonomy[t] and taxonomy[t][start_level] == top_level:
                      try:
--                        if taxonomy[t][level] == nt:
--                            taxa_to_add.append(t.replace(' ','_'))
++                        if taxonomy[t][level] == nt and not t in taxa_list:
++                            taxa_to_add[t] = taxonomy[t]
                      except KeyError:
                          continue
++
              # add to tree
              for t in taxa_list:
                  if level in tree_taxonomy[t] and tree_taxonomy[t][level] == nt:
                      taxa_in_clade.append(t)
--            if len(taxa_in_clade) > 0:
--                tree = add_taxa(tree, taxa_to_add, taxa_in_clade)
--                for t in taxa_to_add: # clean up taxonomy
--                    del taxonomy[t.replace('_',' ')]
--
--
++                    if t in generic:
++                        # we are appending taxa to this higher taxon, so we need to remove it
++                        remove_higher_level.append(t)
++
++
++            if len(taxa_in_clade) > 0 and len(taxa_to_add) > 0:
++                tree = add_taxa(tree, taxa_to_add, taxa_in_clade,level)
++                try:
++                    taxa_list = stk._getTaxaFromNewick(tree)
++                except stk.TreeParseError as e:
++                    print taxa_to_add, taxa_in_clade, level, tree
++                    print e.msg
++                    return
++
++                for t in taxa_to_add:
++                    tree_taxonomy[t.replace(' ','_')] = taxa_to_add[t]
++                    try:
++                        del taxonomy[t.replace('_',' ')]
++                    except KeyError:
++                        # It might have _ or it might not...
++                        del taxonomy[t]
++
++
++    # remove singelton nodes
++    tree = stk._collapse_nodes(tree)
++    tree = stk._collapse_nodes(tree)
++    tree = stk._collapse_nodes(tree)
++
++    tree = stk._sub_taxa_in_tree(tree, remove_higher_level)
      trees = {}
      trees['tree_1'] = tree
      output = stk._amalgamate_trees(trees,format='nexus')
      f = open(output_file, "w")
      f.write(output)
      f.close()
--
--    if not save_taxonomy_file == None:
--        with open(save_taxonomy_file, 'w') as f:
--            writer = csv.writer(f)
--            headers = []
--            headers.append("OTU")
--            headers.extend(taxonomy_levels)
--            headers.append("Data source")
--            writer.writerow(headers)
--            for t in taxonomy:
--                otu = t
--                try:
--                    species = taxonomy[t]['species']
--                except KeyError:
--                    species = "-"
--                try:
--                    genus = taxonomy[t]['genus']
--                except KeyError:
--                    genus = "-"
--                try:
--                    family = taxonomy[t]['family']
--                except KeyError:
--                    family = "-"
--                try:
--                    superfamily = taxonomy[t]['superfamily']
--                except KeyError:
--                    superfamily = "-"
--                try:
--                    infraorder = taxonomy[t]['infraorder']
--                except KeyError:
--                    infraorder = "-"
--                try:
--                    suborder = taxonomy[t]['suborder']
--                except KeyError:
--                    suborder = "-"
--                try:
--                    order = taxonomy[t]['order']
--                except KeyError:
--                    order = "-"
--                try:
--                    superorder = taxonomy[t]['superorder']
--                except KeyError:
--                    superorder = "-"
--                try:
--                    subclass = taxonomy[t]['subclass']
--                except KeyError:
--                    subclass = "-"
--                try:
--                    tclass = taxonomy[t]['class']
--                except KeyError:
--                    tclass = "-"
--                try:
--                    subphylum = taxonomy[t]['subphylum']
--                except KeyError:
--                    subphylum = "-"
--                try:
--                    phylum = taxonomy[t]['phylum']
--                except KeyError:
--                    phylum = "-"
--                try:
--                    superphylum = taxonomy[t]['superphylum']
--                except KeyError:
--                    superphylum = "-"
--                try:
--                    infrakingdom = taxonomy[t]['infrakingdom']
--                except:
--                    infrakingdom = "-"
--                try:
--                    subkingdom = taxonomy[t]['subkingdom']
--                except:
--                    subkingdom = "-"
--                try:
--                    kingdom = taxonomy[t]['kingdom']
--                except KeyError:
--                    kingdom = "-"
--                try:
--                    provider = taxonomy[t]['provider']
--                except KeyError:
--                    provider = "-"
--
--                if (isinstance(species, list)):
--                    species = " ".join(species)
--                this_classification = [
--                        otu.encode('utf-8'),
--                        species.encode('utf-8'),
--                        genus.encode('utf-8'),
--                        family.encode('utf-8'),
--                        superfamily.encode('utf-8'),
--                        infraorder.encode('utf-8'),
--                        suborder.encode('utf-8'),
--                        order.encode('utf-8'),
--                        superorder.encode('utf-8'),
--                        subclass.encode('utf-8'),
--                        tclass.encode('utf-8'),
--                        subphylum.encode('utf-8'),
--                        phylum.encode('utf-8'),
--                        superphylum.encode('utf-8'),
--                        infrakingdom.encode('utf-8'),
--                        subkingdom.encode('utf-8'),
--                        kingdom.encode('utf-8'),
--                        provider.encode('utf-8')]
--                writer.writerow(this_classification)
--
++    taxa_list = stk._getTaxaFromNewick(tree)
++
++    print "Final taxa count:", len(taxa_list)
++
  def _uniquify(l):
      """
@@ -379,28 +825,119 @@
      return keys.keys()
--def add_taxa(tree, new_taxa, taxa_in_clade):
++def add_taxa(tree, new_taxa, taxa_in_clade, level):
      # create new tree of the new taxa
--    #tree_string = "(" + ",".join(new_taxa) + ");"
--    #additionalTaxa = stk._parse_tree(tree_string)
++    additionalTaxa = tree_from_taxonomy(level,new_taxa)
      # find mrca parent
      treeobj = stk._parse_tree(tree)
      mrca = stk.get_mrca(tree,taxa_in_clade)
--    mrca_parent = treeobj.node(mrca).parent
--
--    # insert a node into the tree between the MRCA and it's parent (p4.addNodeBetweenNodes)
--    newNode = treeobj.addNodeBetweenNodes(mrca, mrca_parent)
--
--    # add the new tree at the new node using p4.addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None)
--    #treeobj.addSubTree(newNode, additionalTaxa)
--    for t in new_taxa:
--        treeobj.addSibLeaf(newNode,t)
--
--    # return new tree
++    if (mrca == 0):
++        # we need to make a new tree! The additional taxa are being placed at the root of the tree
++        t = Tree()
++        A = t.add_child()
++        B = t.add_child()
++        t1 = Tree(additionalTaxa)
++        t2 = Tree(tree)
++        A.add_child(t1)
++        B.add_child(t2)
++        return t.write(format=9)
++    else:
++        mrca = treeobj.nodes[mrca]
++        additionalTaxa = stk._parse_tree(additionalTaxa)
++
++        if len(taxa_in_clade) == 1:
++            taxon = treeobj.node(taxa_in_clade[0])
++            mrca = treeobj.addNodeBetweenNodes(taxon,mrca)
++
++
++        # insert a node into the tree between the MRCA and it's parent (p4.addNodeBetweenNodes)
++        # newNode = treeobj.addNodeBetweenNodes(mrca, mrca_parent)
++
++        # add the new tree at the new node using p4.addSubTree(self, selfNode, theSubTree, subTreeTaxNames=None)
++        treeobj.addSubTree(mrca, additionalTaxa, ignoreRootAssert=True)
++
      return treeobj.writeNewick(fName=None,toString=True).strip()
++
++
++def tree_from_taxonomy(top_level, tree_taxonomy):
++
++    start_level = taxonomy_levels.index(top_level)
++    new_taxa = tree_taxonomy.keys()
++
++    tl_types = []
++    for tt in tree_taxonomy:
++        tl_types.append(tree_taxonomy[tt][top_level])
++
++    tl_types = _uniquify(tl_types)
++    levels_to_worry_about = tlevels[0:tlevels.index(top_level)+1]
++
++    t = Tree()
++    nodes = {}
++    nodes[top_level] = []
++    for tl in tl_types:
++        n = t.add_child(name=tl)
++        nodes[top_level].append({tl:n})
++
++    for l in levels_to_worry_about[-2::-1]:
++        names = []
++        nodes[l] = []
++        ci = levels_to_worry_about.index(l)
++        for tt in tree_taxonomy:
++            try:
++                names.append(tree_taxonomy[tt][l])
++            except KeyError:
++                pass
++        names = _uniquify(names)
++        for n in names:
++            # find my parent
++            parent = None
++            for tt in tree_taxonomy:
++                try:
++                    if tree_taxonomy[tt][l] == n:
++                        try:
++                            parent = tree_taxonomy[tt][levels_to_worry_about[ci+1]]
++                            level = ci+1
++                        except KeyError:
++                            try:
++                                parent = tree_taxonomy[tt][levels_to_worry_about[ci+2]]
++                                level = ci+2
++                            except KeyError:
++                                try:
++                                    parent = tree_taxonomy[tt][levels_to_worry_about[ci+3]]
++                                    level = ci+3
++                                except KeyError:
++                                    print "ERROR: tried to find some taxonomic info for "+tt+" from tree_taxonomy file/downloaded data and I went two levels up, but failed find any. Looked at:\n"
++                                    print "\t"+levels_to_worry_about[ci+1]
++                                    print "\t"+levels_to_worry_about[ci+2]
++                                    print "\t"+levels_to_worry_about[ci+3]
++                                    print "This is the taxonomy info I have for "+tt
++                                    print tree_taxonomy[tt]
++                                    sys.exit(1)
++
++                        k = []
++                        for nd in nodes[levels_to_worry_about[level]]:
++                            k.extend(nd.keys())
++                        i = 0
++                        for kk in k:
++                            if kk == parent:
++                                break
++                            i += 1
++                        parent_id = i
++                        break
++                except KeyError:
++                    pass # no data at this level for this beastie
++            # find out where to attach it
++            node_id = nodes[levels_to_worry_about[level]][parent_id][parent]
++            nd = node_id.add_child(name=n.replace(" ","_"))
++            nodes[l].append({n:nd})
++
++    tree = t.write(format=9)
++
++    return tree
++
  if __name__ == "__main__":
      main()
 === modified file 'stk/scripts/plot_character_taxa_matrix.py'
 --- stk/scripts/plot_character_taxa_matrix.py	2014-12-10 08:55:43 +0000
 +++ stk/scripts/plot_character_taxa_matrix.py	2017-01-12 09:27:31 +0000
@@ -42,6 +42,18 @@
              default=False
+             )
      parser.add_argument(
++            '-t',
++            '--taxonomy',
++            help="Use taxonomy to sort the taxa on the axis. Supply a STK taxonomy file",
++            )
++    parser.add_argument(
++            '--level',
++            choices=['family','superfamily','infraorder','suborder','order'],
++            default='family',
++            help="""What level to group the taxonomy at. Default is family.
++                    Note data for a particular levelmay be missing in taxonomy."""
++            )
++    parser.add_argument(
              'input_file',
              metavar='input_file',
              nargs=1,
@@ -59,14 +71,58 @@
      verbose = args.verbose
      input_file = args.input_file[0]
      output_file = args.output_file[0]
++    taxonomy = args.taxonomy
++    level = args.level
      XML = stk.load_phyml(input_file)
++    if not taxonomy == None:
++        taxonomy = stk.load_taxonomy(taxonomy)
++
      all_taxa = stk.get_all_taxa(XML)
      all_chars_d = stk.get_all_characters(XML)
      all_chars = []
      for c in all_chars_d:
          all_chars.extend(all_chars_d[c])
++    if not taxonomy == None:
++        tax_data = {}
++        new_all_taxa = []
++        for t in all_taxa:
++            taxon = t.replace("_"," ")
++            try:
++                if taxonomy[taxon][level] == "":
++                    # skip this
++                    continue
++                tax_data[t] = taxonomy[taxon][level]
++            except KeyError:
++                print "Couldn't find "+t+" in taxonomy. Adding as null data"
++                tax_data[t] = 'zzzzz' # it's at the end...
++
++        from sets import Set
++        unique = set(tax_data.values())
++        unique = list(unique)
++        unique.sort()
++        print "Groups are:"
++        print unique
++        counts = []
++        for u in unique:
++            count = 0
++            for t in tax_data:
++                if tax_data[t] == u:
++                    count += 1
++                    new_all_taxa.append(t)
++            counts.append(count)
++
++        all_taxa = new_all_taxa
++        # cumulate counts
++        count_cumulate = []
++        count_cumulate.append(counts[0])
++        for c in counts[1::]:
++            count_cumulate.append(c+count_cumulate[-1])
++
++        print count_cumulate
++
++
      taxa_character_matrix = {}
      for t in all_taxa:
          taxa_character_matrix[t] = []
@@ -77,7 +133,8 @@
          taxa = stk.get_taxa_from_tree(XML,t, sort=True)
          for taxon in taxa:
              taxon = taxon.replace(" ","_")
--            taxa_character_matrix[taxon].extend(chars)
++            if taxon in all_taxa:
++                taxa_character_matrix[taxon].extend(chars)
      for t in taxa_character_matrix:
          array = taxa_character_matrix[t]
@@ -92,6 +149,31 @@
                  x.append(i)
                  y.append(j)
++
++    i = 0
++    for j in all_chars:
++        # do a substitution of character names to tidy things up
++        if j.lower().startswith('mitochondrial carrier; adenine nucleotide translocator'):
++            j = "ANT"
++        if j.lower().startswith('mitochondrially encoded 12s'):
++            j = '12S'
++        if j.lower().startswith('complete mitochondrial genome'):
++            j = 'Mitogenome'
++        if j.lower().startswith('mtdna'):
++            j = "mtDNA restriction sites"
++        if j.lower().startswith('h3 histone'):
++            j = 'H3'
++        if j.lower().startswith('mitochondrially encoded cytochrome'):
++            j = 'COI'
++        if j.lower().startswith('rna, 28s'):
++            j = '28S'
++        if j.lower().startswith('rna, 18s'):
++            j = '18S'
++        if j.lower().startswith('mitochondrially encoded 16s'):
++            j = '16S'
++        all_chars[i] = j
++        i += 1
++
      fig=figure(figsize=(22,17),dpi=90)
      fig.subplots_adjust(left=0.3)
      ax = fig.add_subplot(1,1,1)
 === modified file 'stk/scripts/plot_tree_taxa_matrix.py'
 --- stk/scripts/plot_tree_taxa_matrix.py	2014-12-10 08:55:43 +0000
 +++ stk/scripts/plot_tree_taxa_matrix.py	2017-01-12 09:27:31 +0000
@@ -43,6 +43,18 @@
              default=False
+             )
      parser.add_argument(
++            '-t',
++            '--taxonomy',
++            help="Use taxonomy to sort the taxa on the axis. Supply a STK taxonomy file",
++            )
++    parser.add_argument(
++            '--level',
++            choices=['family','superfamily','infraorder','suborder','order'],
++            default='family',
++            help="""What level to group the taxonomy at. Default is family.
++                    Note data for a particular levelmay be missing in taxonomy."""
++            )
++    parser.add_argument(
              'input_file',
              metavar='input_file',
              nargs=1,
@@ -60,13 +72,57 @@
      verbose = args.verbose
      input_file = args.input_file[0]
      output_file = args.output_file[0]
++    taxonomy = args.taxonomy
++    level = args.level
      XML = stk.load_phyml(input_file)
++    if not taxonomy == None:
++        taxonomy = stk.load_taxonomy(taxonomy)
++
      all_taxa = stk.get_all_taxa(XML)
      taxa_tree_matrix = {}
      for t in all_taxa:
          taxa_tree_matrix[t] = []
++
++    if not taxonomy == None:
++        tax_data = {}
++        new_all_taxa = []
++        for t in all_taxa:
++            taxon = t.replace("_"," ")
++            try:
++                if taxonomy[taxon][level] == "":
++                    # skip this
++                    continue
++                tax_data[t] = taxonomy[taxon][level]
++            except KeyError:
++                print "Couldn't find "+t+" in taxonomy. Adding as null data"
++                tax_data[t] = 'zzzzz' # it's at the end...
++
++        from sets import Set
++        unique = set(tax_data.values())
++        unique = list(unique)
++        unique.sort()
++        print "Groups are:"
++        print unique
++        counts = []
++        for u in unique:
++            count = 0
++            for t in tax_data:
++                if tax_data[t] == u:
++                    count += 1
++                    new_all_taxa.append(t)
++            counts.append(count)
++
++        all_taxa = new_all_taxa
++        # cumulate counts
++        count_cumulate = []
++        count_cumulate.append(counts[0])
++        for c in counts[1::]:
++            count_cumulate.append(c+count_cumulate[-1])
++
++        print count_cumulate
++
      trees = stk.obtain_trees(XML)
      i = 0
 === modified file 'stk/scripts/remove_poorly_constrained_taxa.py'
 --- stk/scripts/remove_poorly_constrained_taxa.py	2014-04-18 11:57:14 +0000
 +++ stk/scripts/remove_poorly_constrained_taxa.py	2017-01-12 09:27:31 +0000
@@ -12,8 +12,8 @@
      # do stuff
      parser = argparse.ArgumentParser(
--         prog="convert tree from specific to generic",
--         description="""Converts a tree at specific level to generic level""",
++         prog="remove poorly contrained taxa",
++         description="""Remove taxa that appea in one source tree only.""",
+          )
      parser.add_argument(
              '-v',
@@ -34,6 +34,13 @@
                   " to removal those in polytomies *and* only in one other tree."
+             )
      parser.add_argument(
++            '--tree_only',
++            default=False,
++            action='store_true',
++            help="Restrict removal of taxa that only occur in one source tree. Default"+
++                 " to removal those in polytomies *and* only in one other tree."
++            )
++    parser.add_argument(
              'input_phyml',
              metavar='input_phyml',
              nargs=1,
@@ -43,13 +50,13 @@
              'input_tree',
              metavar='input_tree',
              nargs=1,
--            help="Your tree"
++            help="Your tree - can be NULL or None"
+             )
      parser.add_argument(
              'output_tree',
              metavar='output_tree',
              nargs=1,
--            help="Your output tree"
++            help="Your output tree or phyml - if input_tree is none, this is the Phyml"
+             )
@@ -62,14 +69,20 @@
          dl = True
      poly_only = args.poly_only
      input_tree = args.input_tree[0]
--    output_tree = args.output_tree[0]
++    if input_tree == 'NULL' or input_tree == 'None':
++        input_tree = None
++    output_file = args.output_tree[0]
      input_phyml = args.input_phyml[0]
      XML = stk.load_phyml(input_phyml)
      # load tree
--    supertree = stk.import_tree(input_tree)
++    if (not input_tree == None):
++        supertree = stk.import_tree(input_tree)
++        taxa = stk._getTaxaFromNewick(supertree)
++    else:
++        supertree = None
++        taxa = stk.get_all_taxa(XML)
      # grab taxa
--    taxa = stk._getTaxaFromNewick(supertree)
      delete_list = []
      # loop over taxa in supertree and get some stats
@@ -115,19 +128,29 @@
      print "Taxa: "+str(len(taxa))
      print "Deleting: "+str(len(delete_list))
--    # done, so delete the problem taxa from the supertree
--    for t in delete_list:
--        # remove taxa from supertree
--        supertree = stk._sub_taxa_in_tree(supertree,t)
--
--    # save supertree
--    tree = {}
--    tree['Tree_1'] = supertree
--    output = stk._amalgamate_trees(tree,format='nexus')
--    # write file
--    f = open(output_tree,"w")
--    f.write(output)
--    f.close()
++
++    if not supertree == None:
++        # done, so delete the problem taxa from the supertree
++        for t in delete_list:
++            # remove taxa from supertree
++            supertree = stk._sub_taxa_in_tree(supertree,t)
++
++        # save supertree
++        tree = {}
++        tree['Tree_1'] = supertree
++        output = stk._amalgamate_trees(tree,format='nexus')
++        # write file
++        f = open(output_file,"w")
++        f.write(output)
++        f.close()
++    else:
++        new_phyml =  stk.substitute_taxa(XML,delete_list)
++        # write file
++        f = open(output_file,"w")
++        f.write(new_phyml)
++        f.close()
++
++
      if (dl):
          # write file
 === added file 'stk/scripts/tree_from_taxonomy.py'
 --- stk/scripts/tree_from_taxonomy.py	1970-01-01 00:00:00 +0000
 +++ stk/scripts/tree_from_taxonomy.py	2017-01-12 09:27:31 +0000
@@ -0,0 +1,142 @@
++#    trees ready for supretree construction.
++#    Copyright (C) 2015, Jon Hill, Katie Davis
++#
++#    This program is free software: you can redistribute it and/or modify
++#    it under the terms of the GNU General Public License as published by
++#    the Free Software Foundation, either version 3 of the License, or
++#    (at your option) any later version.
++#
++#    This program is distributed in the hope that it will be useful,
++#    but WITHOUT ANY WARRANTY; without even the implied warranty of
++#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
++#    GNU General Public License for more details.
++#
++#    You should have received a copy of the GNU General Public License
++#    along with this program.  If not, see <http://www.gnu.org/licenses/>.
++#
++#    Jon Hill. jon.hill@york.ac.uk
++
++import argparse
++import copy
++import os
++import sys
++stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir )
++sys.path.insert(0, stk_path)
++import supertree_toolkit as stk
++import csv
++from ete2 import Tree
++
++taxonomy_levels = ['species','subgenus','genus','subfamily','family','superfamily','subsection','section','infraorder','suborder','order','superorder','subclass','class','superclass','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
++tlevels = ['species','genus','family','order','class','phylum','kingdom']
++
++
++def main():
++
++    # do stuff
++    parser = argparse.ArgumentParser(
++         prog="create a tree from a taxonomy file",
++         description="Create a taxonomic tree",
++         )
++    parser.add_argument(
++            '-v',
++            '--verbose',
++            action='store_true',
++            help="Verbose output: mainly progress reports.",
++            default=False
++            )
++    parser.add_argument(
++            'top_level',
++            nargs=1,
++            help="The top level group to start with, e.g. family"
++            )
++    parser.add_argument(
++            'input_file',
++            metavar='input_file',
++            nargs=1,
++            help="Your taxonomy file"
++            )
++    parser.add_argument(
++            'output_file',
++            metavar='output_file',
++            nargs=1,
++            help="Your new tree file"
++            )
++
++    args = parser.parse_args()
++    verbose = args.verbose
++    input_file = args.input_file[0]
++    output_file = args.output_file[0]
++    top_level = args.top_level[0]
++
++    start_level = taxonomy_levels.index(top_level)
++    tree_taxonomy = stk.load_taxonomy(input_file)
++    new_taxa = tree_taxonomy.keys()
++
++    tl_types = []
++    for tt in tree_taxonomy:
++        tl_types.append(tree_taxonomy[tt][top_level])
++
++    tl_types = _uniquify(tl_types)
++    levels_to_worry_about = tlevels[0:tlevels.index(top_level)+1]
++
++    #print levels_to_worry_about[-2::-1]
++
++    t = Tree()
++    nodes = {}
++    nodes[top_level] = []
++    for tl in tl_types:
++        n = t.add_child(name=tl)
++        nodes[top_level].append({tl:n})
++
++    for l in levels_to_worry_about[-2::-1]:
++        #print t
++        names = []
++        nodes[l] = []
++        ci = levels_to_worry_about.index(l)
++        for tt in tree_taxonomy:
++            names.append(tree_taxonomy[tt][l])
++        names = _uniquify(names)
++        for n in names:
++            #print n
++            # find my parent
++            parent = None
++            for tt in tree_taxonomy:
++                if tree_taxonomy[tt][l] == n:
++                    parent = tree_taxonomy[tt][levels_to_worry_about[ci+1]]
++                    k = []
++                    for nd in nodes[levels_to_worry_about[ci+1]]:
++                        k.extend(nd.keys())
++                    i = 0
++                    for kk in k:
++                        print kk
++                        if kk == parent:
++                            break
++                        i += 1
++                    parent_id = i
++                    break
++            # find out where to attach it
++            node_id = nodes[levels_to_worry_about[ci+1]][parent_id][parent]
++            nd = node_id.add_child(name=n.replace(" ","_"))
++            nodes[l].append({n:nd})
++
++    tree = t.write(format=9)
++    tree = stk._collapse_nodes(tree)
++    tree = stk._collapse_nodes(tree)
++    print tree
++
++
++def _uniquify(l):
++    """
++    Make a list, l, contain only unique data
++    """
++    keys = {}
++    for e in l:
++        keys[e] = 1
++
++    return keys.keys()
++
++if __name__ == "__main__":
++    main()
++
++
++
 === modified file 'stk/stk'
 --- stk/stk	2014-12-09 10:58:48 +0000
 +++ stk/stk	2017-01-12 09:27:31 +0000
@@ -23,6 +23,7 @@
  import sys
  import argparse
  import traceback
++import time
  try:
      __file__
  except NameError:
@@ -41,6 +42,10 @@
  import string
  import stk.p4 as p4
  import lxml
++import csv
++import tempfile
++from subprocess import check_call, CalledProcessError, call
++
  import stk.bzr_version as bzr_version
  d = bzr_version.version_info
  build = d.get('revno','<unknown revno>')
@@ -366,7 +371,7 @@
      # Clean data
      parser_cm = subparsers.add_parser('clean_data',
--            help='Remove errant taxa, uninformative trees and empty sources.'
++            help='Renames all sources and trees sensibly. Removes errant taxa, uninformative trees and empty sources.'
+             )
      parser_cm.add_argument('input',
              help='The input phyml file')
@@ -488,7 +493,81 @@
      parser_cm.add_argument('subs',
              help='The subs file')
      parser_cm.set_defaults(func=check_subs)
--
++
++    # taxonomic name checker
++    parser_cm = subparsers.add_parser('check_otus',
++            help='Check your OTUs against EoL.'
++            )
++    parser_cm.add_argument('input',
++            help='The input Phyml. Also accepts tree files or a simple list')
++    parser_cm.add_argument('output',
++            help='The output CSV file. Taxon, synonyms, status')
++    parser_cm.add_argument('--overwrite',
++            action='store_true',
++            default=False,
++            help="Overwrite the existing file without asking for confirmation")
++    parser_cm.set_defaults(func=check_otus)
++
++    # create taxonomy csv file
++    parser_cm = subparsers.add_parser('create_taxonomy',
++            help='Create a taxonomy file in CSV for you to then augment.'
++            )
++    parser_cm.add_argument('input',
++            help='The input Phyml. Also accepts tree files or a simple list')
++    parser_cm.add_argument('output',
++            help='The output CSV file. Name, followed by classification and source')
++    parser_cm.add_argument('--overwrite',
++            action='store_true',
++            default=False,
++            help="Overwrite the existing file without asking for confirmation")
++    parser_cm.add_argument('--taxonomy',
++            help="Give a starting taxonomy file, e.g. one you ran earlier",)
++    parser_cm.set_defaults(func=create_taxonomy)
++
++
++    # do the subs in a one go using taxonomy
++    parser_cm = subparsers.add_parser('auto_subs',
++            help='Using a taxonomy, generate a species level version of your data in one go.'
++            )
++    parser_cm.add_argument('input',
++            help='The input Phyml')
++    parser_cm.add_argument('taxonomy',
++            help='Your taxonomy file',
++            )
++    parser_cm.add_argument('output',
++            help='The output phyml')
++    parser_cm.add_argument('--overwrite',
++            action='store_true',
++            default=False,
++            help="Overwrite the existing file without asking for confirmation")
++    #parser_cm.add_argument('--level',
++    #        choices=supertree_toolkit.taxonomy_levels,
++    #        help="Taxonomic level to output at",)
++    parser_cm.set_defaults(func=auto_subs)
++
++
++    # attempt to process the data into a matrix all automatically
++    parser_cm = subparsers.add_parser('process',
++            help='Generate a species-level matrix, and do all the checks and processing automatically. Note this creates a taxonomy and does all the processing, but will not be perfect (as taxonomies are not perfect)'
++            )
++    parser_cm.add_argument('input',
++            help='The input Phyml')
++    parser_cm.add_argument('output',
++            help='The output matrix')
++    parser_cm.add_argument('--taxonomy_file',
++            help='Existing taxonomy file to prevent redownloading data. Any taxa not in the file will be checked online, so partial complete file are OK.')
++    parser_cm.add_argument('--equivalents_file',
++            help='Existing equivalents file from a taxonomic name check. Any taxa not in the file will be checked online, so partially complete files are OK.')
++    parser_cm.add_argument('--overwrite',
++            action='store_true',
++            default=False,
++            help="Overwrite the existing file without asking for confirmation")
++    parser_cm.add_argument('--no_store',
++            action="store_true",
++            default=False,
++            help="Do not store intermediate files -- not recommended")
++    parser_cm.set_defaults(func=process)
++
      # before we let argparse work its magic, check for --version
      if "--version" in sys.argv:
@@ -602,7 +681,7 @@
      # check if output files are there
      if (output_file and os.path.exists(output_file) and not overwrite):
          print "Output file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -612,7 +691,7 @@
                  break
      if (not newphyml == None and os.path.exists(newphyml) and not overwrite):
          print "Output Phyml file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -624,9 +703,9 @@
      XML = supertree_toolkit.load_phyml(input_file)
      try:
          if (newphyml == None):
--            data_independence = supertree_toolkit.data_independence(XML,ignoreWarnings=ignoreWarnings)
++            data_independence, subsets = supertree_toolkit.data_independence(XML,ignoreWarnings=ignoreWarnings)
          else:
--            data_independence, new_phyml = supertree_toolkit.data_independence(XML,make_new_xml=True,ignoreWarnings=ignoreWarnings)
++            data_independence, subsets, new_phyml = supertree_toolkit.data_independence(XML,make_new_xml=True,ignoreWarnings=ignoreWarnings)
      except NotUniqueError as detail:
          msg = "***Error: Failed to check independence.\n"+detail.msg
          print msg
@@ -644,7 +723,7 @@
          print msg
          return
      except:
--        msg = "***Error: failed to check independence due to unknown error."
++        msg = "***Error: failed to check independence due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
          print msg
          traceback.print_exc()
          return
@@ -653,16 +732,14 @@
      data_ind = ""
      #column headers
      data_ind = "Source trees that are subsets of others\n"
--    data_ind = data_ind + "Flagged tree, is a subset of:\n"
--    for name in data_independence:
--        if ( data_independence[name][1] == supertree_toolkit.SUBSET ):
--            data_ind += name + "," + data_independence[name][0] + "\n"
++    data_ind = data_ind + "Flagged tree(s), is/are subset(s) of:\n"
++    for names in subsets:
++        data_ind += names[1:] + "," + names[0] + "\n"
      data_ind += "\n\nSource trees that are identical to others\n"
--    data_ind = data_ind + "Flagged tree, is identical to:\n"
--    for name in data_independence:
--        if ( data_independence[name][1] == supertree_toolkit.IDENTICAL ):
--            data_ind += name + "," + data_independence[name][0] + "\n"
++    data_ind = data_ind + "Flagged tree(s), is/are identical to:\n"
++    for names in data_independence:
++        data_ind += names[1:] + "," + names[0] + "\n"
      if (output_file == False or
@@ -762,7 +839,7 @@
      # Does the output file already exist?
      if (os.path.exists(output_file) and not overwrite):
          print "Output file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -771,6 +848,7 @@
              if k.lower() == 'y':
                  break
      try:
++
          XML = supertree_toolkit.load_phyml(input_file)
          input_is_xml = True
      except:
@@ -896,7 +974,7 @@
      # Does the output file already exist?
      if (os.path.exists(output_file) and not overwrite):
          print "Output file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -942,7 +1020,7 @@
          print msg
          return
      except:
--        msg = "***Error: Failed sbstituting taxa due to unknown error.\n"
++        msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
          print msg
          traceback.print_exc()
          return
@@ -983,7 +1061,7 @@
      if (os.path.exists(output_file) and not overwrite):
          print "Output file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -1013,7 +1091,7 @@
          print msg
          return
      except:
--        msg = "***Error: Failed sbstituting taxa due to unknown error.\n"
++        msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
          print msg
          traceback.print_exc()
          return
@@ -1060,7 +1138,7 @@
          print msg
          return
      except:
--        msg = "***Error: Failed to export data due to unknown error.\n"
++        msg = "***Error: Failed to export data due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
          print msg
          traceback.print_exc()
          return
@@ -1115,7 +1193,7 @@
          print msg
          return
      except:
--        msg = "***Error: Failed to check overlap due to unknown error.\n"
++        msg = "***Error: Failed to check overlap due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
          print msg
          traceback.print_exc()
          return
@@ -1161,7 +1239,7 @@
      # check if output files are there
      if (output_file and os.path.exists(output_file) and not overwrite):
          print "Output file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -1191,7 +1269,7 @@
          print msg
          return
      except:
--        msg = "***Error: Failed to export trees due to unknown error.\n"
++        msg = "***Error: Failed to export trees due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
          print msg
          traceback.print_exc()
          return
@@ -1220,7 +1298,7 @@
      # check if output files are there
      if (output_file and os.path.exists(output_file) and not overwrite):
          print "Output file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -1309,7 +1387,7 @@
              print msg
              return
          except:
--            msg = "***Error: Failed to permute trees due to unknown error.\n"
++            msg = "***Error: Failed to permute trees due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
              print msg
              traceback.print_exc()
              return
@@ -1347,7 +1425,7 @@
      # check if output files are there
      if (os.path.exists(output_file) and not overwrite):
          print "Output file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -1376,7 +1454,7 @@
          print msg
          return
      except:
--        msg = "***Error: Failed to clean data due to unknown error.\n"
++        msg = "***Error: Failed to clean data due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
          print msg
          traceback.print_exc()
          return
@@ -1404,7 +1482,7 @@
      # check if output files are there
      if (os.path.exists(output_file) and not overwrite):
          print "Output file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -1433,7 +1511,7 @@
          print msg
          return
      except:
--        msg = "***Error: Failed to replace genera due to unknown error.\n"
++        msg = "***Error: Failed to replace genera due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
          print msg
          traceback.print_exc()
          return
@@ -1488,7 +1566,7 @@
              new_trees = {}
              i = 1
              for t in trees:
--                new_trees['tree_'+str(i)] = t
++                new_trees['tree_'+str(i)] = supertree_toolkit._collapse_nodes(t)
                  i += 1
              output = supertree_toolkit._amalgamate_trees(new_trees,format=output_format)
          except TreeParseError as detail:
@@ -1503,7 +1581,7 @@
      # check if output files are there
      if (os.path.exists(output_file) and not overwrite):
          print "Output file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -1540,7 +1618,7 @@
      # check if output files are there
      if (os.path.exists(output_file) and not overwrite):
          print "Output file exists. Either remove the file or use the --overwrite flag."
--        print "Do you wish to continue? [Y/n]"
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
          while True:
              k=inkey()
              if k.lower() == 'n':
@@ -1589,7 +1667,7 @@
          print msg
          return
      except:
--        msg = "***Error: Failed to create subset due to unknown error.\n"
++        msg = "***Error: Failed to create subset due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
          print msg
          traceback.print_exc()
          return
@@ -1637,6 +1715,681 @@
      print "**************************************************************\n"
++def check_otus(args):
++    """check out the OTUs in the Phyml - are they considered valid?"""
++
++    verbose = args.verbose
++    input_file = args.input
++    output_file = args.output
++
++    print input_file
++    if (input_file.endswith(".phyml")):
++        XML = supertree_toolkit.load_phyml(input_file)
++        try:
++            equivs = supertree_toolkit.taxonomic_checker(XML, verbose=verbose)
++        except NotUniqueError as detail:
++            msg = "***Error: Failed to check OTUs.\n"+detail.msg
++            print msg
++            return
++        except InvalidSTKData as detail:
++            msg = "***Error: Failed to check OTUs.\n"+detail.msg
++            print msg
++            return
++        except UninformativeTreeError as detail:
++            msg = "***Error: Failed to check OTUs.\n"+detail.msg
++            print msg
++            return
++        except TreeParseError as detail:
++            msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
++            print msg
++            return
++        except:
++            # what about no internet conenction? What error do that throw?
++            msg = "***Error: failed to create OTUs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
++            print msg
++            traceback.print_exc()
++            return
++    elif (input_file.endswith(".txt") or input_file.endswith('.dat')):
++        # read file - assume one taxa per line
++        with open(input_file,'r') as f:
++            lines = f.read().splitlines()
++        equivs = supertree_toolkit.taxonomic_checker_list(lines, verbose=verbose)
++    else:
++        # assume a tree!
++        equivs = supertree_toolkit.taxonomic_checker_tree(input_file, verbose=verbose)
++
++
++
++    f = open(output_file,"w")
++    for taxon in sorted(equivs.keys()):
++        f.write(taxon+","+";".join(equivs[taxon][0])+","+equivs[taxon][1]+"\n")
++    f.close()
++
++
++
++def create_taxonomy(args):
++    """create a taxonomic heirachy for each OTU in the Phyml"""
++
++    verbose = args.verbose
++    input_file = args.input
++    output_file = args.output
++    existing_taxonomy = args.taxonomy
++    ignoreWarnings = args.ignoreWarnings
++
++    XML = supertree_toolkit.load_phyml(input_file)
++    if (not existing_taxonomy == None):
++        existing_taxonomy = supertree_toolkit.load_taxonomy(existing_taxonomy) # load it in and create the dictionary
++        pass
++
++    try:
++        taxonomy = supertree_toolkit.create_taxonomy(XML,existing_taxonomy=existing_taxonomy,verbose=verbose,ignoreWarnings=ignoreWarnings)
++    except NotUniqueError as detail:
++        msg = "***Error: Failed to create taxonomy.\n"+detail.msg
++        print msg
++        return
++    except InvalidSTKData as detail:
++        msg = "***Error: Failed to create taxonomy.\n"+detail.msg
++        print msg
++        return
++    except UninformativeTreeError as detail:
++        msg = "***Error: Failed to create taxonomy.\n"+detail.msg
++        print msg
++        return
++    except TreeParseError as detail:
++        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
++        print msg
++        return
++    except:
++        # what about no internet conenction? What error do that throw?
++        msg = "***Error: failed to create taxonomy due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
++        print msg
++        traceback.print_exc()
++        return
++
++    # Now create the CSV output
++    with open(output_file, 'w') as f:
++        writer = csv.writer(f)
++        headers = []
++        headers.append("OTU")
++        headers.extend(supertree_toolkit.taxonomy_levels)
++        headers.append("Data source")
++        writer.writerow(headers)
++        for t in taxonomy:
++            otu = t
++            try:
++                species = taxonomy[t]['species']
++            except KeyError:
++                species = "-"
++            try:
++                genus = taxonomy[t]['genus']
++            except KeyError:
++                genus = "-"
++            try:
++                family = taxonomy[t]['family']
++            except KeyError:
++                family = "-"
++            try:
++                superfamily = taxonomy[t]['superfamily']
++            except KeyError:
++                superfamily = "-"
++            try:
++                infraorder = taxonomy[t]['infraorder']
++            except KeyError:
++                infraorder = "-"
++            try:
++                suborder = taxonomy[t]['suborder']
++            except KeyError:
++                suborder = "-"
++            try:
++                order = taxonomy[t]['order']
++            except KeyError:
++                order = "-"
++            try:
++                superorder = taxonomy[t]['superorder']
++            except KeyError:
++                superorder = "-"
++            try:
++                subclass = taxonomy[t]['subclass']
++            except KeyError:
++                subclass = "-"
++            try:
++                tclass = taxonomy[t]['class']
++            except KeyError:
++                tclass = "-"
++            try:
++                subphylum = taxonomy[t]['subphylum']
++            except KeyError:
++                subphylum = "-"
++            try:
++                phylum = taxonomy[t]['phylum']
++            except KeyError:
++                phylum = "-"
++            try:
++                superphylum = taxonomy[t]['superphylum']
++            except KeyError:
++                superphylum = "-"
++            try:
++                infrakingdom = taxonomy[t]['infrakingdom']
++            except:
++                infrakingdom = "-"
++            try:
++                subkingdom = taxonomy[t]['subkingdom']
++            except:
++                subkingdom = "-"
++            try:
++                kingdom = taxonomy[t]['kingdom']
++            except KeyError:
++                kingdom = "-"
++            try:
++                provider = taxonomy[t]['provider']
++            except KeyError:
++                provider = "-"
++
++            if (isinstance(species, list)):
++                species = " ".join(species)
++            this_classification = [
++                    otu.encode('utf-8'),
++                    species.encode('utf-8'),
++                    genus.encode('utf-8'),
++                    family.encode('utf-8'),
++                    superfamily.encode('utf-8'),
++                    infraorder.encode('utf-8'),
++                    suborder.encode('utf-8'),
++                    order.encode('utf-8'),
++                    superorder.encode('utf-8'),
++                    subclass.encode('utf-8'),
++                    tclass.encode('utf-8'),
++                    subphylum.encode('utf-8'),
++                    phylum.encode('utf-8'),
++                    superphylum.encode('utf-8'),
++                    infrakingdom.encode('utf-8'),
++                    subkingdom.encode('utf-8'),
++                    kingdom.encode('utf-8'),
++                    provider.encode('utf-8')]
++            writer.writerow(this_classification)
++
++def auto_subs(args):
++    """Get all OTUs to the same taxonomic level"""
++
++
++    verbose = args.verbose
++    input_file = args.input
++    output = args.output
++    taxonomy = args.taxonomy
++    ignoreWarnings = args.ignoreWarnings
++
++    if (os.path.exists(output) and not overwrite):
++        print "Output Phyml file exists. Either remove the file or use the --overwrite flag."
++        print "Do you wish to continue and overwrite the file anyway?? [Y/n]"
++        while True:
++            k=inkey()
++            if k.lower() == 'n':
++                print "Exiting..."
++                sys.exit(0)
++            if k.lower() == 'y':
++                break
++
++    XML = supertree_toolkit.load_phyml(input_file)
++    taxonomy = supertree_toolkit.load_taxonomy(taxonomy) # load it in and create the dictionary
++
++    try:
++        newXML = supertree_toolkit.generate_species_level_data(XML,taxonomy,verbose=verbose,ignoreWarnings=ignoreWarnings)
++    except NotUniqueError as detail:
++        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
++        print msg
++        return
++    except InvalidSTKData as detail:
++        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
++        print msg
++        return
++    except UninformativeTreeError as detail:
++        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
++        print msg
++        return
++    except TreeParseError as detail:
++        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
++        print msg
++        return
++    except NoneCompleteTaxonomy as detail:
++        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
++        print msg
++        return
++    except:
++        # what about no internet conenction? What error do that throw?
++        msg = "***Error: failed to carry out auto subs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
++        print msg
++        traceback.print_exc()
++        return
++
++    f = open(output,"w")
++    f.write(newXML)
++    f.close()
++
++def process(args):
++
++    verbose = args.verbose
++    input_file = args.input
++    output = args.output
++    no_store = args.no_store
++    ignoreWarnings = args.ignoreWarnings
++    taxonomy_file = args.taxonomy_file
++    equivalents_file = args.equivalents_file
++    overwrite = args.overwrite
++
++    if (os.path.exists(output) and not overwrite):
++        print "Output matrix file exists. Either remove the file or use the --overwrite flag."
++        print "Do you wish to continue and overwrite the file anyway? [Y/n]"
++        while True:
++            k=inkey()
++            if k.lower() == 'n':
++                print "Exiting..."
++                sys.exit(0)
++            if k.lower() == 'y':
++                break
++
++    filename = os.path.basename(input_file)
++    dirname = os.path.dirname(input_file)
++
++    if verbose:
++        print "Loading and checking your data"
++    # 0) load and check data
++    try:
++        phyml = supertree_toolkit.load_phyml(input_file)
++        project_name = supertree_toolkit.get_project_name(phyml)
++        supertree_toolkit._check_data(phyml)
++    except NotUniqueError as detail:
++        msg = "***Error: Failed to load data.\n"+detail.msg
++        print msg
++        return
++    except InvalidSTKData as detail:
++        msg = "***Error: Failed to load data.\n"+detail.msg
++        print msg
++        return
++    except UninformativeTreeError as detail:
++        msg = "***Error: Failed to load data.\n"+detail.msg
++        print msg
++        return
++    except TreeParseError as detail:
++        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
++        print msg
++        return
++    except:
++        msg = "***Error: Failed to load input due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
++        print msg
++        traceback.print_exc()
++        return
++
++    if verbose:
++        print "Checking taxa againt online databases"
++    # 1) taxonomy checker with autoreplace
++    # Load existing data if any:
++    if (not equivalents_file == None):
++        equivalents = supertree_toolkit.load_equivalents(equivalents_file)
++    else:
++        equivalents = None
++    equivalents = supertree_toolkit.taxonomic_checker(phyml,existing_data=equivalents,verbose=verbose)
++    # save the equivalents for later (as CSV and as sub file)
++    data_string_csv = _equivalents_to_csv(equivalents)
++    data_string_subs = _equivalents_to_subs(equivalents)
++    f = open(os.path.join(dirname,project_name+"_taxonomy_checker.csv"), "w")
++    f.write(data_string_csv)
++    f.close()
++    f = open(os.path.join(dirname,project_name+"_taxonomy_check_subs.dat"), "w")
++    f.write(data_string_subs)
++    f.close()
++
++    # now do the replacements - we use the subs file :)
++    if verbose:
++        print "Swapping in the corrected taxa names"
++    try:
++        old_taxa, new_taxa = supertree_toolkit.parse_subs_file(os.path.join(dirname,project_name+"_taxonomy_check_subs.dat"))
++    except UnableToParseSubsFile as e:
++        print e.msg
++        sys.exit(-1)
++    try:
++        phyml = supertree_toolkit.substitute_taxa(phyml,old_taxa,new_taxa,only_existing=False,verbose=verbose)
++    except NotUniqueError as detail:
++        msg = "***Error: Failed to substituting taxa.\n"+detail.msg
++        print msg
++        return
++    except InvalidSTKData as detail:
++        msg = "***Error: Failed substituting taxa.\n"+detail.msg
++        print msg
++        return
++    except UninformativeTreeError as detail:
++        msg = "***Error: Failed to substituting taxa.\n"+detail.msg
++        print msg
++        return
++    except TreeParseError as detail:
++        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
++        print msg
++        return
++    except:
++        msg = "***Error: Failed sbstituting taxa due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
++        print msg
++        traceback.print_exc()
++        return
++    # save phyml as intermediate step
++    f = open(os.path.join(dirname,project_name+"_taxonomy_checked.phyml"), "w")
++    f.write(phyml)
++    f.close()
++
++
++    if verbose:
++        print "Creating taxonomic information"
++    # 2) create taxonomy
++    if (not taxonomy_file == None):
++        taxonomy = supertree_toolkit.load_taxonomy(taxonomy_file)
++    else:
++        taxonomy = None
++    taxonomy = supertree_toolkit.create_taxonomy(phyml,existing_taxonomy=taxonomy,verbose=verbose)
++    # save the taxonomy for later
++    # Now create the CSV output - seperate out into function in STK (used several times)
++    with open(os.path.join(dirname,project_name+"_taxonomy.csv"), 'w') as f:
++        writer = csv.writer(f)
++        headers = []
++        headers.append("OTU")
++        headers.extend(supertree_toolkit.taxonomy_levels)
++        headers.append("Data source")
++        writer.writerow(headers)
++        for t in taxonomy:
++            otu = t
++            try:
++                species = taxonomy[t]['species']
++            except KeyError:
++                species = "-"
++            try:
++                subgenus = taxonomy[t]['subgenus']
++            except KeyError:
++                subgenus = "-"
++            try:
++                genus = taxonomy[t]['genus']
++            except KeyError:
++                genus = "-"
++            try:
++                subfamily = taxonomy[t]['subfamily']
++            except KeyError:
++                subfamily = "-"
++            try:
++                family = taxonomy[t]['family']
++            except KeyError:
++                family = "-"
++            try:
++                superfamily = taxonomy[t]['superfamily']
++            except KeyError:
++                superfamily = "-"
++            try:
++                subsection = taxonomy[t]['subsection']
++            except KeyError:
++                subsection = "-"
++            try:
++                section = taxonomy[t]['section']
++            except KeyError:
++                section = "-"
++            try:
++                infraorder = taxonomy[t]['infraorder']
++            except KeyError:
++                infraorder = "-"
++            try:
++                suborder = taxonomy[t]['suborder']
++            except KeyError:
++                suborder = "-"
++            try:
++                order = taxonomy[t]['order']
++            except KeyError:
++                order = "-"
++            try:
++                superorder = taxonomy[t]['superorder']
++            except KeyError:
++                superorder = "-"
++            try:
++                subclass = taxonomy[t]['subclass']
++            except KeyError:
++                subclass = "-"
++            try:
++                tclass = taxonomy[t]['class']
++            except KeyError:
++                tclass = "-"
++            try:
++                superclass = taxonomy[t]['superclass']
++            except KeyError:
++                superclass = "-"
++            try:
++                subphylum = taxonomy[t]['subphylum']
++            except KeyError:
++                subphylum = "-"
++            try:
++                phylum = taxonomy[t]['phylum']
++            except KeyError:
++                phylum = "-"
++            try:
++                superphylum = taxonomy[t]['superphylum']
++            except KeyError:
++                superphylum = "-"
++            try:
++                infrakingdom = taxonomy[t]['infrakingdom']
++            except:
++                infrakingdom = "-"
++            try:
++                subkingdom = taxonomy[t]['subkingdom']
++            except:
++                subkingdom = "-"
++            try:
++                kingdom = taxonomy[t]['kingdom']
++            except KeyError:
++                kingdom = "-"
++            try:
++                provider = taxonomy[t]['provider']
++            except KeyError:
++                provider = "-"
++            this_classification = [
++                    otu.encode('utf-8'),
++                    species.encode('utf-8'),
++                    subgenus.encode('utf-8'),
++                    genus.encode('utf-8'),
++                    subfamily.encode('utf-8'),
++                    family.encode('utf-8'),
++                    superfamily.encode('utf-8'),
++                    subsection.encode('utf-8'),
++                    section.encode('utf-8'),
++                    infraorder.encode('utf-8'),
++                    suborder.encode('utf-8'),
++                    order.encode('utf-8'),
++                    superorder.encode('utf-8'),
++                    subclass.encode('utf-8'),
++                    tclass.encode('utf-8'),
++                    superclass.encode('utf-8'),
++                    subphylum.encode('utf-8'),
++                    phylum.encode('utf-8'),
++                    superphylum.encode('utf-8'),
++                    infrakingdom.encode('utf-8'),
++                    subkingdom.encode('utf-8'),
++                    kingdom.encode('utf-8'),
++                    provider.encode('utf-8')]
++            writer.writerow(this_classification)
++
++    # 3) create species level dataset
++    if verbose:
++        print "Converting data to species level"
++    try:
++        phyml = supertree_toolkit.generate_species_level_data(phyml,taxonomy,verbose=verbose)
++    except NotUniqueError as detail:
++        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
++        print msg
++        return
++    except InvalidSTKData as detail:
++        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
++        print msg
++        return
++    except UninformativeTreeError as detail:
++        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
++        print msg
++        return
++    except TreeParseError as detail:
++        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
++        print msg
++        return
++    except NoneCompleteTaxonomy as detail:
++        msg = "***Error: Failed to carry out auto subs.\n"+detail.msg
++        print msg
++        return
++    except:
++        # what about no internet conenction? What error do that throw?
++        msg = "***Error: failed to carry out auto subs due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit"
++        print msg
++        traceback.print_exc()
++        return
++    # save the phyml as intermediate step
++    f = open(os.path.join(dirname,project_name+"_species_level.phyml"), "w")
++    f.write(phyml)
++    f.close()
++
++    # 4) Remove non-monophyletic taxa (requires TNT to be installed)
++    if verbose:
++        print "Removing non-monophyletic taxa via mini-supertree method"
++    tree_list = supertree_toolkit._find_trees_for_permuting(phyml)
++    try:
++        for t in tree_list:
++            # permute
++            output_string = supertree_toolkit.permute_tree(tree_list[t],matrix='hennig',treefile=None,verbose=verbose)
++            #save
++            if (not output_string == ""):
++                file_name = os.path.basename(filename)
++                dirname = os.path.dirname(filename)
++                new_output = os.path.join(dirname,t,t+"_matrix.tnt")
++                try:
++                   os.makedirs(os.path.join(dirname,t))
++                except OSError:
++                    if not os.path.isdir(os.path.join(dirname,t)):
++                        raise
++                f = open(new_output,'w',0)
++                f.write(output_string)
++                f.close
++                time.sleep(1)
++
++                # now create the tnt command to deal with this
++                # create a tmp file for the output tree
++                temp_file_handle, temp_file = tempfile.mkstemp(suffix=".tnt")
++                tnt_command = "tnt mxram 512,run "+new_output+",echo= ,timeout 00:10:00,rseed0,rseed*,hold 1000,xmult= level 0,taxname=,nelsen *,tsave *"+temp_file+",save /,quit"
++                #tnt_command = "tnt run "+new_output+",ienum,taxname=,nelsen*,tsave *"+temp_file+",save /,quit"
++                # run tnt, grab the output and store back in the data
++                #try:
++                call(tnt_command, shell=True)
++                #except CalledProcessError as e:
++                #    msg = "***Error: Failed to run TNT. Is it installed correctl?.\n"+e.msg
++                #    print msg
++                #    return
++                #ret = os.system(tnt_command)
++                #if (not ret == 0):
++                #    print "error running tnt"
++                #    return
++
++                new_tree = supertree_toolkit.import_tree(temp_file)
++                phyml = supertree_toolkit._swap_tree_in_XML(phyml,new_tree,t)
++
++    except TreeParseError as e:
++        msg = "***Error permuting trees.\n"+e.msg
++        print msg
++        return
++
++    #4.5) remove MRP_Outgroups
++    phyml = supertree_toolkit.substitute_taxa(phyml,'MRP_Outgroup')
++    phyml = supertree_toolkit.substitute_taxa(phyml,'MRPOutgroup')
++    phyml = supertree_toolkit.substitute_taxa(phyml,'MRP_outgroup')
++    phyml = supertree_toolkit.substitute_taxa(phyml,'MRPoutgroup')
++    phyml = supertree_toolkit.substitute_taxa(phyml,'MRPOUTGROUP')
++
++    # save intermediate phyml
++    f = open(os.path.join(dirname,project_name+"_nonmonophyl_removed.phyml"), "w")
++    f.write(phyml)
++    f.close()
++
++
++    # 5) Remove common names
++    # no function to do this yet...
++
++    # 6) Data independance
++    if verbose:
++        print "Checking data independence"
++    data_ind,subsets,phyml = supertree_toolkit.data_independence(phyml,make_new_xml=True)
++    # save phyml
++    f = open(os.path.join(dirname,project_name+"_data_ind.phyml"), "w")
++    f.write(phyml)
++    f.close()
++
++    # 7) Data overlap
++    if verbose:
++        print "Checking data overlap"
++    sufficient_overlap, key_list = supertree_toolkit.data_overlap(phyml,verbose=verbose)
++    # process the key_list to remove the unconnected trees
++    if not sufficient_overlap:
++        # we don't, have enough, then remove all but the largest group.
++        # the key contains a list, with the largest group first (thanks networkX!)
++        # we can therefore just remove trees from everything but the first in the list
++        delete_me = []
++        for t in key_list[1::]: # skip 0
++            delete_me.extend(t)
++        for tree in delete_me:
++            phyml = supertree_toolkit._swap_tree_in_XML(phyml, None, tree, delete=True) # delete the tree and clean the data as we go
++    # save phyml
++    f = open(os.path.join(dirname,project_name+"_data_tax_overlap.phyml"), "w")
++    f.write(phyml)
++    f.close()
++
++
++    # 8) Create matrix
++    if verbose:
++        print "Creating matrix"
++    try:
++        matrix = supertree_toolkit.create_matrix(phyml)
++    except NotUniqueError as detail:
++        msg = "***Error: Failed to create matrix.\n"+detail.msg
++        print msg
++        return
++    except InvalidSTKData as detail:
++        msg = "***Error: Failed to create matrix.\n"+detail.msg
++        print msg
++        return
++    except UninformativeTreeError as detail:
++        msg = "***Error: Failed to create matrix.\n"+detail.msg
++        print msg
++        return
++    except TreeParseError as detail:
++        msg = "***Error: failed to parse a tree in your data set.\n"+detail.msg
++        print msg
++        return
++    except:
++        msg = "***Error: Failed to create matrix due to unknown error. File a bug report, please!\nhttps://bugs.launchpad.net/supertree-toolkit\n"
++        print msg
++        traceback.print_exc()
++        return
++
++    f = open(output, "w")
++    f.write(matrix)
++    f.close()
++
++    return
++
++
++def _equivalents_to_csv(equivalents):
++
++    output_string = 'Taxa,Equivalents,Status\n'
++
++    for taxon in sorted(equivalents):
++        output_string += taxon + "," + ';'.join(equivalents[taxon][0]) + "," + equivalents[taxon][1] + "\n"
++
++    return output_string
++
++
++def _equivalents_to_subs(equivalents):
++    """Only corrects the yellow ones. Red and green are left alone"""
++
++    output_string = ""
++    for taxon in sorted(equivalents):
++        if (equivalents[taxon][1] == 'yellow'):
++            # the first name is always the correct one
++            output_string += taxon + " = "+equivalents[taxon][0][0]+"\n"
++    return output_string
  if __name__ == "__main__":
      main()
 === modified file 'stk/stk_exceptions.py'
 --- stk/stk_exceptions.py	2013-10-22 08:26:54 +0000
 +++ stk/stk_exceptions.py	2017-01-12 09:27:31 +0000
@@ -134,4 +134,12 @@
      def __init__(self, msg):
          self.msg = msg
++class NoneCompleteTaxonomy(Error):
++    """Exception raised when a taxonomy is not complete for these data
++    Attributes:
++          msg -- explaination of error
++    """
++
++    def __init__(self, msg):
++        self.msg = msg
 === modified file 'stk/supertree_toolkit.py'
 --- stk/supertree_toolkit.py	2017-01-11 15:16:21 +0000
 +++ stk/supertree_toolkit.py	2017-01-12 09:27:31 +0000
@@ -44,15 +44,49 @@
  import unicodedata
  from stk_internals import *
  from copy import deepcopy
++import Queue
++import threading
++import urllib2
++from urllib import quote_plus
++import simplejson as json
++import time
  import types
  #plt.ion()
++sys.setrecursionlimit(50000)
  # GLOBAL VARIABLES
  IDENTICAL = 0
  SUBSET = 1
  PLATFORM = sys.platform
--taxonomy_levels = ['species','genus','family','superfamily','infraorder','suborder','order','superorder','subclass','class','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
++#Logging
++import logging
++logging.basicConfig(filename='supertreetoolkit.log', level=logging.DEBUG, format='%(asctime)s %(levelname)s:%(message)s', datefmt='%m/%d/%Y %I:%M:%S %p')
++
++# taxonomy levels
++# What we get from EOL
++current_taxonomy_levels = ['species','genus','family','order','class','phylum','kingdom']
++# And the extra ones from ITIS
++extra_taxonomy_levels = ['superfamily','infraorder','suborder','superorder','subclass','subphylum','superphylum','infrakingdom','subkingdom']
++# all of them in order
++taxonomy_levels = ['species','subgenus','genus','tribe','subfamily','family','superfamily','subsection','section','parvorder','infraorder','suborder','order','superorder','subclass','class','superclass','subphylum','phylum','superphylum','infrakingdom','subkingdom','kingdom']
++
++SPECIES = taxonomy_levels[0]
++GENUS = taxonomy_levels[1]
++FAMILY = taxonomy_levels[2]
++SUPERFAMILY = taxonomy_levels[3]
++INFRAORDER = taxonomy_levels[4]
++SUBORDER = taxonomy_levels[5]
++ORDER = taxonomy_levels[6]
++SUPERORDER = taxonomy_levels[7]
++SUBCLASS = taxonomy_levels[8]
++CLASS = taxonomy_levels[9]
++SUBPHYLUM = taxonomy_levels[10]
++PHYLUM = taxonomy_levels[11]
++SUPERPHYLUM = taxonomy_levels[12]
++INFRAKINGDOM = taxonomy_levels[13]
++SUBKINGDOM = taxonomy_levels[14]
++KINGDOM = taxonomy_levels[15]
  # supertree_toolkit is the backend for the STK. Loaded by both the GUI and
  # CLI, this contains all the functions to actually *do* something
@@ -60,6 +94,17 @@
  # All functions take XML and a list of other arguments, process the data and return
  # it back to the user interface handler to save it somewhere
++
++def get_project_name(XML):
++    """
++    Get the name of the dataset currently being worked on
++    """
++
++    xml_root = _parse_xml(XML)
++
++    return xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
++
++
  def create_name(authors, year, append=''):
      """
      Construct a sensible from a list of authors and a year for a
@@ -161,6 +206,22 @@
      return names
++def get_all_tree_names(XML):
++    """ From a full XML-PHYML string, extract all tree names.
++    """
++
++    xml_root = _parse_xml(XML)
++    find = etree.XPath("//source")
++    sources = find(xml_root)
++    names = []
++    for s in sources:
++        for st in s.xpath("source_tree"):
++            if 'name' in st.attrib and not st.attrib['name'] == "":
++                names.append(st.attrib['name'])
++
++    return names
++
++
  def set_unique_names(XML):
      """ Ensures all sources have unique names.
      """
@@ -249,9 +310,17 @@
          if (ele.tag == "source"):
              sources.append(ele)
++    if overwrite:
++        # remove all the names first
++        for s in sources:
++            for st in s.xpath("source_tree"):
++                if 'name' in st.attrib:
++                    del st.attrib['name']
++
++
      for s in sources:
          for st in s.xpath("source_tree"):
--            if overwrite or not 'name' in st.attrib:
++            if not'name' in st.attrib:
                  tree_name = create_tree_name(XML,st)
                  st.attrib['name'] = tree_name
@@ -339,7 +408,7 @@
              taxa = etree.SubElement(s_tree,"taxa_data")
              taxa.tail="\n      "
              # Note: we do not add all elements as otherwise they get set to some option
--            # rather than remaining blank (and hence blue int he interface)
++            # rather than remaining blank (and hence blue in the interface)
              # append our new source to the main tree
              # if sources has no valid source, overwrite,
@@ -877,7 +946,7 @@
      # Need to add checks on the file. Problems include:
  # TNT: outputs Phyllip format or something - basically a Newick
  # string without commas, so add 'em back in
--    m = re.search(r'proc-;', content)
++    m = re.search(r'proc.;', content)
      if (m != None):
          # TNT output tree
          # Done on a Mac? Replace ^M with a newline
@@ -1402,6 +1471,36 @@
      return _amalgamate_trees(trees,format,anonymous)
++def get_taxa_from_tree_for_taxonomy(tree, pretty=False, ignoreErrors=False):
++    """Returns a list of all taxa available for the tree passed as argument.
++    :param tree: string with the data for the tree in Newick format.
++    :type tree: string
++    :param pretty: defines if '_' in taxa names should be replaced with spaces.
++    :type pretty: boolean
++    :param ignoreErrors: should execution continue on error?
++    :type ignoreErrors: boolean
++    :returns: list of strings with the taxa names, sorted alphabetically
++    :rtype: list
++    """
++    taxa_list = []
++
++    try:
++        taxa_list.extend(_getTaxaFromNewick(tree))
++    except TreeParseError as detail:
++        if (ignoreErrors):
++            logging.warning(detail.msg)
++            pass
++        else:
++            raise TreeParseError( detail.msg )
++
++    # now uniquify the list of taxa
++    taxa_list = _uniquify(taxa_list)
++    taxa_list.sort()
++
++    if (pretty):
++        taxa_list = [x.replace('_', ' ') for x in taxa_list]
++
++    return taxa_list
  def get_all_taxa(XML, pretty=False, ignoreErrors=False):
      """ Produce a taxa list by scanning all trees within
@@ -1422,21 +1521,17 @@
              taxa_list.extend(_getTaxaFromNewick(t))
          except TreeParseError as detail:
              if (ignoreErrors):
++                logging.warning(detail.msg)
                  pass
              else:
                  raise TreeParseError( detail.msg )
--
--
      # now uniquify the list of taxa
      taxa_list = _uniquify(taxa_list)
      taxa_list.sort()
--    if (pretty):
--        unpretty_tl = taxa_list
--        taxa_list = []
--        for t in unpretty_tl:
--            taxa_list.append(t.replace('_',' '))
++    if (pretty): #Remove underscores from names
++        taxa_list = [x.replace('_', ' ') for x in taxa_list]
      return taxa_list
@@ -1508,7 +1603,7 @@
      return outgroups
--def create_matrix(XML,format="hennig",quote=False,taxonomy=None,outgroups=False,ignoreWarnings=False):
++def create_matrix(XML,format="hennig",quote=False,taxonomy=None,outgroups=False,ignoreWarnings=False, verbose=False):
      """ From all trees in the XML, create a matrix
      """
@@ -1553,7 +1648,7 @@
          taxa.sort()
      taxa.insert(0,"MRP_Outgroup")
--    return _create_matrix(trees, taxa, format=format, quote=quote, weights=weights)
++    return _create_matrix(trees, taxa, format=format, quote=quote, weights=weights,verbose=verbose)
  def create_matrix_from_trees(trees,format="hennig"):
@@ -1925,7 +2020,7 @@
          _check_data(XML)
      xml_root = _parse_xml(XML)
--    proj_name = xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
++    proj_name = get_project_name(XML)
      output_string  = "======================\n"
      output_string += " Data summary of: " + proj_name + "\n"
@@ -1989,6 +2084,188 @@
      return output_string
++def taxonomic_checker_list(name_list,existing_data=None,verbose=False):
++    """ For each name in the database generate a database of the original name,
++    possible synonyms and if the taxon is not know, signal that. We do this by
++    using the EoL API to grab synonyms of each taxon.  """
++
++    import urllib2
++    from urllib import quote_plus
++    import simplejson as json
++
++    if existing_data == None:
++        equivalents = {}
++    else:
++        equivalents = existing_data
++
++    # for each taxon, check the name on EoL - what if it's a synonym? Does EoL still return a result?
++    # if not, is there another API function to do this?
++    # search for the taxon and grab the name - if you search for a recognised synonym on EoL then
++    # you get the original ('correct') name - shorten this to two words and you're done.
++    for t in name_list:
++        if t in equivalents:
++            continue
++        taxon = t.replace("_"," ")
++        if (verbose):
++            print "Looking up ", taxon
++        # get the data from EOL on taxon
++        taxonq = quote_plus(taxon)
++        URL = "http://eol.org/api/search/1.0.json?q="+taxonq
++        req = urllib2.Request(URL)
++        opener = urllib2.build_opener()
++        f = opener.open(req)
++        data = json.load(f)
++        # check if there's some data
++        if len(data['results']) == 0:
++            equivalents[t] = [[t],'red']
++            continue
++        amber = False
++        if len(data['results']) > 1:
++            # this is not great - we have multiple hits for this taxon - needs the user to go back and warn about this
++            # for automatic processing we'll just take the first one though
++            # colour is amber in this case
++            amber = True
++        ID = str(data['results'][0]['id']) # take first hit
++        URL = "http://eol.org/api/pages/1.0/"+ID+".json?images=0&videos=0&sounds=0&maps=0&text=0&iucn=false&subjects=overview&licenses=all&details=true&common_names=true&synonyms=true&references=true&vetted=0"
++        req = urllib2.Request(URL)
++        opener = urllib2.build_opener()
++
++        try:
++            f = opener.open(req)
++        except urllib2.HTTPError:
++            equivalents[t] = [[t],'red']
++            continue
++        data = json.load(f)
++        if len(data['scientificName']) == 0:
++            # not found a scientific name, so set as red
++            equivalents[t] = [[t],'red']
++            continue
++        correct_name = data['scientificName'].encode("ascii","ignore")
++        # we only want the first two bits of the name, not the original author and year if any
++        temp_name = correct_name.split(' ')
++        if (len(temp_name) > 2):
++            correct_name = ' '.join(temp_name[0:2])
++        correct_name = correct_name.replace(' ','_')
++
++        # build up the output dictionary - original name is key, synonyms/missing is value
++        if (correct_name == t):
++            # if the original matches the 'correct', then it's green
++            equivalents[t] = [[t], 'green']
++        else:
++            # if we managed to get something anyway, then it's yellow and create a list of possible synonyms with the
++            # 'correct' taxon at the top
++            eol_synonyms = data['synonyms']
++            synonyms = []
++            for s in eol_synonyms:
++                ts = s['synonym'].encode("ascii","ignore")
++                temp_syn = ts.split(' ')
++                if (len(temp_syn) > 2):
++                    temp_syn = ' '.join(temp_syn[0:2])
++                    ts = temp_syn
++                if (s['relationship'] == "synonym"):
++                    ts = ts.replace(" ","_")
++                    synonyms.append(ts)
++            synonyms = _uniquify(synonyms)
++            # we need to put the correct name at the top of the list now
++            if (correct_name in synonyms):
++                synonyms.insert(0, synonyms.pop(synonyms.index(correct_name)))
++            elif len(synonyms) == 0:
++                synonyms.append(correct_name)
++            else:
++                synonyms.insert(0,correct_name)
++
++            if (amber):
++                equivalents[t] = [synonyms,'amber']
++            else:
++                equivalents[t] = [synonyms,'yellow']
++        # if our search was empty, then it's red - see above
++
++    # up to the calling funciton to do something sensible with this
++    # we build a dictionary of names and then a list of synonyms or the original name, then a tag if it's green, yellow, red.
++    # Amber means we found synonyms and multilpe hits. User def needs to sort these!
++
++    return equivalents
++
++def taxonomic_checker_tree(tree_file,existing_data=None,verbose=False):
++    """ For each name in the database generate a database of the original name,
++    possible synonyms and if the taxon is not know, signal that. We do this by
++    using the EoL API to grab synonyms of each taxon.  """
++
++    tree = import_tree(tree_file)
++    p4tree = _parse_tree(tree)
++    taxa = p4tree.getAllLeafNames(p4tree.root)
++    if existing_data == None:
++        equivalents = {}
++    else:
++        equivalents = existing_data
++
++    equivalents = taxonomic_checker_list(taxa,existing_data,verbose)
++    return equivalents
++
++def taxonomic_checker(XML,existing_data=None,verbose=False):
++    """ For each name in the database generate a database of the original name,
++    possible synonyms and if the taxon is not know, signal that. We do this by
++    using the EoL API to grab synonyms of each taxon.  """
++
++    # grab all taxa
++    taxa = get_all_taxa(XML)
++
++    if existing_data == None:
++        equivalents = {}
++    else:
++        equivalents = existing_data
++
++    equivalents = taxonomic_checker_list(taxa,existing_data,verbose)
++    return equivalents
++
++
++def load_equivalents(equiv_csv):
++    """Load equivalents data from a csv and convert to a equivalents Dict.
++        Structure is key, with a list that is array of synonyms, followed by status ('green',
++        'yellow' or 'red').
++
++    """
++
++    import csv
++
++    equivalents = {}
++
++    with open(equiv_csv, 'rU') as csvfile:
++        equiv_reader = csv.reader(csvfile, delimiter=',')
++        equiv_reader.next() # skip header
++        for row in equiv_reader:
++            i = 1
++            equivalents[row[0]] = [row[1].split(';'),row[2]]
++
++    return equivalents
++
++def save_taxonomy(taxonomy, output_file):
++
++    import csv
++
++    with open(output_file, 'w') as f:
++        writer = csv.writer(f)
++        row = ['OTU']
++        row.extend(taxonomy_levels)
++        row.append('Provider')
++        writer.writerow(row)
++        for t in taxonomy:
++            species = t
++            row = []
++            row.append(t.encode('utf-8'))
++            for l in taxonomy_levels:
++                try:
++                    g = taxonomy[t][l]
++                except KeyError:
++                    g = '-'
++                row.append(g.encode('utf-8'))
++            try:
++                provider = taxonomy[t]['provider']
++            except KeyError:
++                provider = "-"
++            row.append(provider)
++
++            writer.writerow(row)
  def load_taxonomy(taxonomy_csv):
@@ -2000,20 +2277,443 @@
      with open(taxonomy_csv, 'rU') as csvfile:
          tax_reader = csv.reader(csvfile, delimiter=',')
--        tax_reader.next()
--        for row in tax_reader:
--            current_taxonomy = {}
--            i = 1
--            for t in taxonomy_levels:
--                if not row[i] == '-':
--                    current_taxonomy[t] = row[i]
--                i = i+ 1
--
--            current_taxonomy['provider'] = row[17] # data source
--            taxonomy[row[0]] = current_taxonomy
--
--    return taxonomy
--
++        try:
++            j = 0
++            for row in tax_reader:
++                if j == 0:
++                    tax_levels = row[1:-1]
++                    j += 1
++                    continue
++                i = 1
++                current_taxonomy = {}
++                for t in tax_levels:
++                    if not row[i] == '-':
++                        current_taxonomy[t] = row[i]
++                    i = i+ 1
++                current_taxonomy['provider'] = row[-1] # data source
++                taxonomy[row[0].replace(" ","_")] = current_taxonomy
++                j += 1
++        except:
++            pass
++
++    return taxonomy
++
++
++class TaxonomyFetcher(threading.Thread):
++    """ Class to provide the taxonomy fetching functionality as a threaded function to be used individually or working with a pool.
++    """
++
++    def __init__(self, taxonomy, lock, queue, id=0, pref_db=None, verbose=False, ignoreWarnings=False):
++        """ Constructor for the threaded model.
++        :param taxonomy: previous taxonomy available (if available) or an empty dictionary to store the results .
++        :type taxonomy: dictionary
++        :param lock: lock to keep the taxonomy threadsafe.
++        :type lock: Lock
++        :param queue: queue where the taxa are kept to be processed.
++        :type queue: Queue of strings
++        :param id: id for the thread to use if messages need to be printed.
++        :type id: int
++        :param pref_db: Gives priority to database. Seems it is unused.
++        :type pref_db: string
++        :param verbose: Show verbose messages during execution, will also define level of logging. True will set logging level to INFO.
++        :type verbose: boolean
++        :param ignoreWarnings: Ignore warnings and errors during execution? Errors will be logged with ERROR level on the logging output.
++        :type ignoreWarnings: boolean
++        """
++
++        threading.Thread.__init__(self)
++        self.taxonomy = taxonomy
++        self.lock = lock
++        self.queue = queue
++        self.id = id
++        self.verbose = verbose
++        self.pref_db = pref_db
++        self.ignoreWarnings = ignoreWarnings
++
++    def run(self):
++        """ Gets and processes a taxon from the queue to get its taxonomy."""
++        while True :
++            if self.verbose :
++                logging.getLogger().setLevel(logging.INFO)
++            #get taxon from queue
++            taxon = self.queue.get()
++
++            logging.debug("Starting {} with thread #{} remaining ~{}".format(taxon,str(self.id),str(self.queue.qsize())))
++
++            #Lock access to the taxonomy
++            self.lock.acquire()
++            if not taxon in self.taxonomy: # is a new taxon, not previously in the taxonomy
++                #Release access to the taxonomy
++                self.lock.release()
++                if (self.verbose):
++                    print "Looking up ", taxon
++                    logging.info("Loolking up taxon: {}".format(str(taxon)))
++                try:
++                    # get the data from EOL on taxon
++                    taxonq = quote_plus(taxon)
++                    URL = "http://eol.org/api/search/1.0.json?q="+taxonq
++                    req = urllib2.Request(URL)
++                    opener = urllib2.build_opener()
++                    f = opener.open(req)
++                    data = json.load(f)
++                    # check if there's some data
++                    if len(data['results']) == 0:
++                        # try PBDB as it might be a fossil
++                        URL = "http://paleobiodb.org/data1.1/taxa/single.json?name="+taxonq+"&show=phylo&vocab=pbdb"
++                        req = urllib2.Request(URL)
++                        opener = urllib2.build_opener()
++                        f = opener.open(req)
++                        datapbdb = json.load(f)
++                        if (len(datapbdb['records']) == 0):
++                            # no idea!
++                            with self.lock:
++                                self.taxonomy[taxon] = {}
++                            self.queue.task_done()
++                            continue
++                        # otherwise, let's fill in info here - only if extinct!
++                        if datapbdb['records'][0]['is_extant'] == 0:
++                            this_taxonomy = {}
++                            this_taxonomy['provider'] = 'PBDB'
++                            for level in taxonomy_levels:
++                                try:
++                                    if datapbdb.has_key('records'):
++                                        pbdb_lev = datapbdb['records'][0][level]
++                                        temp_lev = pbdb_lev.split(" ")
++                                        # they might have the author on the end, so strip it off
++                                        if (level == 'species'):
++                                            this_taxonomy[level] = ' '.join(temp_lev[0:2])
++                                        else:
++                                            this_taxonomy[level] = temp_lev[0]
++                                except KeyError as e:
++                                    logging.exception("Key not found records")
++                                    continue
++                            # add the taxon at right level too
++                            try:
++                                if datapbdb.has_key('records'):
++                                    current_level = datapbdb['records'][0]['rank']
++                                    this_taxonomy[current_level] = datapbdb['records'][0]['taxon_name']
++                            except KeyError as e:
++                                self.queue.task_done()
++                                logging.exception("Key not found records")
++                                continue
++                            with self.lock:
++                                self.taxonomy[taxon] = this_taxonomy
++                            self.queue.task_done()
++                            continue
++                        else:
++                            # extant, but not in EoL - leave the user to sort this one out
++                            with self.lock:
++                                self.taxonomy[taxon] = {}
++                            self.queue.task_done()
++                            continue
++
++
++                    ID = str(data['results'][0]['id']) # take first hit
++                    # Now look for taxonomies
++                    URL = "http://eol.org/api/pages/1.0/"+ID+".json"
++                    req = urllib2.Request(URL)
++                    opener = urllib2.build_opener()
++                    f = opener.open(req)
++                    data = json.load(f)
++                    if len(data['taxonConcepts']) == 0:
++                        with self.lock:
++                            self.taxonomy[taxon] = {}
++                        self.queue.task_done()
++                        continue
++                    TID = str(data['taxonConcepts'][0]['identifier']) # take first hit
++                    currentdb = str(data['taxonConcepts'][0]['nameAccordingTo'])
++                    # loop through and get preferred one if specified
++                    # now get taxonomy
++                    if (not self.pref_db is None):
++                        for db in data['taxonConcepts']:
++                            currentdb = db['nameAccordingTo'].lower()
++                            if (self.pref_db.lower() in currentdb):
++                                TID = str(db['identifier'])
++                                break
++                    URL="http://eol.org/api/hierarchy_entries/1.0/"+TID+".json"
++                    req = urllib2.Request(URL)
++                    opener = urllib2.build_opener()
++                    f = opener.open(req)
++                    data = json.load(f)
++                    this_taxonomy = {}
++                    this_taxonomy['provider'] = currentdb
++                    for a in data['ancestors']:
++                        try:
++                            if a.has_key('taxonRank') :
++                                temp_level = a['taxonRank'].encode("ascii","ignore")
++                                if (temp_level in taxonomy_levels):
++                                    # note the dump into ASCII
++                                    temp_name = a['scientificName'].encode("ascii","ignore")
++                                    temp_name = temp_name.split(" ")
++                                    if (temp_level == 'species'):
++                                        this_taxonomy[temp_level] = temp_name[0:2]
++
++                                    else:
++                                        this_taxonomy[temp_level] = temp_name[0]
++                        except KeyError as e:
++                            logging.exception("Key not found: taxonRank")
++                            continue
++                    try:
++                        # add taxonomy in to the taxonomy!
++                        # some issues here, so let's make sure it's OK
++                        temp_name = taxon.split(" ")
++                        if data.has_key('taxonRank') :
++                            if not data['taxonRank'].lower() == 'species':
++                                this_taxonomy[data['taxonRank'].lower()] = temp_name[0]
++                            else:
++                                this_taxonomy[data['taxonRank'].lower()] = ' '.join(temp_name[0:2])
++                    except KeyError as e:
++                        self.queue.task_done()
++                        logging.exception("Key not found: taxonRank")
++                        continue
++                    with self.lock:
++                        #Send result to dictionary
++                        self.taxonomy[taxon] = this_taxonomy
++                except urllib2.HTTPError:
++                    print("Network error when processing {} ".format(taxon,))
++                    logging.info("Network error when processing {} ".format(taxon,))
++                    self.queue.task_done()
++                    continue
++                except urllib2.URLError:
++                    print("Network error when processing {} ".format(taxon,))
++                    logging.info("Network error when processing {} ".format(taxon,))
++                    self.queue.task_done()
++                    continue
++            else :
++                #Nothing to do release the lock on taxonomy
++                self.lock.release()
++            #Mark task as done
++            self.queue.task_done()
++
++def create_taxonomy_from_taxa(taxa, taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False, threadNumber=5):
++    """Uses the taxa provided to generate a taxonomy for all the taxon available.
++    :param taxa: list of the taxa.
++    :type taxa : list
++    :param taxonomy: previous taxonomy available (if available) or an empty
++    dictionary to store the results. If None will be init to an empty dictionary
++    :type taxonomy: dictionary
++    :param pref_db: Gives priority to database. Seems it is unused.
++    :type pref_db: string
++    :param verbose: Show verbose messages during execution, will also define
++    level of logging. True will set logging level to INFO.
++    :type verbose: boolean
++    :param ignoreWarnings: Ignore warnings and errors during execution? Errors
++    will be logged with ERROR level on the logging output.
++    :type ignoreWarnings: boolean
++    :param threadNumber: Maximum number of threads to use for taxonomy processing.
++    :type threadNumber: int
++    :returns: dictionary with resulting taxonomy for each taxon (keys)
++    :rtype: dictionary
++    """
++    if verbose :
++        logging.getLogger().setLevel(logging.INFO)
++    if taxonomy is None:
++        taxonomy = {}
++
++    lock = threading.Lock()
++    queue = Queue.Queue()
++
++    #Starting a few threads as daemons checking the queue
++    for i in range(threadNumber) :
++        t = TaxonomyFetcher(taxonomy, lock, queue, i, pref_db, verbose, ignoreWarnings)
++        t.setDaemon(True)
++        t.start()
++
++    #Popoluate the queue with the taxa.
++    for taxon in taxa :
++        queue.put(taxon)
++
++    #Wait till everyone finishes
++    queue.join()
++    logging.getLogger().setLevel(logging.WARNING)
++
++def create_taxonomy_from_tree(tree, existing_taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False):
++    """ Generates the taxonomy from a tree. Uses a similar method to the XML version but works directly on a string with the tree.
++    :param tree: list of the taxa.
++    :type tree : list
++    :param existing_taxonomy: list of the taxa.
++    :type existing_taxonomy: list
++    :param pref_db: Gives priority to database. Seems it is unused.
++    :type pref_db: string
++    :param verbose: Flag for verbosity.
++    :type verbose: boolean
++    :param ignoreWarnings: Flag for exception processing.
++    :type ignoreWarnings: boolean
++    :returns: the modified taxonomy
++    :rtype: dictionary
++    """
++    starttime = time.time()
++
++    if(existing_taxonomy is None) :
++        taxonomy = {}
++    else :
++        taxonomy = existing_taxonomy
++
++    taxa = get_taxa_from_tree_for_taxonomy(tree, pretty=True)
++
++    create_taxonomy_from_taxa(taxa, taxonomy)
++
++    taxonomy = create_extended_taxonomy(taxonomy, starttime, verbose, ignoreWarnings)
++
++    return taxonomy
++
++def create_taxonomy(XML, existing_taxonomy=None, pref_db=None, verbose=False, ignoreWarnings=False):
++    """Generates a taxonomy of the data from EoL data. This is stored as a
++    dictionary of taxonomy for each taxon in the dataset. Missing data are
++    encoded as '' (blank string). It's up to the calling function to store this
++    data to file or display it."""
++
++    starttime = time.time()
++
++    if not ignoreWarnings:
++        _check_data(XML)
++
++    if (existing_taxonomy is None):
++        taxonomy = {}
++    else:
++        taxonomy = existing_taxonomy
++    taxa = get_all_taxa(XML, pretty=True)
++    create_taxonomy_from_taxa(taxa, taxonomy)
++    #taxonomy = create_extended_taxonomy(taxonomy, starttime, verbose, ignoreWarnings)
++    return taxonomy
++
++def create_extended_taxonomy(taxonomy, starttime, verbose=False, ignoreWarnings=False):
++    """Bring extra taxonomy terms from other databases, shared method for completing the taxonomy
++    both for trees comming from XML or directly from trees.
++    :param taxonomy: Dictionary with the relationship for taxa and taxonomy terms.
++    :type taxonomy: dictionary
++    :param starttime: time to keep track of processing time.
++    :type starttime: long
++    :param verbose: Flag for verbosity.
++    :type verbose: boolean
++    :param ignoreWarnings: Flag for exception processing.
++    :type ignoreWarnings: boolean
++    :returns: the modified taxonomy
++    :rtype: dictionary
++    """
++
++    if (verbose):
++        logging.info('Done basic taxonomy, getting more info from ITIS')
++        print("Time elapsed {}".format(str(time.time() - starttime)))
++        print "Done basic taxonomy, getting more info from ITIS"
++    # fill in the rest of the taxonomy
++    # get all genera
++    genera = []
++    for t in taxonomy:
++        if t in taxonomy:
++            if GENUS in taxonomy[t]:
++                genera.append(taxonomy[t][GENUS])
++    genera = _uniquify(genera)
++    # We then use ITIS to fill in missing info based on the genera only - that saves us a species level search
++    # and we can fill in most of the EoL missing data
++    for g in genera:
++        if (verbose):
++            print "Looking up ", g
++            logging.info("Looking up {}".format(str(g)))
++        try:
++            URL="http://www.itis.gov/ITISWebService/jsonservice/searchByScientificName?srchKey="+quote_plus(g.strip())
++        except:
++            continue
++        req = urllib2.Request(URL)
++        opener = urllib2.build_opener()
++        try:
++            f = opener.open(req)
++        except urllib2.HTTPError:
++            continue
++        string = unicode(f.read(),"ISO-8859-1")
++        data = json.loads(string)
++        if data['scientificNames'][0] == None:
++            continue
++        tsn = data["scientificNames"][0]["tsn"]
++        URL="http://www.itis.gov/ITISWebService/jsonservice/getFullHierarchyFromTSN?tsn="+str(tsn)
++        req = urllib2.Request(URL)
++        opener = urllib2.build_opener()
++        f = opener.open(req)
++        try:
++            string = unicode(f.read(),"ISO-8859-1")
++        except:
++            continue
++        data = json.loads(string)
++        this_taxonomy = {}
++        for level in data['hierarchyList']:
++            if not level['rankName'].lower() in current_taxonomy_levels:
++                # note the dump into ASCII
++                if level['rankName'].lower() == 'species':
++                    this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = ' '.join.level['taxonName'][0:2].encode("ascii","ignore")
++                else:
++                    this_taxonomy[level['rankName'].lower().encode("ascii","ignore")] = level['taxonName'].encode("ascii","ignore")
++
++        for t in taxonomy:
++            if t in taxonomy:
++                if GENUS in taxonomy[t]:
++                    if taxonomy[t][GENUS] == g:
++                        taxonomy[t].update(this_taxonomy)
++
++    return taxonomy
++
++def generate_species_level_data(XML, taxonomy, ignoreWarnings=False, verbose=False):
++    """ Based on a taxonomy data set, amend the data to be at species level as
++    far as possible.  This function creates an internal 'subs file' and calls
++    the standard substitution functions.  The internal subs are generated by
++    looping over the taxa and if not at species-level, working out which level
++    they are at and then adding species already in the dataset to replace it
++    via a polytomy. This has to be done in one step to avoid adding spurious
++    structure to the phylogenies """
++
++    if not ignoreWarnings:
++        _check_data(XML)
++
++    # if taxonomic checker not done, warn
++    if (not taxonomy):
++        raise NoneCompleteTaxonomy("Taxonomy is empty. Create a taxonomy first. You'll probably need to hand edit the file to complete")
++        return
++
++    # if missing data in taxonomy, warn
++    taxa = get_all_taxa(XML)
++    keys = taxonomy.keys()
++    if (not ignoreWarnings):
++        for t in taxa:
++            t = t.replace("_"," ")
++            if not t in keys:
++                # This idea here is that the caller will catch this, then re-run with ignoreWarnings set to True
++                raise NoneCompleteTaxonomy("Taxonomy is not complete. I will soldier on anyway, but this might not work as intended")
++
++    # get all taxa - see above!
++    # for each taxa, if not at species level
++    new_taxa = []
++    old_taxa = []
++    for t in taxa:
++        subs = []
++        t = t.replace("_"," ")
++        if (not SPECIES in taxonomy[t]): # the current taxon is not a species, but higher level taxon
++            # work out which level - should we encode this in the data to start with?
++            for tl in taxonomy_levels:
++                try:
++                    tax_data = taxonomy[t][tl]
++                except KeyError:
++                    continue
++                if (t == taxonomy[t][tl]):
++                    current_level = tl
++                    # find all species in the taxonomy that match this level
++                    for taxon in taxa:
++                        taxon = taxon.replace("_"," ")
++                        if (SPECIES in taxonomy[taxon]):
++                            try:
++                                if taxonomy[taxon][current_level] == t: # our current taxon
++                                    subs.append(taxon.replace(" ","_"))
++                            except KeyError:
++                                continue
++
++        # create the sub
++        if len(subs) > 0:
++            old_taxa.append(t.replace(" ","_"))
++            new_taxa.append(','.join(subs))
++
++    # call the sub
++    new_XML = substitute_taxa(XML, old_taxa, new_taxa, verbose=verbose)
++    new_XML = clean_data(new_XML)
++
++    return new_XML
  def data_overlap(XML, overlap_amount=2, filename=None, detailed=False, show=False, verbose=False, ignoreWarnings=False):
      """ Calculate the amount of taxonomic overlap between source trees.
@@ -2024,7 +2724,7 @@
      If filename is None, no graphic is generated. Otherwise a simple
      graphic is generated showing the number of cluster. If detailed is set to
      true, a graphic is generated showing *all* trees. For data containing >200
--    source tres this could be very big and take along time. More likely, you'll run
++    source trees this could be very big and take along time. More likely, you'll run
      out of memory.
      """
      import matplotlib
@@ -2103,6 +2803,7 @@
          sufficient_overlap = True
      # The above list actually contains which components are seperate from each other
++    key_list = connected_components
      if (not filename == None or show):
          if (verbose):
@@ -2266,7 +2967,9 @@
      prev_char = None
      prev_taxa = None
      prev_name = None
--    non_ind = {}
++    subsets = []
++    identical = []
++    is_identical = False
      for data in data_ind:
          name = data[0]
          char = data[1]
@@ -2275,22 +2978,71 @@
              # when sorted, the longer list comes first
              if set(taxa).issubset(set(prev_taxa)):
                  if (taxa == prev_taxa):
--                    non_ind[name] = [prev_name,IDENTICAL]
++                    if (is_identical):
++                        identical[-1].append(name)
++                    else:
++                        identical.append([name,prev_name])
++                        is_identical = True
++
                  else:
--                    non_ind[name] = [prev_name,SUBSET]
++                    subsets.append([prev_name, name])
++                    prev_name = name
++                    is_identical = False
++            else:
++                prev_name = name
++                is_identical = False
++        else:
++            prev_name = name
++            is_identical = False
++
          prev_char = char
          prev_taxa = taxa
--        prev_name = name
--
++
      if (make_new_xml):
          new_xml = XML
--        for name in non_ind:
--            if (non_ind[name][1] == SUBSET):
--                new_xml = _swap_tree_in_XML(new_xml,None,name)
++        # deal with subsets
++        for s in subsets:
++            new_xml = _swap_tree_in_XML(new_xml,None,s[1])
          new_xml = clean_data(new_xml)
--        return non_ind, new_xml
++        # deal with identical - weight them, if there's 3, weights are 0.3, i.e.
++        # weights are 1/no of identical trees
++        for i in identical:
++            weight = 1.0 / float(len(i))
++            new_xml = add_weights(new_xml, i, weight)
++
++        return identical, subsets, new_xml
      else:
--        return non_ind
++        return identical, subsets
++
++
++def add_weights(XML, names, weight):
++    """ Add weights for tree, supply array of names and a weight, they get set
++        Returns a new XML
++    """
++
++    xml_root = _parse_xml(XML)
++    # By getting source, we can then loop over each source_tree
++    find = etree.XPath("//source_tree")
++    sources = find(xml_root)
++    for s in sources:
++        s_name = s.attrib['name']
++        for n in names:
++            if s_name == n:
++                if s.xpath("tree/weight/real_value") == []:
++                    # add weights
++                    weights_element = etree.Element("weight")
++                    weights_element.tail="\n"
++                    real_value = etree.SubElement(weights_element,'real_value')
++                    real_value.attrib['rank'] = '0'
++                    real_value.tail = '\n'
++                    real_value.text = str(weight)
++                    t = s.xpath("tree")[0]
++                    t.append(weights_element)
++                else:
++                    s.xpath("tree/weight/real_value")[0].text = str(weight)
++
++    return etree.tostring(xml_root,pretty_print=True)
++
  def add_historical_event(XML, event_description):
      """
@@ -2380,8 +3132,15 @@
      # check trees are informative
      XML = _check_informative_trees(XML,delete=True)
++
      # check sources
      XML = _check_sources(XML,delete=True)
++    XML = all_sourcenames(XML)
++
++    # fix tree names
++    XML = set_unique_names(XML)
++    XML = set_all_tree_names(XML,overwrite=True)
++
      # unpermutable trees
      permutable_trees = _find_trees_for_permuting(XML)
@@ -2659,7 +3418,7 @@
          s.getparent().remove(s)
      # edit name (append _subset)
--    proj_name = xml_root.xpath('/phylo_storage/project_name/string_value')[0].text
++    proj_name = get_project_name(XML)
      proj_name += "_subset"
      xml_root.xpath('/phylo_storage/project_name/string_value')[0].text = proj_name
@@ -2928,6 +3687,37 @@
      return mrca
++
++def tree_from_taxonomy(taxonomy, end_level, end_rank):
++    """Create a tree from a taxonomy data structure.
++    This is not the most efficient way, but works OK
++    """
++
++    # Grab data only for the end_level classification
++    required_taxonomy = {}
++    for t in taxonomy:
++        if (end_level in t):
++            required_taxonomy[t] = taxonomy[t]
++
++    rank_index = taxonomy_levels.index(end_rank)
++
++    # create basic string
++
++        # get unique otus
++
++        # sort by the subfamily
++
++        # for each genus create a newick string
++
++        # if it's the same grouping as previous, add as sister clade (i.e. ,)
++        # else, prepend a (, append a ) and add new clade (ie. ,)
++
++
++    # return tree
++
++
++
++
  ################ PRIVATE FUNCTIONS ########################
  def _uniquify(l):
@@ -2975,13 +3765,25 @@
                      "The source names in the dataset are not unique. Please run the auto-name function on these data. Name: "+name+"\n"
          last_name = name
++    # do same for tree names:
++    names = get_all_tree_names(XML)
++    names.sort()
++    last_name = "" # This will actually throw an non-unique error if a name is empty
++    # not great, but still an error!
++    for name in names:
++        if name == last_name:
++            # if non-unique throw exception
++            message = message + \
++                    "The tree names in the dataset are not unique. Please run the auto-name function on these data with replace or edit by hand. Name: "+name+"\n"
++        last_name = name
++
      if (not message == ""):
          raise NotUniqueError(message)
      return
--def _assemble_tree_matrix(tree_string):
++def _assemble_tree_matrix(tree_string, verbose=False):
      """ Assembles the MRP matrix for an individual tree
          returns: matrix (2D numpy array: taxa on i, nodes on j)
@@ -3009,7 +3811,7 @@
          for i in range(0,len(names)):
              adjmat.append([1])
          adjmat = numpy.array(adjmat)
--
++    if verbose:
          print "Warning: Found uninformative tree in data. Including it in the matrix anyway"
      return adjmat, names
@@ -3020,7 +3822,7 @@
      If the new_taxa array is missing, simply delete the old_taxa
      """
--
++
      tree = _correctly_quote_taxa(tree)
      # are the input values lists or simple strings?
      if (isinstance(old_taxa,str)):
@@ -3564,7 +4366,7 @@
      return permute_trees
--def _create_matrix(trees, taxa, format="hennig", quote=False, weights=None):
++def _create_matrix(trees, taxa, format="hennig", quote=False, weights=None, verbose=False):
      """
      Does the hard work on creating a matrix
      """
@@ -3585,7 +4387,7 @@
          if (not weights == None):
              weight = weights[key]
          names.append(key)
--        submatrix, tree_taxa = _assemble_tree_matrix(trees[key])
++        submatrix, tree_taxa = _assemble_tree_matrix(trees[key], verbose=verbose)
          nChars = len(submatrix[0,:])
          # loop over characters in the submatrix
          for i in range(1,nChars):
@@ -3637,7 +4439,7 @@
              matrix_string += string + "\n"
              i += 1
--        matrix_string += "\t;\n"
++        matrix_string += "\n"
          if (not weights == None):
              # get unique weights
              unique_weights = _uniquify(weights)
@@ -3652,7 +4454,7 @@
                          matrix_string += " " + str(i)
                      i += 1
                  matrix_string += ";\n"
--        matrix_string += "procedure /;"
++        matrix_string += "proc /;"
      elif (format == 'nexus'):
          matrix_string = "#nexus\n\nbegin data;\n"
          matrix_string += "\tdimensions ntax = "+str(len(taxa)) +" nchar = "+str(last_char)+";\n"
 === modified file 'stk/test/_substitute_taxa.py'
 --- stk/test/_substitute_taxa.py	2016-07-14 10:12:17 +0000
 +++ stk/test/_substitute_taxa.py	2017-01-12 09:27:31 +0000
@@ -10,6 +10,7 @@
  from stk.supertree_toolkit import check_subs, _tree_contains, _correctly_quote_taxa, _remove_single_poly_taxa
  from stk.supertree_toolkit import _swap_tree_in_XML, substitute_taxa, get_all_taxa, _parse_tree, _delete_taxon
  from stk.supertree_toolkit import _collapse_nodes, import_tree, subs_from_csv, _getTaxaFromNewick, obtain_trees
++from stk.supertree_toolkit import generate_species_level_data
  from lxml import etree
  from util import *
  from stk.stk_exceptions import *
@@ -776,7 +777,24 @@
          new_tree = _sub_taxa_in_tree(tree2,"Thereuopodina",sub_in,skip_existing=True);
          self.assert_(answer2, new_tree)
--
++
++    def test_auto_subs_taxonomy(self):
++        """test the automatic subs function with a simple test"""
++        XML = etree.tostring(etree.parse('data/input/auto_sub.phyml',parser),pretty_print=True)
++        taxonomy = {'Ardea goliath': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Ecdysozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Ardea', 'order': 'Pelecaniformes', 'species': 'Ardea goliath'},
++                    'Pelecaniformes': {'kingdom': 'Animalia', 'phylum': 'Chordata', 'order': 'Pelecaniformes', 'class': 'Aves', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013'}, 'Gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes'},
++                    'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'infraphylum': 'Gnathostomata', 'superclass': 'Tetrapoda', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'order': 'Procellariiformes', 'species': 'Thalassarche melanophris'},
++                    'Platalea leucorodia': {'kingdom': 'Animalia', 'subfamily': 'Plataleinae', 'family': 'Threskiornithidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'infraphylum': 'Gnathostomata', 'superclass': 'Tetrapoda', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Platalea', 'order': 'Pelecaniformes', 'species': 'Platalea leucorodia'},
++                    'Gallus lafayetii': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus lafayetii'},
++                    'Ardea humbloti': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Ecdysozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Ardea', 'order': 'Pelecaniformes', 'species': 'Ardea humbloti'},
++                    'Gallus varius': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus varius'}}
++        XML = generate_species_level_data(XML, taxonomy)
++        expected_XML = etree.tostring(etree.parse('data/output/one_click_subs_output.phyml',parser),pretty_print=True)
++        trees = obtain_trees(XML)
++        expected_trees = obtain_trees(expected_XML)
++        for t in trees:
++            self.assert_(_trees_equal(trees[t], expected_trees[t]))
++
      def test_parrot_edge_case(self):
          """Random edge case where the tree dissappeared..."""
          trees = ["(((((((Agapornis_lilianae, Agapornis_nigrigenis), Agapornis_personata, Agapornis_fischeri), Agapornis_roseicollis), (Agapornis_pullaria, Agapornis_taranta)), Agapornis_cana), Loriculus_galgulus), Geopsittacus_occidentalis);"]
 === modified file 'stk/test/_supertree_toolkit.py'
 --- stk/test/_supertree_toolkit.py	2015-03-26 09:58:58 +0000
 +++ stk/test/_supertree_toolkit.py	2017-01-12 09:27:31 +0000
@@ -7,12 +7,13 @@
  import os
  stk_path = os.path.join( os.path.realpath(os.path.dirname(__file__)), os.pardir, os.pardir )
  sys.path.insert(0, stk_path)
--from stk.supertree_toolkit import _check_uniqueness, _check_taxa, _check_data, get_all_characters, data_independence
++from stk.supertree_toolkit import _check_uniqueness, _check_taxa, _check_data, get_all_characters, data_independence, add_weights
  from stk.supertree_toolkit import get_fossil_taxa, get_publication_years, data_summary, get_character_numbers, get_analyses_used
  from stk.supertree_toolkit import data_overlap, read_matrix, subs_file_from_str, clean_data, obtain_trees, get_all_source_names
  from stk.supertree_toolkit import add_historical_event, _sort_data, _parse_xml, _check_sources, _swap_tree_in_XML, replace_genera
  from stk.supertree_toolkit import get_all_taxa, _get_all_siblings, _parse_tree, get_characters_used, _trees_equal, get_weights
--from stk.supertree_toolkit import get_outgroup, set_all_tree_names, create_tree_name, load_taxonomy
++from stk.supertree_toolkit import get_outgroup, set_all_tree_names, create_tree_name, taxonomic_checker, load_taxonomy, load_equivalents
++from stk.supertree_toolkit import create_taxonomy, create_taxonomy_from_tree, get_all_tree_names
  from lxml import etree
  from util import *
  from stk.stk_exceptions import *
@@ -268,19 +269,52 @@
      def test_data_independence(self):
          XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
--        expected_dict = {'Hill_2011_2': ['Hill_2011_1', 1], 'Hill_Davis_2011_1': ['Hill_Davis_2011_2', 0]}
--        non_ind = data_independence(XML)
--        self.assertDictEqual(expected_dict, non_ind)
++        expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
++        non_ind,subsets = data_independence(XML)
++        expected_subsets = [['Hill_2011_1', 'Hill_2011_2']]
++        self.assertListEqual(expected_subsets, subsets)
++        self.assertListEqual(expected_idents, non_ind)
--    def test_data_independence(self):
++    def test_data_independence_2(self):
          XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
--        expected_dict = {'Hill_2011_2': ['Hill_2011_1', 1], 'Hill_Davis_2011_1': ['Hill_Davis_2011_2', 0]}
--        non_ind, new_xml = data_independence(XML,make_new_xml=True)
--        self.assertDictEqual(expected_dict, non_ind)
++        expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
++        expected_subsets = [['Hill_2011_1', 'Hill_2011_2']]
++        non_ind, subset, new_xml = data_independence(XML,make_new_xml=True)
++        self.assertListEqual(expected_idents, non_ind)
++        self.assertListEqual(expected_subsets, subset)
          # check the second tree has not been removed
          self.assertRegexpMatches(new_xml,re.escape('((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,H:1.00000)0.00000:0.00000)0.00000:0.00000;'))
          # check that the first tree is removed
          self.assertNotRegexpMatches(new_xml,re.escape('((A:1.00000,B:1.00000)0.00000:0.00000,(F:1.00000,E:1.00000)0.00000:0.00000)0.00000:0.00000;'))
++
++    def test_add_weights(self):
++        """Add weights to a bunch of trees"""
++        XML = etree.tostring(etree.parse('data/input/check_data_ind.phyml',parser),pretty_print=True)
++        # see above
++        expected_idents = [['Hill_Davis_2011_2', 'Hill_Davis_2011_1', 'Hill_Davis_2011_3'], ['Hill_Davis_2013_1', 'Hill_Davis_2013_2']]
++        # so the first should end up with a weight of 0.33333 and the second with 0.5
++        for ei in expected_idents:
++            weight = 1.0/float(len(ei))
++            XML = add_weights(XML, ei, weight)
++
++        expected_weights = [str(1.0/3.0), str(1.0/3.0), str(1.0/3.0), str(0.5), str(0.5)]
++        weights_in_xml = []
++        # now check weights have been added to the correct part of the tree
++        xml_root = _parse_xml(XML)
++        i = 0
++        for ei in expected_idents:
++            for tree in ei:
++                find = etree.XPath("//source_tree")
++                trees = find(xml_root)
++                for t in trees:
++                    if t.attrib['name'] == tree:
++                        # check len(trees) == 0
++                        weights_in_xml.append(t.xpath("tree/weight/real_value")[0].text)
++
++        self.assertListEqual(expected_weights,weights_in_xml)
++
++
++
      def test_overlap(self):
          XML = etree.tostring(etree.parse('data/input/check_overlap_ok.phyml',parser),pretty_print=True)
@@ -438,7 +472,7 @@
          XML = clean_data(XML)
          trees = obtain_trees(XML)
          self.assert_(len(trees) == 2)
--        expected_trees = {'Hill_2011_4': '(A,B,(C,D,E));', 'Hill_2011_2': '(A, B, C, (D, E, F));'}
++        expected_trees = {'Hill_2011_2': '(A,B,(C,D,E));', 'Hill_2011_1': '(A, B, C, (D, E, F));'}
          for t in trees:
              self.assert_(_trees_equal(trees[t],expected_trees[t]))
@@ -558,18 +592,78 @@
              self.assert_(c in expected_characters)
          self.assert_(len(characters) == len(expected_characters))
++    def test_create_taxonomy(self):
++        XML = etree.tostring(etree.parse('data/input/create_taxonomy.phyml',parser),pretty_print=True)
++        # Tested on 11/01/17 and EOL have changed the output
++        # old_expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'superclass': 'Sarcopterygii', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Egretta tricolor', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Jeletzkytes criptonodosus': {'superfamily': 'Scaphitoidea', 'family': 'Scaphitidae', 'subkingdom': 'Metazoa', 'subclass': 'Ammonoidea', 'species': 'Jeletzkytes criptonodosus', 'phylum': 'Mollusca', 'suborder': 'Ancyloceratina', 'provider': 'Paleobiology Database', 'genus': 'Jeletzkytes', 'class': 'Cephalopoda'}}
++        expected = {'Jeletzkytes criptonodosus': {'superfamily': 'Scaphitoidea', 'family': 'Scaphitidae', 'subkingdom': 'Metazoa', 'subclass': 'Ammonoidea', 'species': 'Jeletzkytes criptonodosus', 'phylum': 'Mollusca', 'suborder': 'Ancyloceratina', 'provider': 'Paleobiology Database', 'genus': 'Jeletzkytes', 'class': 'Cephalopoda'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'infraspecies': 'Egretta', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': ['Egretta', 'tricolor'], 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Archaeopteryx lithographica': {'genus': 'Archaeopteryx', 'provider': 'Paleobiology Database'}}
++        if (internet_on()):
++            taxonomy = create_taxonomy(XML)
++            self.maxDiff = None
++            self.assertDictEqual(taxonomy, expected)
++        else:
++            print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
++        return
++
++    def test_create_taxonomy_from_tree(self):
++        """Tests if taxonomy from tree works. Uses same data for normal XML test but goes directly for the tree instead of parsing the XML """
++        # Tested on 11/01/17 and this no longer worked, but is correct! EOL returned something different.
++        #old_expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'superclass': 'Sarcopterygii', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Egretta tricolor', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}}
++        expected = {'Archaeopteryx lithographica': {'genus': 'Archaeopteryx', 'provider': 'Paleobiology Database'}, 'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'infraspecies': 'Egretta', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': ['Egretta', 'tricolor'], 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus gallus': {'kingdom': 'Animalia', 'family': 'Phasianidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'genus': 'Gallus', 'order': 'Galliformes'}, 'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Thalassarche melanophris', 'genus': 'Thalassarche', 'order': 'Procellariiformes'}}
++        tree = "(Archaeopteryx_lithographica, (Gallus_gallus, (Thalassarche_melanophris, Egretta_tricolor)));"
++        if (internet_on()):
++            taxonomy = create_taxonomy_from_tree(tree)
++            self.maxDiff = None
++            self.assertDictEqual(taxonomy, expected)
++        else:
++            print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the create_taxonomy function"
++        return
++
++    def test_taxonomy_checker(self):
++        expected = {'Thalassarche_melanophrys': [['Thalassarche_melanophris', 'Thalassarche_melanophrys', 'Diomedea_melanophris', 'Thalassarche_[melanophrys', 'Diomedea_melanophrys'], 'amber'], 'Egretta_tricolor': [['Egretta_tricolor'], 'green'], 'Gallus_gallus': [['Gallus_gallus'], 'green']}
++        XML = etree.tostring(etree.parse('data/input/check_taxonomy.phyml',parser),pretty_print=True)
++        if (internet_on()):
++            equivs = taxonomic_checker(XML)
++            self.maxDiff = None
++            self.assertDictEqual(equivs, expected)
++        else:
++            print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
++        return
++
++    def test_taxonomy_checker2(self):
++        XML = etree.tostring(etree.parse('data/input/check_taxonomy_fixes.phyml',parser),pretty_print=True)
++        if (internet_on()):
++            # This test is a bit dodgy as it depends on EOL's server speed. Run it a few times before deciding it's broken.
++            equivs = taxonomic_checker(XML,verbose=False)
++            self.maxDiff = None
++            self.assert_(equivs['Agathamera_crassa'][0][0] == 'Agathemera_crassa')
++            self.assert_(equivs['Celatoblatta_brunni'][0][0] == 'Maoriblatta_brunni')
++            self.assert_(equivs['Blatta_lateralis'][1] == 'amber')
++        else:
++            print bcolors.WARNING + "WARNING: "+ bcolors.ENDC+ "No internet connection found. Not checking the taxonomy_checker function"
++        return
++
++
      def test_load_taxonomy(self):
          csv_file = "data/input/create_taxonomy.csv"
--        expected = {'Archaeopteryx lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'},
--                    'Egretta tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'subkingdom': 'Bilateria', 'subclass': 'Neoloricata', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'suborder': 'Ischnochitonina', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Protostomia', 'genus': 'Egretta', 'order': 'Pelecaniformes', 'species': 'Egretta tricolor'},
--                    'Gallus gallus': {'kingdom': 'Animalia', 'infrakingdom': 'Protostomia', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'genus': 'Gallus', 'order': 'Galliformes', 'species': 'Gallus gallus'},
--                    'Thalassarche melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'class': 'Aves', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'order': 'Procellariiformes', 'species': 'Thalassarche melanophris'},
--                    'Jeletzkytes criptonodosus': {'kingdom': 'Metazoa', 'family': 'Scaphitidae', 'order': 'Ammonoidea', 'phylum': 'Mollusca', 'provider': 'PBDB', 'species': 'Jeletzkytes criptonodosus', 'class': 'Cephalopoda'}}
++        expected = {'Jeletzkytes_criptonodosus': {'kingdom': 'Metazoa', 'subclass': 'Cephalopoda', 'species': 'Jeletzkytes criptonodosus', 'suborder': 'Ammonoidea', 'provider': 'PBDB', 'subfamily': 'Scaphitidae', 'class': 'Mollusca'}, 'Archaeopteryx_lithographica': {'subkingdom': 'Metazoa', 'subclass': 'Tetrapodomorpha', 'suborder': 'Coelurosauria', 'provider': 'Paleobiology Database', 'genus': 'Archaeopteryx', 'class': 'Aves'}, 'Egretta_tricolor': {'kingdom': 'Animalia', 'family': 'Ardeidae', 'class': 'Aves', 'subkingdom': 'Bilateria', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'subclass': 'Neoloricata', 'species': 'Egretta tricolor', 'phylum': 'Chordata', 'suborder': 'Ischnochitonina', 'superphylum': 'Lophozoa', 'infrakingdom': 'Protostomia', 'genus': 'Egretta', 'order': 'Pelecaniformes'}, 'Gallus_gallus': {'kingdom': 'Animalia', 'superorder': 'Galliformes', 'family': 'Phasianidae', 'subkingdom': 'Bilateria', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'species': 'Gallus gallus', 'phylum': 'Chordata', 'superphylum': 'Lophozoa', 'infrakingdom': 'Protostomia', 'genus': 'Gallus', 'class': 'Aves'}, 'Thalassarche_melanophris': {'kingdom': 'Animalia', 'family': 'Diomedeidae', 'subkingdom': 'Bilateria', 'species': 'Thalassarche melanophris', 'order': 'Procellariiformes', 'phylum': 'Chordata', 'provider': 'Species 2000 & ITIS Catalogue of Life: April 2013', 'infrakingdom': 'Deuterostomia', 'subphylum': 'Vertebrata', 'genus': 'Thalassarche', 'class': 'Aves'}}
          taxonomy = load_taxonomy(csv_file)
          self.maxDiff = None
          self.assertDictEqual(taxonomy, expected)
++
++    def test_load_equivalents(self):
++        csv_file = "data/input/equivalents.csv"
++        expected = {'Turnix_sylvatica': [['Turnix_sylvaticus','Tetrao_sylvaticus','Tetrao_sylvatica','Turnix_sylvatica'],'yellow'],
++                    'Xiphorhynchus_pardalotus':[['Xiphorhynchus_pardalotus'],'green'],
++                    'Phaenicophaeus_curvirostris':[['Zanclostomus_curvirostris','Rhamphococcyx_curvirostris','Phaenicophaeus_curvirostris','Rhamphococcyx_curvirostr'],'yellow'],
++                    'Megalapteryx_benhami':[['Megalapteryx_benhami'],'red']
++                    }
++        equivalents = load_equivalents(csv_file)
++        self.assertDictEqual(equivalents, expected)
++
++
      def test_name_tree(self):
          XML = etree.tostring(etree.parse('data/input/single_source_no_names.phyml',parser),pretty_print=True)
          xml_root = _parse_xml(XML)
@@ -583,6 +677,35 @@
          XML = etree.tostring(etree.parse('data/input/single_source.phyml',parser),pretty_print=True)
          self.assert_(isEqualXML(new_xml,XML))
++    def test_all_rename_tree(self):
++        XML = etree.tostring(etree.parse('data/input/single_source_same_tree_name.phyml',parser),pretty_print=True)
++        new_xml = set_all_tree_names(XML,overwrite=True)
++        XML = etree.tostring(etree.parse('data/output/single_source_same_tree_name.phyml',parser),pretty_print=True)
++        self.assert_(isEqualXML(new_xml,XML))
++
++    def test_get_all_tree_names(self):
++        XML = etree.tostring(etree.parse('data/input/single_source_same_tree_name.phyml',parser),pretty_print=True)
++        names = get_all_tree_names(XML)
++        self.assertListEqual(names,['Hill_2011_2','Hill_2011_2'])
++
++
++def internet_on(host="8.8.8.8", port=443, timeout=5):
++    import socket
++
++    """
++      Host: 8.8.8.8 (google-public-dns-a.google.com)
++      OpenPort: 53/tcp
++      Service: domain (DNS/TCP)
++    """
++    try:
++        socket.setdefaulttimeout(timeout)
++        socket.socket(socket.AF_INET, socket.SOCK_STREAM).connect((host, port))
++        return True
++    except Exception as ex:
++        print ex.message
++        return False
++
++
  if __name__ == '__main__':
      unittest.main()
 === modified file 'stk/test/_trees.py'
 --- stk/test/_trees.py	2015-03-26 09:58:58 +0000
 +++ stk/test/_trees.py	2017-01-12 09:27:31 +0000
@@ -5,7 +5,7 @@
  sys.path.insert(0,"../../")
  from stk.supertree_toolkit import import_tree, obtain_trees, get_all_taxa, _assemble_tree_matrix, create_matrix, _delete_taxon, _sub_taxon,_tree_contains
  from stk.supertree_toolkit import _swap_tree_in_XML, substitute_taxa, get_taxa_from_tree, get_characters_from_tree, amalgamate_trees, _uniquify
--from stk.supertree_toolkit import import_trees, import_tree, _trees_equal, _find_trees_for_permuting, permute_tree, get_all_source_names, _getTaxaFromNewick
++from stk.supertree_toolkit import import_trees, import_tree, _trees_equal, _find_trees_for_permuting, permute_tree, get_all_source_names, _getTaxaFromNewick, _parse_tree
  from stk.supertree_toolkit import get_mrca
  import os
  from lxml import etree
@@ -215,6 +215,18 @@
          mrca = get_mrca(tree,["A","I", "L"])
          self.assert_(mrca == 8)
++    def test_get_mrca(self):
++        tree = "(B,(C,(D,(E,((A,F),((I,(G,H)),(J,(K,L))))))));"
++        mrca = get_mrca(tree,["A","F"])
++        print mrca
++        #self.assert_(mrca == 8)
++        to = _parse_tree('(X,Y,Z,(Q,W));')
++        treeobj = _parse_tree(tree)
++        newnode = treeobj.addNodeBetweenNodes(10,9)
++        treeobj.addSubTree(newnode, to, ignoreRootAssert=True)
++        treeobj.draw()
++
++
      def test_get_all_trees(self):
          XML = etree.tostring(etree.parse(single_source_input,parser),pretty_print=True)
          tree = obtain_trees(XML)
 === added file 'stk/test/data/input/auto_sub.phyml'
 --- stk/test/data/input/auto_sub.phyml	1970-01-01 00:00:00 +0000
 +++ stk/test/data/input/auto_sub.phyml	2017-01-12 09:27:31 +0000
@@ -0,0 +1,97 @@
++<?xml version='1.0' encoding='utf-8'?>
++<phylo_storage>
++  <project_name>
++    <string_value lines="1">Test</string_value>
++  </project_name>
++  <sources>
++    <source name="Hill_2011">
++      <bibliographic_information>
++        <article>
++          <authors>
++            <author>
++              <surname>
++                <string_value lines="1">Hill</string_value>
++              </surname>
++              <other_names>
++                <string_value lines="1">Jon</string_value>
++              </other_names>
++            </author>
++          </authors>
++          <title>
++            <string_value lines="1">A great paper</string_value>
++          </title>
++          <year>
++            <integer_value rank="0">2011</integer_value>
++          </year>
++          <journal>
++            <string_value lines="1">Nature</string_value>
++          </journal>
++          <pages>
++            <string_value lines="1">1-12</string_value>
++          </pages>
++        </article>
++      </bibliographic_information>
++      <source_tree name="Hill_2011_1">
++        <tree>
++          <tree_string>
++            <string_value lines="1">(Thalassarche_melanophris, Pelecaniformes, (Gallus, Gallus_varius));</string_value>
++          </tree_string>
++          <figure_legend>
++            <string_value lines="1">NA</string_value>
++          </figure_legend>
++          <figure_number>
++            <string_value lines="1">1</string_value>
++          </figure_number>
++          <page_number>
++            <string_value lines="1">1</string_value>
++          </page_number>
++          <tree_inference>
++            <optimality_criterion name="Maximum Parsimony"/>
++          </tree_inference>
++          <topology>
++            <outgroup>
++              <string_value lines="1">A</string_value>
++            </outgroup>
++          </topology>
++        </tree>
++        <taxa_data>
++          <all_extant/>
++        </taxa_data>
++        <character_data>
++          <character type="molecular" name="12S"/>
++        </character_data>
++      </source_tree>
++      <source_tree name="Hill_2011_2">
++        <tree>
++          <tree_string>
++            <string_value lines="1">(Gallus_lafayetii, (Platalea_leucorodia, (Ardea_humbloti, Ardea_goliath)));</string_value>
++          </tree_string>
++          <figure_legend>
++            <string_value lines="1">NA</string_value>
++          </figure_legend>
++          <figure_number>
++            <string_value lines="1">1</string_value>
++          </figure_number>
++          <page_number>
++            <string_value lines="1">1</string_value>
++          </page_number>
++          <tree_inference>
++            <optimality_criterion name="Maximum Parsimony"/>
++          </tree_inference>
++          <topology>
++            <outgroup>
++              <string_value lines="1">A</string_value>
++            </outgroup>
++          </topology>
++        </tree>
++        <taxa_data>
++          <all_extant/>
++        </taxa_data>
++        <character_data>
++          <character type="molecular" name="12S"/>
++        </character_data>
++      </source_tree>
++    </source>
++  </sources>
++  <history/>
++</phylo_storage>
 === modified file 'stk/test/data/input/check_data_ind.phyml'
 --- stk/test/data/input/check_data_ind.phyml	2014-10-09 09:33:21 +0000
 +++ stk/test/data/input/check_data_ind.phyml	2017-01-12 09:27:31 +0000
@@ -249,6 +249,147 @@
            <character type="molecular" name="12S"/>
          </character_data>
        </source_tree>
++      <source_tree name="Hill_Davis_2011_3">
++        <tree>
++          <tree_string>
++            <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,H:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
++          </tree_string>
++          <figure_legend>
++            <string_value lines="1">NA</string_value>
++          </figure_legend>
++          <figure_number>
++            <string_value lines="1">0</string_value>
++          </figure_number>
++          <page_number>
++            <string_value lines="1">0</string_value>
++          </page_number>
++          <tree_inference>
++            <optimality_criterion name="Maximum Parsimony"/>
++          </tree_inference>
++          <topology>
++            <outgroup>
++              <string_value lines="1">A</string_value>
++            </outgroup>
++          </topology>
++        </tree>
++        <taxa_data>
++          <mixed_fossil_and_extant>
++            <taxon name="A">
++              <fossil/>
++            </taxon>
++            <taxon name="B">
++              <fossil/>
++            </taxon>
++          </mixed_fossil_and_extant>
++        </taxa_data>
++        <character_data>
++          <character type="molecular" name="12S"/>
++        </character_data>
++      </source_tree>
++    </source>
++    <source name="Hill_Davis_2013">
++      <bibliographic_information>
++        <article>
++          <authors>
++            <author>
++              <surname>
++                <string_value lines="1">Hill</string_value>
++              </surname>
++              <other_names>
++                <string_value lines="1">Jon</string_value>
++              </other_names>
++            </author>
++            <author>
++              <surname>
++                <string_value lines="1">Davis</string_value>
++              </surname>
++              <other_names>
++                <string_value lines="1">Katie</string_value>
++              </other_names>
++            </author>
++          </authors>
++          <title>
++            <string_value lines="1">Another superb paper</string_value>
++          </title>
++          <year>
++            <integer_value rank="0">2013</integer_value>
++          </year>
++        </article>
++      </bibliographic_information>
++      <source_tree name="Hill_Davis_2013_1">
++        <tree>
++          <tree_string>
++            <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,Z:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
++          </tree_string>
++          <figure_legend>
++            <string_value lines="1">NA</string_value>
++          </figure_legend>
++          <figure_number>
++            <string_value lines="1">0</string_value>
++          </figure_number>
++          <page_number>
++            <string_value lines="1">0</string_value>
++          </page_number>
++          <tree_inference>
++            <optimality_criterion name="Maximum Parsimony"/>
++          </tree_inference>
++          <topology>
++            <outgroup>
++              <string_value lines="1">A</string_value>
++            </outgroup>
++          </topology>
++        </tree>
++        <taxa_data>
++          <mixed_fossil_and_extant>
++            <taxon name="A">
++              <fossil/>
++            </taxon>
++            <taxon name="B">
++              <fossil/>
++            </taxon>
++          </mixed_fossil_and_extant>
++        </taxa_data>
++        <character_data>
++          <character type="molecular" name="12S"/>
++        </character_data>
++      </source_tree>
++      <source_tree name="Hill_Davis_2013_2">
++        <tree>
++          <tree_string>
++            <string_value lines="1">((A:1.00000,B:1.00000)0.00000:0.00000,F:1.00000,E:1.00000,(G:1.00000,Z:1.00000)0.00000:0.00000)0.00000:0.00000;</string_value>
++          </tree_string>
++          <figure_legend>
++            <string_value lines="1">NA</string_value>
++          </figure_legend>
++          <figure_number>
++            <string_value lines="1">0</string_value>
++          </figure_number>
++          <page_number>
++            <string_value lines="1">0</string_value>
++          </page_number>
++          <tree_inference>
++            <optimality_criterion name="Maximum Parsimony"/>
++          </tree_inference>
++          <topology>
++            <outgroup>
++              <string_value lines="1">A</string_value>
++            </outgroup>
++          </topology>
++        </tree>
++        <taxa_data>
++          <mixed_fossil_and_extant>
++            <taxon name="A">
++              <fossil/>
++            </taxon>
++            <taxon name="B">
++              <fossil/>
++            </taxon>
++          </mixed_fossil_and_extant>
++        </taxa_data>
++        <character_data>
++          <character type="molecular" name="12S"/>
++        </character_data>
++      </source_tree>
      </source>
    </sources>
    <history/>
 === added file 'stk/test/data/input/check_taxonomy.phyml'
 --- stk/test/data/input/check_taxonomy.phyml	1970-01-01 00:00:00 +0000
 +++ stk/test/data/input/check_taxonomy.phyml	2017-01-12 09:27:31 +0000
@@ -0,0 +1,67 @@
++<?xml version='1.0' encoding='utf-8'?>
++<phylo_storage>
++  <project_name>
++    <string_value lines="1">Test</string_value>
++  </project_name>
++  <sources>
++    <source name="Hill_2011">

Supertree Toolkit

Merge lp:~jon-hill/supertree-toolkit/sub_in_subfile into lp:supertree-toolkit

Commit message

Description of the change

Preview Diff

Subscribers