Merge lp:~xnox/ubuntu-geonames/fix-duplicates into lp:ubuntu-geonames

Proposed by Dimitri John Ledkov
Status: Merged
Merged at revision: 17
Proposed branch: lp:~xnox/ubuntu-geonames/fix-duplicates
Merge into: lp:ubuntu-geonames
Diff against target: 166 lines (+37/-33)
4 files modified
README (+7/-7)
geoname-modpython.py (+15/-8)
import-geonames.sh (+14/-17)
sphinx.conf (+1/-1)
To merge this branch: bzr merge lp:~xnox/ubuntu-geonames/fix-duplicates
Reviewer Review Type Date Requested Status
Evan (community) Approve
Review via email: mp+120911@code.launchpad.net

Description of the change

* This fixes database import due to data format changes
* Also fixes documentation and sample config
* This updates geoname-modpython.py query
* This does not _fix_ cherrypy version geoname.py (as it appears to not be used / broken ?!)

Instead of doing sql query for each potential location, we build the two location lists (for normal and alternative names) and execute a single query across both tables with a UNION. As a side-effect, UNION removes duplicate entries. Thus instead of 20 New Yorks, we get only 4.

If you are deploying this, please note:
- Due to database changes it's probably best to drop the whole database, recreate it & re-import the dataset.
- And don't forget the re-index with sphinx.

To post a comment you must log in.
Revision history for this message
Evan (ev) wrote :

Yes, the cherrypy version was the initially developed one, which was dropped for the mod_python variant.

Revision history for this message
Evan (ev) wrote :

Looks good! Please file an RT for deployment.

review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'README'
--- README 2010-09-03 07:35:55 +0000
+++ README 2012-08-23 04:40:25 +0000
@@ -1,12 +1,12 @@
1- Install postgres and create a 'geonames' database.1- Install postgres and create a 'geonames' database.
2- Download Geonames from http://download.geonames.org/export/dump2- Create a geouser account with password geopw
3- Run import-geonames.sh3- Run import-geonames.sh
4- Create a geouser account with password geopw, grant permissions on the table:4 + it downloads & populates the database
5 GRANT ALL PRIVILEGES ON geoname, admin1codes, countryInfo TO geouser5 + grants slightly excessive permissions to geouser for queries
6 (slightly excessive, it just needs to query)6- Install sphinxsearch and configure (/etc/sphinxsearch/sphinx.conf ) from universe.
7- Install and configure Sphinx (/etc/sphinxsearch/sphinx.conf ) from universe.7 + run $ sudo indexer geonames
8- Create an index on geonameid which should reduce query times:8 + Enable /etc/default/sphinxsearch
9 CREATE INDEX geoname_id_idx ON geoname(geonameid);9 + start $ sudo service sphinxsearch start
10- Install sphinxapi-python from ppa:~ev and use or modify the sphinx.conf10- Install sphinxapi-python from ppa:~ev and use or modify the sphinx.conf
11 provided.11 provided.
12- Install CherryPy from universe.12- Install CherryPy from universe.
1313
=== modified file 'geoname-modpython.py'
--- geoname-modpython.py 2011-04-06 20:33:16 +0000
+++ geoname-modpython.py 2012-08-23 04:40:25 +0000
@@ -5,11 +5,12 @@
5statement = "SELECT geoname.name, admin1codes.name, countryInfo.name, \5statement = "SELECT geoname.name, admin1codes.name, countryInfo.name, \
6geoname.longitude, geoname.latitude FROM admin1codes, geoname, countryInfo \6geoname.longitude, geoname.latitude FROM admin1codes, geoname, countryInfo \
7WHERE code = geoname.country||'.'||geoname.admin1 AND \7WHERE code = geoname.country||'.'||geoname.admin1 AND \
8countryInfo.iso_alpha2=geoname.country AND geoname.geonameid=%s;"8countryInfo.iso_alpha2=geoname.country AND geoname.geonameid in %s \
9altstatement = "SELECT alternatename.alternateName, admin1codes.name, countryInfo.name, \9UNION \
10SELECT alternatename.alternateName, admin1codes.name, countryInfo.name, \
10geoname.longitude, geoname.latitude FROM alternatename, admin1codes, geoname, countryInfo \11geoname.longitude, geoname.latitude FROM alternatename, admin1codes, geoname, countryInfo \
11WHERE code = geoname.country||'.'||geoname.admin1 AND \12WHERE code = geoname.country||'.'||geoname.admin1 AND \
12countryInfo.iso_alpha2=geoname.country AND alternatename.geonameid=geoname.geonameid AND alternatename.alternatenameId=%s;"13countryInfo.iso_alpha2=geoname.country AND alternatename.geonameid=geoname.geonameid AND alternatename.alternatenameId in %s;"
13authstring = 'dbname=geonames user=geouser password=geopw host=localhost'14authstring = 'dbname=geonames user=geouser password=geopw host=localhost'
14jsonheader = '['15jsonheader = '['
15jsonfooter = ']'16jsonfooter = ']'
@@ -31,17 +32,23 @@
31 connection = psycopg2.connect(authstring)32 connection = psycopg2.connect(authstring)
32 cursor = connection.cursor()33 cursor = connection.cursor()
33 try:34 try:
35 # We need at least one value for the sql in operator
36 # and there are no locations with id 0
37 statement_ids = [0]
38 altstatement_ids = [0]
34 for x in result:39 for x in result:
35 rawid = x['id']40 rawid = x['id']
36 idval = rawid / 1041 idval = rawid / 10
37 idtype = rawid % 1042 idtype = rawid % 10
38 if idtype == 1:43 if idtype == 1:
39 cursor.execute(statement % idval)44 statement_ids.append(idval)
40 else:45 else:
41 cursor.execute(altstatement % idval)46 altstatement_ids.append(idval)
42 record = cursor.fetchone()47
43 if record:48 cursor.execute(statement, (tuple(statement_ids), tuple(altstatement_ids)))
44 ret.append(jsonentry % record)49 records = cursor.fetchall()
50 for record in records:
51 ret.append(jsonentry % record)
45 finally:52 finally:
46 cursor.close()53 cursor.close()
47 connection.close()54 connection.close()
4855
=== modified file 'import-geonames.sh'
--- import-geonames.sh 2011-04-07 15:10:15 +0000
+++ import-geonames.sh 2012-08-23 04:40:25 +0000
@@ -9,7 +9,7 @@
9# allCountries.zip contains allCountries.txt9# allCountries.zip contains allCountries.txt
10# alternateNames.zip contains iso-languagecodes.txt alternateNames.txt10# alternateNames.zip contains iso-languagecodes.txt alternateNames.txt
11ZIPFILES="allCountries.zip alternateNames.zip"11ZIPFILES="allCountries.zip alternateNames.zip"
12TXTFILES="admin1Codes.txt admin1CodesASCII.txt countryInfo.txt timeZones.txt"12TXTFILES="admin1CodesASCII.txt countryInfo.txt timeZones.txt"
13for i in $ZIPFILES $TXTFILES13for i in $ZIPFILES $TXTFILES
14do14do
15 wget "http://download.geonames.org/export/dump/$i"15 wget "http://download.geonames.org/export/dump/$i"
@@ -24,14 +24,14 @@
24grep -v '^#' countryInfo.txt > countryInfo.txt.tmp24grep -v '^#' countryInfo.txt > countryInfo.txt.tmp
25tail -n +2 timeZones.txt > timeZones.txt.tmp25tail -n +2 timeZones.txt > timeZones.txt.tmp
26 26
27sudo -u postgres psql geonames <<EOT27psql geonames <<EOT
28BEGIN;28BEGIN;
29DROP TABLE IF EXISTS geoname;29DROP TABLE IF EXISTS geoname;
30CREATE TABLE geoname (30CREATE TABLE geoname (
31 geonameid int,31 geonameid int,
32 name varchar(200),32 name varchar(200),
33 asciiname varchar(200),33 asciiname varchar(200),
34 alternatenames varchar(6000),34 alternatenames varchar(10000),
35 latitude float,35 latitude float,
36 longitude float,36 longitude float,
37 fclass char(1),37 fclass char(1),
@@ -57,9 +57,11 @@
57 isoLanguage varchar(7),57 isoLanguage varchar(7),
58 alternateName varchar(200),58 alternateName varchar(200),
59 isPreferredName boolean,59 isPreferredName boolean,
60 isShortName boolean60 isShortName boolean,
61 isColloquial boolean,
62 isHistoric boolean
61);63);
62copy alternatename (alternatenameid,geonameid,isoLanguage,alternateName,isPreferredName,isShortName) from '$WORKPATH/alternateNames.txt' null as '';64copy alternatename (alternatenameid,geonameid,isoLanguage,alternateName,isPreferredName,isShortName,isColloquial,isHistoric) from '$WORKPATH/alternateNames.txt' null as '';
6365
64DROP TABLE IF EXISTS countryinfo;66DROP TABLE IF EXISTS countryinfo;
65CREATE TABLE countryinfo (67CREATE TABLE countryinfo (
@@ -94,29 +96,24 @@
94);96);
95copy iso_languagecodes (iso_639_3, iso_639_2, iso_639_1, language_name) from '$WORKPATH/iso-languagecodes.txt.tmp' null as '';97copy iso_languagecodes (iso_639_3, iso_639_2, iso_639_1, language_name) from '$WORKPATH/iso-languagecodes.txt.tmp' null as '';
9698
97DROP TABLE IF EXISTS admin1Codes;99DROP TABLE IF EXISTS admin1codes;
98CREATE TABLE admin1Codes (100CREATE TABLE admin1codes (
99 code varchar(10),
100 name TEXT
101);
102copy admin1Codes (code, name) from '$WORKPATH/admin1Codes.txt' null as '';
103
104DROP TABLE IF EXISTS admin1CodesAscii;
105CREATE TABLE admin1CodesAscii (
106 code varchar(10),101 code varchar(10),
107 name TEXT,102 name TEXT,
108 nameAscii TEXT,103 nameAscii TEXT,
109 geonameid int104 geonameid int
110);105);
111copy admin1CodesAscii (code,name,nameAscii,geonameid) from '$WORKPATH/admin1CodesASCII.txt' null as '';106copy admin1codes (code,name,nameAscii,geonameid) from '$WORKPATH/admin1CodesASCII.txt' null as '';
112107
113DROP TABLE IF EXISTS timeZones;108DROP TABLE IF EXISTS timeZones;
114CREATE TABLE timeZones (109CREATE TABLE timeZones (
110 code CHAR(2),
115 timeZoneId VARCHAR(200),111 timeZoneId VARCHAR(200),
116 GMT_offset numeric(3,1),112 GMT_offset numeric(3,1),
117 DST_offset numeric(3,1)113 DST_offset numeric(3,1),
114 RAW_offset numeric(3,1)
118);115);
119copy timeZones (timeZoneId,GMT_offset,DST_offset) from '$WORKPATH/timeZones.txt.tmp' null as '';116copy timeZones (code,timeZoneId,GMT_offset,DST_offset,RAW_offset) from '$WORKPATH/timeZones.txt.tmp' null as '';
120117
121DROP TABLE IF EXISTS continentCodes;118DROP TABLE IF EXISTS continentCodes;
122CREATE TABLE continentCodes (119CREATE TABLE continentCodes (
123120
=== modified file 'sphinx.conf'
--- sphinx.conf 2011-04-07 13:26:08 +0000
+++ sphinx.conf 2012-08-23 04:40:25 +0000
@@ -41,7 +41,7 @@
41query_log = /var/log/sphinxsearch/query.log41query_log = /var/log/sphinxsearch/query.log
42read_timeout = 542read_timeout = 5
43max_children = 3043max_children = 30
44pid_file = /var/run/searchd.pid44pid_file = /var/run/sphinxsearch/searchd.pid
45max_matches = 100045max_matches = 1000
46seamless_rotate = 146seamless_rotate = 1
47preopen_indexes = 047preopen_indexes = 0

Subscribers

People subscribed via source and target branches

to all changes: