Merge lp:~diogo-simoes89/zorba/data-cleaning-thesaurus into lp:zorba/data-cleaning-module

Proposed by Diogo Simões
Status: Rejected
Rejected by: Chris Hillery
Proposed branch: lp:~diogo-simoes89/zorba/data-cleaning-thesaurus
Merge into: lp:zorba/data-cleaning-module
Diff against target: 203 lines (+148/-0)
9 files modified
src/com/zorba-xquery/www/modules/data-cleaning/CMakeLists.txt (+3/-0)
src/com/zorba-xquery/www/modules/data-cleaning/normalization.xq (+22/-0)
src/com/zorba-xquery/www/modules/data-cleaning/thesaurus-based.xq (+74/-0)
test/ExpQueryResults/data-cleaning/normalization/capitalize.xml.res (+1/-0)
test/ExpQueryResults/data-cleaning/thesaurus-based/check-related.xml.res (+1/-0)
test/ExpQueryResults/data-cleaning/thesaurus-based/related-terms.xml.res (+1/-0)
test/Queries/data-cleaning/normalization/capitalize.xq (+3/-0)
test/Queries/data-cleaning/thesaurus-based/check-related.xq (+25/-0)
test/Queries/data-cleaning/thesaurus-based/related-terms.xq (+18/-0)
To merge this branch: bzr merge lp:~diogo-simoes89/zorba/data-cleaning-thesaurus
Reviewer Review Type Date Requested Status
Zorba Coders Pending
Review via email: mp+100683@code.launchpad.net

Commit message

Added the capitalize function and the thesaurus-based module.

Description of the change

This revision includes a new normalization function: capitalize($string as xs:string) as xs:string.

It also includes the thesaurus-based module, with the check-related ( $s1 as xs:string, $s2 as xs:string, $uri as xs:string, $type as xs:string ) and the related-terms ( $s1 as xs:string, $uri as xs:string, $type as xs:string ) functions.

To post a comment you must log in.
Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :
Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :

The attempt to merge lp:~diogo-simoes89/zorba/data-cleaning-thesaurus into lp:zorba/data-cleaning-module failed. Below is the output from the failed tests.

CMake Error at /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake:274 (message):
  Validation queue job data-cleaning-thesaurus-2012-04-04T15-40-20.194Z is
  finished. The final status was:

  8 tests did not succeed - changes not commited.

Error in read script: /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake

Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :
Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :

The attempt to merge lp:~diogo-simoes89/zorba/data-cleaning-thesaurus into lp:zorba/data-cleaning-module failed. Below is the output from the failed tests.

CMake Error at /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake:274 (message):
  Validation queue job data-cleaning-thesaurus-2012-04-11T16-16-43.833Z is
  finished. The final status was:

  3 tests did not succeed - changes not commited.

Error in read script: /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake

Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :
Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :

The attempt to merge lp:~diogo-simoes89/zorba/data-cleaning-thesaurus into lp:zorba/data-cleaning-module failed. Below is the output from the failed tests.

CMake Error at /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake:274 (message):
  Validation queue job data-cleaning-thesaurus-2012-04-16T13-01-49.132Z is
  finished. The final status was:

  2 tests did not succeed - changes not commited.

Error in read script: /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake

Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :
Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :

The attempt to merge lp:~diogo-simoes89/zorba/data-cleaning-thesaurus into lp:zorba/data-cleaning-module failed. Below is the output from the failed tests.

CMake Error at /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake:274 (message):
  Validation queue job data-cleaning-thesaurus-2012-04-17T11-34-47.556Z is
  finished. The final status was:

  2 tests did not succeed - changes not commited.

Error in read script: /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake

Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :
Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :

The attempt to merge lp:~diogo-simoes89/zorba/data-cleaning-thesaurus into lp:zorba/data-cleaning-module failed. Below is the output from the failed tests.

CMake Error at /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake:274 (message):
  Validation queue job data-cleaning-thesaurus-2012-04-18T11-37-19.07Z is
  finished. The final status was:

  5 tests did not succeed - changes not commited.

Error in read script: /home/ceej/zo/testing/zorbatest/tester/TarmacLander.cmake

Revision history for this message
Paul J. Lucas (paul-lucas) wrote :

The capitalize() function apparently only works for English. IMHO, this is a severe limitation. That aside, it probably ought to be named something like title-capitalize().

The thesaurus:check-related() function doesn't seem right. First, the result of ft:lookup() does *not* return the relationship between two phrases. It returns the related phrases for a given phrase that *have* the relationship you give it (if you use the signature that takes a $relationship).

Second, you're (again) hard-coding English (and yet not hard-coding the URI). It seems like that's all related-terms() does. Why bother? It just adds an extra layer of function and documentation with very little utility, IMHO.

Third, the documentation for check-related() isn't clear. Describing $s1 as the first string and $s2 as the second string is insufficient. The order matters.

FYI: the way you currently specify a thesaurus (http://wordnet.princeton.edu:=$RBKT_BINARY_DIR/thesauri/wordnet-en.zth) is going to change.

Revision history for this message
Chris Hillery (ceejatec) wrote :

Per Helena, this is no longer needed (stale proposal and the existing module is sufficient).

Unmerged revisions

46. By Diogo Simões

Adding thesaurus-based module (CMakeLists)

45. By Diogo Simões

Added the capitalize function and the thesaurus-based module.

44. By Diogo Simões

thesaurus-based bugs fixing

43. By Diogo Simões

Capitalize funtion and thesaurus-based module

42. By Diogo Simões

Capitalize funtion and thesaurus-based module

41. By Diogo Simões

Capitalize and thesaurus

40. By Diogo Simões

Changes in documentation

39. By Diogo Simões

Capitalize function and implementation of the thesaurus-based module

38. By Diogo Simões

New normalization functions and implementation of the thesaurus-based module.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/com/zorba-xquery/www/modules/data-cleaning/CMakeLists.txt'
2--- src/com/zorba-xquery/www/modules/data-cleaning/CMakeLists.txt 2011-08-07 20:36:50 +0000
3+++ src/com/zorba-xquery/www/modules/data-cleaning/CMakeLists.txt 2012-04-18 10:50:24 +0000
4@@ -38,3 +38,6 @@
5
6 DECLARE_ZORBA_MODULE (URI "http://www.zorba-xquery.com/modules/data-cleaning/token-based-string-similarity"
7 VERSION 2.0 FILE "token-based-string-similarity.xq")
8+
9+DECLARE_ZORBA_MODULE (URI "http://www.zorba-xquery.com/modules/data-cleaning/thesaurus-based"
10+ VERSION 2.0 FILE "thesaurus-based.xq")
11
12=== modified file 'src/com/zorba-xquery/www/modules/data-cleaning/normalization.xq'
13--- src/com/zorba-xquery/www/modules/data-cleaning/normalization.xq 2012-04-11 09:50:34 +0000
14+++ src/com/zorba-xquery/www/modules/data-cleaning/normalization.xq 2012-04-18 10:50:24 +0000
15@@ -37,6 +37,28 @@
16 declare option ver:module-version "2.0";
17
18 (:~
19+: Converts a given string into a capitalized representation.
20+:
21+: @param $string The string to be capitalized.
22+:
23+: @return The string resulting from the conversion.
24+: @example test/Queries/data-cleaning/normalization/capitalize.xq
25+:)
26+declare function normalization:capitalize ($string as xs:string) as xs:string{
27+ let $ttokens := tokenize ($string, " ")
28+ let $non-capitalized-words := ("a", "an", "the", "but", "as", "if", "and", "or", "nor", "of")
29+ let $cap-tokens :=
30+ for $toks in $ttokens[position()>1]
31+ let $capitalized-tokens :=
32+ if (not($non-capitalized-words = $toks))
33+ then concat(upper-case(substring($toks, 1,1)), substring(lower-case($toks), 2), " ")
34+ else concat(lower-case($toks), " ")
35+ return $capitalized-tokens
36+ let $cap-string := concat(concat(upper-case(substring($ttokens[position()=1], 1,1)), substring(lower-case($ttokens[position()=1]), 2), " "), string-join($cap-tokens))
37+ return substring($cap-string, 1, string-length($cap-string)-1)
38+};
39+
40+(:~
41 : Converts a given string representation of a date value into a date representation valid according
42 : to the corresponding XML Schema type.
43 :
44
45=== added file 'src/com/zorba-xquery/www/modules/data-cleaning/thesaurus-based.xq'
46--- src/com/zorba-xquery/www/modules/data-cleaning/thesaurus-based.xq 1970-01-01 00:00:00 +0000
47+++ src/com/zorba-xquery/www/modules/data-cleaning/thesaurus-based.xq 2012-04-18 10:50:24 +0000
48@@ -0,0 +1,74 @@
49+(:
50+ : Copyright 2006-2009 The FLWOR Foundation.
51+ :
52+ : Licensed under the Apache License, Version 2.0 (the "License");
53+ : you may not use this file except in compliance with the License.
54+ : You may obtain a copy of the License at
55+ :
56+ : http://www.apache.org/licenses/LICENSE-2.0
57+ :
58+ : Unless required by applicable law or agreed to in writing, software
59+ : distributed under the License is distributed on an "AS IS" BASIS,
60+ : WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
61+ : See the License for the specific language governing permissions and
62+ : limitations under the License.
63+ :)
64+
65+(:~
66+ : This library module provides thesaurus functions for checking semantic relations between strings
67+ : and for checking abbreviations.
68+
69+ : These functions are particularly useful in tasks related to the creation of semantic mappings.
70+ :
71+ :
72+ : @author Bruno Martins and Diogo Simões
73+ :)
74+
75+module namespace thesaurus = "http://www.zorba-xquery.com/modules/data-cleaning/thesaurus-based";
76+
77+import module namespace ft = "http://www.zorba-xquery.com/modules/full-text";
78+
79+(:~
80+ : Checks if two strings have a relationship defined in a given thesaurus.
81+ : The implementation of this function depends on the full-text module.
82+ :
83+ :
84+ : @param $s1 The first string.
85+ : @param $s2 The second string.
86+ : @param $uri The uri of the thesaurus to be considered.
87+ : @param $type An identifyer for the type of relationship.
88+ :
89+ : @return true if the first string has the provided relationship with the second string defined in the thesaurus and false otherwise.
90+ : @example test/Queries/data-cleaning/thesaurus-based/check-related.xq
91+ :
92+ :)
93+declare function thesaurus:check-related ( $s1 as xs:string, $s2 as xs:string, $uri as xs:string, $type as xs:string ) as xs:boolean {
94+ let $relation := ft:thesaurus-lookup( $uri,
95+ $s2,
96+ xs:language("en"),
97+ $type )
98+ return $relation = $s1
99+};
100+
101+(:~
102+ : Returns a sequence with the strings that have a relationship,
103+ : defined in a given thesaurus, with the string provided as input.
104+ : The implementation of this function depends on the full-text module.
105+ :
106+ :
107+ : @param $s1 The string with the query term.
108+ : @param $uri The uri of the thesaurus to be considered.
109+ : @param $type An identifyer for the type of relationship.
110+ :
111+ : @return A sequence with the strings that have the provided relationship, defined in the thesaurus, with the query term.
112+ : @example test/Queries/data-cleaning/thesaurus-based/related-terms.xq
113+ :)
114+declare function thesaurus:related-terms ( $s1 as xs:string, $uri as xs:string, $type as xs:string ) as xs:string* {
115+ let $synonyms := ft:thesaurus-lookup( $uri,
116+ $s1,
117+ xs:language("en"),
118+ $type )
119+ return $synonyms
120+};
121+
122+
123
124=== added file 'test/ExpQueryResults/data-cleaning/normalization/capitalize.xml.res'
125--- test/ExpQueryResults/data-cleaning/normalization/capitalize.xml.res 1970-01-01 00:00:00 +0000
126+++ test/ExpQueryResults/data-cleaning/normalization/capitalize.xml.res 2012-04-18 10:50:24 +0000
127@@ -0,0 +1,1 @@
128+The Lord of the Rings
129
130=== added directory 'test/ExpQueryResults/data-cleaning/thesaurus-based'
131=== added file 'test/ExpQueryResults/data-cleaning/thesaurus-based/check-related.xml.res'
132--- test/ExpQueryResults/data-cleaning/thesaurus-based/check-related.xml.res 1970-01-01 00:00:00 +0000
133+++ test/ExpQueryResults/data-cleaning/thesaurus-based/check-related.xml.res 2012-04-18 10:50:24 +0000
134@@ -0,0 +1,1 @@
135+true
136
137=== added file 'test/ExpQueryResults/data-cleaning/thesaurus-based/related-terms.xml.res'
138--- test/ExpQueryResults/data-cleaning/thesaurus-based/related-terms.xml.res 1970-01-01 00:00:00 +0000
139+++ test/ExpQueryResults/data-cleaning/thesaurus-based/related-terms.xml.res 2012-04-18 10:50:24 +0000
140@@ -0,0 +1,1 @@
141+chromatic color chromatic colour spectral color spectral colour clothing article of clothing vesture wear wearable habiliment organization organisation sky dye dyestuff amobarbital lycaenid lycaenid butterfly discolor discolour colour color coloring colouring covering consumer goods social group atmosphere coloring material colouring material barbiturate truth serum truth drug butterfly change
142
143=== added file 'test/Queries/data-cleaning/normalization/capitalize.xq'
144--- test/Queries/data-cleaning/normalization/capitalize.xq 1970-01-01 00:00:00 +0000
145+++ test/Queries/data-cleaning/normalization/capitalize.xq 2012-04-18 10:50:24 +0000
146@@ -0,0 +1,3 @@
147+import module namespace normalization = "http://www.zorba-xquery.com/modules/data-cleaning/normalization";
148+
149+normalization:capitalize ("the lord of the rings")
150
151=== added directory 'test/Queries/data-cleaning/thesaurus-based'
152=== added file 'test/Queries/data-cleaning/thesaurus-based/check-related.xq'
153--- test/Queries/data-cleaning/thesaurus-based/check-related.xq 1970-01-01 00:00:00 +0000
154+++ test/Queries/data-cleaning/thesaurus-based/check-related.xq 2012-04-18 10:50:24 +0000
155@@ -0,0 +1,25 @@
156+import module namespace thesaurus = "http://www.zorba-xquery.com/modules/data-cleaning/thesaurus-based";
157+
158+thesaurus:check-related ( "animal", "dog", "http://wordnet.princeton.edu", "BT" )
159+
160+(: Example configuration (taken from zorba testsuite):
161+
162+Args:
163+--thesaurus
164+http://wordnet.princeton.edu:=$RBKT_BINARY_DIR/thesauri/wordnet-en.zth
165+
166+
167+---------------------------------------------------------------------------------------
168+Args: --thesaurus http://wordnet.princeton.edu:=$RBKT_BINARY_DIR/thesauri/wordnet-en.zth
169+
170+---------------------------------------------------------------------------------------
171+
172+
173+
174+Expected output:
175+
176+true
177+
178+
179+:)
180+
181
182=== added file 'test/Queries/data-cleaning/thesaurus-based/related-terms.xq'
183--- test/Queries/data-cleaning/thesaurus-based/related-terms.xq 1970-01-01 00:00:00 +0000
184+++ test/Queries/data-cleaning/thesaurus-based/related-terms.xq 2012-04-18 10:50:24 +0000
185@@ -0,0 +1,18 @@
186+import module namespace thesaurus = "http://www.zorba-xquery.com/modules/data-cleaning/thesaurus-based";
187+
188+thesaurus:related-terms( "blue", "http://wordnet.princeton.edu", "BT" )
189+
190+(: Example configuration (taken from zorba testsuite):
191+
192+Args:
193+--thesaurus
194+http://wordnet.princeton.edu:=$RBKT_BINARY_DIR/thesauri/wordnet-en.zth
195+
196+
197+---------------------------------------------------------------------------------------
198+Args: --thesaurus http://wordnet.princeton.edu:=$RBKT_BINARY_DIR/thesauri/wordnet-en.zth
199+
200+---------------------------------------------------------------------------------------
201+
202+
203+:)

Subscribers

People subscribed via source and target branches

to all changes: