Merge lp:~jeremy-munsch/synapse-project/ascii-smart into lp:synapse-project

Proposed by Jeremy Munsch on 2015-11-13
Status: Needs review
Proposed branch: lp:~jeremy-munsch/synapse-project/ascii-smart
Merge into: lp:synapse-project
Diff against target: 190 lines (+145/-3)
2 files modified
src/core/query.vala (+3/-3)
src/core/utils.vala (+142/-0)
To merge this branch: bzr merge lp:~jeremy-munsch/synapse-project/ascii-smart
Reviewer Review Type Date Requested Status
Rico Tzschichholz 2015-11-13 Needs Information on 2015-11-30
Review via email: mp+277477@code.launchpad.net

Description of the change

Ignore non ascii chars like éè and so long.
This allows to search word with non ascii chars and match both.
The reverse is also true, searching with non ascii chars will match same word with ascii chars.

eg: searching for Éteindre (Shutdown action) i could just type eteindre, éteindre will still work though.

This will basically provide more matches by sort of ignoring special chars. It makes synapse smarter and have a faster usage will typing. It is extendible by using just a string for hardcoded config.

It is based on this link http://stackoverflow.com/a/16427125

To post a comment you must log in.
Jeremy Munsch (jeremy-munsch) wrote :

Updated the branch and commit message following recommendations

Rico Tzschichholz (ricotz) wrote :

Looks pretty bold and changes the internal behavior quite a lot.

There are for sure some avoidable performances issue, like useless string-copying.
(Better usage of const, unowned strings and GLib.HashTables)

What I am most curious about is the "reverse" stuff.
If the user explicitly searches for it should there be not-closely-matching results?
Should "additional" results be weighted lower?

review: Needs Information
Jeremy Munsch (jeremy-munsch) wrote :

> What I am most curious about is the "reverse" stuff.
> If the user explicitly searches for it should there be not-closely-matching
> results?
> Should "additional" results be weighted lower?

You're making a really good point here. I must agree using non ascii char is a value added to query information. In this case yes, the additional results might be weighted lower.

I'll make some changes.

Jeremy Munsch (jeremy-munsch) wrote :

Updated this, I have drop reverse search as this would be too much complicated to handle, i think this PR wins more by staying as simple as possible.

I tried to write things with performance in minds (though with my skills ><).

It also handles multiple replacement (eg: ss => ß :p )

Jeremy Munsch (jeremy-munsch) wrote :

* fixed non special chars not outputing
* fixed string copy by using string builder
* fixed double and tripple matching to ignore upper letters, Ss,sS,SS,ss => ß

* added some special upper chars, and removed them from lower to honor Caseless regex rule
this is. Allthough double and tripple matching remains caseless as it would
be too much complicated to handle for the program and the user.
http://stackoverflow.com/questions/3371697/replacing-accented-characters-php/33856250#33856250

Jeremy Munsch (jeremy-munsch) wrote :

I also not if this is merge desktop files plugin would have to be updated to remove usage Utils.remove_accents as it will become useless.

Rico Tzschichholz (ricotz) wrote :

Define the strings in a const-array, use a GLib.HashTable and fill it in a loop.

Rebase on trunk.

review: Needs Fixing
Jeremy Munsch (jeremy-munsch) wrote :

Just did that.

Rico Tzschichholz (ricotz) wrote :

why are you constantly introducing other changes?

review: Needs Information
Jeremy Munsch (jeremy-munsch) wrote :

corrected:

* usage of multi dim array with unowned strings
* renamed var starting with '_'
* added space to have better formatting of equations (with numeric operators)

Unmerged revisions

637. By Jeremy Munsch on 2015-12-02

core/utils: add map_special_chars
core/ query: make usage of Utils.map_special_chars

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/core/query.vala'
2--- src/core/query.vala 2015-11-30 14:46:12 +0000
3+++ src/core/query.vala 2015-12-02 20:11:43 +0000
4@@ -137,7 +137,7 @@
5
6 var results = new Gee.HashMap<Regex, int> ();
7 var stripped_query = query.strip ();
8- var escaped_query = Regex.escape_string (stripped_query);
9+ var escaped_query = Utils.map_special_chars (Regex.escape_string (stripped_query));
10 Regex re;
11
12 try
13@@ -174,7 +174,7 @@
14 string[] escaped_words = {};
15 foreach (unowned string word in individual_words)
16 {
17- escaped_words += Regex.escape_string (word);
18+ escaped_words += Utils.map_special_chars (Regex.escape_string (word));
19 }
20 string pattern = "\\b(%s)".printf (string.joinv (").+\\b(",
21 escaped_words));
22@@ -246,7 +246,7 @@
23 string[] escaped_chars = {};
24 foreach (unowned string word in individual_chars)
25 {
26- escaped_chars += Regex.escape_string (word);
27+ escaped_chars += Utils.map_special_chars (Regex.escape_string (word));
28 }
29
30 // make "aj" match "Activity Journal"
31
32=== modified file 'src/core/utils.vala'
33--- src/core/utils.vala 2015-11-19 09:11:34 +0000
34+++ src/core/utils.vala 2015-12-02 20:11:43 +0000
35@@ -23,6 +23,9 @@
36 [CCode (gir_namespace = "SynapseUtils", gir_version = "1.0")]
37 namespace Utils
38 {
39+ static GLib.HashTable<string, string> spec_char_map;
40+ private static size_t spec_char_map_initializer;
41+
42 /* Make sure setlocale was called before calling this function
43 * (Gtk.init calls it automatically)
44 */
45@@ -46,6 +49,145 @@
46 return result;
47 }
48
49+ /**
50+ * This method rewrite a string (regex),
51+ * to transform special non ascii chars
52+ * and include option matching their equivalent (non)ascii char
53+ * eg: (french) éteindre 'shutdown' becomes [ée]teindre
54+ * This allows synapse to be globally smarter and having
55+ * a faster usage for foreign languages mostly.
56+ *
57+ * Source map is from thread above StackOverflow thread
58+ * http://stackoverflow.com/a/16427125
59+ */
60+ public static string map_special_chars (string query)
61+ {
62+ if (Once.init_enter (&spec_char_map_initializer))
63+ {
64+ (unowned string)[,] charmap = {{"-", "ъьЪЬ"},
65+ {"A", "АĂǍĄÀÃÁÆÂÅǺĀא"},
66+ {"B", "БבÞ"},
67+ {"C", "ĈĆÇЦצĊČ©ץ"},
68+ {"CH", "Ч"},
69+ {"D", "ДĎĐדÐ"},
70+ {"E", "ÈĘÉËÊЕĒĖĚĔЄƏע"},
71+ {"F", "ФƑ"},
72+ {"G", "ĞĠĢĜГגҐ"},
73+ {"H", "חĦХĤה"},
74+ {"I", "IÏÎÍÌĮĬIИĨǏיЇĪІ"},
75+ {"J", "ЙĴ"},
76+ {"K", "ĸכĶКך"},
77+ {"L", "ŁĿЛĻĹĽל"},
78+ {"M", "מМם"},
79+ {"N", "ÑŃНŅןŊנʼnŇ"},
80+ {"O", "ØÓÒÔÕОŐŎŌǾǑƠ"},
81+ {"P", "פףП"},
82+ {"Q", "ק"},
83+ {"R", "ŔŘŖרР®"},
84+ {"S", "ŞŚȘŠСŜס"},
85+ {"T", "ТȚטŦתŤŢ"},
86+ {"U", "ÙÛÚŪУŨƯǓŲŬŮŰǕǛǙǗ"},
87+ {"V", "Вו"},
88+ {"Y", "ÝЫŶŸ"},
89+ {"Z", "ŹŽŻЗזS"},
90+ {"a", "аăǎąàãáæâåǻāא"},
91+ {"b", "бבþ"},
92+ {"c", "ĉćçцצċč©ץ"},
93+ {"d", "дďđדð"},
94+ {"e", "èęéëêеēėěĕєəע"},
95+ {"f", "фƒ"},
96+ {"g", "ğġģĝгגґ"},
97+ {"h", "חħхĥה"},
98+ {"i", "iïîíìįĭıиĩǐיїīі"},
99+ {"j", "йĵ"},
100+ {"k", "ĸכķкך"},
101+ {"l", "łŀлļĺľל"},
102+ {"m", "מмם"},
103+ {"n", "ñńнņןŋנʼnň"},
104+ {"o", "øóòôõоőŏōǿǒơ"},
105+ {"p", "פףп"},
106+ {"q", "ק"},
107+ {"r", "ŕřŗרр®"},
108+ {"s", "şśșšсŝס"},
109+ {"t", "тțטŧתťţ"},
110+ {"u", "ùûúūуũưǔųŭůűǖǜǚǘ"},
111+ {"v", "вו"},
112+ {"y", "ýыŷÿ"},
113+ {"z", "źžżзזſ"},
114+ {"tm", "™"},
115+ {"at", "@"},
116+ {"ae", "ÄǼäæǽ"},
117+ {"ch", "Чч"},
118+ {"ij", "ijIJ"},
119+ {"j", "йЙĴĵ"},
120+ {"ja", "яЯ"},
121+ {"je", "Ээ"},
122+ {"jo", "ёЁ"},
123+ {"ju", "юЮ"},
124+ {"oe", "œŒöÖ"},
125+ {"sch", "щЩ"},
126+ {"sh", "шШ"},
127+ {"ss", "ß"},
128+ {"tm", "™"},
129+ {"ue", "Ü"},
130+ {"zh", "Жж"}};
131+
132+ spec_char_map = new GLib.HashTable<string, string> (str_hash, str_equal);
133+
134+ for (int i = 0; i < 67; i++)
135+ {
136+ spec_char_map.set (charmap[i, 0], charmap[i, 1]);
137+ }
138+
139+ size_t spec_char_map_initializer_value = 42;
140+ Once.init_leave (&spec_char_map_initializer, spec_char_map_initializer_value);
141+ }
142+
143+ var query_list = new Gee.ArrayList<string>();
144+ string query_char = "";
145+ string retro_query_char = "";
146+ string replace = "";
147+ StringBuilder output = new StringBuilder ();
148+ int q_pos = 0, qlist_post = 0;
149+
150+ while (q_pos < query.length)
151+ {
152+ query_char = query.get (q_pos).to_string ();
153+ if (spec_char_map.contains (query_char))
154+ {
155+ // simple letter replace
156+ query_list.add ("[" + query_char + spec_char_map.get (query_char) + "]");
157+
158+ // triple letter replace (eg: sch)
159+ if (q_pos > 1 && spec_char_map.contains (retro_query_char = query.slice (q_pos - 2, q_pos + 1).down ()))
160+ {
161+ query_list.insert (qlist_post - 2, "(?:");
162+ query_list.insert (qlist_post + 2, "|[" + spec_char_map.get (retro_query_char) + "])");
163+ qlist_post += 2;
164+ }
165+ // double letter replace (eg: ss, zh), handles repeats (eg: ssssss)
166+ else if ( q_pos > 0
167+ && spec_char_map.contains (retro_query_char = query.slice (q_pos - 1, q_pos + 1).down ())
168+ && (replace = "|[" + spec_char_map.get (retro_query_char) + "])") != query_list.get (qlist_post - 1))
169+ {
170+ query_list.insert (qlist_post - 1, "(?:");
171+ query_list.insert (qlist_post + 2, replace);
172+ qlist_post += 2;
173+ }
174+ }
175+ else
176+ query_list.add (query_char);
177+ qlist_post++;
178+ q_pos++;
179+ }
180+
181+ foreach (string regex_keystroke in query_list)
182+ {
183+ output.append (regex_keystroke);
184+ }
185+ return output.str;
186+ }
187+
188 public static string? remove_last_unichar (string input)
189 {
190 long char_count = input.char_count ();