Merge lp:~pmarchwiak/synapse-project/recoll-plugin into lp:synapse-project

Proposed by Patrick Marchwiak
Status: Needs review
Proposed branch: lp:~pmarchwiak/synapse-project/recoll-plugin
Merge into: lp:synapse-project
Diff against target: 217 lines (+191/-0)
3 files modified
src/plugins/Makefile.am (+1/-0)
src/plugins/recoll-plugin.vala (+189/-0)
src/ui/synapse-main.vala (+1/-0)
To merge this branch: bzr merge lp:~pmarchwiak/synapse-project/recoll-plugin
Reviewer Review Type Date Requested Status
Michal Hruby Pending
Review via email: mp+133784@code.launchpad.net

Description of the change

This plugin runs queries against a Recoll index providing search results based on file contents (document types supported by Recoll include msword and PDF).

Feature summary for the current implementation:
* returns up to 20 UriMatches, using the filename as the title and filename + abstract (sample of doc contents) as the description
* results are scored lower than results from the locate plugin to ensure file name matches show up first in the results
* plugin is only enabled when the recoll command is available

To post a comment you must log in.
Revision history for this message
Michal Hruby (mhr3) wrote :

Thanks for the contribution, it's looking pretty good, although I do have one gripe:

I'm not a fan of spawning processes in the search() method of an ItemProvider plugin - this is called on every keystroke and is pretty heavy on the system. That's why "Locate" is hidden behind an action, and before calculating with `bc` the query is first checked against a regex. Could you please do something like that?

Otherwise, the code is clean, nice job ;)

Revision history for this message
Patrick Marchwiak (pmarchwiak) wrote :

I understand your concern.

I feel there's more motivation for the action approach with "Locate" since it is quite common for users to have a large enough number of files that it would take too long to run. I don't see how checking a regex would help in the search index situation, since anything a user types is a valid search. In my (limited) testing on a more recent i5 laptop as well as an older core duo laptop, results have been quick to return with this plugin. However, I do appreciate that spawning a process could be taxing on other systems. The other drawback of hiding it behind an action is that it would require additional keystrokes to get to the results, which makes it less useful to me.

I'm not sure when I'll find the free time, but I will consider making hiding it behind an action.

Revision history for this message
Ari (ari-lp) wrote :

Hi Patrick,

first of all thank you very much for this plugin. I have been searching for a way to integrate Recoll into Synapse for a while now and I am very glad I found your code. I am not a coder myself but I just wanted to say that I agree with Michal's remarks. If you're working with a large index - say thousands of documents - searching can be quite resource-intensive, both in terms of CPU cycles and hard disk activity. Engaging a search while the user is still typing is counter-productive in these use cases. That's why I would love to see this plugin implemented as an action. It would be fantastic if you could modify it in that regard.

Thank you for taking the time to read this.

Cheers
--Ari

Unmerged revisions

511. By Patrick Marchwiak

Set the description to file name + abstract in recoll plugin

510. By Patrick Marchwiak

Turn down match score and use Utils.FileInfo to build better results in recoll plugin

509. By Patrick Marchwiak

Remove python script for accessing recoll and invoke cmd line tool directly

508. By Patrick Marchwiak

Add initial version of recoll plugin

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'src/plugins/Makefile.am'
2--- src/plugins/Makefile.am 2012-03-18 19:11:23 +0000
3+++ src/plugins/Makefile.am 2012-11-10 07:41:20 +0000
4@@ -48,6 +48,7 @@
5 opensearch.vala \
6 pastebin-plugin.vala \
7 pidgin-plugin.vala \
8+ recoll-plugin.vala \
9 rhythmbox-plugin.vala \
10 selection-plugin.vala \
11 test-slow-plugin.vala \
12
13=== added file 'src/plugins/recoll-plugin.vala'
14--- src/plugins/recoll-plugin.vala 1970-01-01 00:00:00 +0000
15+++ src/plugins/recoll-plugin.vala 2012-11-10 07:41:20 +0000
16@@ -0,0 +1,189 @@
17+ /*
18+ *
19+ * Authored by Patrick Marchwiak <pd@marchwiak.com>
20+ *
21+ */
22+
23+ namespace Synapse
24+ {
25+
26+ // Sends query to Recoll command line tool.
27+ public class RecollPlugin : Object, Activatable, ItemProvider
28+ {
29+ // a mandatory property
30+ public bool enabled { get; set; default = true; }
31+
32+ // this method is called when a plugin is enabled
33+ // use it to initialize your plugin
34+ public void activate ()
35+ {
36+ }
37+
38+ // this method is called when a plugin is disabled
39+ // use it to free the resources you're using
40+ public void deactivate ()
41+ {
42+ }
43+
44+ // register your plugin in the UI
45+ static void register_plugin ()
46+ {
47+ DataSink.PluginRegistry.get_default ().register_plugin (
48+ typeof (RecollPlugin),
49+ _ ("Recoll"), // plugin title
50+ _ ("Returns results of full text search against an existing Recoll index."), // description
51+ "recoll", // icon name
52+ register_plugin, // reference to this function
53+ Environment.find_program_in_path ("recoll") != null, // true if user's system has all required components which the plugin needs
54+ _ ("recoll is not installed") // error message
55+ );
56+ }
57+
58+ static construct
59+ {
60+ // register the plugin when the class is constructed
61+ register_plugin ();
62+ }
63+
64+ // an optional method to improve the speed of searches,
65+ // if you return false here, the search method won't be called
66+ // for this query
67+ public bool handles_query (Query query)
68+ {
69+ return (QueryFlags.FILES in query.query_type);
70+ }
71+
72+ enum LineType {
73+ FIELDS,
74+ ABSTRACT_START,
75+ ABSTRACT,
76+ ABSTRACT_END;
77+ }
78+
79+ public async ResultSet? search (Query query) throws SearchError
80+ {
81+ Pid pid;
82+ int read_fd, write_fd;
83+ string[] argv = {"recoll",
84+ "-t", // command line mode
85+ "-n", // indices of results
86+ "0-20", // return first 20 results
87+ "-a", // ALL TERMS mode
88+ "-A", // output abstracts
89+ query.query_string};
90+
91+ try
92+ {
93+ Process.spawn_async_with_pipes (null, argv, null,
94+ SpawnFlags.SEARCH_PATH,
95+ null, out pid, out write_fd, out read_fd);
96+ UnixInputStream read_stream = new UnixInputStream (read_fd, true);
97+ DataInputStream recoll_output = new DataInputStream (read_stream);
98+
99+ // Sample output from `recoll -t` :
100+ // ===============================
101+ // Recoll query: ((kernel:(wqf=11) OR kernels OR kernelize OR kernelized))
102+ // 4725 results (printing 1 max):
103+ // text/plain [file:///home/patrick/code/sample-results.txt] [sample-results.txt] 8806 bytes
104+ // ABSTRACT
105+ // some text summarizing the document usually has the keyword (kernel)
106+ // /ABSTRACT
107+
108+ string line = null;
109+ var next_line_type = LineType.FIELDS;
110+ ResultSet results = new ResultSet ();
111+ Utils.FileInfo result = null;
112+ int line_idx = 0;
113+ string description = null;
114+ while ((line = yield recoll_output.read_line_async (Priority.DEFAULT)) != null)
115+ {
116+ if (line_idx >= 2) // skip first two lines
117+ {
118+ if (next_line_type == LineType.FIELDS)
119+ {
120+ string[] fields = line.split("\t");
121+
122+ //string mimetype = fields[0];
123+
124+ string uri = fields[1];
125+ uri = uri.substring(1, uri.length - 2);
126+
127+ description = uri.split("://")[1];
128+
129+ // FIXME: recoll already gives us the mimetype so FileInfo
130+ // is doing extra work obtaining it from the file
131+ result = new Utils.FileInfo (uri, typeof (MatchObject));
132+
133+ yield result.initialize ();
134+
135+ next_line_type = LineType.ABSTRACT_START;
136+ }
137+ else if (next_line_type == LineType.ABSTRACT_START)
138+ {
139+ // TODO check for the start of the abstract
140+ next_line_type = LineType.ABSTRACT;
141+ }
142+ else if (next_line_type == LineType.ABSTRACT)
143+ {
144+ line = line.chug().chomp();
145+ if (line != null && line != "")
146+ {
147+ description = description + ": " + line;
148+ }
149+ next_line_type = LineType.ABSTRACT_END;
150+ }
151+ else if (next_line_type == LineType.ABSTRACT_END)
152+ {
153+ // TODO check for the end of the abstract
154+
155+ // TODO use relevancy rating to set match score
156+
157+ // score defaults to just under that of results from the locate plugin
158+ result.match_obj.description = description;
159+ results.add(result.match_obj, Match.Score.INCREMENT_MINOR * 2);
160+ next_line_type = LineType.FIELDS;
161+ }
162+ else
163+ {
164+ // TODO handle unexpected output ?
165+ }
166+ }
167+ line_idx++;
168+ }
169+
170+ return results;
171+ }
172+ catch (Error err)
173+ {
174+ if (!query.is_cancelled ()) warning ("%s", err.message);
175+ }
176+
177+ // make sure this method is called before returning any results
178+ query.check_cancellable ();
179+ return null;
180+ }
181+
182+ private class MatchObject: Object, Match, UriMatch
183+ {
184+ // from Match interface
185+ public string title { get; construct set; }
186+ public string description { get; set; }
187+ public string icon_name { get; construct set; }
188+ public bool has_thumbnail { get; construct set; }
189+ public string thumbnail_path { get; construct set; }
190+ public MatchType match_type { get; construct set; }
191+
192+ // from UriMatch interface
193+ public string uri { get; set; }
194+ public QueryFlags file_type { get; set; }
195+ public string mime_type { get; set; }
196+
197+ public int default_relevancy { get; set; default = 0; }
198+
199+ public MatchObject ()
200+ {
201+ Object (match_type: MatchType.GENERIC_URI);
202+ }
203+ }
204+ }
205+}
206
207=== modified file 'src/ui/synapse-main.vala'
208--- src/ui/synapse-main.vala 2012-03-18 16:03:54 +0000
209+++ src/ui/synapse-main.vala 2012-11-10 07:41:20 +0000
210@@ -186,6 +186,7 @@
211 #if HAVE_LIBREST
212 typeof (ImgUrPlugin),
213 #endif
214+ typeof (RecollPlugin),
215 // action-only plugins
216 typeof (DevhelpPlugin),
217 typeof (OpenSearchPlugin),