Merge lp:~tristan-rivoallan/vanilla-miner/vm-594599 into lp:vanilla-miner

Proposed by Tristan Rivoallan
Status: Merged
Merged at revision: 49
Proposed branch: lp:~tristan-rivoallan/vanilla-miner/vm-594599
Merge into: lp:vanilla-miner
Diff against target: 521 lines (+279/-110)
12 files modified
apps/frontend/modules/resource/templates/_documentation/link/schema.php (+33/-25)
config/doctrine/schema.yml (+14/-2)
config/search.yml (+3/-1)
config/solr/IndexA_fr/conf/schema.xml (+2/-0)
data/utils/proxy.php (+13/-0)
lib/filter/doctrine/ResourceTypeFormFilter.class.php (+0/-16)
lib/form/doctrine/ResourceTypeForm.class.php (+0/-16)
lib/model/doctrine/ResourceType.class.php (+0/-15)
lib/model/doctrine/ResourceTypeTable.class.php (+0/-11)
lib/task/minerExpandLinksTask.class.php (+178/-0)
lib/task/minerExtractlinksTask.class.php (+30/-23)
lib/vendor/CI/Search/Link/Segment.php (+6/-1)
To merge this branch: bzr merge lp:~tristan-rivoallan/vanilla-miner/vm-594599
Reviewer Review Type Date Requested Status
Tristan Rivoallan Approve
Review via email: mp+28352@code.launchpad.net
To post a comment you must log in.
Revision history for this message
Tristan Rivoallan (tristan-rivoallan) :
review: Approve

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'apps/frontend/modules/resource/templates/_documentation/link/schema.php'
2--- apps/frontend/modules/resource/templates/_documentation/link/schema.php 2010-06-03 16:39:07 +0000
3+++ apps/frontend/modules/resource/templates/_documentation/link/schema.php 2010-06-23 20:53:23 +0000
4@@ -1,40 +1,32 @@
5 <p>Cette collection expose les attributs suivants :</p>
6
7-<h4 id="schema-url">url</h4>
8-<p>C'est l'URL vers la ressource. Par exemple : </p>
9-<pre>http://www.glafouk.com/dlz/radioclash_astrotease.mp3</pre>
10-
11-<h4 id="schema-domain_fqdn">domain_fqdn</h4>
12-<p>C'est le nom de domaine complet de l'URL vers la ressource. Par exemple :</p>
13-<pre>data.musiques-incongrues.net</pre>
14-
15-<h4 id="schema-domain_parent">domain_parent</h4>
16-<p>C'est le domaine parent de l'URL vers la ressource. Par exemple :</p>
17-<pre>musiques-incongrues.net</pre>
18-<p>Les deux URLs http://www.musiques-incongrues.net et http://data.musiques-incongrues.net ont un domaine parent identique.</p>
19-
20-<h4 id="schema-mime_type">mime_type</h4>
21-<p>C'est le type MIME de la ressource. Cet attribut n'est pas toujours définit. Il l'est toujours pour les fichiers binaires (mp3, image, etc). Par exemple :</p>
22-<pre>audio/mpeg</pre>
23+<h4 id="schema-availability">availability</h4>
24+<p>Ce paramètre correspond à la disponibilité du lien.</p>
25+<dl>
26+ <dt>Valeurs possibles</dt>
27+ <dd><code>unknown</code> : On ne sait pas si l'URL est accessible ou non</dd>
28+ <dd><code>available</code> : L'URL est accessible</dd>
29+ <dd><code>unavailable</code> : L'URL n'est pas accessible</dd>
30+</dl>
31+<p>Par défaut, les liens avec une URL non accessible ne sont pas retournés.</p>
32+
33+<h4 id="schema-comment_id">comment_id</h4>
34+<p>C'est l'identifiant du commentaire sur le forum dans lequel à été contribué le lien. Par exemple :</p>
35+<pre>15336</pre>
36
37 <h4 id="schema-contributed_at">contributed_at</h4>
38 <p>C'est la date à laquelle a été contribué le lien. Par exemple :</p>
39 <pre>2007-05-09T21:22:05Z</pre>
40
41+<h4 id="schema-contributor_name">contributor_name</h4>
42+<p>C'est le nom sur le forum de l'utilisateur ayant contribué le lien. Par exemple :</p>
43+<pre>mbertier</pre>
44
45 <h4 id="schema-contributor_id">contributor_id</h4>
46 <p>C'est l'identifiant sur le forum de l'utilisateur ayant contribué le lien. Par exemple :</p>
47 <pre>34</pre>
48 <p>Les URL pour accéder au profil d'un utilisateur sur Musiques Incongrues ont la forme http://www.musiques-incongrues.net/forum/account/<strong>contributor_id</strong>/</p>
49
50-<h4 id="schema-contributor_name">contributor_name</h4>
51-<p>C'est le nom sur le forum de l'utilisateur ayant contribué le lien. Par exemple :</p>
52-<pre>mbertier</pre>
53-
54-<h4 id="schema-comment_id">comment_id</h4>
55-<p>C'est l'identifiant du commentaire sur le forum dans lequel à été contribué le lien. Par exemple :</p>
56-<pre>15336</pre>
57-
58 <h4 id="schema-discussion_id">discussion_id</h4>
59 <p>C'est l'identifiant de la discussion dans laquelle a été contribué le lien. Par exemple :</p>
60 <pre>5455</pre>
61@@ -42,4 +34,20 @@
62
63 <h4 id="schema-discussion_name">discussion_name</h4>
64 <p>C'est le titre de la discussion dans laquelle a été contribué le lien. Par exemple :</p>
65-<pre>Des clips, des clips, rien que des clips</pre>
66\ No newline at end of file
67+<pre>Des clips, des clips, rien que des clips</pre>
68+<h4 id="schema-domain_fqdn">domain_fqdn</h4>
69+<p>C'est le nom de domaine complet de l'URL vers la ressource. Par exemple :</p>
70+<pre>data.musiques-incongrues.net</pre>
71+
72+<h4 id="schema-domain_parent">domain_parent</h4>
73+<p>C'est le domaine parent de l'URL vers la ressource. Par exemple :</p>
74+<pre>musiques-incongrues.net</pre>
75+<p>Les deux URLs http://www.musiques-incongrues.net et http://data.musiques-incongrues.net ont un domaine parent identique.</p>
76+
77+<h4 id="schema-mime_type">mime_type</h4>
78+<p>C'est le type MIME de la ressource. Cet attribut n'est pas toujours définit. Il l'est toujours pour les fichiers binaires (mp3, image, etc). Par exemple :</p>
79+<pre>audio/mpeg</pre>
80+
81+<h4 id="schema-url">url</h4>
82+<p>C'est l'URL vers la ressource. Par exemple : </p>
83+<pre>http://www.glafouk.com/dlz/radioclash_astrotease.mp3</pre>
84\ No newline at end of file
85
86=== modified file 'config/doctrine/schema.yml'
87--- config/doctrine/schema.yml 2010-06-15 13:22:18 +0000
88+++ config/doctrine/schema.yml 2010-06-23 20:53:23 +0000
89@@ -31,9 +31,21 @@
90 type: integer
91 discussion_name:
92 type: string
93+ # available, unavailable, unknown
94+ availability:
95+ type: string
96+ default: 'unknown'
97+ expanded_at:
98+ type: timestamp
99 indexes:
100- url_index:
101+ idx_url:
102 fields:
103 url:
104 length: 512
105- type: unique
106\ No newline at end of file
107+ type: unique
108+ idx_expanded_at:
109+ fields: [expanded_at]
110+ idx_availability:
111+ fields:
112+ availability:
113+ length: 11
114\ No newline at end of file
115
116=== modified file 'config/search.yml'
117--- config/search.yml 2010-06-02 16:53:29 +0000
118+++ config/search.yml 2010-06-23 20:53:23 +0000
119@@ -26,7 +26,9 @@
120 type: int
121 discussion_name:
122 stored: true
123-
124+ availability:
125+ stored: true
126+
127 index:
128 encoding: UTF-8
129 cultures: [fr]
130
131=== modified file 'config/solr/IndexA_fr/conf/schema.xml'
132--- config/solr/IndexA_fr/conf/schema.xml 2010-06-03 13:14:51 +0000
133+++ config/solr/IndexA_fr/conf/schema.xml 2010-06-23 20:53:23 +0000
134@@ -258,6 +258,7 @@
135 <field name='comment_id' type='int' stored='true' multiValued='false' required='false' />
136 <field name='discussion_id' type='int' stored='true' multiValued='false' required='false' />
137 <field name='discussion_name' type='text' stored='true' multiValued='false' required='false' />
138+ <field name='availability' type='text' stored='true' multiValued='false' required='false' />
139 </fields>
140
141 <!-- field to use to determine and enforce document uniqueness. -->
142@@ -283,5 +284,6 @@
143 <copyField source='comment_id' dest='sfl_all' />
144 <copyField source='discussion_id' dest='sfl_all' />
145 <copyField source='discussion_name' dest='sfl_all' />
146+ <copyField source='availability' dest='sfl_all' />
147
148 </schema>
149
150=== added directory 'data/utils'
151=== added file 'data/utils/proxy.php'
152--- data/utils/proxy.php 1970-01-01 00:00:00 +0000
153+++ data/utils/proxy.php 2010-06-23 20:53:23 +0000
154@@ -0,0 +1,13 @@
155+<?php
156+// Sanity checks
157+if (is_null($_SERVER['PATH_INFO']))
158+{
159+ throw new InvalidArgumentException('Please specify path info.');
160+}
161+
162+// Call service
163+$curl = curl_init(sprintf('http://data.musiques-incongrues.net/%s?%s', $_SERVER['PATH_INFO'], $_SERVER['QUERY_STRING']));
164+curl_exec($curl);
165+
166+// Clean up
167+curl_close($curl);
168\ No newline at end of file
169
170=== removed file 'lib/filter/doctrine/ResourceTypeFormFilter.class.php'
171--- lib/filter/doctrine/ResourceTypeFormFilter.class.php 2010-06-02 10:18:41 +0000
172+++ lib/filter/doctrine/ResourceTypeFormFilter.class.php 1970-01-01 00:00:00 +0000
173@@ -1,16 +0,0 @@
174-<?php
175-
176-/**
177- * ResourceType filter form.
178- *
179- * @package vanilla-miner
180- * @subpackage filter
181- * @author Your name here
182- * @version SVN: $Id: sfDoctrineFormFilterTemplate.php 23810 2009-11-12 11:07:44Z Kris.Wallsmith $
183- */
184-class ResourceTypeFormFilter extends BaseResourceTypeFormFilter
185-{
186- public function configure()
187- {
188- }
189-}
190
191=== removed file 'lib/form/doctrine/ResourceTypeForm.class.php'
192--- lib/form/doctrine/ResourceTypeForm.class.php 2010-06-02 10:18:41 +0000
193+++ lib/form/doctrine/ResourceTypeForm.class.php 1970-01-01 00:00:00 +0000
194@@ -1,16 +0,0 @@
195-<?php
196-
197-/**
198- * ResourceType form.
199- *
200- * @package vanilla-miner
201- * @subpackage form
202- * @author Your name here
203- * @version SVN: $Id: sfDoctrineFormTemplate.php 23810 2009-11-12 11:07:44Z Kris.Wallsmith $
204- */
205-class ResourceTypeForm extends BaseResourceTypeForm
206-{
207- public function configure()
208- {
209- }
210-}
211
212=== removed file 'lib/model/doctrine/ResourceType.class.php'
213--- lib/model/doctrine/ResourceType.class.php 2010-06-02 10:18:41 +0000
214+++ lib/model/doctrine/ResourceType.class.php 1970-01-01 00:00:00 +0000
215@@ -1,15 +0,0 @@
216-<?php
217-
218-/**
219- * ResourceType
220- *
221- * This class has been auto-generated by the Doctrine ORM Framework
222- *
223- * @package vanilla-miner
224- * @subpackage model
225- * @author Your name here
226- * @version SVN: $Id: Builder.php 7490 2010-03-29 19:53:27Z jwage $
227- */
228-class ResourceType extends BaseResourceType
229-{
230-}
231
232=== removed file 'lib/model/doctrine/ResourceTypeTable.class.php'
233--- lib/model/doctrine/ResourceTypeTable.class.php 2010-06-02 10:18:41 +0000
234+++ lib/model/doctrine/ResourceTypeTable.class.php 1970-01-01 00:00:00 +0000
235@@ -1,11 +0,0 @@
236-<?php
237-
238-
239-class ResourceTypeTable extends Doctrine_Table
240-{
241-
242- public static function getInstance()
243- {
244- return Doctrine_Core::getTable('ResourceType');
245- }
246-}
247\ No newline at end of file
248
249=== added file 'lib/task/minerExpandLinksTask.class.php'
250--- lib/task/minerExpandLinksTask.class.php 1970-01-01 00:00:00 +0000
251+++ lib/task/minerExpandLinksTask.class.php 2010-06-23 20:53:23 +0000
252@@ -0,0 +1,178 @@
253+<?php
254+
255+class minerExpandLinksTask extends sfBaseTask
256+{
257+ protected function configure()
258+ {
259+ $this->addOptions(array(
260+ new sfCommandOption('env', null, sfCommandOption::PARAMETER_REQUIRED, 'The environment', 'dev'),
261+ new sfCommandOption('connection', null, sfCommandOption::PARAMETER_REQUIRED, 'The connection name', 'doctrine'),
262+ new sfCommandOption('progress', null, sfCommandOption::PARAMETER_NONE, 'Display a progress bar'),
263+ new sfCommandOption('verbose', null, sfCommandOption::PARAMETER_NONE, 'Display more informations about extraction process'),
264+ new sfCommandOption('all', null, sfCommandOption::PARAMETER_NONE, 'Expand all links in database. By default, only new links are expanded'),
265+ new sfCommandOption('with-unavailable', null, sfCommandOption::PARAMETER_NONE, 'When expanding all links (--all), also include links previously marked as unavailable'),
266+ // TODO : add --older-than option
267+ ));
268+
269+ $this->namespace = 'miner';
270+ $this->name = 'expand-links';
271+ // TODO : write descriptions
272+ $this->briefDescription = 'Expands informations about links by crawling their URLs';
273+ $this->detailedDescription = <<<EOF
274+
275+Use cases :
276+ * Expand new urls : [php symfony miner:expand-links|INFO]
277+ * Expand all urls (a word about --with-unavailable) : [php symfony miner:expand-links --all|INFO]
278+ * Expand all urls, including those previously marked as unavailable : [php symfony miner:expand-links --all --with-unavailable|INFO]
279+EOF;
280+ }
281+
282+ protected function execute($arguments = array(), $options = array())
283+ {
284+ // Open database connection
285+ $databaseManager = new sfDatabaseManager($this->configuration);
286+ $connection = $databaseManager->getDatabase($options['connection'])->getConnection();
287+
288+ // Build query for fetching links from database
289+ $q = Doctrine_Query::create()
290+ ->select('l.url')
291+ ->from('Link l');
292+ if (!$options['all'])
293+ {
294+ $q->where('l.expanded_at is null');
295+ }
296+ if (!$options['with-unavailable'])
297+ {
298+ $q->andWhere('l.availability != "unavailable"');
299+ }
300+
301+ // Fetch links from database
302+ $links_count = $q->count();
303+ $links = $q->execute(null, Doctrine_Core::HYDRATE_ON_DEMAND);
304+ $q->free();
305+ $this->logSection('info', sprintf('Expanding %s links', $links_count));
306+
307+ // Instanciate progress bar, if user requested so
308+ $links_expanded = 0;
309+ if ($options['progress'])
310+ {
311+ include 'Console/ProgressBar.php';
312+ $progress_bar = new Console_ProgressBar(
313+ '** Links %fraction% comments [%bar%] %percent% | ',
314+ '=>', '-', 80, $links_count, array('ansi_terminal' => true)
315+ );
316+ $progress_bar->update($links_expanded);
317+ }
318+
319+ // Launch a HEAD request on each link, and use data in response headers to update informations about link in database
320+ // TODO : move crawling code to dedicated class. and then create miner:crawl-url task
321+ require 'HTTP/Request2.php';
322+ $request = new HTTP_Request2(null, HTTP_Request2::METHOD_HEAD, array('follow_redirects' => true));
323+ $request->setHeader('user-agent', 'vanilla-miner/1.1 (https://launchpad.net/vanilla-miner)');
324+
325+ foreach ($links as $link)
326+ {
327+ $link->expanded_at = time();
328+ try
329+ {
330+ $request->setUrl($link->url);
331+ $response = $request->send();
332+ if (200 == $response->getStatus())
333+ {
334+ if ($options['progress'])
335+ {
336+ $this->log(sprintf('[%d] %s', $response->getStatus(), $link->url));
337+ }
338+ else
339+ {
340+ $this->logSection('info', sprintf('[%d] %s - Updating metadata, marking as available', $response->getStatus(), $link->url));
341+ }
342+
343+ // Extract meaningful informations from server response
344+ $header = $response->getHeader();
345+ $header = $this->normalizeHeader($header);
346+ $link->mime_type = $this->getMimeType($header);
347+
348+ // Mark link as available
349+ $link->availability = 'available';
350+
351+ // Save link to database
352+ $link->replace();
353+ }
354+ else
355+ {
356+ if ($options['progress'])
357+ {
358+ $this->log(sprintf('[%d] %s', $response->getStatus(), $link->url));
359+ }
360+ else
361+ {
362+ $this->logSection('notice', sprintf(
363+ '[%d] %s (%d %s) - Marking as unavailable',
364+ $response->getStatus(),
365+ $link->url,
366+ $response->getStatus(),
367+ $response->getReasonPhrase()
368+ )
369+ );
370+ }
371+ $link->availability = 'unavailable';
372+ $link->replace();
373+ }
374+ }
375+ catch (HTTP_Request2_Exception $e)
376+ {
377+ if ($options['progress'])
378+ {
379+ $this->log(sprintf('[ERR] %s', $link->url));
380+ }
381+ else
382+ {
383+ $this->logSection('error', sprintf('[ERR] Received exception with message "%s" for link "%s" - Marking as unavailable.', $e->getMessage(), $link->url));
384+ }
385+ $link->availability = 'unavailable';
386+ $link->replace();
387+ }
388+
389+ // Update progress bar
390+ if ($options['progress'])
391+ {
392+ $progress_bar->update(++$links_expanded);
393+ }
394+
395+ }
396+ }
397+
398+ private function normalizeHeader(array $header)
399+ {
400+ // Make all header names lower case
401+ $header_rev = array_flip($header);
402+ array_walk($header_rev, create_function('&$item, $key', 'strtolower($item);'));
403+ $header = array_flip($header_rev);
404+
405+ return $header;
406+ }
407+
408+ private function getMimeType(array $header)
409+ {
410+ $mime_type = null;
411+
412+ if (isset($header['content-type']))
413+ {
414+ $mime_type = $header['content-type'];
415+
416+ // Extract mime type from content-type header
417+ // TODO : use a regular expression instead of this crappy flow
418+ $matches = array();
419+ if (strpos($header['content-type'], 'charset') !== false)
420+ {
421+ if (preg_match('/(.+); ?charset=.+/i', $header['content-type'], $matches))
422+ {
423+ $mime_type = $matches[1];
424+ }
425+ }
426+ }
427+
428+ return $mime_type;
429+ }
430+}
431\ No newline at end of file
432
433=== modified file 'lib/task/minerExtractlinksTask.class.php'
434--- lib/task/minerExtractlinksTask.class.php 2010-06-23 13:13:30 +0000
435+++ lib/task/minerExtractlinksTask.class.php 2010-06-23 20:53:23 +0000
436@@ -63,31 +63,38 @@
437 $resources_parsed = 0;
438 $resources_total = $extractor->countResources($arguments['dsn']);
439
440- // Instanciate an configure progress bar
441- if ($options['progress'])
442- {
443- include 'Console/ProgressBar.php';
444- $progress_bar = new Console_ProgressBar(
445- '** '.$arguments['dsn'].' %fraction% resources [%bar%] %percent% | ',
446- '=>', '-', 80, $resources_total, array('ansi_terminal' => true)
447- );
448- }
449-
450- // Extract resources from source and insert them in Links database
451- while ($resource_extraction_info = $extractor->extract($arguments['dsn'], $options['connection']))
452- {
453- // Update extraction statistics
454- $urls_found_count += $resource_extraction_info['urls_found_count'];
455-
456- // Update progress bar
457+ if ($resources_total > 0)
458+ {
459+ // Instanciate an configure progress bar
460 if ($options['progress'])
461 {
462- $progress_bar->update($resource_extraction_info['resources_parsed_count']);
463- }
464- }
465-
466- // Log
467- $this->logSection('extract', sprintf('%d URLs where extracted from %d resources', $urls_found_count, $resources_total));
468+ include 'Console/ProgressBar.php';
469+ $progress_bar = new Console_ProgressBar(
470+ '** '.$arguments['dsn'].' %fraction% resources [%bar%] %percent% | ',
471+ '=>', '-', 80, $resources_total, array('ansi_terminal' => true)
472+ );
473+ }
474+
475+ // Extract resources from source and insert them in Links database
476+ while ($resource_extraction_info = $extractor->extract($arguments['dsn'], $options['connection']))
477+ {
478+ // Update extraction statistics
479+ $urls_found_count += $resource_extraction_info['urls_found_count'];
480+
481+ // Update progress bar
482+ if ($options['progress'])
483+ {
484+ $progress_bar->update($resource_extraction_info['resources_parsed_count']);
485+ }
486+ }
487+
488+ // Log
489+ $this->logSection('extract', sprintf('%d URLs where extracted from %d resources', $urls_found_count, $resources_total));
490+ }
491+ else
492+ {
493+ $this->logSection('extract', 'No resources to extract. Exiting.');
494+ }
495 }
496
497 /**
498
499=== modified file 'lib/vendor/CI/Search/Link/Segment.php'
500--- lib/vendor/CI/Search/Link/Segment.php 2010-06-15 13:21:38 +0000
501+++ lib/vendor/CI/Search/Link/Segment.php 2010-06-23 20:53:23 +0000
502@@ -59,6 +59,12 @@
503 }
504 $c->setLimit($limit);
505
506+ // default : return links with availability being marked as "available" or "unknown"
507+ if ($parameters->get('availability', null) === null)
508+ {
509+ $c->addField('-availability', 'unavailable');
510+ }
511+
512 // Define sorting
513 $sorting_direction = $parameters->get('sort_direction', 'asc');
514 if ($sorting_direction == 'desc')
515@@ -108,5 +114,4 @@
516
517 return array_keys($schema_fields);
518 }
519-
520 }
521\ No newline at end of file

Subscribers

People subscribed via source and target branches

to all changes: