Merge lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba

Proposed by Daniel Turcanu
Status: Merged
Approved by: Chris Hillery
Approved revision: 10470
Merged at revision: 10489
Proposed branch: lp:~danielturcanu/zorba/web_crawler_tutorial
Merge into: lp:zorba
Diff against target: 197 lines (+181/-0)
2 files modified
doc/zorba/indexpage.dox.in (+8/-0)
doc/zorba/web_crawler.dox (+173/-0)
To merge this branch: bzr merge lp:~danielturcanu/zorba/web_crawler_tutorial
Reviewer Review Type Date Requested Status
Chris Hillery Approve
Sorin Marian Nasoi Abstain
Review via email: mp+77179@code.launchpad.net

Commit message

Added tutorial for web crawler script from html module (or script directory in zorba).

Description of the change

Added tutorial for web crawler script from html module (or script directory in zorba).

To post a comment you must log in.
Revision history for this message
Sorin Marian Nasoi (sorin.marian.nasoi) wrote :

The tutorial is nice, but I am not sure the index page in our Doxygen documentation is the best place to put it.

Revision history for this message
Sorin Marian Nasoi (sorin.marian.nasoi) wrote :

The tutorial is nice, but I am not sure the index page in our Doxygen documentation is the best place to put it.

review: Abstain
Revision history for this message
Chris Hillery (ceejatec) wrote :

I like it. I'd leave the link from the index page there - having a specific section marked "tutorials" will maybe encourage folks to write some more over time. If not, we can easily move that later.

review: Approve
Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :
Revision history for this message
Matthias Brantner (matthias-brantner) wrote :

I think that the code in the tutorial should be literally included and be tested as such to make sure that we don't regress.

The tutorial should be linked from a blog entry. Also, the tutorial should provide a link to download the source code.

Daniel, could you please provide Dana with the HTML version of the tutorial. I'm sure she is also interested in reading it before it gets published.

Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :

Validation queue job web_crawler_tutorial-2011-10-04T23-35-02.03Z is finished. The final status was:

All tests succeeded!

Revision history for this message
Daniel Turcanu (danielturcanu) wrote :

The link crawler is added in html module as a test for compilation.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== modified file 'doc/zorba/indexpage.dox.in'
2--- doc/zorba/indexpage.dox.in 2011-09-06 16:39:46 +0000
3+++ doc/zorba/indexpage.dox.in 2011-09-27 15:05:56 +0000
4@@ -127,6 +127,14 @@
5 <!--li>\ref extensions_update</li-->
6
7
8+</td></tr>
9+<tr><td class="tdDocIndexTable">
10+
11+
12+ <h2>Tutorials</h2>
13+
14+ \ref web_crawler_tutorial
15+
16 </td><tr>
17 </table>
18
19
20=== added file 'doc/zorba/web_crawler.dox'
21--- doc/zorba/web_crawler.dox 1970-01-01 00:00:00 +0000
22+++ doc/zorba/web_crawler.dox 2011-09-27 15:05:56 +0000
23@@ -0,0 +1,173 @@
24+/**
25+\page web_crawler_tutorial Web Crawler example in XQuery
26+
27+Description of a web crawler example in XQuery.
28+
29+The idea is to crawl through the pages of a website and store a list with external pages and internal pages and check if they work or not.
30+This example uses Zorba's http module for accessing the webpages, and the html module for converting the html to xml.
31+The complete code can be found in the test directory of the html convertor module.
32+
33+\code
34+import module namespace http = "http://www.zorba-xquery.com/modules/http-client";
35+import module namespace map = "http://www.zorba-xquery.com/modules/store/data-structures/unordered-map";
36+import module namespace html = "http://www.zorba-xquery.com/modules/converters/html";
37+import module namespace parse-xml = "http://www.zorba-xquery.com/modules/xml";
38+\endcode
39+
40+The internal pages are checked recursively, while the external ones are only checked for existence.
41+The distinction between internal and external links is made by comparing the URI with a global string variable $uri-host.
42+Change this variable to point to your website, or a subdirectory on your website.
43+
44+\code
45+declare variable $top-uri as xs:string := "http://www.zorba-xquery.com/site2/html/index.html";
46+declare variable $uri-host as xs:string := "http://www.zorba-xquery.com/site2/";
47+
48+declare function local:is-internal($x as xs:string) as xs:boolean
49+{
50+ starts-with($x, $uri-host)
51+};
52+
53+\endcode
54+
55+The crawling starts from the URI pointed by $top-uri.
56+
57+Visited links are stored as nodes in two maps, one for internal pages and one for external pages.
58+The keys are the URIs, and the values are the strings "broken" or "clean".
59+The maps are used to avoid parsing the same page twice.
60+
61+\code
62+declare variable $local:processed-internal-links := xs:QName("processed-internal-links");
63+declare variable $local:processed-external-links := xs:QName("processed-external-links");
64+
65+declare %ann:sequential function local:create-containers()
66+{
67+ map:create($local:processed-internal-links, xs:QName("xs:string"));
68+ map:create($local:processed-external-links, xs:QName("xs:string"));
69+};
70+
71+declare %ann:sequential function local:delete-containers(){
72+ for $x in map:available-maps()
73+ return map:delete($x);
74+};
75+
76+\endcode
77+
78+After parsing an internal page with html module, all the links are extracted and parsed recursively, if they haven't been parsed.
79+The html module uses tidy library, so we use tidy options to setup for converting from html to xml.
80+Some html tags are marked to be ignored in new-inline-tags param, this being a particular case of this website.
81+You can add or remove tags to suit your website needs.
82+
83+\code
84+declare function local:get-out-links-parsed($content as node()*, $uri as xs:string) as xs:string*
85+{ distinct-values( for $y in ($content//*:a/string(@href),
86+ $content//*:link/string(@href),
87+ $content//*:script/string(@src),
88+ $content//*:img/string(@src),
89+ $content//*:area/string(@href)
90+ )
91+return local:get-real-link($y, $uri))
92+};
93+
94+declare function local:tidy-options()
95+{<options xmlns="http://www.zorba-xquery.com/modules/converters/html-options" >
96+ <tidyParam name="output-xml" value="yes" />
97+ <tidyParam name="doctype" value="omit" />
98+ <tidyParam name="quote-nbsp" value="no" />
99+ <tidyParam name="char-encoding" value="utf8" />
100+ <tidyParam name="newline" value="LF" />
101+ <tidyParam name="tidy-mark" value="no" />
102+ <tidyParam name="new-inline-tags" value="nav header section article footer xqdoc:custom d c options json-param" />
103+ </options>
104+};
105+
106+declare %ann:sequential function local:process-internal-link($x as xs:string, $n as xs:integer){
107+ if($n=3) then exit returning (); else {}
108+ if(not(empty(map:get($local:processed-internal-links, $x))))
109+ then exit returning false();
110+ else {}
111+ variable $http-call:=();
112+ try{
113+ $http-call:=http:send-request(<httpsch:request method="GET" href="{$x}"/>, (), ());
114+ }
115+ catch * {}
116+ if( not(local:alive($http-call)))
117+ then { map:insert($local:processed-internal-links, "broken", $x); exit returning ();}
118+ else {}
119+ if(not (local:get-media-type($http-call[1]) = $supported-media-types))
120+ then {map:insert($local:processed-internal-links, "clean", $x); exit returning ();}
121+ else {}
122+ variable $string-content := xs:string($http-call[2]);
123+ variable $content:=();
124+
125+ try{
126+ $content:=html:parse($string-content,local:tidy-options() );
127+ }
128+ catch *
129+ {
130+ map:insert($local:processed-internal-links, concat("cannot tidy", $err:description), $x);
131+ try{
132+ $content:=parse-xml:parse-xml-fragment ($string-content, "");
133+ }
134+ catch *
135+ { map:insert($local:processed-internal-links, concat("cannot parse", $err:description), $x);}
136+ }
137+ variable $links :=();
138+ if(empty($content))
139+ then $links:=local:get-out-links-unparsed($string-content, $x);
140+ else $links:=local:get-out-links-parsed($content, $x);
141+ for $l in $links
142+ return local:process-link($l, $n+1);
143+};
144+
145+\endcode
146+
147+Some html pages have errors, and tidy library is very strict with checking errors.
148+When the parsing fails, we fallback to using regex for extracting the links.
149+
150+\code
151+declare function local:get-out-links-unparsed($content as xs:string, $uri as xs:string) as xs:string*{
152+
153+ distinct-values(
154+ let $search := fn:analyze-string($content, "(&lt;|&amp;lt;|<)(((a|link|area).+?href)|((script|img).+?src))=([""'])(.*?)\7")
155+ for $other-uri2 in $search//group[@nr=8]/string()
156+ let $y:= fn:normalize-space($other-uri2)
157+ return local:get-real-link($y, $uri)
158+ )
159+};
160+
161+\endcode
162+
163+For external links, we just check if they exist, so the http command requests only for HEAD.
164+
165+\code
166+declare %ann:sequential function local:process-external-link($x as xs:string){
167+ if(not(empty(map:get($local:processed-external-links, $x))))
168+ then exit returning false();
169+ else {}
170+ variable $http-call:=();
171+ try{
172+ $http-call:=http:send-request(<httpsch:request method="HEAD" href="{$x}"/>, (), ());
173+ }
174+ catch * {}
175+ if( local:alive($http-call))
176+ then map:insert($local:processed-external-links, "clean", $x);
177+ else map:insert($local:processed-external-links, "broken", $x);
178+};
179+
180+\endcode
181+
182+After parsing, the results are returned in xml format.
183+
184+\code
185+declare function local:print-results() as element()*
186+{
187+ for $x in map:keys($local:processed-internal-links)/map:attribute/@value/string()
188+ return <INTERNAL><LINK>{$x}</LINK><RESULT>{map:get($local:processed-internal-links,$x)}</RESULT></INTERNAL>,
189+
190+ for $x in map:keys($local:processed-external-links)/map:attribute/@value/string()
191+ return <EXTERNAL><LINK>{$x}</LINK><RESULT>{map:get($local:processed-external-links,$x)}</RESULT></EXTERNAL>
192+};
193+
194+\endcode
195+
196+*/
197\ No newline at end of file

Subscribers

People subscribed via source and target branches