Merge lp:~danielturcanu/zorba/web_crawler_tutorial into lp:zorba

Proposed by Daniel Turcanu
Status: Merged
Approved by: Chris Hillery
Approved revision: 10470
Merged at revision: 10489
Proposed branch: lp:~danielturcanu/zorba/web_crawler_tutorial
Merge into: lp:zorba
Diff against target: 197 lines (+181/-0)
2 files modified
doc/zorba/indexpage.dox.in (+8/-0)
doc/zorba/web_crawler.dox (+173/-0)
To merge this branch: bzr merge lp:~danielturcanu/zorba/web_crawler_tutorial
Reviewer Review Type Date Requested Status
Chris Hillery Approve
Sorin Marian Nasoi Abstain
Review via email: mp+77179@code.launchpad.net

Commit message

Added tutorial for web crawler script from html module (or script directory in zorba).

Description of the change

Added tutorial for web crawler script from html module (or script directory in zorba).

To post a comment you must log in.
Revision history for this message
Sorin Marian Nasoi (sorin.marian.nasoi) wrote :

The tutorial is nice, but I am not sure the index page in our Doxygen documentation is the best place to put it.

Revision history for this message
Sorin Marian Nasoi (sorin.marian.nasoi) wrote :

The tutorial is nice, but I am not sure the index page in our Doxygen documentation is the best place to put it.

review: Abstain
Revision history for this message
Chris Hillery (ceejatec) wrote :

I like it. I'd leave the link from the index page there - having a specific section marked "tutorials" will maybe encourage folks to write some more over time. If not, we can easily move that later.

review: Approve
Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :
Revision history for this message
Matthias Brantner (matthias-brantner) wrote :

I think that the code in the tutorial should be literally included and be tested as such to make sure that we don't regress.

The tutorial should be linked from a blog entry. Also, the tutorial should provide a link to download the source code.

Daniel, could you please provide Dana with the HTML version of the tutorial. I'm sure she is also interested in reading it before it gets published.

Revision history for this message
Zorba Build Bot (zorba-buildbot) wrote :

Validation queue job web_crawler_tutorial-2011-10-04T23-35-02.03Z is finished. The final status was:

All tests succeeded!

Revision history for this message
Daniel Turcanu (danielturcanu) wrote :

The link crawler is added in html module as a test for compilation.

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
=== modified file 'doc/zorba/indexpage.dox.in'
--- doc/zorba/indexpage.dox.in 2011-09-06 16:39:46 +0000
+++ doc/zorba/indexpage.dox.in 2011-09-27 15:05:56 +0000
@@ -127,6 +127,14 @@
127 <!--li>\ref extensions_update</li-->127 <!--li>\ref extensions_update</li-->
128128
129129
130</td></tr>
131<tr><td class="tdDocIndexTable">
132
133
134 <h2>Tutorials</h2>
135
136 \ref web_crawler_tutorial
137
130</td><tr>138</td><tr>
131</table>139</table>
132140
133141
=== added file 'doc/zorba/web_crawler.dox'
--- doc/zorba/web_crawler.dox 1970-01-01 00:00:00 +0000
+++ doc/zorba/web_crawler.dox 2011-09-27 15:05:56 +0000
@@ -0,0 +1,173 @@
1/**
2\page web_crawler_tutorial Web Crawler example in XQuery
3
4Description of a web crawler example in XQuery.
5
6The idea is to crawl through the pages of a website and store a list with external pages and internal pages and check if they work or not.
7This example uses Zorba's http module for accessing the webpages, and the html module for converting the html to xml.
8The complete code can be found in the test directory of the html convertor module.
9
10\code
11import module namespace http = "http://www.zorba-xquery.com/modules/http-client";
12import module namespace map = "http://www.zorba-xquery.com/modules/store/data-structures/unordered-map";
13import module namespace html = "http://www.zorba-xquery.com/modules/converters/html";
14import module namespace parse-xml = "http://www.zorba-xquery.com/modules/xml";
15\endcode
16
17The internal pages are checked recursively, while the external ones are only checked for existence.
18The distinction between internal and external links is made by comparing the URI with a global string variable $uri-host.
19Change this variable to point to your website, or a subdirectory on your website.
20
21\code
22declare variable $top-uri as xs:string := "http://www.zorba-xquery.com/site2/html/index.html";
23declare variable $uri-host as xs:string := "http://www.zorba-xquery.com/site2/";
24
25declare function local:is-internal($x as xs:string) as xs:boolean
26{
27 starts-with($x, $uri-host)
28};
29
30\endcode
31
32The crawling starts from the URI pointed by $top-uri.
33
34Visited links are stored as nodes in two maps, one for internal pages and one for external pages.
35The keys are the URIs, and the values are the strings "broken" or "clean".
36The maps are used to avoid parsing the same page twice.
37
38\code
39declare variable $local:processed-internal-links := xs:QName("processed-internal-links");
40declare variable $local:processed-external-links := xs:QName("processed-external-links");
41
42declare %ann:sequential function local:create-containers()
43{
44 map:create($local:processed-internal-links, xs:QName("xs:string"));
45 map:create($local:processed-external-links, xs:QName("xs:string"));
46};
47
48declare %ann:sequential function local:delete-containers(){
49 for $x in map:available-maps()
50 return map:delete($x);
51};
52
53\endcode
54
55After parsing an internal page with html module, all the links are extracted and parsed recursively, if they haven't been parsed.
56The html module uses tidy library, so we use tidy options to setup for converting from html to xml.
57Some html tags are marked to be ignored in new-inline-tags param, this being a particular case of this website.
58You can add or remove tags to suit your website needs.
59
60\code
61declare function local:get-out-links-parsed($content as node()*, $uri as xs:string) as xs:string*
62{ distinct-values( for $y in ($content//*:a/string(@href),
63 $content//*:link/string(@href),
64 $content//*:script/string(@src),
65 $content//*:img/string(@src),
66 $content//*:area/string(@href)
67 )
68return local:get-real-link($y, $uri))
69};
70
71declare function local:tidy-options()
72{<options xmlns="http://www.zorba-xquery.com/modules/converters/html-options" >
73 <tidyParam name="output-xml" value="yes" />
74 <tidyParam name="doctype" value="omit" />
75 <tidyParam name="quote-nbsp" value="no" />
76 <tidyParam name="char-encoding" value="utf8" />
77 <tidyParam name="newline" value="LF" />
78 <tidyParam name="tidy-mark" value="no" />
79 <tidyParam name="new-inline-tags" value="nav header section article footer xqdoc:custom d c options json-param" />
80 </options>
81};
82
83declare %ann:sequential function local:process-internal-link($x as xs:string, $n as xs:integer){
84 if($n=3) then exit returning (); else {}
85 if(not(empty(map:get($local:processed-internal-links, $x))))
86 then exit returning false();
87 else {}
88 variable $http-call:=();
89 try{
90 $http-call:=http:send-request(<httpsch:request method="GET" href="{$x}"/>, (), ());
91 }
92 catch * {}
93 if( not(local:alive($http-call)))
94 then { map:insert($local:processed-internal-links, "broken", $x); exit returning ();}
95 else {}
96 if(not (local:get-media-type($http-call[1]) = $supported-media-types))
97 then {map:insert($local:processed-internal-links, "clean", $x); exit returning ();}
98 else {}
99 variable $string-content := xs:string($http-call[2]);
100 variable $content:=();
101
102 try{
103 $content:=html:parse($string-content,local:tidy-options() );
104 }
105 catch *
106 {
107 map:insert($local:processed-internal-links, concat("cannot tidy", $err:description), $x);
108 try{
109 $content:=parse-xml:parse-xml-fragment ($string-content, "");
110 }
111 catch *
112 { map:insert($local:processed-internal-links, concat("cannot parse", $err:description), $x);}
113 }
114 variable $links :=();
115 if(empty($content))
116 then $links:=local:get-out-links-unparsed($string-content, $x);
117 else $links:=local:get-out-links-parsed($content, $x);
118 for $l in $links
119 return local:process-link($l, $n+1);
120};
121
122\endcode
123
124Some html pages have errors, and tidy library is very strict with checking errors.
125When the parsing fails, we fallback to using regex for extracting the links.
126
127\code
128declare function local:get-out-links-unparsed($content as xs:string, $uri as xs:string) as xs:string*{
129
130 distinct-values(
131 let $search := fn:analyze-string($content, "(&lt;|&amp;lt;|<)(((a|link|area).+?href)|((script|img).+?src))=([""'])(.*?)\7")
132 for $other-uri2 in $search//group[@nr=8]/string()
133 let $y:= fn:normalize-space($other-uri2)
134 return local:get-real-link($y, $uri)
135 )
136};
137
138\endcode
139
140For external links, we just check if they exist, so the http command requests only for HEAD.
141
142\code
143declare %ann:sequential function local:process-external-link($x as xs:string){
144 if(not(empty(map:get($local:processed-external-links, $x))))
145 then exit returning false();
146 else {}
147 variable $http-call:=();
148 try{
149 $http-call:=http:send-request(<httpsch:request method="HEAD" href="{$x}"/>, (), ());
150 }
151 catch * {}
152 if( local:alive($http-call))
153 then map:insert($local:processed-external-links, "clean", $x);
154 else map:insert($local:processed-external-links, "broken", $x);
155};
156
157\endcode
158
159After parsing, the results are returned in xml format.
160
161\code
162declare function local:print-results() as element()*
163{
164 for $x in map:keys($local:processed-internal-links)/map:attribute/@value/string()
165 return <INTERNAL><LINK>{$x}</LINK><RESULT>{map:get($local:processed-internal-links,$x)}</RESULT></INTERNAL>,
166
167 for $x in map:keys($local:processed-external-links)/map:attribute/@value/string()
168 return <EXTERNAL><LINK>{$x}</LINK><RESULT>{map:get($local:processed-external-links,$x)}</RESULT></EXTERNAL>
169};
170
171\endcode
172
173*/
0\ No newline at end of file174\ No newline at end of file

Subscribers

People subscribed via source and target branches