Merge lp:~noskcaj/ubuntu/vivid/urlgrabber/3.10.1 into lp:ubuntu/vivid/urlgrabber

Proposed by Jackson Doak on 2014-12-13
Status: Needs review
Proposed branch: lp:~noskcaj/ubuntu/vivid/urlgrabber/3.10.1
Merge into: lp:ubuntu/vivid/urlgrabber
Diff against target: 7325 lines (+1389/-4846)
26 files modified
.pc/applied-patches (+0/-3)
.pc/grabber_fix.diff/urlgrabber/grabber.py (+0/-1730)
.pc/progress_fix.diff/urlgrabber/progress.py (+0/-755)
.pc/progress_object_callback_fix.diff/urlgrabber/grabber.py (+0/-1802)
ChangeLog (+8/-0)
MANIFEST (+2/-0)
PKG-INFO (+22/-22)
README (+1/-1)
debian/changelog (+7/-0)
debian/patches/grabber_fix.diff (+0/-236)
debian/patches/progress_fix.diff (+0/-11)
debian/patches/progress_object_callback_fix.diff (+0/-21)
debian/patches/series (+0/-3)
scripts/urlgrabber (+14/-6)
scripts/urlgrabber-ext-down (+75/-0)
setup.py (+4/-2)
test/base_test_code.py (+1/-1)
test/munittest.py (+3/-3)
test/test_byterange.py (+1/-13)
test/test_grabber.py (+2/-1)
test/test_mirror.py (+72/-1)
urlgrabber/__init__.py (+5/-4)
urlgrabber/byterange.py (+8/-8)
urlgrabber/grabber.py (+901/-152)
urlgrabber/mirror.py (+54/-11)
urlgrabber/progress.py (+209/-60)
To merge this branch: bzr merge lp:~noskcaj/ubuntu/vivid/urlgrabber/3.10.1
Reviewer Review Type Date Requested Status
Daniel Holbach 2014-12-13 Needs Fixing on 2014-12-16
Review via email: mp+244676@code.launchpad.net

Description of the change

New upstream release, upstreams some patcges

To post a comment you must log in.
Daniel Holbach (dholbach) wrote :

daniel@daydream:~/urlgrabber$ bzr merge lp:~noskcaj/ubuntu/vivid/urlgrabber/3.10.1
Unapplying quilt patches to prevent spurious conflicts
+N scripts/urlgrabber-ext-down
 M ChangeLog
 M MANIFEST
 M PKG-INFO
 M README
 M debian/changelog
-D debian/patches/grabber_fix.diff
-D debian/patches/progress_fix.diff
-D debian/patches/progress_object_callback_fix.diff
 M debian/patches/series
 M scripts/urlgrabber
 M setup.py
 M test/base_test_code.py
 M test/munittest.py
 M test/test_byterange.py
 M test/test_grabber.py
 M test/test_mirror.py
 M urlgrabber/__init__.py
 M urlgrabber/byterange.py
 M urlgrabber/grabber.py
 M urlgrabber/mirror.py
 M urlgrabber/progress.py
Text conflict in urlgrabber/grabber.py
1 conflicts encountered.
daniel@daydream:~/urlgrabber$

review: Needs Fixing

Unmerged revisions

12. By Jackson Doak on 2014-12-13

* New upstream release.
* Drop all patches, fixed upstream

Preview Diff

[H/L] Next/Prev Comment, [J/K] Next/Prev File, [N/P] Next/Prev Hunk
1=== removed file '.pc/applied-patches'
2--- .pc/applied-patches 2011-08-09 17:45:08 +0000
3+++ .pc/applied-patches 1970-01-01 00:00:00 +0000
4@@ -1,3 +0,0 @@
5-grabber_fix.diff
6-progress_fix.diff
7-progress_object_callback_fix.diff
8
9=== removed directory '.pc/grabber_fix.diff'
10=== removed directory '.pc/grabber_fix.diff/urlgrabber'
11=== removed file '.pc/grabber_fix.diff/urlgrabber/grabber.py'
12--- .pc/grabber_fix.diff/urlgrabber/grabber.py 2010-07-08 17:40:08 +0000
13+++ .pc/grabber_fix.diff/urlgrabber/grabber.py 1970-01-01 00:00:00 +0000
14@@ -1,1730 +0,0 @@
15-# This library is free software; you can redistribute it and/or
16-# modify it under the terms of the GNU Lesser General Public
17-# License as published by the Free Software Foundation; either
18-# version 2.1 of the License, or (at your option) any later version.
19-#
20-# This library is distributed in the hope that it will be useful,
21-# but WITHOUT ANY WARRANTY; without even the implied warranty of
22-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
23-# Lesser General Public License for more details.
24-#
25-# You should have received a copy of the GNU Lesser General Public
26-# License along with this library; if not, write to the
27-# Free Software Foundation, Inc.,
28-# 59 Temple Place, Suite 330,
29-# Boston, MA 02111-1307 USA
30-
31-# This file is part of urlgrabber, a high-level cross-protocol url-grabber
32-# Copyright 2002-2004 Michael D. Stenner, Ryan Tomayko
33-# Copyright 2009 Red Hat inc, pycurl code written by Seth Vidal
34-
35-"""A high-level cross-protocol url-grabber.
36-
37-GENERAL ARGUMENTS (kwargs)
38-
39- Where possible, the module-level default is indicated, and legal
40- values are provided.
41-
42- copy_local = 0 [0|1]
43-
44- ignored except for file:// urls, in which case it specifies
45- whether urlgrab should still make a copy of the file, or simply
46- point to the existing copy. The module level default for this
47- option is 0.
48-
49- close_connection = 0 [0|1]
50-
51- tells URLGrabber to close the connection after a file has been
52- transfered. This is ignored unless the download happens with the
53- http keepalive handler (keepalive=1). Otherwise, the connection
54- is left open for further use. The module level default for this
55- option is 0 (keepalive connections will not be closed).
56-
57- keepalive = 1 [0|1]
58-
59- specifies whether keepalive should be used for HTTP/1.1 servers
60- that support it. The module level default for this option is 1
61- (keepalive is enabled).
62-
63- progress_obj = None
64-
65- a class instance that supports the following methods:
66- po.start(filename, url, basename, length, text)
67- # length will be None if unknown
68- po.update(read) # read == bytes read so far
69- po.end()
70-
71- text = None
72-
73- specifies alternative text to be passed to the progress meter
74- object. If not given, the default progress meter will use the
75- basename of the file.
76-
77- throttle = 1.0
78-
79- a number - if it's an int, it's the bytes/second throttle limit.
80- If it's a float, it is first multiplied by bandwidth. If throttle
81- == 0, throttling is disabled. If None, the module-level default
82- (which can be set on default_grabber.throttle) is used. See
83- BANDWIDTH THROTTLING for more information.
84-
85- timeout = None
86-
87- a positive float expressing the number of seconds to wait for socket
88- operations. If the value is None or 0.0, socket operations will block
89- forever. Setting this option causes urlgrabber to call the settimeout
90- method on the Socket object used for the request. See the Python
91- documentation on settimeout for more information.
92- http://www.python.org/doc/current/lib/socket-objects.html
93-
94- bandwidth = 0
95-
96- the nominal max bandwidth in bytes/second. If throttle is a float
97- and bandwidth == 0, throttling is disabled. If None, the
98- module-level default (which can be set on
99- default_grabber.bandwidth) is used. See BANDWIDTH THROTTLING for
100- more information.
101-
102- range = None
103-
104- a tuple of the form (first_byte, last_byte) describing a byte
105- range to retrieve. Either or both of the values may set to
106- None. If first_byte is None, byte offset 0 is assumed. If
107- last_byte is None, the last byte available is assumed. Note that
108- the range specification is python-like in that (0,10) will yeild
109- the first 10 bytes of the file.
110-
111- If set to None, no range will be used.
112-
113- reget = None [None|'simple'|'check_timestamp']
114-
115- whether to attempt to reget a partially-downloaded file. Reget
116- only applies to .urlgrab and (obviously) only if there is a
117- partially downloaded file. Reget has two modes:
118-
119- 'simple' -- the local file will always be trusted. If there
120- are 100 bytes in the local file, then the download will always
121- begin 100 bytes into the requested file.
122-
123- 'check_timestamp' -- the timestamp of the server file will be
124- compared to the timestamp of the local file. ONLY if the
125- local file is newer than or the same age as the server file
126- will reget be used. If the server file is newer, or the
127- timestamp is not returned, the entire file will be fetched.
128-
129- NOTE: urlgrabber can do very little to verify that the partial
130- file on disk is identical to the beginning of the remote file.
131- You may want to either employ a custom "checkfunc" or simply avoid
132- using reget in situations where corruption is a concern.
133-
134- user_agent = 'urlgrabber/VERSION'
135-
136- a string, usually of the form 'AGENT/VERSION' that is provided to
137- HTTP servers in the User-agent header. The module level default
138- for this option is "urlgrabber/VERSION".
139-
140- http_headers = None
141-
142- a tuple of 2-tuples, each containing a header and value. These
143- will be used for http and https requests only. For example, you
144- can do
145- http_headers = (('Pragma', 'no-cache'),)
146-
147- ftp_headers = None
148-
149- this is just like http_headers, but will be used for ftp requests.
150-
151- proxies = None
152-
153- a dictionary that maps protocol schemes to proxy hosts. For
154- example, to use a proxy server on host "foo" port 3128 for http
155- and https URLs:
156- proxies={ 'http' : 'http://foo:3128', 'https' : 'http://foo:3128' }
157- note that proxy authentication information may be provided using
158- normal URL constructs:
159- proxies={ 'http' : 'http://user:host@foo:3128' }
160- Lastly, if proxies is None, the default environment settings will
161- be used.
162-
163- prefix = None
164-
165- a url prefix that will be prepended to all requested urls. For
166- example:
167- g = URLGrabber(prefix='http://foo.com/mirror/')
168- g.urlgrab('some/file.txt')
169- ## this will fetch 'http://foo.com/mirror/some/file.txt'
170- This option exists primarily to allow identical behavior to
171- MirrorGroup (and derived) instances. Note: a '/' will be inserted
172- if necessary, so you cannot specify a prefix that ends with a
173- partial file or directory name.
174-
175- opener = None
176- No-op when using the curl backend (default)
177-
178- cache_openers = True
179- No-op when using the curl backend (default)
180-
181- data = None
182-
183- Only relevant for the HTTP family (and ignored for other
184- protocols), this allows HTTP POSTs. When the data kwarg is
185- present (and not None), an HTTP request will automatically become
186- a POST rather than GET. This is done by direct passthrough to
187- urllib2. If you use this, you may also want to set the
188- 'Content-length' and 'Content-type' headers with the http_headers
189- option. Note that python 2.2 handles the case of these
190- badly and if you do not use the proper case (shown here), your
191- values will be overridden with the defaults.
192-
193- urlparser = URLParser()
194-
195- The URLParser class handles pre-processing of URLs, including
196- auth-handling for user/pass encoded in http urls, file handing
197- (that is, filenames not sent as a URL), and URL quoting. If you
198- want to override any of this behavior, you can pass in a
199- replacement instance. See also the 'quote' option.
200-
201- quote = None
202-
203- Whether or not to quote the path portion of a url.
204- quote = 1 -> quote the URLs (they're not quoted yet)
205- quote = 0 -> do not quote them (they're already quoted)
206- quote = None -> guess what to do
207-
208- This option only affects proper urls like 'file:///etc/passwd'; it
209- does not affect 'raw' filenames like '/etc/passwd'. The latter
210- will always be quoted as they are converted to URLs. Also, only
211- the path part of a url is quoted. If you need more fine-grained
212- control, you should probably subclass URLParser and pass it in via
213- the 'urlparser' option.
214-
215- ssl_ca_cert = None
216-
217- this option can be used if M2Crypto is available and will be
218- ignored otherwise. If provided, it will be used to create an SSL
219- context. If both ssl_ca_cert and ssl_context are provided, then
220- ssl_context will be ignored and a new context will be created from
221- ssl_ca_cert.
222-
223- ssl_context = None
224-
225- No-op when using the curl backend (default)
226-
227-
228- self.ssl_verify_peer = True
229-
230- Check the server's certificate to make sure it is valid with what our CA validates
231-
232- self.ssl_verify_host = True
233-
234- Check the server's hostname to make sure it matches the certificate DN
235-
236- self.ssl_key = None
237-
238- Path to the key the client should use to connect/authenticate with
239-
240- self.ssl_key_type = 'PEM'
241-
242- PEM or DER - format of key
243-
244- self.ssl_cert = None
245-
246- Path to the ssl certificate the client should use to to authenticate with
247-
248- self.ssl_cert_type = 'PEM'
249-
250- PEM or DER - format of certificate
251-
252- self.ssl_key_pass = None
253-
254- password to access the ssl_key
255-
256- self.size = None
257-
258- size (in bytes) or Maximum size of the thing being downloaded.
259- This is mostly to keep us from exploding with an endless datastream
260-
261- self.max_header_size = 2097152
262-
263- Maximum size (in bytes) of the headers.
264-
265-
266-RETRY RELATED ARGUMENTS
267-
268- retry = None
269-
270- the number of times to retry the grab before bailing. If this is
271- zero, it will retry forever. This was intentional... really, it
272- was :). If this value is not supplied or is supplied but is None
273- retrying does not occur.
274-
275- retrycodes = [-1,2,4,5,6,7]
276-
277- a sequence of errorcodes (values of e.errno) for which it should
278- retry. See the doc on URLGrabError for more details on this. You
279- might consider modifying a copy of the default codes rather than
280- building yours from scratch so that if the list is extended in the
281- future (or one code is split into two) you can still enjoy the
282- benefits of the default list. You can do that with something like
283- this:
284-
285- retrycodes = urlgrabber.grabber.URLGrabberOptions().retrycodes
286- if 12 not in retrycodes:
287- retrycodes.append(12)
288-
289- checkfunc = None
290-
291- a function to do additional checks. This defaults to None, which
292- means no additional checking. The function should simply return
293- on a successful check. It should raise URLGrabError on an
294- unsuccessful check. Raising of any other exception will be
295- considered immediate failure and no retries will occur.
296-
297- If it raises URLGrabError, the error code will determine the retry
298- behavior. Negative error numbers are reserved for use by these
299- passed in functions, so you can use many negative numbers for
300- different types of failure. By default, -1 results in a retry,
301- but this can be customized with retrycodes.
302-
303- If you simply pass in a function, it will be given exactly one
304- argument: a CallbackObject instance with the .url attribute
305- defined and either .filename (for urlgrab) or .data (for urlread).
306- For urlgrab, .filename is the name of the local file. For
307- urlread, .data is the actual string data. If you need other
308- arguments passed to the callback (program state of some sort), you
309- can do so like this:
310-
311- checkfunc=(function, ('arg1', 2), {'kwarg': 3})
312-
313- if the downloaded file has filename /tmp/stuff, then this will
314- result in this call (for urlgrab):
315-
316- function(obj, 'arg1', 2, kwarg=3)
317- # obj.filename = '/tmp/stuff'
318- # obj.url = 'http://foo.com/stuff'
319-
320- NOTE: both the "args" tuple and "kwargs" dict must be present if
321- you use this syntax, but either (or both) can be empty.
322-
323- failure_callback = None
324-
325- The callback that gets called during retries when an attempt to
326- fetch a file fails. The syntax for specifying the callback is
327- identical to checkfunc, except for the attributes defined in the
328- CallbackObject instance. The attributes for failure_callback are:
329-
330- exception = the raised exception
331- url = the url we're trying to fetch
332- tries = the number of tries so far (including this one)
333- retry = the value of the retry option
334-
335- The callback is present primarily to inform the calling program of
336- the failure, but if it raises an exception (including the one it's
337- passed) that exception will NOT be caught and will therefore cause
338- future retries to be aborted.
339-
340- The callback is called for EVERY failure, including the last one.
341- On the last try, the callback can raise an alternate exception,
342- but it cannot (without severe trickiness) prevent the exception
343- from being raised.
344-
345- interrupt_callback = None
346-
347- This callback is called if KeyboardInterrupt is received at any
348- point in the transfer. Basically, this callback can have three
349- impacts on the fetch process based on the way it exits:
350-
351- 1) raise no exception: the current fetch will be aborted, but
352- any further retries will still take place
353-
354- 2) raise a URLGrabError: if you're using a MirrorGroup, then
355- this will prompt a failover to the next mirror according to
356- the behavior of the MirrorGroup subclass. It is recommended
357- that you raise URLGrabError with code 15, 'user abort'. If
358- you are NOT using a MirrorGroup subclass, then this is the
359- same as (3).
360-
361- 3) raise some other exception (such as KeyboardInterrupt), which
362- will not be caught at either the grabber or mirror levels.
363- That is, it will be raised up all the way to the caller.
364-
365- This callback is very similar to failure_callback. They are
366- passed the same arguments, so you could use the same function for
367- both.
368-
369-BANDWIDTH THROTTLING
370-
371- urlgrabber supports throttling via two values: throttle and
372- bandwidth Between the two, you can either specify and absolute
373- throttle threshold or specify a theshold as a fraction of maximum
374- available bandwidth.
375-
376- throttle is a number - if it's an int, it's the bytes/second
377- throttle limit. If it's a float, it is first multiplied by
378- bandwidth. If throttle == 0, throttling is disabled. If None, the
379- module-level default (which can be set with set_throttle) is used.
380-
381- bandwidth is the nominal max bandwidth in bytes/second. If throttle
382- is a float and bandwidth == 0, throttling is disabled. If None, the
383- module-level default (which can be set with set_bandwidth) is used.
384-
385- THROTTLING EXAMPLES:
386-
387- Lets say you have a 100 Mbps connection. This is (about) 10^8 bits
388- per second, or 12,500,000 Bytes per second. You have a number of
389- throttling options:
390-
391- *) set_bandwidth(12500000); set_throttle(0.5) # throttle is a float
392-
393- This will limit urlgrab to use half of your available bandwidth.
394-
395- *) set_throttle(6250000) # throttle is an int
396-
397- This will also limit urlgrab to use half of your available
398- bandwidth, regardless of what bandwidth is set to.
399-
400- *) set_throttle(6250000); set_throttle(1.0) # float
401-
402- Use half your bandwidth
403-
404- *) set_throttle(6250000); set_throttle(2.0) # float
405-
406- Use up to 12,500,000 Bytes per second (your nominal max bandwidth)
407-
408- *) set_throttle(6250000); set_throttle(0) # throttle = 0
409-
410- Disable throttling - this is more efficient than a very large
411- throttle setting.
412-
413- *) set_throttle(0); set_throttle(1.0) # throttle is float, bandwidth = 0
414-
415- Disable throttling - this is the default when the module is loaded.
416-
417- SUGGESTED AUTHOR IMPLEMENTATION (THROTTLING)
418-
419- While this is flexible, it's not extremely obvious to the user. I
420- suggest you implement a float throttle as a percent to make the
421- distinction between absolute and relative throttling very explicit.
422-
423- Also, you may want to convert the units to something more convenient
424- than bytes/second, such as kbps or kB/s, etc.
425-
426-"""
427-
428-
429-
430-import os
431-import sys
432-import urlparse
433-import time
434-import string
435-import urllib
436-import urllib2
437-import mimetools
438-import thread
439-import types
440-import stat
441-import pycurl
442-from ftplib import parse150
443-from StringIO import StringIO
444-from httplib import HTTPException
445-import socket
446-from byterange import range_tuple_normalize, range_tuple_to_header, RangeError
447-
448-########################################################################
449-# MODULE INITIALIZATION
450-########################################################################
451-try:
452- exec('from ' + (__name__.split('.'))[0] + ' import __version__')
453-except:
454- __version__ = '???'
455-
456-########################################################################
457-# functions for debugging output. These functions are here because they
458-# are also part of the module initialization.
459-DEBUG = None
460-def set_logger(DBOBJ):
461- """Set the DEBUG object. This is called by _init_default_logger when
462- the environment variable URLGRABBER_DEBUG is set, but can also be
463- called by a calling program. Basically, if the calling program uses
464- the logging module and would like to incorporate urlgrabber logging,
465- then it can do so this way. It's probably not necessary as most
466- internal logging is only for debugging purposes.
467-
468- The passed-in object should be a logging.Logger instance. It will
469- be pushed into the keepalive and byterange modules if they're
470- being used. The mirror module pulls this object in on import, so
471- you will need to manually push into it. In fact, you may find it
472- tidier to simply push your logging object (or objects) into each
473- of these modules independently.
474- """
475-
476- global DEBUG
477- DEBUG = DBOBJ
478-
479-def _init_default_logger(logspec=None):
480- '''Examines the environment variable URLGRABBER_DEBUG and creates
481- a logging object (logging.logger) based on the contents. It takes
482- the form
483-
484- URLGRABBER_DEBUG=level,filename
485-
486- where "level" can be either an integer or a log level from the
487- logging module (DEBUG, INFO, etc). If the integer is zero or
488- less, logging will be disabled. Filename is the filename where
489- logs will be sent. If it is "-", then stdout will be used. If
490- the filename is empty or missing, stderr will be used. If the
491- variable cannot be processed or the logging module cannot be
492- imported (python < 2.3) then logging will be disabled. Here are
493- some examples:
494-
495- URLGRABBER_DEBUG=1,debug.txt # log everything to debug.txt
496- URLGRABBER_DEBUG=WARNING,- # log warning and higher to stdout
497- URLGRABBER_DEBUG=INFO # log info and higher to stderr
498-
499- This funtion is called during module initialization. It is not
500- intended to be called from outside. The only reason it is a
501- function at all is to keep the module-level namespace tidy and to
502- collect the code into a nice block.'''
503-
504- try:
505- if logspec is None:
506- logspec = os.environ['URLGRABBER_DEBUG']
507- dbinfo = logspec.split(',')
508- import logging
509- level = logging._levelNames.get(dbinfo[0], None)
510- if level is None: level = int(dbinfo[0])
511- if level < 1: raise ValueError()
512-
513- formatter = logging.Formatter('%(asctime)s %(message)s')
514- if len(dbinfo) > 1: filename = dbinfo[1]
515- else: filename = ''
516- if filename == '': handler = logging.StreamHandler(sys.stderr)
517- elif filename == '-': handler = logging.StreamHandler(sys.stdout)
518- else: handler = logging.FileHandler(filename)
519- handler.setFormatter(formatter)
520- DBOBJ = logging.getLogger('urlgrabber')
521- DBOBJ.addHandler(handler)
522- DBOBJ.setLevel(level)
523- except (KeyError, ImportError, ValueError):
524- DBOBJ = None
525- set_logger(DBOBJ)
526-
527-def _log_package_state():
528- if not DEBUG: return
529- DEBUG.info('urlgrabber version = %s' % __version__)
530- DEBUG.info('trans function "_" = %s' % _)
531-
532-_init_default_logger()
533-_log_package_state()
534-
535-
536-# normally this would be from i18n or something like it ...
537-def _(st):
538- return st
539-
540-########################################################################
541-# END MODULE INITIALIZATION
542-########################################################################
543-
544-
545-
546-class URLGrabError(IOError):
547- """
548- URLGrabError error codes:
549-
550- URLGrabber error codes (0 -- 255)
551- 0 - everything looks good (you should never see this)
552- 1 - malformed url
553- 2 - local file doesn't exist
554- 3 - request for non-file local file (dir, etc)
555- 4 - IOError on fetch
556- 5 - OSError on fetch
557- 6 - no content length header when we expected one
558- 7 - HTTPException
559- 8 - Exceeded read limit (for urlread)
560- 9 - Requested byte range not satisfiable.
561- 10 - Byte range requested, but range support unavailable
562- 11 - Illegal reget mode
563- 12 - Socket timeout
564- 13 - malformed proxy url
565- 14 - HTTPError (includes .code and .exception attributes)
566- 15 - user abort
567- 16 - error writing to local file
568-
569- MirrorGroup error codes (256 -- 511)
570- 256 - No more mirrors left to try
571-
572- Custom (non-builtin) classes derived from MirrorGroup (512 -- 767)
573- [ this range reserved for application-specific error codes ]
574-
575- Retry codes (< 0)
576- -1 - retry the download, unknown reason
577-
578- Note: to test which group a code is in, you can simply do integer
579- division by 256: e.errno / 256
580-
581- Negative codes are reserved for use by functions passed in to
582- retrygrab with checkfunc. The value -1 is built in as a generic
583- retry code and is already included in the retrycodes list.
584- Therefore, you can create a custom check function that simply
585- returns -1 and the fetch will be re-tried. For more customized
586- retries, you can use other negative number and include them in
587- retry-codes. This is nice for outputting useful messages about
588- what failed.
589-
590- You can use these error codes like so:
591- try: urlgrab(url)
592- except URLGrabError, e:
593- if e.errno == 3: ...
594- # or
595- print e.strerror
596- # or simply
597- print e #### print '[Errno %i] %s' % (e.errno, e.strerror)
598- """
599- def __init__(self, *args):
600- IOError.__init__(self, *args)
601- self.url = "No url specified"
602-
603-class CallbackObject:
604- """Container for returned callback data.
605-
606- This is currently a dummy class into which urlgrabber can stuff
607- information for passing to callbacks. This way, the prototype for
608- all callbacks is the same, regardless of the data that will be
609- passed back. Any function that accepts a callback function as an
610- argument SHOULD document what it will define in this object.
611-
612- It is possible that this class will have some greater
613- functionality in the future.
614- """
615- def __init__(self, **kwargs):
616- self.__dict__.update(kwargs)
617-
618-def urlgrab(url, filename=None, **kwargs):
619- """grab the file at <url> and make a local copy at <filename>
620- If filename is none, the basename of the url is used.
621- urlgrab returns the filename of the local file, which may be different
622- from the passed-in filename if the copy_local kwarg == 0.
623-
624- See module documentation for a description of possible kwargs.
625- """
626- return default_grabber.urlgrab(url, filename, **kwargs)
627-
628-def urlopen(url, **kwargs):
629- """open the url and return a file object
630- If a progress object or throttle specifications exist, then
631- a special file object will be returned that supports them.
632- The file object can be treated like any other file object.
633-
634- See module documentation for a description of possible kwargs.
635- """
636- return default_grabber.urlopen(url, **kwargs)
637-
638-def urlread(url, limit=None, **kwargs):
639- """read the url into a string, up to 'limit' bytes
640- If the limit is exceeded, an exception will be thrown. Note that urlread
641- is NOT intended to be used as a way of saying "I want the first N bytes"
642- but rather 'read the whole file into memory, but don't use too much'
643-
644- See module documentation for a description of possible kwargs.
645- """
646- return default_grabber.urlread(url, limit, **kwargs)
647-
648-
649-class URLParser:
650- """Process the URLs before passing them to urllib2.
651-
652- This class does several things:
653-
654- * add any prefix
655- * translate a "raw" file to a proper file: url
656- * handle any http or https auth that's encoded within the url
657- * quote the url
658-
659- Only the "parse" method is called directly, and it calls sub-methods.
660-
661- An instance of this class is held in the options object, which
662- means that it's easy to change the behavior by sub-classing and
663- passing the replacement in. It need only have a method like:
664-
665- url, parts = urlparser.parse(url, opts)
666- """
667-
668- def parse(self, url, opts):
669- """parse the url and return the (modified) url and its parts
670-
671- Note: a raw file WILL be quoted when it's converted to a URL.
672- However, other urls (ones which come with a proper scheme) may
673- or may not be quoted according to opts.quote
674-
675- opts.quote = 1 --> quote it
676- opts.quote = 0 --> do not quote it
677- opts.quote = None --> guess
678- """
679- quote = opts.quote
680-
681- if opts.prefix:
682- url = self.add_prefix(url, opts.prefix)
683-
684- parts = urlparse.urlparse(url)
685- (scheme, host, path, parm, query, frag) = parts
686-
687- if not scheme or (len(scheme) == 1 and scheme in string.letters):
688- # if a scheme isn't specified, we guess that it's "file:"
689- if url[0] not in '/\\': url = os.path.abspath(url)
690- url = 'file:' + urllib.pathname2url(url)
691- parts = urlparse.urlparse(url)
692- quote = 0 # pathname2url quotes, so we won't do it again
693-
694- if scheme in ['http', 'https']:
695- parts = self.process_http(parts, url)
696-
697- if quote is None:
698- quote = self.guess_should_quote(parts)
699- if quote:
700- parts = self.quote(parts)
701-
702- url = urlparse.urlunparse(parts)
703- return url, parts
704-
705- def add_prefix(self, url, prefix):
706- if prefix[-1] == '/' or url[0] == '/':
707- url = prefix + url
708- else:
709- url = prefix + '/' + url
710- return url
711-
712- def process_http(self, parts, url):
713- (scheme, host, path, parm, query, frag) = parts
714- # TODO: auth-parsing here, maybe? pycurl doesn't really need it
715- return (scheme, host, path, parm, query, frag)
716-
717- def quote(self, parts):
718- """quote the URL
719-
720- This method quotes ONLY the path part. If you need to quote
721- other parts, you should override this and pass in your derived
722- class. The other alternative is to quote other parts before
723- passing into urlgrabber.
724- """
725- (scheme, host, path, parm, query, frag) = parts
726- path = urllib.quote(path)
727- return (scheme, host, path, parm, query, frag)
728-
729- hexvals = '0123456789ABCDEF'
730- def guess_should_quote(self, parts):
731- """
732- Guess whether we should quote a path. This amounts to
733- guessing whether it's already quoted.
734-
735- find ' ' -> 1
736- find '%' -> 1
737- find '%XX' -> 0
738- else -> 1
739- """
740- (scheme, host, path, parm, query, frag) = parts
741- if ' ' in path:
742- return 1
743- ind = string.find(path, '%')
744- if ind > -1:
745- while ind > -1:
746- if len(path) < ind+3:
747- return 1
748- code = path[ind+1:ind+3].upper()
749- if code[0] not in self.hexvals or \
750- code[1] not in self.hexvals:
751- return 1
752- ind = string.find(path, '%', ind+1)
753- return 0
754- return 1
755-
756-class URLGrabberOptions:
757- """Class to ease kwargs handling."""
758-
759- def __init__(self, delegate=None, **kwargs):
760- """Initialize URLGrabberOptions object.
761- Set default values for all options and then update options specified
762- in kwargs.
763- """
764- self.delegate = delegate
765- if delegate is None:
766- self._set_defaults()
767- self._set_attributes(**kwargs)
768-
769- def __getattr__(self, name):
770- if self.delegate and hasattr(self.delegate, name):
771- return getattr(self.delegate, name)
772- raise AttributeError, name
773-
774- def raw_throttle(self):
775- """Calculate raw throttle value from throttle and bandwidth
776- values.
777- """
778- if self.throttle <= 0:
779- return 0
780- elif type(self.throttle) == type(0):
781- return float(self.throttle)
782- else: # throttle is a float
783- return self.bandwidth * self.throttle
784-
785- def derive(self, **kwargs):
786- """Create a derived URLGrabberOptions instance.
787- This method creates a new instance and overrides the
788- options specified in kwargs.
789- """
790- return URLGrabberOptions(delegate=self, **kwargs)
791-
792- def _set_attributes(self, **kwargs):
793- """Update object attributes with those provided in kwargs."""
794- self.__dict__.update(kwargs)
795- if kwargs.has_key('range'):
796- # normalize the supplied range value
797- self.range = range_tuple_normalize(self.range)
798- if not self.reget in [None, 'simple', 'check_timestamp']:
799- raise URLGrabError(11, _('Illegal reget mode: %s') \
800- % (self.reget, ))
801-
802- def _set_defaults(self):
803- """Set all options to their default values.
804- When adding new options, make sure a default is
805- provided here.
806- """
807- self.progress_obj = None
808- self.throttle = 1.0
809- self.bandwidth = 0
810- self.retry = None
811- self.retrycodes = [-1,2,4,5,6,7]
812- self.checkfunc = None
813- self.copy_local = 0
814- self.close_connection = 0
815- self.range = None
816- self.user_agent = 'urlgrabber/%s' % __version__
817- self.keepalive = 1
818- self.proxies = None
819- self.reget = None
820- self.failure_callback = None
821- self.interrupt_callback = None
822- self.prefix = None
823- self.opener = None
824- self.cache_openers = True
825- self.timeout = None
826- self.text = None
827- self.http_headers = None
828- self.ftp_headers = None
829- self.data = None
830- self.urlparser = URLParser()
831- self.quote = None
832- self.ssl_ca_cert = None # sets SSL_CAINFO - path to certdb
833- self.ssl_context = None # no-op in pycurl
834- self.ssl_verify_peer = True # check peer's cert for authenticityb
835- self.ssl_verify_host = True # make sure who they are and who the cert is for matches
836- self.ssl_key = None # client key
837- self.ssl_key_type = 'PEM' #(or DER)
838- self.ssl_cert = None # client cert
839- self.ssl_cert_type = 'PEM' # (or DER)
840- self.ssl_key_pass = None # password to access the key
841- self.size = None # if we know how big the thing we're getting is going
842- # to be. this is ultimately a MAXIMUM size for the file
843- self.max_header_size = 2097152 #2mb seems reasonable for maximum header size
844-
845- def __repr__(self):
846- return self.format()
847-
848- def format(self, indent=' '):
849- keys = self.__dict__.keys()
850- if self.delegate is not None:
851- keys.remove('delegate')
852- keys.sort()
853- s = '{\n'
854- for k in keys:
855- s = s + indent + '%-15s: %s,\n' % \
856- (repr(k), repr(self.__dict__[k]))
857- if self.delegate:
858- df = self.delegate.format(indent + ' ')
859- s = s + indent + '%-15s: %s\n' % ("'delegate'", df)
860- s = s + indent + '}'
861- return s
862-
863-class URLGrabber:
864- """Provides easy opening of URLs with a variety of options.
865-
866- All options are specified as kwargs. Options may be specified when
867- the class is created and may be overridden on a per request basis.
868-
869- New objects inherit default values from default_grabber.
870- """
871-
872- def __init__(self, **kwargs):
873- self.opts = URLGrabberOptions(**kwargs)
874-
875- def _retry(self, opts, func, *args):
876- tries = 0
877- while 1:
878- # there are only two ways out of this loop. The second has
879- # several "sub-ways"
880- # 1) via the return in the "try" block
881- # 2) by some exception being raised
882- # a) an excepton is raised that we don't "except"
883- # b) a callback raises ANY exception
884- # c) we're not retry-ing or have run out of retries
885- # d) the URLGrabError code is not in retrycodes
886- # beware of infinite loops :)
887- tries = tries + 1
888- exception = None
889- retrycode = None
890- callback = None
891- if DEBUG: DEBUG.info('attempt %i/%s: %s',
892- tries, opts.retry, args[0])
893- try:
894- r = apply(func, (opts,) + args, {})
895- if DEBUG: DEBUG.info('success')
896- return r
897- except URLGrabError, e:
898- exception = e
899- callback = opts.failure_callback
900- retrycode = e.errno
901- except KeyboardInterrupt, e:
902- exception = e
903- callback = opts.interrupt_callback
904-
905- if DEBUG: DEBUG.info('exception: %s', exception)
906- if callback:
907- if DEBUG: DEBUG.info('calling callback: %s', callback)
908- cb_func, cb_args, cb_kwargs = self._make_callback(callback)
909- obj = CallbackObject(exception=exception, url=args[0],
910- tries=tries, retry=opts.retry)
911- cb_func(obj, *cb_args, **cb_kwargs)
912-
913- if (opts.retry is None) or (tries == opts.retry):
914- if DEBUG: DEBUG.info('retries exceeded, re-raising')
915- raise
916-
917- if (retrycode is not None) and (retrycode not in opts.retrycodes):
918- if DEBUG: DEBUG.info('retrycode (%i) not in list %s, re-raising',
919- retrycode, opts.retrycodes)
920- raise
921-
922- def urlopen(self, url, **kwargs):
923- """open the url and return a file object
924- If a progress object or throttle value specified when this
925- object was created, then a special file object will be
926- returned that supports them. The file object can be treated
927- like any other file object.
928- """
929- opts = self.opts.derive(**kwargs)
930- if DEBUG: DEBUG.debug('combined options: %s' % repr(opts))
931- (url,parts) = opts.urlparser.parse(url, opts)
932- def retryfunc(opts, url):
933- return PyCurlFileObject(url, filename=None, opts=opts)
934- return self._retry(opts, retryfunc, url)
935-
936- def urlgrab(self, url, filename=None, **kwargs):
937- """grab the file at <url> and make a local copy at <filename>
938- If filename is none, the basename of the url is used.
939- urlgrab returns the filename of the local file, which may be
940- different from the passed-in filename if copy_local == 0.
941- """
942- opts = self.opts.derive(**kwargs)
943- if DEBUG: DEBUG.debug('combined options: %s' % repr(opts))
944- (url,parts) = opts.urlparser.parse(url, opts)
945- (scheme, host, path, parm, query, frag) = parts
946- if filename is None:
947- filename = os.path.basename( urllib.unquote(path) )
948- if scheme == 'file' and not opts.copy_local:
949- # just return the name of the local file - don't make a
950- # copy currently
951- path = urllib.url2pathname(path)
952- if host:
953- path = os.path.normpath('//' + host + path)
954- if not os.path.exists(path):
955- err = URLGrabError(2,
956- _('Local file does not exist: %s') % (path, ))
957- err.url = url
958- raise err
959- elif not os.path.isfile(path):
960- err = URLGrabError(3,
961- _('Not a normal file: %s') % (path, ))
962- err.url = url
963- raise err
964-
965- elif not opts.range:
966- if not opts.checkfunc is None:
967- cb_func, cb_args, cb_kwargs = \
968- self._make_callback(opts.checkfunc)
969- obj = CallbackObject()
970- obj.filename = path
971- obj.url = url
972- apply(cb_func, (obj, )+cb_args, cb_kwargs)
973- return path
974-
975- def retryfunc(opts, url, filename):
976- fo = PyCurlFileObject(url, filename, opts)
977- try:
978- fo._do_grab()
979- if not opts.checkfunc is None:
980- cb_func, cb_args, cb_kwargs = \
981- self._make_callback(opts.checkfunc)
982- obj = CallbackObject()
983- obj.filename = filename
984- obj.url = url
985- apply(cb_func, (obj, )+cb_args, cb_kwargs)
986- finally:
987- fo.close()
988- return filename
989-
990- return self._retry(opts, retryfunc, url, filename)
991-
992- def urlread(self, url, limit=None, **kwargs):
993- """read the url into a string, up to 'limit' bytes
994- If the limit is exceeded, an exception will be thrown. Note
995- that urlread is NOT intended to be used as a way of saying
996- "I want the first N bytes" but rather 'read the whole file
997- into memory, but don't use too much'
998- """
999- opts = self.opts.derive(**kwargs)
1000- if DEBUG: DEBUG.debug('combined options: %s' % repr(opts))
1001- (url,parts) = opts.urlparser.parse(url, opts)
1002- if limit is not None:
1003- limit = limit + 1
1004-
1005- def retryfunc(opts, url, limit):
1006- fo = PyCurlFileObject(url, filename=None, opts=opts)
1007- s = ''
1008- try:
1009- # this is an unfortunate thing. Some file-like objects
1010- # have a default "limit" of None, while the built-in (real)
1011- # file objects have -1. They each break the other, so for
1012- # now, we just force the default if necessary.
1013- if limit is None: s = fo.read()
1014- else: s = fo.read(limit)
1015-
1016- if not opts.checkfunc is None:
1017- cb_func, cb_args, cb_kwargs = \
1018- self._make_callback(opts.checkfunc)
1019- obj = CallbackObject()
1020- obj.data = s
1021- obj.url = url
1022- apply(cb_func, (obj, )+cb_args, cb_kwargs)
1023- finally:
1024- fo.close()
1025- return s
1026-
1027- s = self._retry(opts, retryfunc, url, limit)
1028- if limit and len(s) > limit:
1029- err = URLGrabError(8,
1030- _('Exceeded limit (%i): %s') % (limit, url))
1031- err.url = url
1032- raise err
1033-
1034- return s
1035-
1036- def _make_callback(self, callback_obj):
1037- if callable(callback_obj):
1038- return callback_obj, (), {}
1039- else:
1040- return callback_obj
1041-
1042-# create the default URLGrabber used by urlXXX functions.
1043-# NOTE: actual defaults are set in URLGrabberOptions
1044-default_grabber = URLGrabber()
1045-
1046-
1047-class PyCurlFileObject():
1048- def __init__(self, url, filename, opts):
1049- self.fo = None
1050- self._hdr_dump = ''
1051- self._parsed_hdr = None
1052- self.url = url
1053- self.scheme = urlparse.urlsplit(self.url)[0]
1054- self.filename = filename
1055- self.append = False
1056- self.reget_time = None
1057- self.opts = opts
1058- if self.opts.reget == 'check_timestamp':
1059- raise NotImplementedError, "check_timestamp regets are not implemented in this ver of urlgrabber. Please report this."
1060- self._complete = False
1061- self._rbuf = ''
1062- self._rbufsize = 1024*8
1063- self._ttime = time.time()
1064- self._tsize = 0
1065- self._amount_read = 0
1066- self._reget_length = 0
1067- self._prog_running = False
1068- self._error = (None, None)
1069- self.size = None
1070- self._do_open()
1071-
1072-
1073- def __getattr__(self, name):
1074- """This effectively allows us to wrap at the instance level.
1075- Any attribute not found in _this_ object will be searched for
1076- in self.fo. This includes methods."""
1077-
1078- if hasattr(self.fo, name):
1079- return getattr(self.fo, name)
1080- raise AttributeError, name
1081-
1082- def _retrieve(self, buf):
1083- try:
1084- if not self._prog_running:
1085- if self.opts.progress_obj:
1086- size = self.size + self._reget_length
1087- self.opts.progress_obj.start(self._prog_reportname,
1088- urllib.unquote(self.url),
1089- self._prog_basename,
1090- size=size,
1091- text=self.opts.text)
1092- self._prog_running = True
1093- self.opts.progress_obj.update(self._amount_read)
1094-
1095- self._amount_read += len(buf)
1096- self.fo.write(buf)
1097- return len(buf)
1098- except KeyboardInterrupt:
1099- return -1
1100-
1101- def _hdr_retrieve(self, buf):
1102- if self._over_max_size(cur=len(self._hdr_dump),
1103- max_size=self.opts.max_header_size):
1104- return -1
1105- try:
1106- self._hdr_dump += buf
1107- # we have to get the size before we do the progress obj start
1108- # but we can't do that w/o making it do 2 connects, which sucks
1109- # so we cheat and stuff it in here in the hdr_retrieve
1110- if self.scheme in ['http','https'] and buf.lower().find('content-length') != -1:
1111- length = buf.split(':')[1]
1112- self.size = int(length)
1113- elif self.scheme in ['ftp']:
1114- s = None
1115- if buf.startswith('213 '):
1116- s = buf[3:].strip()
1117- elif buf.startswith('150 '):
1118- s = parse150(buf)
1119- if s:
1120- self.size = int(s)
1121-
1122- return len(buf)
1123- except KeyboardInterrupt:
1124- return pycurl.READFUNC_ABORT
1125-
1126- def _return_hdr_obj(self):
1127- if self._parsed_hdr:
1128- return self._parsed_hdr
1129- statusend = self._hdr_dump.find('\n')
1130- hdrfp = StringIO()
1131- hdrfp.write(self._hdr_dump[statusend:])
1132- self._parsed_hdr = mimetools.Message(hdrfp)
1133- return self._parsed_hdr
1134-
1135- hdr = property(_return_hdr_obj)
1136- http_code = property(fget=
1137- lambda self: self.curl_obj.getinfo(pycurl.RESPONSE_CODE))
1138-
1139- def _set_opts(self, opts={}):
1140- # XXX
1141- if not opts:
1142- opts = self.opts
1143-
1144-
1145- # defaults we're always going to set
1146- self.curl_obj.setopt(pycurl.NOPROGRESS, False)
1147- self.curl_obj.setopt(pycurl.NOSIGNAL, True)
1148- self.curl_obj.setopt(pycurl.WRITEFUNCTION, self._retrieve)
1149- self.curl_obj.setopt(pycurl.HEADERFUNCTION, self._hdr_retrieve)
1150- self.curl_obj.setopt(pycurl.PROGRESSFUNCTION, self._progress_update)
1151- self.curl_obj.setopt(pycurl.FAILONERROR, True)
1152- self.curl_obj.setopt(pycurl.OPT_FILETIME, True)
1153-
1154- if DEBUG:
1155- self.curl_obj.setopt(pycurl.VERBOSE, True)
1156- if opts.user_agent:
1157- self.curl_obj.setopt(pycurl.USERAGENT, opts.user_agent)
1158-
1159- # maybe to be options later
1160- self.curl_obj.setopt(pycurl.FOLLOWLOCATION, True)
1161- self.curl_obj.setopt(pycurl.MAXREDIRS, 5)
1162-
1163- # timeouts
1164- timeout = 300
1165- if opts.timeout:
1166- timeout = int(opts.timeout)
1167- self.curl_obj.setopt(pycurl.CONNECTTIMEOUT, timeout)
1168-
1169- # ssl options
1170- if self.scheme == 'https':
1171- if opts.ssl_ca_cert: # this may do ZERO with nss according to curl docs
1172- self.curl_obj.setopt(pycurl.CAPATH, opts.ssl_ca_cert)
1173- self.curl_obj.setopt(pycurl.CAINFO, opts.ssl_ca_cert)
1174- self.curl_obj.setopt(pycurl.SSL_VERIFYPEER, opts.ssl_verify_peer)
1175- self.curl_obj.setopt(pycurl.SSL_VERIFYHOST, opts.ssl_verify_host)
1176- if opts.ssl_key:
1177- self.curl_obj.setopt(pycurl.SSLKEY, opts.ssl_key)
1178- if opts.ssl_key_type:
1179- self.curl_obj.setopt(pycurl.SSLKEYTYPE, opts.ssl_key_type)
1180- if opts.ssl_cert:
1181- self.curl_obj.setopt(pycurl.SSLCERT, opts.ssl_cert)
1182- if opts.ssl_cert_type:
1183- self.curl_obj.setopt(pycurl.SSLCERTTYPE, opts.ssl_cert_type)
1184- if opts.ssl_key_pass:
1185- self.curl_obj.setopt(pycurl.SSLKEYPASSWD, opts.ssl_key_pass)
1186-
1187- #headers:
1188- if opts.http_headers and self.scheme in ('http', 'https'):
1189- headers = []
1190- for (tag, content) in opts.http_headers:
1191- headers.append('%s:%s' % (tag, content))
1192- self.curl_obj.setopt(pycurl.HTTPHEADER, headers)
1193-
1194- # ranges:
1195- if opts.range or opts.reget:
1196- range_str = self._build_range()
1197- if range_str:
1198- self.curl_obj.setopt(pycurl.RANGE, range_str)
1199-
1200- # throttle/bandwidth
1201- if hasattr(opts, 'raw_throttle') and opts.raw_throttle():
1202- self.curl_obj.setopt(pycurl.MAX_RECV_SPEED_LARGE, int(opts.raw_throttle()))
1203-
1204- # proxy settings
1205- if opts.proxies:
1206- for (scheme, proxy) in opts.proxies.items():
1207- if self.scheme in ('ftp'): # only set the ftp proxy for ftp items
1208- if scheme not in ('ftp'):
1209- continue
1210- else:
1211- if proxy == '_none_': proxy = ""
1212- self.curl_obj.setopt(pycurl.PROXY, proxy)
1213- elif self.scheme in ('http', 'https'):
1214- if scheme not in ('http', 'https'):
1215- continue
1216- else:
1217- if proxy == '_none_': proxy = ""
1218- self.curl_obj.setopt(pycurl.PROXY, proxy)
1219-
1220- # FIXME username/password/auth settings
1221-
1222- #posts - simple - expects the fields as they are
1223- if opts.data:
1224- self.curl_obj.setopt(pycurl.POST, True)
1225- self.curl_obj.setopt(pycurl.POSTFIELDS, self._to_utf8(opts.data))
1226-
1227- # our url
1228- self.curl_obj.setopt(pycurl.URL, self.url)
1229-
1230-
1231- def _do_perform(self):
1232- if self._complete:
1233- return
1234-
1235- try:
1236- self.curl_obj.perform()
1237- except pycurl.error, e:
1238- # XXX - break some of these out a bit more clearly
1239- # to other URLGrabErrors from
1240- # http://curl.haxx.se/libcurl/c/libcurl-errors.html
1241- # this covers e.args[0] == 22 pretty well - which will be common
1242-
1243- code = self.http_code
1244- errcode = e.args[0]
1245- if self._error[0]:
1246- errcode = self._error[0]
1247-
1248- if errcode == 23 and code >= 200 and code < 299:
1249- err = URLGrabError(15, _('User (or something) called abort %s: %s') % (self.url, e))
1250- err.url = self.url
1251-
1252- # this is probably wrong but ultimately this is what happens
1253- # we have a legit http code and a pycurl 'writer failed' code
1254- # which almost always means something aborted it from outside
1255- # since we cannot know what it is -I'm banking on it being
1256- # a ctrl-c. XXXX - if there's a way of going back two raises to
1257- # figure out what aborted the pycurl process FIXME
1258- raise KeyboardInterrupt
1259-
1260- elif errcode == 28:
1261- err = URLGrabError(12, _('Timeout on %s: %s') % (self.url, e))
1262- err.url = self.url
1263- raise err
1264- elif errcode == 35:
1265- msg = _("problem making ssl connection")
1266- err = URLGrabError(14, msg)
1267- err.url = self.url
1268- raise err
1269- elif errcode == 37:
1270- msg = _("Could not open/read %s") % (self.url)
1271- err = URLGrabError(14, msg)
1272- err.url = self.url
1273- raise err
1274-
1275- elif errcode == 42:
1276- err = URLGrabError(15, _('User (or something) called abort %s: %s') % (self.url, e))
1277- err.url = self.url
1278- # this is probably wrong but ultimately this is what happens
1279- # we have a legit http code and a pycurl 'writer failed' code
1280- # which almost always means something aborted it from outside
1281- # since we cannot know what it is -I'm banking on it being
1282- # a ctrl-c. XXXX - if there's a way of going back two raises to
1283- # figure out what aborted the pycurl process FIXME
1284- raise KeyboardInterrupt
1285-
1286- elif errcode == 58:
1287- msg = _("problem with the local client certificate")
1288- err = URLGrabError(14, msg)
1289- err.url = self.url
1290- raise err
1291-
1292- elif errcode == 60:
1293- msg = _("client cert cannot be verified or client cert incorrect")
1294- err = URLGrabError(14, msg)
1295- err.url = self.url
1296- raise err
1297-
1298- elif errcode == 63:
1299- if self._error[1]:
1300- msg = self._error[1]
1301- else:
1302- msg = _("Max download size exceeded on %s") % (self.url)
1303- err = URLGrabError(14, msg)
1304- err.url = self.url
1305- raise err
1306-
1307- elif str(e.args[1]) == '' and self.http_code != 0: # fake it until you make it
1308- msg = 'HTTP Error %s : %s ' % (self.http_code, self.url)
1309- else:
1310- msg = 'PYCURL ERROR %s - "%s"' % (errcode, str(e.args[1]))
1311- code = errcode
1312- err = URLGrabError(14, msg)
1313- err.code = code
1314- err.exception = e
1315- raise err
1316-
1317- def _do_open(self):
1318- self.curl_obj = _curl_cache
1319- self.curl_obj.reset() # reset all old settings away, just in case
1320- # setup any ranges
1321- self._set_opts()
1322- self._do_grab()
1323- return self.fo
1324-
1325- def _add_headers(self):
1326- pass
1327-
1328- def _build_range(self):
1329- reget_length = 0
1330- rt = None
1331- if self.opts.reget and type(self.filename) in types.StringTypes:
1332- # we have reget turned on and we're dumping to a file
1333- try:
1334- s = os.stat(self.filename)
1335- except OSError:
1336- pass
1337- else:
1338- self.reget_time = s[stat.ST_MTIME]
1339- reget_length = s[stat.ST_SIZE]
1340-
1341- # Set initial length when regetting
1342- self._amount_read = reget_length
1343- self._reget_length = reget_length # set where we started from, too
1344-
1345- rt = reget_length, ''
1346- self.append = 1
1347-
1348- if self.opts.range:
1349- rt = self.opts.range
1350- if rt[0]: rt = (rt[0] + reget_length, rt[1])
1351-
1352- if rt:
1353- header = range_tuple_to_header(rt)
1354- if header:
1355- return header.split('=')[1]
1356-
1357-
1358-
1359- def _make_request(self, req, opener):
1360- #XXXX
1361- # This doesn't do anything really, but we could use this
1362- # instead of do_open() to catch a lot of crap errors as
1363- # mstenner did before here
1364- return (self.fo, self.hdr)
1365-
1366- try:
1367- if self.opts.timeout:
1368- old_to = socket.getdefaulttimeout()
1369- socket.setdefaulttimeout(self.opts.timeout)
1370- try:
1371- fo = opener.open(req)
1372- finally:
1373- socket.setdefaulttimeout(old_to)
1374- else:
1375- fo = opener.open(req)
1376- hdr = fo.info()
1377- except ValueError, e:
1378- err = URLGrabError(1, _('Bad URL: %s : %s') % (self.url, e, ))
1379- err.url = self.url
1380- raise err
1381-
1382- except RangeError, e:
1383- err = URLGrabError(9, _('%s on %s') % (e, self.url))
1384- err.url = self.url
1385- raise err
1386- except urllib2.HTTPError, e:
1387- new_e = URLGrabError(14, _('%s on %s') % (e, self.url))
1388- new_e.code = e.code
1389- new_e.exception = e
1390- new_e.url = self.url
1391- raise new_e
1392- except IOError, e:
1393- if hasattr(e, 'reason') and isinstance(e.reason, socket.timeout):
1394- err = URLGrabError(12, _('Timeout on %s: %s') % (self.url, e))
1395- err.url = self.url
1396- raise err
1397- else:
1398- err = URLGrabError(4, _('IOError on %s: %s') % (self.url, e))
1399- err.url = self.url
1400- raise err
1401-
1402- except OSError, e:
1403- err = URLGrabError(5, _('%s on %s') % (e, self.url))
1404- err.url = self.url
1405- raise err
1406-
1407- except HTTPException, e:
1408- err = URLGrabError(7, _('HTTP Exception (%s) on %s: %s') % \
1409- (e.__class__.__name__, self.url, e))
1410- err.url = self.url
1411- raise err
1412-
1413- else:
1414- return (fo, hdr)
1415-
1416- def _do_grab(self):
1417- """dump the file to a filename or StringIO buffer"""
1418-
1419- if self._complete:
1420- return
1421- _was_filename = False
1422- if type(self.filename) in types.StringTypes and self.filename:
1423- _was_filename = True
1424- self._prog_reportname = str(self.filename)
1425- self._prog_basename = os.path.basename(self.filename)
1426-
1427- if self.append: mode = 'ab'
1428- else: mode = 'wb'
1429-
1430- if DEBUG: DEBUG.info('opening local file "%s" with mode %s' % \
1431- (self.filename, mode))
1432- try:
1433- self.fo = open(self.filename, mode)
1434- except IOError, e:
1435- err = URLGrabError(16, _(\
1436- 'error opening local file from %s, IOError: %s') % (self.url, e))
1437- err.url = self.url
1438- raise err
1439-
1440- else:
1441- self._prog_reportname = 'MEMORY'
1442- self._prog_basename = 'MEMORY'
1443-
1444-
1445- self.fo = StringIO()
1446- # if this is to be a tempfile instead....
1447- # it just makes crap in the tempdir
1448- #fh, self._temp_name = mkstemp()
1449- #self.fo = open(self._temp_name, 'wb')
1450-
1451-
1452- self._do_perform()
1453-
1454-
1455-
1456- if _was_filename:
1457- # close it up
1458- self.fo.flush()
1459- self.fo.close()
1460- # set the time
1461- mod_time = self.curl_obj.getinfo(pycurl.INFO_FILETIME)
1462- if mod_time != -1:
1463- os.utime(self.filename, (mod_time, mod_time))
1464- # re open it
1465- self.fo = open(self.filename, 'r')
1466- else:
1467- #self.fo = open(self._temp_name, 'r')
1468- self.fo.seek(0)
1469-
1470- self._complete = True
1471-
1472- def _fill_buffer(self, amt=None):
1473- """fill the buffer to contain at least 'amt' bytes by reading
1474- from the underlying file object. If amt is None, then it will
1475- read until it gets nothing more. It updates the progress meter
1476- and throttles after every self._rbufsize bytes."""
1477- # the _rbuf test is only in this first 'if' for speed. It's not
1478- # logically necessary
1479- if self._rbuf and not amt is None:
1480- L = len(self._rbuf)
1481- if amt > L:
1482- amt = amt - L
1483- else:
1484- return
1485-
1486- # if we've made it here, then we don't have enough in the buffer
1487- # and we need to read more.
1488-
1489- if not self._complete: self._do_grab() #XXX cheater - change on ranges
1490-
1491- buf = [self._rbuf]
1492- bufsize = len(self._rbuf)
1493- while amt is None or amt:
1494- # first, delay if necessary for throttling reasons
1495- if self.opts.raw_throttle():
1496- diff = self._tsize/self.opts.raw_throttle() - \
1497- (time.time() - self._ttime)
1498- if diff > 0: time.sleep(diff)
1499- self._ttime = time.time()
1500-
1501- # now read some data, up to self._rbufsize
1502- if amt is None: readamount = self._rbufsize
1503- else: readamount = min(amt, self._rbufsize)
1504- try:
1505- new = self.fo.read(readamount)
1506- except socket.error, e:
1507- err = URLGrabError(4, _('Socket Error on %s: %s') % (self.url, e))
1508- err.url = self.url
1509- raise err
1510-
1511- except socket.timeout, e:
1512- raise URLGrabError(12, _('Timeout on %s: %s') % (self.url, e))
1513- err.url = self.url
1514- raise err
1515-
1516- except IOError, e:
1517- raise URLGrabError(4, _('IOError on %s: %s') %(self.url, e))
1518- err.url = self.url
1519- raise err
1520-
1521- newsize = len(new)
1522- if not newsize: break # no more to read
1523-
1524- if amt: amt = amt - newsize
1525- buf.append(new)
1526- bufsize = bufsize + newsize
1527- self._tsize = newsize
1528- self._amount_read = self._amount_read + newsize
1529- #if self.opts.progress_obj:
1530- # self.opts.progress_obj.update(self._amount_read)
1531-
1532- self._rbuf = string.join(buf, '')
1533- return
1534-
1535- def _progress_update(self, download_total, downloaded, upload_total, uploaded):
1536- if self._over_max_size(cur=self._amount_read-self._reget_length):
1537- return -1
1538-
1539- try:
1540- if self._prog_running:
1541- downloaded += self._reget_length
1542- self.opts.progress_obj.update(downloaded)
1543- except KeyboardInterrupt:
1544- return -1
1545-
1546- def _over_max_size(self, cur, max_size=None):
1547-
1548- if not max_size:
1549- max_size = self.size
1550- if self.opts.size: # if we set an opts size use that, no matter what
1551- max_size = self.opts.size
1552- if not max_size: return False # if we have None for all of the Max then this is dumb
1553- if cur > max_size + max_size*.10:
1554-
1555- msg = _("Downloaded more than max size for %s: %s > %s") \
1556- % (self.url, cur, max_size)
1557- self._error = (pycurl.E_FILESIZE_EXCEEDED, msg)
1558- return True
1559- return False
1560-
1561- def _to_utf8(self, obj, errors='replace'):
1562- '''convert 'unicode' to an encoded utf-8 byte string '''
1563- # stolen from yum.i18n
1564- if isinstance(obj, unicode):
1565- obj = obj.encode('utf-8', errors)
1566- return obj
1567-
1568- def read(self, amt=None):
1569- self._fill_buffer(amt)
1570- if amt is None:
1571- s, self._rbuf = self._rbuf, ''
1572- else:
1573- s, self._rbuf = self._rbuf[:amt], self._rbuf[amt:]
1574- return s
1575-
1576- def readline(self, limit=-1):
1577- if not self._complete: self._do_grab()
1578- return self.fo.readline()
1579-
1580- i = string.find(self._rbuf, '\n')
1581- while i < 0 and not (0 < limit <= len(self._rbuf)):
1582- L = len(self._rbuf)
1583- self._fill_buffer(L + self._rbufsize)
1584- if not len(self._rbuf) > L: break
1585- i = string.find(self._rbuf, '\n', L)
1586-
1587- if i < 0: i = len(self._rbuf)
1588- else: i = i+1
1589- if 0 <= limit < len(self._rbuf): i = limit
1590-
1591- s, self._rbuf = self._rbuf[:i], self._rbuf[i:]
1592- return s
1593-
1594- def close(self):
1595- if self._prog_running:
1596- self.opts.progress_obj.end(self._amount_read)
1597- self.fo.close()
1598-
1599-
1600-_curl_cache = pycurl.Curl() # make one and reuse it over and over and over
1601-
1602-
1603-#####################################################################
1604-# DEPRECATED FUNCTIONS
1605-def set_throttle(new_throttle):
1606- """Deprecated. Use: default_grabber.throttle = new_throttle"""
1607- default_grabber.throttle = new_throttle
1608-
1609-def set_bandwidth(new_bandwidth):
1610- """Deprecated. Use: default_grabber.bandwidth = new_bandwidth"""
1611- default_grabber.bandwidth = new_bandwidth
1612-
1613-def set_progress_obj(new_progress_obj):
1614- """Deprecated. Use: default_grabber.progress_obj = new_progress_obj"""
1615- default_grabber.progress_obj = new_progress_obj
1616-
1617-def set_user_agent(new_user_agent):
1618- """Deprecated. Use: default_grabber.user_agent = new_user_agent"""
1619- default_grabber.user_agent = new_user_agent
1620-
1621-def retrygrab(url, filename=None, copy_local=0, close_connection=0,
1622- progress_obj=None, throttle=None, bandwidth=None,
1623- numtries=3, retrycodes=[-1,2,4,5,6,7], checkfunc=None):
1624- """Deprecated. Use: urlgrab() with the retry arg instead"""
1625- kwargs = {'copy_local' : copy_local,
1626- 'close_connection' : close_connection,
1627- 'progress_obj' : progress_obj,
1628- 'throttle' : throttle,
1629- 'bandwidth' : bandwidth,
1630- 'retry' : numtries,
1631- 'retrycodes' : retrycodes,
1632- 'checkfunc' : checkfunc
1633- }
1634- return urlgrab(url, filename, **kwargs)
1635-
1636-
1637-#####################################################################
1638-# TESTING
1639-def _main_test():
1640- try: url, filename = sys.argv[1:3]
1641- except ValueError:
1642- print 'usage:', sys.argv[0], \
1643- '<url> <filename> [copy_local=0|1] [close_connection=0|1]'
1644- sys.exit()
1645-
1646- kwargs = {}
1647- for a in sys.argv[3:]:
1648- k, v = string.split(a, '=', 1)
1649- kwargs[k] = int(v)
1650-
1651- set_throttle(1.0)
1652- set_bandwidth(32 * 1024)
1653- print "throttle: %s, throttle bandwidth: %s B/s" % (default_grabber.throttle,
1654- default_grabber.bandwidth)
1655-
1656- try: from progress import text_progress_meter
1657- except ImportError, e: pass
1658- else: kwargs['progress_obj'] = text_progress_meter()
1659-
1660- try: name = apply(urlgrab, (url, filename), kwargs)
1661- except URLGrabError, e: print e
1662- else: print 'LOCAL FILE:', name
1663-
1664-
1665-def _retry_test():
1666- try: url, filename = sys.argv[1:3]
1667- except ValueError:
1668- print 'usage:', sys.argv[0], \
1669- '<url> <filename> [copy_local=0|1] [close_connection=0|1]'
1670- sys.exit()
1671-
1672- kwargs = {}
1673- for a in sys.argv[3:]:
1674- k, v = string.split(a, '=', 1)
1675- kwargs[k] = int(v)
1676-
1677- try: from progress import text_progress_meter
1678- except ImportError, e: pass
1679- else: kwargs['progress_obj'] = text_progress_meter()
1680-
1681- def cfunc(filename, hello, there='foo'):
1682- print hello, there
1683- import random
1684- rnum = random.random()
1685- if rnum < .5:
1686- print 'forcing retry'
1687- raise URLGrabError(-1, 'forcing retry')
1688- if rnum < .75:
1689- print 'forcing failure'
1690- raise URLGrabError(-2, 'forcing immediate failure')
1691- print 'success'
1692- return
1693-
1694- kwargs['checkfunc'] = (cfunc, ('hello',), {'there':'there'})
1695- try: name = apply(retrygrab, (url, filename), kwargs)
1696- except URLGrabError, e: print e
1697- else: print 'LOCAL FILE:', name
1698-
1699-def _file_object_test(filename=None):
1700- import cStringIO
1701- if filename is None:
1702- filename = __file__
1703- print 'using file "%s" for comparisons' % filename
1704- fo = open(filename)
1705- s_input = fo.read()
1706- fo.close()
1707-
1708- for testfunc in [_test_file_object_smallread,
1709- _test_file_object_readall,
1710- _test_file_object_readline,
1711- _test_file_object_readlines]:
1712- fo_input = cStringIO.StringIO(s_input)
1713- fo_output = cStringIO.StringIO()
1714- wrapper = PyCurlFileObject(fo_input, None, 0)
1715- print 'testing %-30s ' % testfunc.__name__,
1716- testfunc(wrapper, fo_output)
1717- s_output = fo_output.getvalue()
1718- if s_output == s_input: print 'passed'
1719- else: print 'FAILED'
1720-
1721-def _test_file_object_smallread(wrapper, fo_output):
1722- while 1:
1723- s = wrapper.read(23)
1724- fo_output.write(s)
1725- if not s: return
1726-
1727-def _test_file_object_readall(wrapper, fo_output):
1728- s = wrapper.read()
1729- fo_output.write(s)
1730-
1731-def _test_file_object_readline(wrapper, fo_output):
1732- while 1:
1733- s = wrapper.readline()
1734- fo_output.write(s)
1735- if not s: return
1736-
1737-def _test_file_object_readlines(wrapper, fo_output):
1738- li = wrapper.readlines()
1739- fo_output.write(string.join(li, ''))
1740-
1741-if __name__ == '__main__':
1742- _main_test()
1743- _retry_test()
1744- _file_object_test('test')
1745
1746=== removed directory '.pc/progress_fix.diff'
1747=== removed directory '.pc/progress_fix.diff/urlgrabber'
1748=== removed file '.pc/progress_fix.diff/urlgrabber/progress.py'
1749--- .pc/progress_fix.diff/urlgrabber/progress.py 2010-07-08 17:40:08 +0000
1750+++ .pc/progress_fix.diff/urlgrabber/progress.py 1970-01-01 00:00:00 +0000
1751@@ -1,755 +0,0 @@
1752-# This library is free software; you can redistribute it and/or
1753-# modify it under the terms of the GNU Lesser General Public
1754-# License as published by the Free Software Foundation; either
1755-# version 2.1 of the License, or (at your option) any later version.
1756-#
1757-# This library is distributed in the hope that it will be useful,
1758-# but WITHOUT ANY WARRANTY; without even the implied warranty of
1759-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
1760-# Lesser General Public License for more details.
1761-#
1762-# You should have received a copy of the GNU Lesser General Public
1763-# License along with this library; if not, write to the
1764-# Free Software Foundation, Inc.,
1765-# 59 Temple Place, Suite 330,
1766-# Boston, MA 02111-1307 USA
1767-
1768-# This file is part of urlgrabber, a high-level cross-protocol url-grabber
1769-# Copyright 2002-2004 Michael D. Stenner, Ryan Tomayko
1770-
1771-
1772-import sys
1773-import time
1774-import math
1775-import thread
1776-import fcntl
1777-import struct
1778-import termios
1779-
1780-# Code from http://mail.python.org/pipermail/python-list/2000-May/033365.html
1781-def terminal_width(fd=1):
1782- """ Get the real terminal width """
1783- try:
1784- buf = 'abcdefgh'
1785- buf = fcntl.ioctl(fd, termios.TIOCGWINSZ, buf)
1786- ret = struct.unpack('hhhh', buf)[1]
1787- if ret == 0:
1788- return 80
1789- # Add minimum too?
1790- return ret
1791- except: # IOError
1792- return 80
1793-
1794-_term_width_val = None
1795-_term_width_last = None
1796-def terminal_width_cached(fd=1, cache_timeout=1.000):
1797- """ Get the real terminal width, but cache it for a bit. """
1798- global _term_width_val
1799- global _term_width_last
1800-
1801- now = time.time()
1802- if _term_width_val is None or (now - _term_width_last) > cache_timeout:
1803- _term_width_val = terminal_width(fd)
1804- _term_width_last = now
1805- return _term_width_val
1806-
1807-class TerminalLine:
1808- """ Help create dynamic progress bars, uses terminal_width_cached(). """
1809-
1810- def __init__(self, min_rest=0, beg_len=None, fd=1, cache_timeout=1.000):
1811- if beg_len is None:
1812- beg_len = min_rest
1813- self._min_len = min_rest
1814- self._llen = terminal_width_cached(fd, cache_timeout)
1815- if self._llen < beg_len:
1816- self._llen = beg_len
1817- self._fin = False
1818-
1819- def __len__(self):
1820- """ Usable length for elements. """
1821- return self._llen - self._min_len
1822-
1823- def rest_split(self, fixed, elements=2):
1824- """ After a fixed length, split the rest of the line length among
1825- a number of different elements (default=2). """
1826- if self._llen < fixed:
1827- return 0
1828- return (self._llen - fixed) / elements
1829-
1830- def add(self, element, full_len=None):
1831- """ If there is room left in the line, above min_len, add element.
1832- Note that as soon as one add fails all the rest will fail too. """
1833-
1834- if full_len is None:
1835- full_len = len(element)
1836- if len(self) < full_len:
1837- self._fin = True
1838- if self._fin:
1839- return ''
1840-
1841- self._llen -= len(element)
1842- return element
1843-
1844- def rest(self):
1845- """ Current rest of line, same as .rest_split(fixed=0, elements=1). """
1846- return self._llen
1847-
1848-class BaseMeter:
1849- def __init__(self):
1850- self.update_period = 0.3 # seconds
1851-
1852- self.filename = None
1853- self.url = None
1854- self.basename = None
1855- self.text = None
1856- self.size = None
1857- self.start_time = None
1858- self.last_amount_read = 0
1859- self.last_update_time = None
1860- self.re = RateEstimator()
1861-
1862- def start(self, filename=None, url=None, basename=None,
1863- size=None, now=None, text=None):
1864- self.filename = filename
1865- self.url = url
1866- self.basename = basename
1867- self.text = text
1868-
1869- #size = None ######### TESTING
1870- self.size = size
1871- if not size is None: self.fsize = format_number(size) + 'B'
1872-
1873- if now is None: now = time.time()
1874- self.start_time = now
1875- self.re.start(size, now)
1876- self.last_amount_read = 0
1877- self.last_update_time = now
1878- self._do_start(now)
1879-
1880- def _do_start(self, now=None):
1881- pass
1882-
1883- def update(self, amount_read, now=None):
1884- # for a real gui, you probably want to override and put a call
1885- # to your mainloop iteration function here
1886- if now is None: now = time.time()
1887- if (now >= self.last_update_time + self.update_period) or \
1888- not self.last_update_time:
1889- self.re.update(amount_read, now)
1890- self.last_amount_read = amount_read
1891- self.last_update_time = now
1892- self._do_update(amount_read, now)
1893-
1894- def _do_update(self, amount_read, now=None):
1895- pass
1896-
1897- def end(self, amount_read, now=None):
1898- if now is None: now = time.time()
1899- self.re.update(amount_read, now)
1900- self.last_amount_read = amount_read
1901- self.last_update_time = now
1902- self._do_end(amount_read, now)
1903-
1904- def _do_end(self, amount_read, now=None):
1905- pass
1906-
1907-# This is kind of a hack, but progress is gotten from grabber which doesn't
1908-# know about the total size to download. So we do this so we can get the data
1909-# out of band here. This will be "fixed" one way or anther soon.
1910-_text_meter_total_size = 0
1911-_text_meter_sofar_size = 0
1912-def text_meter_total_size(size, downloaded=0):
1913- global _text_meter_total_size
1914- global _text_meter_sofar_size
1915- _text_meter_total_size = size
1916- _text_meter_sofar_size = downloaded
1917-
1918-#
1919-# update: No size (minimal: 17 chars)
1920-# -----------------------------------
1921-# <text> <rate> | <current size> <elapsed time>
1922-# 8-48 1 8 3 6 1 9 5
1923-#
1924-# Order: 1. <text>+<current size> (17)
1925-# 2. +<elapsed time> (10, total: 27)
1926-# 3. + ( 5, total: 32)
1927-# 4. +<rate> ( 9, total: 41)
1928-#
1929-# update: Size, Single file
1930-# -------------------------
1931-# <text> <pc> <bar> <rate> | <current size> <eta time> ETA
1932-# 8-25 1 3-4 1 6-16 1 8 3 6 1 9 1 3 1
1933-#
1934-# Order: 1. <text>+<current size> (17)
1935-# 2. +<eta time> (10, total: 27)
1936-# 3. +ETA ( 5, total: 32)
1937-# 4. +<pc> ( 4, total: 36)
1938-# 5. +<rate> ( 9, total: 45)
1939-# 6. +<bar> ( 7, total: 52)
1940-#
1941-# update: Size, All files
1942-# -----------------------
1943-# <text> <total pc> <pc> <bar> <rate> | <current size> <eta time> ETA
1944-# 8-22 1 5-7 1 3-4 1 6-12 1 8 3 6 1 9 1 3 1
1945-#
1946-# Order: 1. <text>+<current size> (17)
1947-# 2. +<eta time> (10, total: 27)
1948-# 3. +ETA ( 5, total: 32)
1949-# 4. +<total pc> ( 5, total: 37)
1950-# 4. +<pc> ( 4, total: 41)
1951-# 5. +<rate> ( 9, total: 50)
1952-# 6. +<bar> ( 7, total: 57)
1953-#
1954-# end
1955-# ---
1956-# <text> | <current size> <elapsed time>
1957-# 8-56 3 6 1 9 5
1958-#
1959-# Order: 1. <text> ( 8)
1960-# 2. +<current size> ( 9, total: 17)
1961-# 3. +<elapsed time> (10, total: 27)
1962-# 4. + ( 5, total: 32)
1963-#
1964-
1965-class TextMeter(BaseMeter):
1966- def __init__(self, fo=sys.stderr):
1967- BaseMeter.__init__(self)
1968- self.fo = fo
1969-
1970- def _do_update(self, amount_read, now=None):
1971- etime = self.re.elapsed_time()
1972- fetime = format_time(etime)
1973- fread = format_number(amount_read)
1974- #self.size = None
1975- if self.text is not None:
1976- text = self.text
1977- else:
1978- text = self.basename
1979-
1980- ave_dl = format_number(self.re.average_rate())
1981- sofar_size = None
1982- if _text_meter_total_size:
1983- sofar_size = _text_meter_sofar_size + amount_read
1984- sofar_pc = (sofar_size * 100) / _text_meter_total_size
1985-
1986- # Include text + ui_rate in minimal
1987- tl = TerminalLine(8, 8+1+8)
1988- ui_size = tl.add(' | %5sB' % fread)
1989- if self.size is None:
1990- ui_time = tl.add(' %9s' % fetime)
1991- ui_end = tl.add(' ' * 5)
1992- ui_rate = tl.add(' %5sB/s' % ave_dl)
1993- out = '%-*.*s%s%s%s%s\r' % (tl.rest(), tl.rest(), text,
1994- ui_rate, ui_size, ui_time, ui_end)
1995- else:
1996- rtime = self.re.remaining_time()
1997- frtime = format_time(rtime)
1998- frac = self.re.fraction_read()
1999-
2000- ui_time = tl.add(' %9s' % frtime)
2001- ui_end = tl.add(' ETA ')
2002-
2003- if sofar_size is None:
2004- ui_sofar_pc = ''
2005- else:
2006- ui_sofar_pc = tl.add(' (%i%%)' % sofar_pc,
2007- full_len=len(" (100%)"))
2008-
2009- ui_pc = tl.add(' %2i%%' % (frac*100))
2010- ui_rate = tl.add(' %5sB/s' % ave_dl)
2011- # Make text grow a bit before we start growing the bar too
2012- blen = 4 + tl.rest_split(8 + 8 + 4)
2013- bar = '='*int(blen * frac)
2014- if (blen * frac) - int(blen * frac) >= 0.5:
2015- bar += '-'
2016- ui_bar = tl.add(' [%-*.*s]' % (blen, blen, bar))
2017- out = '%-*.*s%s%s%s%s%s%s%s\r' % (tl.rest(), tl.rest(), text,
2018- ui_sofar_pc, ui_pc, ui_bar,
2019- ui_rate, ui_size, ui_time, ui_end)
2020-
2021- self.fo.write(out)
2022- self.fo.flush()
2023-
2024- def _do_end(self, amount_read, now=None):
2025- global _text_meter_total_size
2026- global _text_meter_sofar_size
2027-
2028- total_time = format_time(self.re.elapsed_time())
2029- total_size = format_number(amount_read)
2030- if self.text is not None:
2031- text = self.text
2032- else:
2033- text = self.basename
2034-
2035- tl = TerminalLine(8)
2036- ui_size = tl.add(' | %5sB' % total_size)
2037- ui_time = tl.add(' %9s' % total_time)
2038- not_done = self.size is not None and amount_read != self.size
2039- if not_done:
2040- ui_end = tl.add(' ... ')
2041- else:
2042- ui_end = tl.add(' ' * 5)
2043-
2044- out = '\r%-*.*s%s%s%s\n' % (tl.rest(), tl.rest(), text,
2045- ui_size, ui_time, ui_end)
2046- self.fo.write(out)
2047- self.fo.flush()
2048-
2049- # Don't add size to the sofar size until we have all of it.
2050- # If we don't have a size, then just pretend/hope we got all of it.
2051- if not_done:
2052- return
2053-
2054- if _text_meter_total_size:
2055- _text_meter_sofar_size += amount_read
2056- if _text_meter_total_size <= _text_meter_sofar_size:
2057- _text_meter_total_size = 0
2058- _text_meter_sofar_size = 0
2059-
2060-text_progress_meter = TextMeter
2061-
2062-class MultiFileHelper(BaseMeter):
2063- def __init__(self, master):
2064- BaseMeter.__init__(self)
2065- self.master = master
2066-
2067- def _do_start(self, now):
2068- self.master.start_meter(self, now)
2069-
2070- def _do_update(self, amount_read, now):
2071- # elapsed time since last update
2072- self.master.update_meter(self, now)
2073-
2074- def _do_end(self, amount_read, now):
2075- self.ftotal_time = format_time(now - self.start_time)
2076- self.ftotal_size = format_number(self.last_amount_read)
2077- self.master.end_meter(self, now)
2078-
2079- def failure(self, message, now=None):
2080- self.master.failure_meter(self, message, now)
2081-
2082- def message(self, message):
2083- self.master.message_meter(self, message)
2084-
2085-class MultiFileMeter:
2086- helperclass = MultiFileHelper
2087- def __init__(self):
2088- self.meters = []
2089- self.in_progress_meters = []
2090- self._lock = thread.allocate_lock()
2091- self.update_period = 0.3 # seconds
2092-
2093- self.numfiles = None
2094- self.finished_files = 0
2095- self.failed_files = 0
2096- self.open_files = 0
2097- self.total_size = None
2098- self.failed_size = 0
2099- self.start_time = None
2100- self.finished_file_size = 0
2101- self.last_update_time = None
2102- self.re = RateEstimator()
2103-
2104- def start(self, numfiles=None, total_size=None, now=None):
2105- if now is None: now = time.time()
2106- self.numfiles = numfiles
2107- self.finished_files = 0
2108- self.failed_files = 0
2109- self.open_files = 0
2110- self.total_size = total_size
2111- self.failed_size = 0
2112- self.start_time = now
2113- self.finished_file_size = 0
2114- self.last_update_time = now
2115- self.re.start(total_size, now)
2116- self._do_start(now)
2117-
2118- def _do_start(self, now):
2119- pass
2120-
2121- def end(self, now=None):
2122- if now is None: now = time.time()
2123- self._do_end(now)
2124-
2125- def _do_end(self, now):
2126- pass
2127-
2128- def lock(self): self._lock.acquire()
2129- def unlock(self): self._lock.release()
2130-
2131- ###########################################################
2132- # child meter creation and destruction
2133- def newMeter(self):
2134- newmeter = self.helperclass(self)
2135- self.meters.append(newmeter)
2136- return newmeter
2137-
2138- def removeMeter(self, meter):
2139- self.meters.remove(meter)
2140-
2141- ###########################################################
2142- # child functions - these should only be called by helpers
2143- def start_meter(self, meter, now):
2144- if not meter in self.meters:
2145- raise ValueError('attempt to use orphaned meter')
2146- self._lock.acquire()
2147- try:
2148- if not meter in self.in_progress_meters:
2149- self.in_progress_meters.append(meter)
2150- self.open_files += 1
2151- finally:
2152- self._lock.release()
2153- self._do_start_meter(meter, now)
2154-
2155- def _do_start_meter(self, meter, now):
2156- pass
2157-
2158- def update_meter(self, meter, now):
2159- if not meter in self.meters:
2160- raise ValueError('attempt to use orphaned meter')
2161- if (now >= self.last_update_time + self.update_period) or \
2162- not self.last_update_time:
2163- self.re.update(self._amount_read(), now)
2164- self.last_update_time = now
2165- self._do_update_meter(meter, now)
2166-
2167- def _do_update_meter(self, meter, now):
2168- pass
2169-
2170- def end_meter(self, meter, now):
2171- if not meter in self.meters:
2172- raise ValueError('attempt to use orphaned meter')
2173- self._lock.acquire()
2174- try:
2175- try: self.in_progress_meters.remove(meter)
2176- except ValueError: pass
2177- self.open_files -= 1
2178- self.finished_files += 1
2179- self.finished_file_size += meter.last_amount_read
2180- finally:
2181- self._lock.release()
2182- self._do_end_meter(meter, now)
2183-
2184- def _do_end_meter(self, meter, now):
2185- pass
2186-
2187- def failure_meter(self, meter, message, now):
2188- if not meter in self.meters:
2189- raise ValueError('attempt to use orphaned meter')
2190- self._lock.acquire()
2191- try:
2192- try: self.in_progress_meters.remove(meter)
2193- except ValueError: pass
2194- self.open_files -= 1
2195- self.failed_files += 1
2196- if meter.size and self.failed_size is not None:
2197- self.failed_size += meter.size
2198- else:
2199- self.failed_size = None
2200- finally:
2201- self._lock.release()
2202- self._do_failure_meter(meter, message, now)
2203-
2204- def _do_failure_meter(self, meter, message, now):
2205- pass
2206-
2207- def message_meter(self, meter, message):
2208- pass
2209-
2210- ########################################################
2211- # internal functions
2212- def _amount_read(self):
2213- tot = self.finished_file_size
2214- for m in self.in_progress_meters:
2215- tot += m.last_amount_read
2216- return tot
2217-
2218-
2219-class TextMultiFileMeter(MultiFileMeter):
2220- def __init__(self, fo=sys.stderr):
2221- self.fo = fo
2222- MultiFileMeter.__init__(self)
2223-
2224- # files: ###/### ###% data: ######/###### ###% time: ##:##:##/##:##:##
2225- def _do_update_meter(self, meter, now):
2226- self._lock.acquire()
2227- try:
2228- format = "files: %3i/%-3i %3i%% data: %6.6s/%-6.6s %3i%% " \
2229- "time: %8.8s/%8.8s"
2230- df = self.finished_files
2231- tf = self.numfiles or 1
2232- pf = 100 * float(df)/tf + 0.49
2233- dd = self.re.last_amount_read
2234- td = self.total_size
2235- pd = 100 * (self.re.fraction_read() or 0) + 0.49
2236- dt = self.re.elapsed_time()
2237- rt = self.re.remaining_time()
2238- if rt is None: tt = None
2239- else: tt = dt + rt
2240-
2241- fdd = format_number(dd) + 'B'
2242- ftd = format_number(td) + 'B'
2243- fdt = format_time(dt, 1)
2244- ftt = format_time(tt, 1)
2245-
2246- out = '%-79.79s' % (format % (df, tf, pf, fdd, ftd, pd, fdt, ftt))
2247- self.fo.write('\r' + out)
2248- self.fo.flush()
2249- finally:
2250- self._lock.release()
2251-
2252- def _do_end_meter(self, meter, now):
2253- self._lock.acquire()
2254- try:
2255- format = "%-30.30s %6.6s %8.8s %9.9s"
2256- fn = meter.basename
2257- size = meter.last_amount_read
2258- fsize = format_number(size) + 'B'
2259- et = meter.re.elapsed_time()
2260- fet = format_time(et, 1)
2261- frate = format_number(size / et) + 'B/s'
2262-
2263- out = '%-79.79s' % (format % (fn, fsize, fet, frate))
2264- self.fo.write('\r' + out + '\n')
2265- finally:
2266- self._lock.release()
2267- self._do_update_meter(meter, now)
2268-
2269- def _do_failure_meter(self, meter, message, now):
2270- self._lock.acquire()
2271- try:
2272- format = "%-30.30s %6.6s %s"
2273- fn = meter.basename
2274- if type(message) in (type(''), type(u'')):
2275- message = message.splitlines()
2276- if not message: message = ['']
2277- out = '%-79s' % (format % (fn, 'FAILED', message[0] or ''))
2278- self.fo.write('\r' + out + '\n')
2279- for m in message[1:]: self.fo.write(' ' + m + '\n')
2280- self._lock.release()
2281- finally:
2282- self._do_update_meter(meter, now)
2283-
2284- def message_meter(self, meter, message):
2285- self._lock.acquire()
2286- try:
2287- pass
2288- finally:
2289- self._lock.release()
2290-
2291- def _do_end(self, now):
2292- self._do_update_meter(None, now)
2293- self._lock.acquire()
2294- try:
2295- self.fo.write('\n')
2296- self.fo.flush()
2297- finally:
2298- self._lock.release()
2299-
2300-######################################################################
2301-# support classes and functions
2302-
2303-class RateEstimator:
2304- def __init__(self, timescale=5.0):
2305- self.timescale = timescale
2306-
2307- def start(self, total=None, now=None):
2308- if now is None: now = time.time()
2309- self.total = total
2310- self.start_time = now
2311- self.last_update_time = now
2312- self.last_amount_read = 0
2313- self.ave_rate = None
2314-
2315- def update(self, amount_read, now=None):
2316- if now is None: now = time.time()
2317- if amount_read == 0:
2318- # if we just started this file, all bets are off
2319- self.last_update_time = now
2320- self.last_amount_read = 0
2321- self.ave_rate = None
2322- return
2323-
2324- #print 'times', now, self.last_update_time
2325- time_diff = now - self.last_update_time
2326- read_diff = amount_read - self.last_amount_read
2327- # First update, on reget is the file size
2328- if self.last_amount_read:
2329- self.last_update_time = now
2330- self.ave_rate = self._temporal_rolling_ave(\
2331- time_diff, read_diff, self.ave_rate, self.timescale)
2332- self.last_amount_read = amount_read
2333- #print 'results', time_diff, read_diff, self.ave_rate
2334-
2335- #####################################################################
2336- # result methods
2337- def average_rate(self):
2338- "get the average transfer rate (in bytes/second)"
2339- return self.ave_rate
2340-
2341- def elapsed_time(self):
2342- "the time between the start of the transfer and the most recent update"
2343- return self.last_update_time - self.start_time
2344-
2345- def remaining_time(self):
2346- "estimated time remaining"
2347- if not self.ave_rate or not self.total: return None
2348- return (self.total - self.last_amount_read) / self.ave_rate
2349-
2350- def fraction_read(self):
2351- """the fraction of the data that has been read
2352- (can be None for unknown transfer size)"""
2353- if self.total is None: return None
2354- elif self.total == 0: return 1.0
2355- else: return float(self.last_amount_read)/self.total
2356-
2357- #########################################################################
2358- # support methods
2359- def _temporal_rolling_ave(self, time_diff, read_diff, last_ave, timescale):
2360- """a temporal rolling average performs smooth averaging even when
2361- updates come at irregular intervals. This is performed by scaling
2362- the "epsilon" according to the time since the last update.
2363- Specifically, epsilon = time_diff / timescale
2364-
2365- As a general rule, the average will take on a completely new value
2366- after 'timescale' seconds."""
2367- epsilon = time_diff / timescale
2368- if epsilon > 1: epsilon = 1.0
2369- return self._rolling_ave(time_diff, read_diff, last_ave, epsilon)
2370-
2371- def _rolling_ave(self, time_diff, read_diff, last_ave, epsilon):
2372- """perform a "rolling average" iteration
2373- a rolling average "folds" new data into an existing average with
2374- some weight, epsilon. epsilon must be between 0.0 and 1.0 (inclusive)
2375- a value of 0.0 means only the old value (initial value) counts,
2376- and a value of 1.0 means only the newest value is considered."""
2377-
2378- try:
2379- recent_rate = read_diff / time_diff
2380- except ZeroDivisionError:
2381- recent_rate = None
2382- if last_ave is None: return recent_rate
2383- elif recent_rate is None: return last_ave
2384-
2385- # at this point, both last_ave and recent_rate are numbers
2386- return epsilon * recent_rate + (1 - epsilon) * last_ave
2387-
2388- def _round_remaining_time(self, rt, start_time=15.0):
2389- """round the remaining time, depending on its size
2390- If rt is between n*start_time and (n+1)*start_time round downward
2391- to the nearest multiple of n (for any counting number n).
2392- If rt < start_time, round down to the nearest 1.
2393- For example (for start_time = 15.0):
2394- 2.7 -> 2.0
2395- 25.2 -> 25.0
2396- 26.4 -> 26.0
2397- 35.3 -> 34.0
2398- 63.6 -> 60.0
2399- """
2400-
2401- if rt < 0: return 0.0
2402- shift = int(math.log(rt/start_time)/math.log(2))
2403- rt = int(rt)
2404- if shift <= 0: return rt
2405- return float(int(rt) >> shift << shift)
2406-
2407-
2408-def format_time(seconds, use_hours=0):
2409- if seconds is None or seconds < 0:
2410- if use_hours: return '--:--:--'
2411- else: return '--:--'
2412- else:
2413- seconds = int(seconds)
2414- minutes = seconds / 60
2415- seconds = seconds % 60
2416- if use_hours:
2417- hours = minutes / 60
2418- minutes = minutes % 60
2419- return '%02i:%02i:%02i' % (hours, minutes, seconds)
2420- else:
2421- return '%02i:%02i' % (minutes, seconds)
2422-
2423-def format_number(number, SI=0, space=' '):
2424- """Turn numbers into human-readable metric-like numbers"""
2425- symbols = ['', # (none)
2426- 'k', # kilo
2427- 'M', # mega
2428- 'G', # giga
2429- 'T', # tera
2430- 'P', # peta
2431- 'E', # exa
2432- 'Z', # zetta
2433- 'Y'] # yotta
2434-
2435- if SI: step = 1000.0
2436- else: step = 1024.0
2437-
2438- thresh = 999
2439- depth = 0
2440- max_depth = len(symbols) - 1
2441-
2442- # we want numbers between 0 and thresh, but don't exceed the length
2443- # of our list. In that event, the formatting will be screwed up,
2444- # but it'll still show the right number.
2445- while number > thresh and depth < max_depth:
2446- depth = depth + 1
2447- number = number / step
2448-
2449- if type(number) == type(1) or type(number) == type(1L):
2450- # it's an int or a long, which means it didn't get divided,
2451- # which means it's already short enough
2452- format = '%i%s%s'
2453- elif number < 9.95:
2454- # must use 9.95 for proper sizing. For example, 9.99 will be
2455- # rounded to 10.0 with the .1f format string (which is too long)
2456- format = '%.1f%s%s'
2457- else:
2458- format = '%.0f%s%s'
2459-
2460- return(format % (float(number or 0), space, symbols[depth]))
2461-
2462-def _tst(fn, cur, tot, beg, size, *args):
2463- tm = TextMeter()
2464- text = "(%d/%d): %s" % (cur, tot, fn)
2465- tm.start(fn, "http://www.example.com/path/to/fn/" + fn, fn, size, text=text)
2466- num = beg
2467- off = 0
2468- for (inc, delay) in args:
2469- off += 1
2470- while num < ((size * off) / len(args)):
2471- num += inc
2472- tm.update(num)
2473- time.sleep(delay)
2474- tm.end(size)
2475-
2476-if __name__ == "__main__":
2477- # (1/2): subversion-1.4.4-7.x86_64.rpm 2.4 MB / 85 kB/s 00:28
2478- # (2/2): mercurial-0.9.5-6.fc8.x86_64.rpm 924 kB / 106 kB/s 00:08
2479- if len(sys.argv) >= 2 and sys.argv[1] == 'total':
2480- text_meter_total_size(1000 + 10000 + 10000 + 1000000 + 1000000 +
2481- 1000000 + 10000 + 10000 + 10000 + 1000000)
2482- _tst("sm-1.0.0-1.fc8.i386.rpm", 1, 10, 0, 1000,
2483- (10, 0.2), (10, 0.1), (100, 0.25))
2484- _tst("s-1.0.1-1.fc8.i386.rpm", 2, 10, 0, 10000,
2485- (10, 0.2), (100, 0.1), (100, 0.1), (100, 0.25))
2486- _tst("m-1.0.1-2.fc8.i386.rpm", 3, 10, 5000, 10000,
2487- (10, 0.2), (100, 0.1), (100, 0.1), (100, 0.25))
2488- _tst("large-file-name-Foo-11.8.7-4.5.6.1.fc8.x86_64.rpm", 4, 10, 0, 1000000,
2489- (1000, 0.2), (1000, 0.1), (10000, 0.1))
2490- _tst("large-file-name-Foo2-11.8.7-4.5.6.2.fc8.x86_64.rpm", 5, 10,
2491- 500001, 1000000, (1000, 0.2), (1000, 0.1), (10000, 0.1))
2492- _tst("large-file-name-Foo3-11.8.7-4.5.6.3.fc8.x86_64.rpm", 6, 10,
2493- 750002, 1000000, (1000, 0.2), (1000, 0.1), (10000, 0.1))
2494- _tst("large-file-name-Foo4-10.8.7-4.5.6.1.fc8.x86_64.rpm", 7, 10, 0, 10000,
2495- (100, 0.1))
2496- _tst("large-file-name-Foo5-10.8.7-4.5.6.2.fc8.x86_64.rpm", 8, 10,
2497- 5001, 10000, (100, 0.1))
2498- _tst("large-file-name-Foo6-10.8.7-4.5.6.3.fc8.x86_64.rpm", 9, 10,
2499- 7502, 10000, (1, 0.1))
2500- _tst("large-file-name-Foox-9.8.7-4.5.6.1.fc8.x86_64.rpm", 10, 10,
2501- 0, 1000000, (10, 0.5),
2502- (100000, 0.1), (10000, 0.1), (10000, 0.1), (10000, 0.1),
2503- (100000, 0.1), (10000, 0.1), (10000, 0.1), (10000, 0.1),
2504- (100000, 0.1), (10000, 0.1), (10000, 0.1), (10000, 0.1),
2505- (100000, 0.1), (10000, 0.1), (10000, 0.1), (10000, 0.1),
2506- (100000, 0.1), (1, 0.1))
2507
2508=== removed directory '.pc/progress_object_callback_fix.diff'
2509=== removed directory '.pc/progress_object_callback_fix.diff/urlgrabber'
2510=== removed file '.pc/progress_object_callback_fix.diff/urlgrabber/grabber.py'
2511--- .pc/progress_object_callback_fix.diff/urlgrabber/grabber.py 2011-08-09 17:45:08 +0000
2512+++ .pc/progress_object_callback_fix.diff/urlgrabber/grabber.py 1970-01-01 00:00:00 +0000
2513@@ -1,1802 +0,0 @@
2514-# This library is free software; you can redistribute it and/or
2515-# modify it under the terms of the GNU Lesser General Public
2516-# License as published by the Free Software Foundation; either
2517-# version 2.1 of the License, or (at your option) any later version.
2518-#
2519-# This library is distributed in the hope that it will be useful,
2520-# but WITHOUT ANY WARRANTY; without even the implied warranty of
2521-# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
2522-# Lesser General Public License for more details.
2523-#
2524-# You should have received a copy of the GNU Lesser General Public
2525-# License along with this library; if not, write to the
2526-# Free Software Foundation, Inc.,
2527-# 59 Temple Place, Suite 330,
2528-# Boston, MA 02111-1307 USA
2529-
2530-# This file is part of urlgrabber, a high-level cross-protocol url-grabber
2531-# Copyright 2002-2004 Michael D. Stenner, Ryan Tomayko
2532-# Copyright 2009 Red Hat inc, pycurl code written by Seth Vidal
2533-
2534-"""A high-level cross-protocol url-grabber.
2535-
2536-GENERAL ARGUMENTS (kwargs)
2537-
2538- Where possible, the module-level default is indicated, and legal
2539- values are provided.
2540-
2541- copy_local = 0 [0|1]
2542-
2543- ignored except for file:// urls, in which case it specifies
2544- whether urlgrab should still make a copy of the file, or simply
2545- point to the existing copy. The module level default for this
2546- option is 0.
2547-
2548- close_connection = 0 [0|1]
2549-
2550- tells URLGrabber to close the connection after a file has been
2551- transfered. This is ignored unless the download happens with the
2552- http keepalive handler (keepalive=1). Otherwise, the connection
2553- is left open for further use. The module level default for this
2554- option is 0 (keepalive connections will not be closed).
2555-
2556- keepalive = 1 [0|1]
2557-
2558- specifies whether keepalive should be used for HTTP/1.1 servers
2559- that support it. The module level default for this option is 1
2560- (keepalive is enabled).
2561-
2562- progress_obj = None
2563-
2564- a class instance that supports the following methods:
2565- po.start(filename, url, basename, length, text)
2566- # length will be None if unknown
2567- po.update(read) # read == bytes read so far
2568- po.end()
2569-
2570- text = None
2571-
2572- specifies alternative text to be passed to the progress meter
2573- object. If not given, the default progress meter will use the
2574- basename of the file.
2575-
2576- throttle = 1.0
2577-
2578- a number - if it's an int, it's the bytes/second throttle limit.
2579- If it's a float, it is first multiplied by bandwidth. If throttle
2580- == 0, throttling is disabled. If None, the module-level default
2581- (which can be set on default_grabber.throttle) is used. See
2582- BANDWIDTH THROTTLING for more information.
2583-
2584- timeout = 300
2585-
2586- a positive integer expressing the number of seconds to wait before
2587- timing out attempts to connect to a server. If the value is None
2588- or 0, connection attempts will not time out. The timeout is passed
2589- to the underlying pycurl object as its CONNECTTIMEOUT option, see
2590- the curl documentation on CURLOPT_CONNECTTIMEOUT for more information.
2591- http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTCONNECTTIMEOUT
2592-
2593- bandwidth = 0
2594-
2595- the nominal max bandwidth in bytes/second. If throttle is a float
2596- and bandwidth == 0, throttling is disabled. If None, the
2597- module-level default (which can be set on
2598- default_grabber.bandwidth) is used. See BANDWIDTH THROTTLING for
2599- more information.
2600-
2601- range = None
2602-
2603- a tuple of the form (first_byte, last_byte) describing a byte
2604- range to retrieve. Either or both of the values may set to
2605- None. If first_byte is None, byte offset 0 is assumed. If
2606- last_byte is None, the last byte available is assumed. Note that
2607- the range specification is python-like in that (0,10) will yeild
2608- the first 10 bytes of the file.
2609-
2610- If set to None, no range will be used.
2611-
2612- reget = None [None|'simple'|'check_timestamp']
2613-
2614- whether to attempt to reget a partially-downloaded file. Reget
2615- only applies to .urlgrab and (obviously) only if there is a
2616- partially downloaded file. Reget has two modes:
2617-
2618- 'simple' -- the local file will always be trusted. If there
2619- are 100 bytes in the local file, then the download will always
2620- begin 100 bytes into the requested file.
2621-
2622- 'check_timestamp' -- the timestamp of the server file will be
2623- compared to the timestamp of the local file. ONLY if the
2624- local file is newer than or the same age as the server file
2625- will reget be used. If the server file is newer, or the
2626- timestamp is not returned, the entire file will be fetched.
2627-
2628- NOTE: urlgrabber can do very little to verify that the partial
2629- file on disk is identical to the beginning of the remote file.
2630- You may want to either employ a custom "checkfunc" or simply avoid
2631- using reget in situations where corruption is a concern.
2632-
2633- user_agent = 'urlgrabber/VERSION'
2634-
2635- a string, usually of the form 'AGENT/VERSION' that is provided to
2636- HTTP servers in the User-agent header. The module level default
2637- for this option is "urlgrabber/VERSION".
2638-
2639- http_headers = None
2640-
2641- a tuple of 2-tuples, each containing a header and value. These
2642- will be used for http and https requests only. For example, you
2643- can do
2644- http_headers = (('Pragma', 'no-cache'),)
2645-
2646- ftp_headers = None
2647-
2648- this is just like http_headers, but will be used for ftp requests.
2649-
2650- proxies = None
2651-
2652- a dictionary that maps protocol schemes to proxy hosts. For
2653- example, to use a proxy server on host "foo" port 3128 for http
2654- and https URLs:
2655- proxies={ 'http' : 'http://foo:3128', 'https' : 'http://foo:3128' }
2656- note that proxy authentication information may be provided using
2657- normal URL constructs:
2658- proxies={ 'http' : 'http://user:host@foo:3128' }
2659- Lastly, if proxies is None, the default environment settings will
2660- be used.
2661-
2662- prefix = None
2663-
2664- a url prefix that will be prepended to all requested urls. For
2665- example:
2666- g = URLGrabber(prefix='http://foo.com/mirror/')
2667- g.urlgrab('some/file.txt')
2668- ## this will fetch 'http://foo.com/mirror/some/file.txt'
2669- This option exists primarily to allow identical behavior to
2670- MirrorGroup (and derived) instances. Note: a '/' will be inserted
2671- if necessary, so you cannot specify a prefix that ends with a
2672- partial file or directory name.
2673-
2674- opener = None
2675- No-op when using the curl backend (default)
2676-
2677- cache_openers = True
2678- No-op when using the curl backend (default)
2679-
2680- data = None
2681-
2682- Only relevant for the HTTP family (and ignored for other
2683- protocols), this allows HTTP POSTs. When the data kwarg is
2684- present (and not None), an HTTP request will automatically become
2685- a POST rather than GET. This is done by direct passthrough to
2686- urllib2. If you use this, you may also want to set the
2687- 'Content-length' and 'Content-type' headers with the http_headers
2688- option. Note that python 2.2 handles the case of these
2689- badly and if you do not use the proper case (shown here), your
2690- values will be overridden with the defaults.
2691-
2692- urlparser = URLParser()
2693-
2694- The URLParser class handles pre-processing of URLs, including
2695- auth-handling for user/pass encoded in http urls, file handing
2696- (that is, filenames not sent as a URL), and URL quoting. If you
2697- want to override any of this behavior, you can pass in a
2698- replacement instance. See also the 'quote' option.
2699-
2700- quote = None
2701-
2702- Whether or not to quote the path portion of a url.
2703- quote = 1 -> quote the URLs (they're not quoted yet)
2704- quote = 0 -> do not quote them (they're already quoted)
2705- quote = None -> guess what to do
2706-
2707- This option only affects proper urls like 'file:///etc/passwd'; it
2708- does not affect 'raw' filenames like '/etc/passwd'. The latter
2709- will always be quoted as they are converted to URLs. Also, only
2710- the path part of a url is quoted. If you need more fine-grained
2711- control, you should probably subclass URLParser and pass it in via
2712- the 'urlparser' option.
2713-
2714- ssl_ca_cert = None
2715-
2716- this option can be used if M2Crypto is available and will be
2717- ignored otherwise. If provided, it will be used to create an SSL
2718- context. If both ssl_ca_cert and ssl_context are provided, then
2719- ssl_context will be ignored and a new context will be created from
2720- ssl_ca_cert.
2721-
2722- ssl_context = None
2723-
2724- No-op when using the curl backend (default)
2725-
2726-
2727- self.ssl_verify_peer = True
2728-
2729- Check the server's certificate to make sure it is valid with what our CA validates
2730-
2731- self.ssl_verify_host = True
2732-
2733- Check the server's hostname to make sure it matches the certificate DN
2734-
2735- self.ssl_key = None
2736-
2737- Path to the key the client should use to connect/authenticate with
2738-
2739- self.ssl_key_type = 'PEM'
2740-
2741- PEM or DER - format of key
2742-
2743- self.ssl_cert = None
2744-
2745- Path to the ssl certificate the client should use to to authenticate with
2746-
2747- self.ssl_cert_type = 'PEM'
2748-
2749- PEM or DER - format of certificate
2750-
2751- self.ssl_key_pass = None
2752-
2753- password to access the ssl_key
2754-
2755- self.size = None
2756-
2757- size (in bytes) or Maximum size of the thing being downloaded.
2758- This is mostly to keep us from exploding with an endless datastream
2759-
2760- self.max_header_size = 2097152
2761-
2762- Maximum size (in bytes) of the headers.
2763-
2764-
2765-RETRY RELATED ARGUMENTS
2766-
2767- retry = None
2768-
2769- the number of times to retry the grab before bailing. If this is
2770- zero, it will retry forever. This was intentional... really, it
2771- was :). If this value is not supplied or is supplied but is None
2772- retrying does not occur.
2773-
2774- retrycodes = [-1,2,4,5,6,7]
2775-
2776- a sequence of errorcodes (values of e.errno) for which it should
2777- retry. See the doc on URLGrabError for more details on this. You
2778- might consider modifying a copy of the default codes rather than
2779- building yours from scratch so that if the list is extended in the
2780- future (or one code is split into two) you can still enjoy the
2781- benefits of the default list. You can do that with something like
2782- this:
2783-
2784- retrycodes = urlgrabber.grabber.URLGrabberOptions().retrycodes
2785- if 12 not in retrycodes:
2786- retrycodes.append(12)
2787-
2788- checkfunc = None
2789-
2790- a function to do additional checks. This defaults to None, which
2791- means no additional checking. The function should simply return
2792- on a successful check. It should raise URLGrabError on an
2793- unsuccessful check. Raising of any other exception will be
2794- considered immediate failure and no retries will occur.
2795-
2796- If it raises URLGrabError, the error code will determine the retry
2797- behavior. Negative error numbers are reserved for use by these
2798- passed in functions, so you can use many negative numbers for
2799- different types of failure. By default, -1 results in a retry,
2800- but this can be customized with retrycodes.
2801-
2802- If you simply pass in a function, it will be given exactly one
2803- argument: a CallbackObject instance with the .url attribute
2804- defined and either .filename (for urlgrab) or .data (for urlread).
2805- For urlgrab, .filename is the name of the local file. For
2806- urlread, .data is the actual string data. If you need other
2807- arguments passed to the callback (program state of some sort), you
2808- can do so like this:
2809-
2810- checkfunc=(function, ('arg1', 2), {'kwarg': 3})
2811-
2812- if the downloaded file has filename /tmp/stuff, then this will
2813- result in this call (for urlgrab):
2814-
2815- function(obj, 'arg1', 2, kwarg=3)
2816- # obj.filename = '/tmp/stuff'
2817- # obj.url = 'http://foo.com/stuff'
2818-
2819- NOTE: both the "args" tuple and "kwargs" dict must be present if
2820- you use this syntax, but either (or both) can be empty.
2821-
2822- failure_callback = None
2823-
2824- The callback that gets called during retries when an attempt to
2825- fetch a file fails. The syntax for specifying the callback is
2826- identical to checkfunc, except for the attributes defined in the
2827- CallbackObject instance. The attributes for failure_callback are:
2828-
2829- exception = the raised exception
2830- url = the url we're trying to fetch
2831- tries = the number of tries so far (including this one)
2832- retry = the value of the retry option
2833-
2834- The callback is present primarily to inform the calling program of
2835- the failure, but if it raises an exception (including the one it's
2836- passed) that exception will NOT be caught and will therefore cause
2837- future retries to be aborted.
2838-
2839- The callback is called for EVERY failure, including the last one.
2840- On the last try, the callback can raise an alternate exception,
2841- but it cannot (without severe trickiness) prevent the exception
2842- from being raised.
2843-
2844- interrupt_callback = None
2845-
2846- This callback is called if KeyboardInterrupt is received at any
2847- point in the transfer. Basically, this callback can have three
2848- impacts on the fetch process based on the way it exits:
2849-
2850- 1) raise no exception: the current fetch will be aborted, but
2851- any further retries will still take place
2852-
2853- 2) raise a URLGrabError: if you're using a MirrorGroup, then
2854- this will prompt a failover to the next mirror according to
2855- the behavior of the MirrorGroup subclass. It is recommended
2856- that you raise URLGrabError with code 15, 'user abort'. If
2857- you are NOT using a MirrorGroup subclass, then this is the
2858- same as (3).
2859-
2860- 3) raise some other exception (such as KeyboardInterrupt), which
2861- will not be caught at either the grabber or mirror levels.
2862- That is, it will be raised up all the way to the caller.
2863-
2864- This callback is very similar to failure_callback. They are
2865- passed the same arguments, so you could use the same function for
2866- both.
2867-
2868-BANDWIDTH THROTTLING
2869-
2870- urlgrabber supports throttling via two values: throttle and
2871- bandwidth Between the two, you can either specify and absolute
2872- throttle threshold or specify a theshold as a fraction of maximum
2873- available bandwidth.
2874-
2875- throttle is a number - if it's an int, it's the bytes/second
2876- throttle limit. If it's a float, it is first multiplied by
2877- bandwidth. If throttle == 0, throttling is disabled. If None, the
2878- module-level default (which can be set with set_throttle) is used.
2879-
2880- bandwidth is the nominal max bandwidth in bytes/second. If throttle
2881- is a float and bandwidth == 0, throttling is disabled. If None, the
2882- module-level default (which can be set with set_bandwidth) is used.
2883-
2884- THROTTLING EXAMPLES:
2885-
2886- Lets say you have a 100 Mbps connection. This is (about) 10^8 bits
2887- per second, or 12,500,000 Bytes per second. You have a number of
2888- throttling options:
2889-
2890- *) set_bandwidth(12500000); set_throttle(0.5) # throttle is a float
2891-
2892- This will limit urlgrab to use half of your available bandwidth.
2893-
2894- *) set_throttle(6250000) # throttle is an int
2895-
2896- This will also limit urlgrab to use half of your available
2897- bandwidth, regardless of what bandwidth is set to.
2898-
2899- *) set_throttle(6250000); set_throttle(1.0) # float
2900-
2901- Use half your bandwidth
2902-
2903- *) set_throttle(6250000); set_throttle(2.0) # float
2904-
2905- Use up to 12,500,000 Bytes per second (your nominal max bandwidth)
2906-
2907- *) set_throttle(6250000); set_throttle(0) # throttle = 0
2908-
2909- Disable throttling - this is more efficient than a very large
2910- throttle setting.
2911-
2912- *) set_throttle(0); set_throttle(1.0) # throttle is float, bandwidth = 0
2913-
2914- Disable throttling - this is the default when the module is loaded.
2915-
2916- SUGGESTED AUTHOR IMPLEMENTATION (THROTTLING)
2917-
2918- While this is flexible, it's not extremely obvious to the user. I
2919- suggest you implement a float throttle as a percent to make the
2920- distinction between absolute and relative throttling very explicit.
2921-
2922- Also, you may want to convert the units to something more convenient
2923- than bytes/second, such as kbps or kB/s, etc.
2924-
2925-"""
2926-
2927-
2928-
2929-import os
2930-import sys
2931-import urlparse
2932-import time
2933-import string
2934-import urllib
2935-import urllib2
2936-import mimetools
2937-import thread
2938-import types
2939-import stat
2940-import pycurl
2941-from ftplib import parse150
2942-from StringIO import StringIO
2943-from httplib import HTTPException
2944-import socket
2945-from byterange import range_tuple_normalize, range_tuple_to_header, RangeError
2946-
2947-########################################################################
2948-# MODULE INITIALIZATION
2949-########################################################################
2950-try:
2951- exec('from ' + (__name__.split('.'))[0] + ' import __version__')
2952-except:
2953- __version__ = '???'
2954-
2955-try:
2956- # this part isn't going to do much - need to talk to gettext
2957- from i18n import _
2958-except ImportError, msg:
2959- def _(st): return st
2960-
2961-########################################################################
2962-# functions for debugging output. These functions are here because they
2963-# are also part of the module initialization.
2964-DEBUG = None
2965-def set_logger(DBOBJ):
2966- """Set the DEBUG object. This is called by _init_default_logger when
2967- the environment variable URLGRABBER_DEBUG is set, but can also be
2968- called by a calling program. Basically, if the calling program uses
2969- the logging module and would like to incorporate urlgrabber logging,
2970- then it can do so this way. It's probably not necessary as most
2971- internal logging is only for debugging purposes.
2972-
2973- The passed-in object should be a logging.Logger instance. It will
2974- be pushed into the keepalive and byterange modules if they're
2975- being used. The mirror module pulls this object in on import, so
2976- you will need to manually push into it. In fact, you may find it
2977- tidier to simply push your logging object (or objects) into each
2978- of these modules independently.
2979- """
2980-
2981- global DEBUG
2982- DEBUG = DBOBJ
2983-
2984-def _init_default_logger(logspec=None):
2985- '''Examines the environment variable URLGRABBER_DEBUG and creates
2986- a logging object (logging.logger) based on the contents. It takes
2987- the form
2988-
2989- URLGRABBER_DEBUG=level,filename
2990-
2991- where "level" can be either an integer or a log level from the
2992- logging module (DEBUG, INFO, etc). If the integer is zero or
2993- less, logging will be disabled. Filename is the filename where
2994- logs will be sent. If it is "-", then stdout will be used. If
2995- the filename is empty or missing, stderr will be used. If the
2996- variable cannot be processed or the logging module cannot be
2997- imported (python < 2.3) then logging will be disabled. Here are
2998- some examples:
2999-
3000- URLGRABBER_DEBUG=1,debug.txt # log everything to debug.txt
3001- URLGRABBER_DEBUG=WARNING,- # log warning and higher to stdout
3002- URLGRABBER_DEBUG=INFO # log info and higher to stderr
3003-
3004- This funtion is called during module initialization. It is not
3005- intended to be called from outside. The only reason it is a
3006- function at all is to keep the module-level namespace tidy and to
3007- collect the code into a nice block.'''
3008-
3009- try:
3010- if logspec is None:
3011- logspec = os.environ['URLGRABBER_DEBUG']
3012- dbinfo = logspec.split(',')
3013- import logging
3014- level = logging._levelNames.get(dbinfo[0], None)
3015- if level is None: level = int(dbinfo[0])
3016- if level < 1: raise ValueError()
3017-
3018- formatter = logging.Formatter('%(asctime)s %(message)s')
3019- if len(dbinfo) > 1: filename = dbinfo[1]
3020- else: filename = ''
3021- if filename == '': handler = logging.StreamHandler(sys.stderr)
3022- elif filename == '-': handler = logging.StreamHandler(sys.stdout)
3023- else: handler = logging.FileHandler(filename)
3024- handler.setFormatter(formatter)
3025- DBOBJ = logging.getLogger('urlgrabber')
3026- DBOBJ.addHandler(handler)
3027- DBOBJ.setLevel(level)
3028- except (KeyError, ImportError, ValueError):
3029- DBOBJ = None
3030- set_logger(DBOBJ)
3031-
3032-def _log_package_state():
3033- if not DEBUG: return
3034- DEBUG.info('urlgrabber version = %s' % __version__)
3035- DEBUG.info('trans function "_" = %s' % _)
3036-
3037-_init_default_logger()
3038-_log_package_state()
3039-
3040-
3041-# normally this would be from i18n or something like it ...
3042-def _(st):
3043- return st
3044-
3045-########################################################################
3046-# END MODULE INITIALIZATION
3047-########################################################################
3048-
3049-
3050-
3051-class URLGrabError(IOError):
3052- """
3053- URLGrabError error codes:
3054-
3055- URLGrabber error codes (0 -- 255)
3056- 0 - everything looks good (you should never see this)
3057- 1 - malformed url
3058- 2 - local file doesn't exist
3059- 3 - request for non-file local file (dir, etc)
3060- 4 - IOError on fetch
3061- 5 - OSError on fetch
3062- 6 - no content length header when we expected one
3063- 7 - HTTPException
3064- 8 - Exceeded read limit (for urlread)
3065- 9 - Requested byte range not satisfiable.
3066- 10 - Byte range requested, but range support unavailable
3067- 11 - Illegal reget mode
3068- 12 - Socket timeout
3069- 13 - malformed proxy url
3070- 14 - HTTPError (includes .code and .exception attributes)
3071- 15 - user abort
3072- 16 - error writing to local file
3073-
3074- MirrorGroup error codes (256 -- 511)
3075- 256 - No more mirrors left to try
3076-
3077- Custom (non-builtin) classes derived from MirrorGroup (512 -- 767)
3078- [ this range reserved for application-specific error codes ]
3079-
3080- Retry codes (< 0)
3081- -1 - retry the download, unknown reason
3082-
3083- Note: to test which group a code is in, you can simply do integer
3084- division by 256: e.errno / 256
3085-
3086- Negative codes are reserved for use by functions passed in to
3087- retrygrab with checkfunc. The value -1 is built in as a generic
3088- retry code and is already included in the retrycodes list.
3089- Therefore, you can create a custom check function that simply
3090- returns -1 and the fetch will be re-tried. For more customized
3091- retries, you can use other negative number and include them in
3092- retry-codes. This is nice for outputting useful messages about
3093- what failed.
3094-
3095- You can use these error codes like so:
3096- try: urlgrab(url)
3097- except URLGrabError, e:
3098- if e.errno == 3: ...
3099- # or
3100- print e.strerror
3101- # or simply
3102- print e #### print '[Errno %i] %s' % (e.errno, e.strerror)
3103- """
3104- def __init__(self, *args):
3105- IOError.__init__(self, *args)
3106- self.url = "No url specified"
3107-
3108-class CallbackObject:
3109- """Container for returned callback data.
3110-
3111- This is currently a dummy class into which urlgrabber can stuff
3112- information for passing to callbacks. This way, the prototype for
3113- all callbacks is the same, regardless of the data that will be
3114- passed back. Any function that accepts a callback function as an
3115- argument SHOULD document what it will define in this object.
3116-
3117- It is possible that this class will have some greater
3118- functionality in the future.
3119- """
3120- def __init__(self, **kwargs):
3121- self.__dict__.update(kwargs)
3122-
3123-def urlgrab(url, filename=None, **kwargs):
3124- """grab the file at <url> and make a local copy at <filename>
3125- If filename is none, the basename of the url is used.
3126- urlgrab returns the filename of the local file, which may be different
3127- from the passed-in filename if the copy_local kwarg == 0.
3128-
3129- See module documentation for a description of possible kwargs.
3130- """
3131- return default_grabber.urlgrab(url, filename, **kwargs)
3132-
3133-def urlopen(url, **kwargs):
3134- """open the url and return a file object
3135- If a progress object or throttle specifications exist, then
3136- a special file object will be returned that supports them.
3137- The file object can be treated like any other file object.
3138-
3139- See module documentation for a description of possible kwargs.
3140- """
3141- return default_grabber.urlopen(url, **kwargs)
3142-
3143-def urlread(url, limit=None, **kwargs):
3144- """read the url into a string, up to 'limit' bytes
3145- If the limit is exceeded, an exception will be thrown. Note that urlread
3146- is NOT intended to be used as a way of saying "I want the first N bytes"
3147- but rather 'read the whole file into memory, but don't use too much'
3148-
3149- See module documentation for a description of possible kwargs.
3150- """
3151- return default_grabber.urlread(url, limit, **kwargs)
3152-
3153-
3154-class URLParser:
3155- """Process the URLs before passing them to urllib2.
3156-
3157- This class does several things:
3158-
3159- * add any prefix
3160- * translate a "raw" file to a proper file: url
3161- * handle any http or https auth that's encoded within the url
3162- * quote the url
3163-
3164- Only the "parse" method is called directly, and it calls sub-methods.
3165-
3166- An instance of this class is held in the options object, which
3167- means that it's easy to change the behavior by sub-classing and
3168- passing the replacement in. It need only have a method like:
3169-
3170- url, parts = urlparser.parse(url, opts)
3171- """
3172-
3173- def parse(self, url, opts):
3174- """parse the url and return the (modified) url and its parts
3175-
3176- Note: a raw file WILL be quoted when it's converted to a URL.
3177- However, other urls (ones which come with a proper scheme) may
3178- or may not be quoted according to opts.quote
3179-
3180- opts.quote = 1 --> quote it
3181- opts.quote = 0 --> do not quote it
3182- opts.quote = None --> guess
3183- """
3184- quote = opts.quote
3185-
3186- if opts.prefix:
3187- url = self.add_prefix(url, opts.prefix)
3188-
3189- parts = urlparse.urlparse(url)
3190- (scheme, host, path, parm, query, frag) = parts
3191-
3192- if not scheme or (len(scheme) == 1 and scheme in string.letters):
3193- # if a scheme isn't specified, we guess that it's "file:"
3194- if url[0] not in '/\\': url = os.path.abspath(url)
3195- url = 'file:' + urllib.pathname2url(url)
3196- parts = urlparse.urlparse(url)
3197- quote = 0 # pathname2url quotes, so we won't do it again
3198-
3199- if scheme in ['http', 'https']:
3200- parts = self.process_http(parts, url)
3201-
3202- if quote is None:
3203- quote = self.guess_should_quote(parts)
3204- if quote:
3205- parts = self.quote(parts)
3206-
3207- url = urlparse.urlunparse(parts)
3208- return url, parts
3209-
3210- def add_prefix(self, url, prefix):
3211- if prefix[-1] == '/' or url[0] == '/':
3212- url = prefix + url
3213- else:
3214- url = prefix + '/' + url
3215- return url
3216-
3217- def process_http(self, parts, url):
3218- (scheme, host, path, parm, query, frag) = parts
3219- # TODO: auth-parsing here, maybe? pycurl doesn't really need it
3220- return (scheme, host, path, parm, query, frag)
3221-
3222- def quote(self, parts):
3223- """quote the URL
3224-
3225- This method quotes ONLY the path part. If you need to quote
3226- other parts, you should override this and pass in your derived
3227- class. The other alternative is to quote other parts before
3228- passing into urlgrabber.
3229- """
3230- (scheme, host, path, parm, query, frag) = parts
3231- path = urllib.quote(path)
3232- return (scheme, host, path, parm, query, frag)
3233-
3234- hexvals = '0123456789ABCDEF'
3235- def guess_should_quote(self, parts):
3236- """
3237- Guess whether we should quote a path. This amounts to
3238- guessing whether it's already quoted.
3239-
3240- find ' ' -> 1
3241- find '%' -> 1
3242- find '%XX' -> 0
3243- else -> 1
3244- """
3245- (scheme, host, path, parm, query, frag) = parts
3246- if ' ' in path:
3247- return 1
3248- ind = string.find(path, '%')
3249- if ind > -1:
3250- while ind > -1:
3251- if len(path) < ind+3:
3252- return 1
3253- code = path[ind+1:ind+3].upper()
3254- if code[0] not in self.hexvals or \
3255- code[1] not in self.hexvals:
3256- return 1
3257- ind = string.find(path, '%', ind+1)
3258- return 0
3259- return 1
3260-
3261-class URLGrabberOptions:
3262- """Class to ease kwargs handling."""
3263-
3264- def __init__(self, delegate=None, **kwargs):
3265- """Initialize URLGrabberOptions object.
3266- Set default values for all options and then update options specified
3267- in kwargs.
3268- """
3269- self.delegate = delegate
3270- if delegate is None:
3271- self._set_defaults()
3272- self._set_attributes(**kwargs)
3273-
3274- def __getattr__(self, name):
3275- if self.delegate and hasattr(self.delegate, name):
3276- return getattr(self.delegate, name)
3277- raise AttributeError, name
3278-
3279- def raw_throttle(self):
3280- """Calculate raw throttle value from throttle and bandwidth
3281- values.
3282- """
3283- if self.throttle <= 0:
3284- return 0
3285- elif type(self.throttle) == type(0):
3286- return float(self.throttle)
3287- else: # throttle is a float
3288- return self.bandwidth * self.throttle
3289-
3290- def derive(self, **kwargs):
3291- """Create a derived URLGrabberOptions instance.
3292- This method creates a new instance and overrides the
3293- options specified in kwargs.
3294- """
3295- return URLGrabberOptions(delegate=self, **kwargs)
3296-
3297- def _set_attributes(self, **kwargs):
3298- """Update object attributes with those provided in kwargs."""
3299- self.__dict__.update(kwargs)
3300- if kwargs.has_key('range'):
3301- # normalize the supplied range value
3302- self.range = range_tuple_normalize(self.range)
3303- if not self.reget in [None, 'simple', 'check_timestamp']:
3304- raise URLGrabError(11, _('Illegal reget mode: %s') \
3305- % (self.reget, ))
3306-
3307- def _set_defaults(self):
3308- """Set all options to their default values.
3309- When adding new options, make sure a default is
3310- provided here.
3311- """
3312- self.progress_obj = None
3313- self.throttle = 1.0
3314- self.bandwidth = 0
3315- self.retry = None
3316- self.retrycodes = [-1,2,4,5,6,7]
3317- self.checkfunc = None
3318- self.copy_local = 0
3319- self.close_connection = 0
3320- self.range = None
3321- self.user_agent = 'urlgrabber/%s' % __version__
3322- self.keepalive = 1
3323- self.proxies = None
3324- self.reget = None
3325- self.failure_callback = None
3326- self.interrupt_callback = None
3327- self.prefix = None
3328- self.opener = None
3329- self.cache_openers = True
3330- self.timeout = 300
3331- self.text = None
3332- self.http_headers = None
3333- self.ftp_headers = None
3334- self.data = None
3335- self.urlparser = URLParser()
3336- self.quote = None
3337- self.ssl_ca_cert = None # sets SSL_CAINFO - path to certdb
3338- self.ssl_context = None # no-op in pycurl
3339- self.ssl_verify_peer = True # check peer's cert for authenticityb
3340- self.ssl_verify_host = True # make sure who they are and who the cert is for matches
3341- self.ssl_key = None # client key
3342- self.ssl_key_type = 'PEM' #(or DER)
3343- self.ssl_cert = None # client cert
3344- self.ssl_cert_type = 'PEM' # (or DER)
3345- self.ssl_key_pass = None # password to access the key
3346- self.size = None # if we know how big the thing we're getting is going
3347- # to be. this is ultimately a MAXIMUM size for the file
3348- self.max_header_size = 2097152 #2mb seems reasonable for maximum header size
3349-
3350- def __repr__(self):
3351- return self.format()
3352-
3353- def format(self, indent=' '):
3354- keys = self.__dict__.keys()
3355- if self.delegate is not None:
3356- keys.remove('delegate')
3357- keys.sort()
3358- s = '{\n'
3359- for k in keys:
3360- s = s + indent + '%-15s: %s,\n' % \
3361- (repr(k), repr(self.__dict__[k]))
3362- if self.delegate:
3363- df = self.delegate.format(indent + ' ')
3364- s = s + indent + '%-15s: %s\n' % ("'delegate'", df)
3365- s = s + indent + '}'
3366- return s
3367-
3368-class URLGrabber:
3369- """Provides easy opening of URLs with a variety of options.
3370-
3371- All options are specified as kwargs. Options may be specified when
3372- the class is created and may be overridden on a per request basis.
3373-
3374- New objects inherit default values from default_grabber.
3375- """
3376-
3377- def __init__(self, **kwargs):
3378- self.opts = URLGrabberOptions(**kwargs)
3379-
3380- def _retry(self, opts, func, *args):
3381- tries = 0
3382- while 1:
3383- # there are only two ways out of this loop. The second has
3384- # several "sub-ways"
3385- # 1) via the return in the "try" block
3386- # 2) by some exception being raised
3387- # a) an excepton is raised that we don't "except"
3388- # b) a callback raises ANY exception
3389- # c) we're not retry-ing or have run out of retries
3390- # d) the URLGrabError code is not in retrycodes
3391- # beware of infinite loops :)
3392- tries = tries + 1
3393- exception = None
3394- retrycode = None
3395- callback = None
3396- if DEBUG: DEBUG.info('attempt %i/%s: %s',
3397- tries, opts.retry, args[0])
3398- try:
3399- r = apply(func, (opts,) + args, {})
3400- if DEBUG: DEBUG.info('success')
3401- return r
3402- except URLGrabError, e:
3403- exception = e
3404- callback = opts.failure_callback
3405- retrycode = e.errno
3406- except KeyboardInterrupt, e:
3407- exception = e
3408- callback = opts.interrupt_callback
3409-
3410- if DEBUG: DEBUG.info('exception: %s', exception)
3411- if callback:
3412- if DEBUG: DEBUG.info('calling callback: %s', callback)
3413- cb_func, cb_args, cb_kwargs = self._make_callback(callback)
3414- obj = CallbackObject(exception=exception, url=args[0],
3415- tries=tries, retry=opts.retry)
3416- cb_func(obj, *cb_args, **cb_kwargs)
3417-
3418- if (opts.retry is None) or (tries == opts.retry):
3419- if DEBUG: DEBUG.info('retries exceeded, re-raising')
3420- raise
3421-
3422- if (retrycode is not None) and (retrycode not in opts.retrycodes):
3423- if DEBUG: DEBUG.info('retrycode (%i) not in list %s, re-raising',
3424- retrycode, opts.retrycodes)
3425- raise
3426-
3427- def urlopen(self, url, **kwargs):
3428- """open the url and return a file object
3429- If a progress object or throttle value specified when this
3430- object was created, then a special file object will be
3431- returned that supports them. The file object can be treated
3432- like any other file object.
3433- """
3434- opts = self.opts.derive(**kwargs)
3435- if DEBUG: DEBUG.debug('combined options: %s' % repr(opts))
3436- (url,parts) = opts.urlparser.parse(url, opts)
3437- def retryfunc(opts, url):
3438- return PyCurlFileObject(url, filename=None, opts=opts)
3439- return self._retry(opts, retryfunc, url)
3440-
3441- def urlgrab(self, url, filename=None, **kwargs):
3442- """grab the file at <url> and make a local copy at <filename>
3443- If filename is none, the basename of the url is used.
3444- urlgrab returns the filename of the local file, which may be
3445- different from the passed-in filename if copy_local == 0.
3446- """
3447- opts = self.opts.derive(**kwargs)
3448- if DEBUG: DEBUG.debug('combined options: %s' % repr(opts))
3449- (url,parts) = opts.urlparser.parse(url, opts)
3450- (scheme, host, path, parm, query, frag) = parts
3451- if filename is None:
3452- filename = os.path.basename( urllib.unquote(path) )
3453- if scheme == 'file' and not opts.copy_local:
3454- # just return the name of the local file - don't make a
3455- # copy currently
3456- path = urllib.url2pathname(path)
3457- if host:
3458- path = os.path.normpath('//' + host + path)
3459- if not os.path.exists(path):
3460- err = URLGrabError(2,
3461- _('Local file does not exist: %s') % (path, ))
3462- err.url = url
3463- raise err
3464- elif not os.path.isfile(path):
3465- err = URLGrabError(3,
3466- _('Not a normal file: %s') % (path, ))
3467- err.url = url
3468- raise err
3469-
3470- elif not opts.range:
3471- if not opts.checkfunc is None:
3472- cb_func, cb_args, cb_kwargs = \
3473- self._make_callback(opts.checkfunc)
3474- obj = CallbackObject()
3475- obj.filename = path
3476- obj.url = url
3477- apply(cb_func, (obj, )+cb_args, cb_kwargs)
3478- return path
3479-
3480- def retryfunc(opts, url, filename):
3481- fo = PyCurlFileObject(url, filename, opts)
3482- try:
3483- fo._do_grab()
3484- if not opts.checkfunc is None:
3485- cb_func, cb_args, cb_kwargs = \
3486- self._make_callback(opts.checkfunc)
3487- obj = CallbackObject()
3488- obj.filename = filename
3489- obj.url = url
3490- apply(cb_func, (obj, )+cb_args, cb_kwargs)
3491- finally:
3492- fo.close()
3493- return filename
3494-
3495- return self._retry(opts, retryfunc, url, filename)
3496-
3497- def urlread(self, url, limit=None, **kwargs):
3498- """read the url into a string, up to 'limit' bytes
3499- If the limit is exceeded, an exception will be thrown. Note
3500- that urlread is NOT intended to be used as a way of saying
3501- "I want the first N bytes" but rather 'read the whole file
3502- into memory, but don't use too much'
3503- """
3504- opts = self.opts.derive(**kwargs)
3505- if DEBUG: DEBUG.debug('combined options: %s' % repr(opts))
3506- (url,parts) = opts.urlparser.parse(url, opts)
3507- if limit is not None:
3508- limit = limit + 1
3509-
3510- def retryfunc(opts, url, limit):
3511- fo = PyCurlFileObject(url, filename=None, opts=opts)
3512- s = ''
3513- try:
3514- # this is an unfortunate thing. Some file-like objects
3515- # have a default "limit" of None, while the built-in (real)
3516- # file objects have -1. They each break the other, so for
3517- # now, we just force the default if necessary.
3518- if limit is None: s = fo.read()
3519- else: s = fo.read(limit)
3520-
3521- if not opts.checkfunc is None:
3522- cb_func, cb_args, cb_kwargs = \
3523- self._make_callback(opts.checkfunc)
3524- obj = CallbackObject()
3525- obj.data = s
3526- obj.url = url
3527- apply(cb_func, (obj, )+cb_args, cb_kwargs)
3528- finally:
3529- fo.close()
3530- return s
3531-
3532- s = self._retry(opts, retryfunc, url, limit)
3533- if limit and len(s) > limit:
3534- err = URLGrabError(8,
3535- _('Exceeded limit (%i): %s') % (limit, url))
3536- err.url = url
3537- raise err
3538-
3539- return s
3540-
3541- def _make_callback(self, callback_obj):
3542- if callable(callback_obj):
3543- return callback_obj, (), {}
3544- else:
3545- return callback_obj
3546-
3547-# create the default URLGrabber used by urlXXX functions.
3548-# NOTE: actual defaults are set in URLGrabberOptions
3549-default_grabber = URLGrabber()
3550-
3551-
3552-class PyCurlFileObject():
3553- def __init__(self, url, filename, opts):
3554- self.fo = None
3555- self._hdr_dump = ''
3556- self._parsed_hdr = None
3557- self.url = url
3558- self.scheme = urlparse.urlsplit(self.url)[0]
3559- self.filename = filename
3560- self.append = False
3561- self.reget_time = None
3562- self.opts = opts
3563- if self.opts.reget == 'check_timestamp':
3564- raise NotImplementedError, "check_timestamp regets are not implemented in this ver of urlgrabber. Please report this."
3565- self._complete = False
3566- self._rbuf = ''
3567- self._rbufsize = 1024*8
3568- self._ttime = time.time()
3569- self._tsize = 0
3570- self._amount_read = 0
3571- self._reget_length = 0
3572- self._prog_running = False
3573- self._error = (None, None)
3574- self.size = 0
3575- self._hdr_ended = False
3576- self._do_open()
3577-
3578-
3579- def geturl(self):
3580- """ Provide the geturl() method, used to be got from
3581- urllib.addinfourl, via. urllib.URLopener.* """
3582- return self.url
3583-
3584- def __getattr__(self, name):
3585- """This effectively allows us to wrap at the instance level.
3586- Any attribute not found in _this_ object will be searched for
3587- in self.fo. This includes methods."""
3588-
3589- if hasattr(self.fo, name):
3590- return getattr(self.fo, name)
3591- raise AttributeError, name
3592-
3593- def _retrieve(self, buf):
3594- try:
3595- if not self._prog_running:
3596- if self.opts.progress_obj:
3597- size = self.size + self._reget_length
3598- self.opts.progress_obj.start(self._prog_reportname,
3599- urllib.unquote(self.url),
3600- self._prog_basename,
3601- size=size,
3602- text=self.opts.text)
3603- self._prog_running = True
3604- self.opts.progress_obj.update(self._amount_read)
3605-
3606- self._amount_read += len(buf)
3607- self.fo.write(buf)
3608- return len(buf)
3609- except KeyboardInterrupt:
3610- return -1
3611-
3612- def _hdr_retrieve(self, buf):
3613- if self._hdr_ended:
3614- self._hdr_dump = ''
3615- self.size = 0
3616- self._hdr_ended = False
3617-
3618- if self._over_max_size(cur=len(self._hdr_dump),
3619- max_size=self.opts.max_header_size):
3620- return -1
3621- try:
3622- self._hdr_dump += buf
3623- # we have to get the size before we do the progress obj start
3624- # but we can't do that w/o making it do 2 connects, which sucks
3625- # so we cheat and stuff it in here in the hdr_retrieve
3626- if self.scheme in ['http','https'] and buf.lower().find('content-length') != -1:
3627- length = buf.split(':')[1]
3628- self.size = int(length)
3629- elif self.scheme in ['ftp']:
3630- s = None
3631- if buf.startswith('213 '):
3632- s = buf[3:].strip()
3633- elif buf.startswith('150 '):
3634- s = parse150(buf)
3635- if s:
3636- self.size = int(s)
3637-
3638- if buf.lower().find('location') != -1:
3639- location = ':'.join(buf.split(':')[1:])
3640- location = location.strip()
3641- self.scheme = urlparse.urlsplit(location)[0]
3642- self.url = location
3643-
3644- if len(self._hdr_dump) != 0 and buf == '\r\n':
3645- self._hdr_ended = True
3646- if DEBUG: DEBUG.info('header ended:')
3647-
3648- return len(buf)
3649- except KeyboardInterrupt:
3650- return pycurl.READFUNC_ABORT
3651-
3652- def _return_hdr_obj(self):
3653- if self._parsed_hdr:
3654- return self._parsed_hdr
3655- statusend = self._hdr_dump.find('\n')
3656- statusend += 1 # ridiculous as it may seem.
3657- hdrfp = StringIO()
3658- hdrfp.write(self._hdr_dump[statusend:])
3659- hdrfp.seek(0)
3660- self._parsed_hdr = mimetools.Message(hdrfp)
3661- return self._parsed_hdr
3662-
3663- hdr = property(_return_hdr_obj)
3664- http_code = property(fget=
3665- lambda self: self.curl_obj.getinfo(pycurl.RESPONSE_CODE))
3666-
3667- def _set_opts(self, opts={}):
3668- # XXX
3669- if not opts:
3670- opts = self.opts
3671-
3672-
3673- # defaults we're always going to set
3674- self.curl_obj.setopt(pycurl.NOPROGRESS, False)
3675- self.curl_obj.setopt(pycurl.NOSIGNAL, True)
3676- self.curl_obj.setopt(pycurl.WRITEFUNCTION, self._retrieve)
3677- self.curl_obj.setopt(pycurl.HEADERFUNCTION, self._hdr_retrieve)
3678- self.curl_obj.setopt(pycurl.PROGRESSFUNCTION, self._progress_update)
3679- self.curl_obj.setopt(pycurl.FAILONERROR, True)
3680- self.curl_obj.setopt(pycurl.OPT_FILETIME, True)
3681- self.curl_obj.setopt(pycurl.FOLLOWLOCATION, True)
3682-
3683- if DEBUG:
3684- self.curl_obj.setopt(pycurl.VERBOSE, True)
3685- if opts.user_agent:
3686- self.curl_obj.setopt(pycurl.USERAGENT, opts.user_agent)
3687-
3688- # maybe to be options later
3689- self.curl_obj.setopt(pycurl.FOLLOWLOCATION, True)
3690- self.curl_obj.setopt(pycurl.MAXREDIRS, 5)
3691-
3692- # timeouts
3693- timeout = 300
3694- if hasattr(opts, 'timeout'):
3695- timeout = int(opts.timeout or 0)
3696- self.curl_obj.setopt(pycurl.CONNECTTIMEOUT, timeout)
3697- self.curl_obj.setopt(pycurl.LOW_SPEED_LIMIT, 1)
3698- self.curl_obj.setopt(pycurl.LOW_SPEED_TIME, timeout)
3699-
3700- # ssl options
3701- if self.scheme == 'https':
3702- if opts.ssl_ca_cert: # this may do ZERO with nss according to curl docs
3703- self.curl_obj.setopt(pycurl.CAPATH, opts.ssl_ca_cert)
3704- self.curl_obj.setopt(pycurl.CAINFO, opts.ssl_ca_cert)
3705- self.curl_obj.setopt(pycurl.SSL_VERIFYPEER, opts.ssl_verify_peer)
3706- self.curl_obj.setopt(pycurl.SSL_VERIFYHOST, opts.ssl_verify_host)
3707- if opts.ssl_key:
3708- self.curl_obj.setopt(pycurl.SSLKEY, opts.ssl_key)
3709- if opts.ssl_key_type:
3710- self.curl_obj.setopt(pycurl.SSLKEYTYPE, opts.ssl_key_type)
3711- if opts.ssl_cert:
3712- self.curl_obj.setopt(pycurl.SSLCERT, opts.ssl_cert)
3713- if opts.ssl_cert_type:
3714- self.curl_obj.setopt(pycurl.SSLCERTTYPE, opts.ssl_cert_type)
3715- if opts.ssl_key_pass:
3716- self.curl_obj.setopt(pycurl.SSLKEYPASSWD, opts.ssl_key_pass)
3717-
3718- #headers:
3719- if opts.http_headers and self.scheme in ('http', 'https'):
3720- headers = []
3721- for (tag, content) in opts.http_headers:
3722- headers.append('%s:%s' % (tag, content))
3723- self.curl_obj.setopt(pycurl.HTTPHEADER, headers)
3724-
3725- # ranges:
3726- if opts.range or opts.reget:
3727- range_str = self._build_range()
3728- if range_str:
3729- self.curl_obj.setopt(pycurl.RANGE, range_str)
3730-
3731- # throttle/bandwidth
3732- if hasattr(opts, 'raw_throttle') and opts.raw_throttle():
3733- self.curl_obj.setopt(pycurl.MAX_RECV_SPEED_LARGE, int(opts.raw_throttle()))
3734-
3735- # proxy settings
3736- if opts.proxies:
3737- for (scheme, proxy) in opts.proxies.items():
3738- if self.scheme in ('ftp'): # only set the ftp proxy for ftp items
3739- if scheme not in ('ftp'):
3740- continue
3741- else:
3742- if proxy == '_none_': proxy = ""
3743- self.curl_obj.setopt(pycurl.PROXY, proxy)
3744- elif self.scheme in ('http', 'https'):
3745- if scheme not in ('http', 'https'):
3746- continue
3747- else:
3748- if proxy == '_none_': proxy = ""
3749- self.curl_obj.setopt(pycurl.PROXY, proxy)
3750-
3751- # FIXME username/password/auth settings
3752-
3753- #posts - simple - expects the fields as they are
3754- if opts.data:
3755- self.curl_obj.setopt(pycurl.POST, True)
3756- self.curl_obj.setopt(pycurl.POSTFIELDS, self._to_utf8(opts.data))
3757-
3758- # our url
3759- self.curl_obj.setopt(pycurl.URL, self.url)
3760-
3761-
3762- def _do_perform(self):
3763- if self._complete:
3764- return
3765-
3766- try:
3767- self.curl_obj.perform()
3768- except pycurl.error, e:
3769- # XXX - break some of these out a bit more clearly
3770- # to other URLGrabErrors from
3771- # http://curl.haxx.se/libcurl/c/libcurl-errors.html
3772- # this covers e.args[0] == 22 pretty well - which will be common
3773-
3774- code = self.http_code
3775- errcode = e.args[0]
3776- if self._error[0]:
3777- errcode = self._error[0]
3778-
3779- if errcode == 23 and code >= 200 and code < 299:
3780- err = URLGrabError(15, _('User (or something) called abort %s: %s') % (self.url, e))
3781- err.url = self.url
3782-
3783- # this is probably wrong but ultimately this is what happens
3784- # we have a legit http code and a pycurl 'writer failed' code
3785- # which almost always means something aborted it from outside
3786- # since we cannot know what it is -I'm banking on it being
3787- # a ctrl-c. XXXX - if there's a way of going back two raises to
3788- # figure out what aborted the pycurl process FIXME
3789- raise KeyboardInterrupt
3790-
3791- elif errcode == 28:
3792- err = URLGrabError(12, _('Timeout on %s: %s') % (self.url, e))
3793- err.url = self.url
3794- raise err
3795- elif errcode == 35:
3796- msg = _("problem making ssl connection")
3797- err = URLGrabError(14, msg)
3798- err.url = self.url
3799- raise err
3800- elif errcode == 37:
3801- msg = _("Could not open/read %s") % (self.url)
3802- err = URLGrabError(14, msg)
3803- err.url = self.url
3804- raise err
3805-
3806- elif errcode == 42:
3807- err = URLGrabError(15, _('User (or something) called abort %s: %s') % (self.url, e))
3808- err.url = self.url
3809- # this is probably wrong but ultimately this is what happens
3810- # we have a legit http code and a pycurl 'writer failed' code
3811- # which almost always means something aborted it from outside
3812- # since we cannot know what it is -I'm banking on it being
3813- # a ctrl-c. XXXX - if there's a way of going back two raises to
3814- # figure out what aborted the pycurl process FIXME
3815- raise KeyboardInterrupt
3816-
3817- elif errcode == 58:
3818- msg = _("problem with the local client certificate")
3819- err = URLGrabError(14, msg)
3820- err.url = self.url
3821- raise err
3822-
3823- elif errcode == 60:
3824- msg = _("Peer cert cannot be verified or peer cert invalid")
3825- err = URLGrabError(14, msg)
3826- err.url = self.url
3827- raise err
3828-
3829- elif errcode == 63:
3830- if self._error[1]:
3831- msg = self._error[1]
3832- else:
3833- msg = _("Max download size exceeded on %s") % (self.url)
3834- err = URLGrabError(14, msg)
3835- err.url = self.url
3836- raise err
3837-
3838- elif str(e.args[1]) == '' and self.http_code != 0: # fake it until you make it
3839- if self.scheme in ['http', 'https']:
3840- msg = 'HTTP Error %s : %s ' % (self.http_code, self.url)
3841- elif self.scheme in ['ftp']:
3842- msg = 'FTP Error %s : %s ' % (self.http_code, self.url)
3843- else:
3844- msg = "Unknown Error: URL=%s , scheme=%s" % (self.url, self.scheme)
3845- else:
3846- msg = 'PYCURL ERROR %s - "%s"' % (errcode, str(e.args[1]))
3847- code = errcode
3848- err = URLGrabError(14, msg)
3849- err.code = code
3850- err.exception = e
3851- raise err
3852- else:
3853- if self._error[1]:
3854- msg = self._error[1]
3855- err = URLGRabError(14, msg)
3856- err.url = self.url
3857- raise err
3858-
3859- def _do_open(self):
3860- self.curl_obj = _curl_cache
3861- self.curl_obj.reset() # reset all old settings away, just in case
3862- # setup any ranges
3863- self._set_opts()
3864- self._do_grab()
3865- return self.fo
3866-
3867- def _add_headers(self):
3868- pass
3869-
3870- def _build_range(self):
3871- reget_length = 0
3872- rt = None
3873- if self.opts.reget and type(self.filename) in types.StringTypes:
3874- # we have reget turned on and we're dumping to a file
3875- try:
3876- s = os.stat(self.filename)
3877- except OSError:
3878- pass
3879- else:
3880- self.reget_time = s[stat.ST_MTIME]
3881- reget_length = s[stat.ST_SIZE]
3882-
3883- # Set initial length when regetting
3884- self._amount_read = reget_length
3885- self._reget_length = reget_length # set where we started from, too
3886-
3887- rt = reget_length, ''
3888- self.append = 1
3889-
3890- if self.opts.range:
3891- rt = self.opts.range
3892- if rt[0]: rt = (rt[0] + reget_length, rt[1])
3893-
3894- if rt:
3895- header = range_tuple_to_header(rt)
3896- if header:
3897- return header.split('=')[1]
3898-
3899-
3900-
3901- def _make_request(self, req, opener):
3902- #XXXX
3903- # This doesn't do anything really, but we could use this
3904- # instead of do_open() to catch a lot of crap errors as
3905- # mstenner did before here
3906- return (self.fo, self.hdr)
3907-
3908- try:
3909- if self.opts.timeout:
3910- old_to = socket.getdefaulttimeout()
3911- socket.setdefaulttimeout(self.opts.timeout)
3912- try:
3913- fo = opener.open(req)
3914- finally:
3915- socket.setdefaulttimeout(old_to)
3916- else:
3917- fo = opener.open(req)
3918- hdr = fo.info()
3919- except ValueError, e:
3920- err = URLGrabError(1, _('Bad URL: %s : %s') % (self.url, e, ))
3921- err.url = self.url
3922- raise err
3923-
3924- except RangeError, e:
3925- err = URLGrabError(9, _('%s on %s') % (e, self.url))
3926- err.url = self.url
3927- raise err
3928- except urllib2.HTTPError, e:
3929- new_e = URLGrabError(14, _('%s on %s') % (e, self.url))
3930- new_e.code = e.code
3931- new_e.exception = e
3932- new_e.url = self.url
3933- raise new_e
3934- except IOError, e:
3935- if hasattr(e, 'reason') and isinstance(e.reason, socket.timeout):
3936- err = URLGrabError(12, _('Timeout on %s: %s') % (self.url, e))
3937- err.url = self.url
3938- raise err
3939- else:
3940- err = URLGrabError(4, _('IOError on %s: %s') % (self.url, e))
3941- err.url = self.url
3942- raise err
3943-
3944- except OSError, e:
3945- err = URLGrabError(5, _('%s on %s') % (e, self.url))
3946- err.url = self.url
3947- raise err
3948-
3949- except HTTPException, e:
3950- err = URLGrabError(7, _('HTTP Exception (%s) on %s: %s') % \
3951- (e.__class__.__name__, self.url, e))
3952- err.url = self.url
3953- raise err
3954-
3955- else:
3956- return (fo, hdr)
3957-
3958- def _do_grab(self):
3959- """dump the file to a filename or StringIO buffer"""
3960-
3961- if self._complete:
3962- return
3963- _was_filename = False
3964- if type(self.filename) in types.StringTypes and self.filename:
3965- _was_filename = True
3966- self._prog_reportname = str(self.filename)
3967- self._prog_basename = os.path.basename(self.filename)
3968-
3969- if self.append: mode = 'ab'
3970- else: mode = 'wb'
3971-
3972- if DEBUG: DEBUG.info('opening local file "%s" with mode %s' % \
3973- (self.filename, mode))
3974- try:
3975- self.fo = open(self.filename, mode)
3976- except IOError, e:
3977- err = URLGrabError(16, _(\
3978- 'error opening local file from %s, IOError: %s') % (self.url, e))
3979- err.url = self.url
3980- raise err
3981-
3982- else:
3983- self._prog_reportname = 'MEMORY'
3984- self._prog_basename = 'MEMORY'
3985-
3986-
3987- self.fo = StringIO()
3988- # if this is to be a tempfile instead....
3989- # it just makes crap in the tempdir
3990- #fh, self._temp_name = mkstemp()
3991- #self.fo = open(self._temp_name, 'wb')
3992-
3993-
3994- self._do_perform()
3995-
3996-
3997-
3998- if _was_filename:
3999- # close it up
4000- self.fo.flush()
4001- self.fo.close()
4002- # set the time
4003- mod_time = self.curl_obj.getinfo(pycurl.INFO_FILETIME)
4004- if mod_time != -1:
4005- try:
4006- os.utime(self.filename, (mod_time, mod_time))
4007- except OSError, e:
4008- err = URLGrabError(16, _(\
4009- 'error setting timestamp on file %s from %s, OSError: %s')
4010- % (self.filenameself.url, e))
4011- err.url = self.url
4012- raise err
4013- # re open it
4014- try:
4015- self.fo = open(self.filename, 'r')
4016- except IOError, e:
4017- err = URLGrabError(16, _(\
4018- 'error opening file from %s, IOError: %s') % (self.url, e))
4019- err.url = self.url
4020- raise err
4021-
4022- else:
4023- #self.fo = open(self._temp_name, 'r')
4024- self.fo.seek(0)
4025-
4026- self._complete = True
4027-
4028- def _fill_buffer(self, amt=None):
4029- """fill the buffer to contain at least 'amt' bytes by reading
4030- from the underlying file object. If amt is None, then it will
4031- read until it gets nothing more. It updates the progress meter
4032- and throttles after every self._rbufsize bytes."""
4033- # the _rbuf test is only in this first 'if' for speed. It's not
4034- # logically necessary
4035- if self._rbuf and not amt is None:
4036- L = len(self._rbuf)
4037- if amt > L:
4038- amt = amt - L
4039- else:
4040- return
4041-
4042- # if we've made it here, then we don't have enough in the buffer
4043- # and we need to read more.
4044-
4045- if not self._complete: self._do_grab() #XXX cheater - change on ranges
4046-
4047- buf = [self._rbuf]
4048- bufsize = len(self._rbuf)
4049- while amt is None or amt:
4050- # first, delay if necessary for throttling reasons
4051- if self.opts.raw_throttle():
4052- diff = self._tsize/self.opts.raw_throttle() - \
4053- (time.time() - self._ttime)
4054- if diff > 0: time.sleep(diff)
4055- self._ttime = time.time()
4056-
4057- # now read some data, up to self._rbufsize
4058- if amt is None: readamount = self._rbufsize
4059- else: readamount = min(amt, self._rbufsize)
4060- try:
4061- new = self.fo.read(readamount)
4062- except socket.error, e:
4063- err = URLGrabError(4, _('Socket Error on %s: %s') % (self.url, e))
4064- err.url = self.url
4065- raise err
4066-
4067- except socket.timeout, e:
4068- raise URLGrabError(12, _('Timeout on %s: %s') % (self.url, e))
4069- err.url = self.url
4070- raise err
4071-
4072- except IOError, e:
4073- raise URLGrabError(4, _('IOError on %s: %s') %(self.url, e))
4074- err.url = self.url
4075- raise err
4076-
4077- newsize = len(new)
4078- if not newsize: break # no more to read
4079-
4080- if amt: amt = amt - newsize
4081- buf.append(new)
4082- bufsize = bufsize + newsize
4083- self._tsize = newsize
4084- self._amount_read = self._amount_read + newsize
4085- #if self.opts.progress_obj:
4086- # self.opts.progress_obj.update(self._amount_read)
4087-
4088- self._rbuf = string.join(buf, '')
4089- return
4090-
4091- def _progress_update(self, download_total, downloaded, upload_total, uploaded):
4092- if self._over_max_size(cur=self._amount_read-self._reget_length):
4093- return -1
4094-
4095- try:
4096- if self._prog_running:
4097- downloaded += self._reget_length
4098- self.opts.progress_obj.update(downloaded)
4099- except KeyboardInterrupt:
4100- return -1
4101-
4102- def _over_max_size(self, cur, max_size=None):
4103-
4104- if not max_size:
4105- if not self.opts.size:
4106- max_size = self.size
4107- else:
4108- max_size = self.opts.size
4109-
4110- if not max_size: return False # if we have None for all of the Max then this is dumb
4111-
4112- if cur > int(float(max_size) * 1.10):
4113-
4114- msg = _("Downloaded more than max size for %s: %s > %s") \
4115- % (self.url, cur, max_size)
4116- self._error = (pycurl.E_FILESIZE_EXCEEDED, msg)
4117- return True
4118- return False
4119-
4120- def _to_utf8(self, obj, errors='replace'):
4121- '''convert 'unicode' to an encoded utf-8 byte string '''
4122- # stolen from yum.i18n
4123- if isinstance(obj, unicode):
4124- obj = obj.encode('utf-8', errors)
4125- return obj
4126-
4127- def read(self, amt=None):
4128- self._fill_buffer(amt)
4129- if amt is None:
4130- s, self._rbuf = self._rbuf, ''
4131- else:
4132- s, self._rbuf = self._rbuf[:amt], self._rbuf[amt:]
4133- return s
4134-
4135- def readline(self, limit=-1):
4136- if not self._complete: self._do_grab()
4137- return self.fo.readline()
4138-
4139- i = string.find(self._rbuf, '\n')
4140- while i < 0 and not (0 < limit <= len(self._rbuf)):
4141- L = len(self._rbuf)
4142- self._fill_buffer(L + self._rbufsize)
4143- if not len(self._rbuf) > L: break
4144- i = string.find(self._rbuf, '\n', L)
4145-
4146- if i < 0: i = len(self._rbuf)
4147- else: i = i+1
4148- if 0 <= limit < len(self._rbuf): i = limit
4149-
4150- s, self._rbuf = self._rbuf[:i], self._rbuf[i:]
4151- return s
4152-
4153- def close(self):
4154- if self._prog_running:
4155- self.opts.progress_obj.end(self._amount_read)
4156- self.fo.close()
4157-
4158- def geturl(self):
4159- """ Provide the geturl() method, used to be got from
4160- urllib.addinfourl, via. urllib.URLopener.* """
4161- return self.url
4162-
4163-_curl_cache = pycurl.Curl() # make one and reuse it over and over and over
4164-
4165-def reset_curl_obj():
4166- """To make sure curl has reread the network/dns info we force a reload"""
4167- global _curl_cache
4168- _curl_cache.close()
4169- _curl_cache = pycurl.Curl()
4170-
4171-
4172-
4173-
4174-#####################################################################
4175-# DEPRECATED FUNCTIONS
4176-def set_throttle(new_throttle):
4177- """Deprecated. Use: default_grabber.throttle = new_throttle"""
4178- default_grabber.throttle = new_throttle
4179-
4180-def set_bandwidth(new_bandwidth):
4181- """Deprecated. Use: default_grabber.bandwidth = new_bandwidth"""
4182- default_grabber.bandwidth = new_bandwidth
4183-
4184-def set_progress_obj(new_progress_obj):
4185- """Deprecated. Use: default_grabber.progress_obj = new_progress_obj"""
4186- default_grabber.progress_obj = new_progress_obj
4187-
4188-def set_user_agent(new_user_agent):
4189- """Deprecated. Use: default_grabber.user_agent = new_user_agent"""
4190- default_grabber.user_agent = new_user_agent
4191-
4192-def retrygrab(url, filename=None, copy_local=0, close_connection=0,
4193- progress_obj=None, throttle=None, bandwidth=None,
4194- numtries=3, retrycodes=[-1,2,4,5,6,7], checkfunc=None):
4195- """Deprecated. Use: urlgrab() with the retry arg instead"""
4196- kwargs = {'copy_local' : copy_local,
4197- 'close_connection' : close_connection,
4198- 'progress_obj' : progress_obj,
4199- 'throttle' : throttle,
4200- 'bandwidth' : bandwidth,
4201- 'retry' : numtries,
4202- 'retrycodes' : retrycodes,
4203- 'checkfunc' : checkfunc
4204- }
4205- return urlgrab(url, filename, **kwargs)
4206-
4207-
4208-#####################################################################
4209-# TESTING
4210-def _main_test():
4211- try: url, filename = sys.argv[1:3]
4212- except ValueError:
4213- print 'usage:', sys.argv[0], \
4214- '<url> <filename> [copy_local=0|1] [close_connection=0|1]'
4215- sys.exit()
4216-
4217- kwargs = {}
4218- for a in sys.argv[3:]:
4219- k, v = string.split(a, '=', 1)
4220- kwargs[k] = int(v)
4221-
4222- set_throttle(1.0)
4223- set_bandwidth(32 * 1024)
4224- print "throttle: %s, throttle bandwidth: %s B/s" % (default_grabber.throttle,
4225- default_grabber.bandwidth)
4226-
4227- try: from progress import text_progress_meter
4228- except ImportError, e: pass
4229- else: kwargs['progress_obj'] = text_progress_meter()
4230-
4231- try: name = apply(urlgrab, (url, filename), kwargs)
4232- except URLGrabError, e: print e
4233- else: print 'LOCAL FILE:', name
4234-
4235-
4236-def _retry_test():
4237- try: url, filename = sys.argv[1:3]
4238- except ValueError:
4239- print 'usage:', sys.argv[0], \
4240- '<url> <filename> [copy_local=0|1] [close_connection=0|1]'
4241- sys.exit()
4242-
4243- kwargs = {}
4244- for a in sys.argv[3:]:
4245- k, v = string.split(a, '=', 1)
4246- kwargs[k] = int(v)
4247-
4248- try: from progress import text_progress_meter
4249- except ImportError, e: pass
4250- else: kwargs['progress_obj'] = text_progress_meter()
4251-
4252- def cfunc(filename, hello, there='foo'):
4253- print hello, there
4254- import random
4255- rnum = random.random()
4256- if rnum < .5:
4257- print 'forcing retry'
4258- raise URLGrabError(-1, 'forcing retry')
4259- if rnum < .75:
4260- print 'forcing failure'
4261- raise URLGrabError(-2, 'forcing immediate failure')
4262- print 'success'
4263- return
4264-
4265- kwargs['checkfunc'] = (cfunc, ('hello',), {'there':'there'})
4266- try: name = apply(retrygrab, (url, filename), kwargs)
4267- except URLGrabError, e: print e
4268- else: print 'LOCAL FILE:', name
4269-
4270-def _file_object_test(filename=None):
4271- import cStringIO
4272- if filename is None:
4273- filename = __file__
4274- print 'using file "%s" for comparisons' % filename
4275- fo = open(filename)
4276- s_input = fo.read()
4277- fo.close()
4278-
4279- for testfunc in [_test_file_object_smallread,
4280- _test_file_object_readall,
4281- _test_file_object_readline,
4282- _test_file_object_readlines]:
4283- fo_input = cStringIO.StringIO(s_input)
4284- fo_output = cStringIO.StringIO()
4285- wrapper = PyCurlFileObject(fo_input, None, 0)
4286- print 'testing %-30s ' % testfunc.__name__,
4287- testfunc(wrapper, fo_output)
4288- s_output = fo_output.getvalue()
4289- if s_output == s_input: print 'passed'
4290- else: print 'FAILED'
4291-
4292-def _test_file_object_smallread(wrapper, fo_output):
4293- while 1:
4294- s = wrapper.read(23)
4295- fo_output.write(s)
4296- if not s: return
4297-
4298-def _test_file_object_readall(wrapper, fo_output):
4299- s = wrapper.read()
4300- fo_output.write(s)
4301-
4302-def _test_file_object_readline(wrapper, fo_output):
4303- while 1:
4304- s = wrapper.readline()
4305- fo_output.write(s)
4306- if not s: return
4307-
4308-def _test_file_object_readlines(wrapper, fo_output):
4309- li = wrapper.readlines()
4310- fo_output.write(string.join(li, ''))
4311-
4312-if __name__ == '__main__':
4313- _main_test()
4314- _retry_test()
4315- _file_object_test('test')
4316
4317=== modified file 'ChangeLog'
4318--- ChangeLog 2010-06-21 20:36:19 +0000
4319+++ ChangeLog 2014-12-13 22:24:13 +0000
4320@@ -1,3 +1,11 @@
4321+2013-10-09 Zdenek Pavlas <zpavlas@redhat.com>
4322+
4323+ * lots of enahncements and bugfixes
4324+ (parallel downloading, mirror profiling, new options)
4325+ * updated authors, url
4326+ * updated unit tests
4327+ * bump version to 3.10
4328+
4329 2009-09-25 Seth Vidal <skvidal@fedoraproject.org>
4330
4331 * urlgrabber/__init__.py: bump version to 3.9.1
4332
4333=== modified file 'MANIFEST'
4334--- MANIFEST 2010-06-21 20:36:19 +0000
4335+++ MANIFEST 2014-12-13 22:24:13 +0000
4336@@ -1,3 +1,4 @@
4337+# file GENERATED by distutils, do NOT edit
4338 ChangeLog
4339 LICENSE
4340 MANIFEST
4341@@ -6,6 +7,7 @@
4342 makefile
4343 setup.py
4344 scripts/urlgrabber
4345+scripts/urlgrabber-ext-down
4346 test/base_test_code.py
4347 test/grabberperf.py
4348 test/munittest.py
4349
4350=== modified file 'PKG-INFO'
4351--- PKG-INFO 2010-06-21 20:36:19 +0000
4352+++ PKG-INFO 2014-12-13 22:24:13 +0000
4353@@ -1,37 +1,37 @@
4354-Metadata-Version: 1.0
4355+Metadata-Version: 1.1
4356 Name: urlgrabber
4357-Version: 3.9.1
4358+Version: 3.10.1
4359 Summary: A high-level cross-protocol url-grabber
4360-Home-page: http://linux.duke.edu/projects/urlgrabber/
4361+Home-page: http://urlgrabber.baseurl.org/
4362 Author: Michael D. Stenner, Ryan Tomayko
4363-Author-email: mstenner@linux.duke.edu, skvidal@fedoraproject.org
4364+Author-email: mstenner@linux.duke.edu, zpavlas@redhat.com
4365 License: LGPL
4366 Description: A high-level cross-protocol url-grabber.
4367
4368 Using urlgrabber, data can be fetched in three basic ways:
4369
4370- urlgrab(url) copy the file to the local filesystem
4371- urlopen(url) open the remote file and return a file object
4372- (like urllib2.urlopen)
4373- urlread(url) return the contents of the file as a string
4374+ urlgrab(url) copy the file to the local filesystem
4375+ urlopen(url) open the remote file and return a file object
4376+ (like urllib2.urlopen)
4377+ urlread(url) return the contents of the file as a string
4378
4379 When using these functions (or methods), urlgrabber supports the
4380 following features:
4381
4382- * identical behavior for http://, ftp://, and file:// urls
4383- * http keepalive - faster downloads of many files by using
4384- only a single connection
4385- * byte ranges - fetch only a portion of the file
4386- * reget - for a urlgrab, resume a partial download
4387- * progress meters - the ability to report download progress
4388- automatically, even when using urlopen!
4389- * throttling - restrict bandwidth usage
4390- * retries - automatically retry a download if it fails. The
4391- number of retries and failure types are configurable.
4392- * authenticated server access for http and ftp
4393- * proxy support - support for authenticated http and ftp proxies
4394- * mirror groups - treat a list of mirrors as a single source,
4395- automatically switching mirrors if there is a failure.
4396+ * identical behavior for http://, ftp://, and file:// urls
4397+ * http keepalive - faster downloads of many files by using
4398+ only a single connection
4399+ * byte ranges - fetch only a portion of the file
4400+ * reget - for a urlgrab, resume a partial download
4401+ * progress meters - the ability to report download progress
4402+ automatically, even when using urlopen!
4403+ * throttling - restrict bandwidth usage
4404+ * retries - automatically retry a download if it fails. The
4405+ number of retries and failure types are configurable.
4406+ * authenticated server access for http and ftp
4407+ * proxy support - support for authenticated http and ftp proxies
4408+ * mirror groups - treat a list of mirrors as a single source,
4409+ automatically switching mirrors if there is a failure.
4410
4411 Platform: UNKNOWN
4412 Classifier: Development Status :: 4 - Beta
4413
4414=== modified file 'README'
4415--- README 2005-10-23 12:29:28 +0000
4416+++ README 2014-12-13 22:24:13 +0000
4417@@ -19,7 +19,7 @@
4418 python setup.py bdist_rpm
4419
4420 The rpms (both source and "binary") will be specific to the current
4421-distrubution/version and may not be portable to others. This is
4422+distribution/version and may not be portable to others. This is
4423 because they will be built for the currently installed python.
4424
4425 keepalive.py and byterange.py are generic urllib2 extension modules and
4426
4427=== modified file 'debian/changelog'
4428--- debian/changelog 2014-02-23 13:54:39 +0000
4429+++ debian/changelog 2014-12-13 22:24:13 +0000
4430@@ -1,3 +1,10 @@
4431+urlgrabber (3.10.1-0ubuntu1) vivid; urgency=medium
4432+
4433+ * New upstream release.
4434+ * Drop all patches, fixed upstream
4435+
4436+ -- Jackson Doak <noskcaj@ubuntu.com> Sun, 14 Dec 2014 09:12:57 +1100
4437+
4438 urlgrabber (3.9.1-4ubuntu3) trusty; urgency=medium
4439
4440 * Rebuild to drop files installed into /usr/share/pyshared.
4441
4442=== removed file 'debian/patches/grabber_fix.diff'
4443--- debian/patches/grabber_fix.diff 2010-07-08 17:40:08 +0000
4444+++ debian/patches/grabber_fix.diff 1970-01-01 00:00:00 +0000
4445@@ -1,236 +0,0 @@
4446---- urlgrabber-3.9.1/urlgrabber/grabber.py.orig 2010-07-02 21:24:12.000000000 -0400
4447-+++ urlgrabber-3.9.1/urlgrabber/grabber.py 2010-07-02 20:30:25.000000000 -0400
4448-@@ -68,14 +68,14 @@
4449- (which can be set on default_grabber.throttle) is used. See
4450- BANDWIDTH THROTTLING for more information.
4451-
4452-- timeout = None
4453-+ timeout = 300
4454-
4455-- a positive float expressing the number of seconds to wait for socket
4456-- operations. If the value is None or 0.0, socket operations will block
4457-- forever. Setting this option causes urlgrabber to call the settimeout
4458-- method on the Socket object used for the request. See the Python
4459-- documentation on settimeout for more information.
4460-- http://www.python.org/doc/current/lib/socket-objects.html
4461-+ a positive integer expressing the number of seconds to wait before
4462-+ timing out attempts to connect to a server. If the value is None
4463-+ or 0, connection attempts will not time out. The timeout is passed
4464-+ to the underlying pycurl object as its CONNECTTIMEOUT option, see
4465-+ the curl documentation on CURLOPT_CONNECTTIMEOUT for more information.
4466-+ http://curl.haxx.se/libcurl/c/curl_easy_setopt.html#CURLOPTCONNECTTIMEOUT
4467-
4468- bandwidth = 0
4469-
4470-@@ -439,6 +439,12 @@
4471- except:
4472- __version__ = '???'
4473-
4474-+try:
4475-+ # this part isn't going to do much - need to talk to gettext
4476-+ from i18n import _
4477-+except ImportError, msg:
4478-+ def _(st): return st
4479-+
4480- ########################################################################
4481- # functions for debugging output. These functions are here because they
4482- # are also part of the module initialization.
4483-@@ -808,7 +814,7 @@
4484- self.prefix = None
4485- self.opener = None
4486- self.cache_openers = True
4487-- self.timeout = None
4488-+ self.timeout = 300
4489- self.text = None
4490- self.http_headers = None
4491- self.ftp_headers = None
4492-@@ -1052,9 +1058,15 @@
4493- self._reget_length = 0
4494- self._prog_running = False
4495- self._error = (None, None)
4496-- self.size = None
4497-+ self.size = 0
4498-+ self._hdr_ended = False
4499- self._do_open()
4500-
4501-+
4502-+ def geturl(self):
4503-+ """ Provide the geturl() method, used to be got from
4504-+ urllib.addinfourl, via. urllib.URLopener.* """
4505-+ return self.url
4506-
4507- def __getattr__(self, name):
4508- """This effectively allows us to wrap at the instance level.
4509-@@ -1085,9 +1097,14 @@
4510- return -1
4511-
4512- def _hdr_retrieve(self, buf):
4513-+ if self._hdr_ended:
4514-+ self._hdr_dump = ''
4515-+ self.size = 0
4516-+ self._hdr_ended = False
4517-+
4518- if self._over_max_size(cur=len(self._hdr_dump),
4519- max_size=self.opts.max_header_size):
4520-- return -1
4521-+ return -1
4522- try:
4523- self._hdr_dump += buf
4524- # we have to get the size before we do the progress obj start
4525-@@ -1104,7 +1121,17 @@
4526- s = parse150(buf)
4527- if s:
4528- self.size = int(s)
4529--
4530-+
4531-+ if buf.lower().find('location') != -1:
4532-+ location = ':'.join(buf.split(':')[1:])
4533-+ location = location.strip()
4534-+ self.scheme = urlparse.urlsplit(location)[0]
4535-+ self.url = location
4536-+
4537-+ if len(self._hdr_dump) != 0 and buf == '\r\n':
4538-+ self._hdr_ended = True
4539-+ if DEBUG: DEBUG.info('header ended:')
4540-+
4541- return len(buf)
4542- except KeyboardInterrupt:
4543- return pycurl.READFUNC_ABORT
4544-@@ -1113,8 +1140,10 @@
4545- if self._parsed_hdr:
4546- return self._parsed_hdr
4547- statusend = self._hdr_dump.find('\n')
4548-+ statusend += 1 # ridiculous as it may seem.
4549- hdrfp = StringIO()
4550- hdrfp.write(self._hdr_dump[statusend:])
4551-+ hdrfp.seek(0)
4552- self._parsed_hdr = mimetools.Message(hdrfp)
4553- return self._parsed_hdr
4554-
4555-@@ -1136,6 +1165,7 @@
4556- self.curl_obj.setopt(pycurl.PROGRESSFUNCTION, self._progress_update)
4557- self.curl_obj.setopt(pycurl.FAILONERROR, True)
4558- self.curl_obj.setopt(pycurl.OPT_FILETIME, True)
4559-+ self.curl_obj.setopt(pycurl.FOLLOWLOCATION, True)
4560-
4561- if DEBUG:
4562- self.curl_obj.setopt(pycurl.VERBOSE, True)
4563-@@ -1148,9 +1178,11 @@
4564-
4565- # timeouts
4566- timeout = 300
4567-- if opts.timeout:
4568-- timeout = int(opts.timeout)
4569-- self.curl_obj.setopt(pycurl.CONNECTTIMEOUT, timeout)
4570-+ if hasattr(opts, 'timeout'):
4571-+ timeout = int(opts.timeout or 0)
4572-+ self.curl_obj.setopt(pycurl.CONNECTTIMEOUT, timeout)
4573-+ self.curl_obj.setopt(pycurl.LOW_SPEED_LIMIT, 1)
4574-+ self.curl_obj.setopt(pycurl.LOW_SPEED_TIME, timeout)
4575-
4576- # ssl options
4577- if self.scheme == 'https':
4578-@@ -1276,7 +1308,7 @@
4579- raise err
4580-
4581- elif errcode == 60:
4582-- msg = _("client cert cannot be verified or client cert incorrect")
4583-+ msg = _("Peer cert cannot be verified or peer cert invalid")
4584- err = URLGrabError(14, msg)
4585- err.url = self.url
4586- raise err
4587-@@ -1291,7 +1323,12 @@
4588- raise err
4589-
4590- elif str(e.args[1]) == '' and self.http_code != 0: # fake it until you make it
4591-- msg = 'HTTP Error %s : %s ' % (self.http_code, self.url)
4592-+ if self.scheme in ['http', 'https']:
4593-+ msg = 'HTTP Error %s : %s ' % (self.http_code, self.url)
4594-+ elif self.scheme in ['ftp']:
4595-+ msg = 'FTP Error %s : %s ' % (self.http_code, self.url)
4596-+ else:
4597-+ msg = "Unknown Error: URL=%s , scheme=%s" % (self.url, self.scheme)
4598- else:
4599- msg = 'PYCURL ERROR %s - "%s"' % (errcode, str(e.args[1]))
4600- code = errcode
4601-@@ -1299,6 +1336,12 @@
4602- err.code = code
4603- err.exception = e
4604- raise err
4605-+ else:
4606-+ if self._error[1]:
4607-+ msg = self._error[1]
4608-+ err = URLGRabError(14, msg)
4609-+ err.url = self.url
4610-+ raise err
4611-
4612- def _do_open(self):
4613- self.curl_obj = _curl_cache
4614-@@ -1446,9 +1489,23 @@
4615- # set the time
4616- mod_time = self.curl_obj.getinfo(pycurl.INFO_FILETIME)
4617- if mod_time != -1:
4618-- os.utime(self.filename, (mod_time, mod_time))
4619-+ try:
4620-+ os.utime(self.filename, (mod_time, mod_time))
4621-+ except OSError, e:
4622-+ err = URLGrabError(16, _(\
4623-+ 'error setting timestamp on file %s from %s, OSError: %s')
4624-+ % (self.filenameself.url, e))
4625-+ err.url = self.url
4626-+ raise err
4627- # re open it
4628-- self.fo = open(self.filename, 'r')
4629-+ try:
4630-+ self.fo = open(self.filename, 'r')
4631-+ except IOError, e:
4632-+ err = URLGrabError(16, _(\
4633-+ 'error opening file from %s, IOError: %s') % (self.url, e))
4634-+ err.url = self.url
4635-+ raise err
4636-+
4637- else:
4638- #self.fo = open(self._temp_name, 'r')
4639- self.fo.seek(0)
4640-@@ -1532,11 +1589,14 @@
4641- def _over_max_size(self, cur, max_size=None):
4642-
4643- if not max_size:
4644-- max_size = self.size
4645-- if self.opts.size: # if we set an opts size use that, no matter what
4646-- max_size = self.opts.size
4647-+ if not self.opts.size:
4648-+ max_size = self.size
4649-+ else:
4650-+ max_size = self.opts.size
4651-+
4652- if not max_size: return False # if we have None for all of the Max then this is dumb
4653-- if cur > max_size + max_size*.10:
4654-+
4655-+ if cur > int(float(max_size) * 1.10):
4656-
4657- msg = _("Downloaded more than max size for %s: %s > %s") \
4658- % (self.url, cur, max_size)
4659-@@ -1582,9 +1642,21 @@
4660- self.opts.progress_obj.end(self._amount_read)
4661- self.fo.close()
4662-
4663--
4664-+ def geturl(self):
4665-+ """ Provide the geturl() method, used to be got from
4666-+ urllib.addinfourl, via. urllib.URLopener.* """
4667-+ return self.url
4668-+
4669- _curl_cache = pycurl.Curl() # make one and reuse it over and over and over
4670-
4671-+def reset_curl_obj():
4672-+ """To make sure curl has reread the network/dns info we force a reload"""
4673-+ global _curl_cache
4674-+ _curl_cache.close()
4675-+ _curl_cache = pycurl.Curl()
4676-+
4677-+
4678-+
4679-
4680- #####################################################################
4681- # DEPRECATED FUNCTIONS
4682
4683=== removed file 'debian/patches/progress_fix.diff'
4684--- debian/patches/progress_fix.diff 2010-07-08 17:40:08 +0000
4685+++ debian/patches/progress_fix.diff 1970-01-01 00:00:00 +0000
4686@@ -1,11 +0,0 @@
4687---- urlgrabber-3.9.1/urlgrabber/progress.py.orig 2010-07-02 21:25:51.000000000 -0400
4688-+++ urlgrabber-3.9.1/urlgrabber/progress.py 2010-07-02 20:30:25.000000000 -0400
4689-@@ -658,6 +658,8 @@
4690- if seconds is None or seconds < 0:
4691- if use_hours: return '--:--:--'
4692- else: return '--:--'
4693-+ elif seconds == float('inf'):
4694-+ return 'Infinite'
4695- else:
4696- seconds = int(seconds)
4697- minutes = seconds / 60
4698
4699=== removed file 'debian/patches/progress_object_callback_fix.diff'
4700--- debian/patches/progress_object_callback_fix.diff 2011-08-09 17:45:08 +0000
4701+++ debian/patches/progress_object_callback_fix.diff 1970-01-01 00:00:00 +0000
4702@@ -1,21 +0,0 @@
4703-From: James Antill <james@and.org>
4704-Date: Thu, 19 May 2011 20:17:14 +0000 (-0400)
4705-Subject: Fix documentation for progress_object callback.
4706-X-Git-Url: http://yum.baseurl.org/gitweb?p=urlgrabber.git;a=commitdiff_plain;h=674d545ee303aa99701ffb982536851572d8db77
4707-
4708-Fix documentation for progress_object callback.
4709----
4710-
4711-diff --git a/urlgrabber/grabber.py b/urlgrabber/grabber.py
4712-index 36212cf..f6f57bd 100644
4713---- a/urlgrabber/grabber.py
4714-+++ b/urlgrabber/grabber.py
4715-@@ -49,7 +49,7 @@ GENERAL ARGUMENTS (kwargs)
4716- progress_obj = None
4717-
4718- a class instance that supports the following methods:
4719-- po.start(filename, url, basename, length, text)
4720-+ po.start(filename, url, basename, size, now, text)
4721- # length will be None if unknown
4722- po.update(read) # read == bytes read so far
4723- po.end()
4724
4725=== modified file 'debian/patches/series'
4726--- debian/patches/series 2011-08-09 17:45:08 +0000
4727+++ debian/patches/series 2014-12-13 22:24:13 +0000
4728@@ -1,3 +0,0 @@
4729-grabber_fix.diff
4730-progress_fix.diff
4731-progress_object_callback_fix.diff
4732
4733=== modified file 'scripts/urlgrabber'
4734--- scripts/urlgrabber 2010-06-21 20:36:19 +0000
4735+++ scripts/urlgrabber 2014-12-13 22:24:13 +0000
4736@@ -115,6 +115,7 @@
4737 including quotes in the case of strings.
4738 e.g. --user_agent='"foobar/2.0"'
4739
4740+ --output FILE
4741 -o FILE write output to FILE, otherwise the basename of the
4742 url will be used
4743 -O print the names of saved files to STDOUT
4744@@ -170,12 +171,17 @@
4745 return ug_options, ug_defaults
4746
4747 def process_command_line(self):
4748- short_options = 'vd:hoOpD'
4749+ short_options = 'vd:ho:OpD'
4750 long_options = ['profile', 'repeat=', 'verbose=',
4751- 'debug=', 'help', 'progress']
4752+ 'debug=', 'help', 'progress', 'output=']
4753 ug_long = [ o + '=' for o in self.ug_options ]
4754- optlist, args = getopt.getopt(sys.argv[1:], short_options,
4755- long_options + ug_long)
4756+ try:
4757+ optlist, args = getopt.getopt(sys.argv[1:], short_options,
4758+ long_options + ug_long)
4759+ except getopt.GetoptError, e:
4760+ print >>sys.stderr, "Error:", e
4761+ self.help([], ret=1)
4762+
4763 self.verbose = 0
4764 self.debug = None
4765 self.outputfile = None
4766@@ -193,6 +199,7 @@
4767 if o == '--verbose': self.verbose = v
4768 if o == '-v': self.verbose += 1
4769 if o == '-o': self.outputfile = v
4770+ if o == '--output': self.outputfile = v
4771 if o == '-p' or o == '--progress': self.progress = 1
4772 if o == '-d' or o == '--debug': self.debug = v
4773 if o == '--profile': self.profile = 1
4774@@ -222,7 +229,7 @@
4775 print "ERROR: cannot use -o when grabbing multiple files"
4776 sys.exit(1)
4777
4778- def help(self, args):
4779+ def help(self, args, ret=0):
4780 if not args:
4781 print MAINHELP
4782 else:
4783@@ -234,7 +241,7 @@
4784 self.help_ug_option(a)
4785 else:
4786 print 'ERROR: no help on command "%s"' % a
4787- sys.exit(0)
4788+ sys.exit(ret)
4789
4790 def help_doc(self):
4791 print __doc__
4792@@ -294,6 +301,7 @@
4793 if self.op.localfile: print f
4794 except URLGrabError, e:
4795 print e
4796+ sys.exit(1)
4797
4798 def set_debug_logger(self, dbspec):
4799 try:
4800
4801=== added file 'scripts/urlgrabber-ext-down'
4802--- scripts/urlgrabber-ext-down 1970-01-01 00:00:00 +0000
4803+++ scripts/urlgrabber-ext-down 2014-12-13 22:24:13 +0000
4804@@ -0,0 +1,75 @@
4805+#! /usr/bin/python
4806+# A very simple external downloader
4807+# Copyright 2011-2012 Zdenek Pavlas
4808+
4809+# This library is free software; you can redistribute it and/or
4810+# modify it under the terms of the GNU Lesser General Public
4811+# License as published by the Free Software Foundation; either
4812+# version 2.1 of the License, or (at your option) any later version.
4813+#
4814+# This library is distributed in the hope that it will be useful,
4815+# but WITHOUT ANY WARRANTY; without even the implied warranty of
4816+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
4817+# Lesser General Public License for more details.
4818+#
4819+# You should have received a copy of the GNU Lesser General Public
4820+# License along with this library; if not, write to the
4821+# Free Software Foundation, Inc.,
4822+# 59 Temple Place, Suite 330,
4823+# Boston, MA 02111-1307 USA
4824+
4825+import time, os, errno, sys
4826+from urlgrabber.grabber import \
4827+ _readlines, URLGrabberOptions, _loads, \
4828+ PyCurlFileObject, URLGrabError
4829+
4830+def write(fmt, *arg):
4831+ try: os.write(1, fmt % arg)
4832+ except OSError, e:
4833+ if e.args[0] != errno.EPIPE: raise
4834+ sys.exit(1)
4835+
4836+class ProxyProgress:
4837+ def start(self, *d1, **d2):
4838+ self.next_update = 0
4839+ def update(self, _amount_read):
4840+ t = time.time()
4841+ if t < self.next_update: return
4842+ self.next_update = t + 0.31
4843+ write('%d %d\n', self._id, _amount_read)
4844+
4845+def main():
4846+ import signal
4847+ signal.signal(signal.SIGINT, lambda n, f: sys.exit(1))
4848+ cnt = 0
4849+ while True:
4850+ lines = _readlines(0)
4851+ if not lines: break
4852+ for line in lines:
4853+ cnt += 1
4854+ opts = URLGrabberOptions()
4855+ opts._id = cnt
4856+ for k in line.split(' '):
4857+ k, v = k.split('=', 1)
4858+ setattr(opts, k, _loads(v))
4859+ if opts.progress_obj:
4860+ opts.progress_obj = ProxyProgress()
4861+ opts.progress_obj._id = cnt
4862+
4863+ dlsz = dltm = 0
4864+ try:
4865+ fo = PyCurlFileObject(opts.url, opts.filename, opts)
4866+ fo._do_grab()
4867+ fo.fo.close()
4868+ size = fo._amount_read
4869+ if fo._tm_last:
4870+ dlsz = fo._tm_last[0] - fo._tm_first[0]
4871+ dltm = fo._tm_last[1] - fo._tm_first[1]
4872+ ug_err = 'OK'
4873+ except URLGrabError, e:
4874+ size = 0
4875+ ug_err = '%d %d %s' % (e.errno, getattr(e, 'code', 0), e.strerror)
4876+ write('%d %d %d %.3f %s\n', opts._id, size, dlsz, dltm, ug_err)
4877+
4878+if __name__ == '__main__':
4879+ main()
4880
4881=== modified file 'setup.py'
4882--- setup.py 2005-10-23 12:29:28 +0000
4883+++ setup.py 2014-12-13 22:24:13 +0000
4884@@ -15,8 +15,10 @@
4885 packages = ['urlgrabber']
4886 package_dir = {'urlgrabber':'urlgrabber'}
4887 scripts = ['scripts/urlgrabber']
4888-data_files = [('share/doc/' + name + '-' + version,
4889- ['README','LICENSE', 'TODO', 'ChangeLog'])]
4890+data_files = [
4891+ ('share/doc/' + name + '-' + version, ['README','LICENSE', 'TODO', 'ChangeLog']),
4892+ ('libexec', ['scripts/urlgrabber-ext-down']),
4893+]
4894 options = { 'clean' : { 'all' : 1 } }
4895 classifiers = [
4896 'Development Status :: 4 - Beta',
4897
4898=== modified file 'test/base_test_code.py'
4899--- test/base_test_code.py 2005-10-23 12:29:28 +0000
4900+++ test/base_test_code.py 2014-12-13 22:24:13 +0000
4901@@ -1,6 +1,6 @@
4902 from munittest import *
4903
4904-base_http = 'http://www.linux.duke.edu/projects/urlgrabber/test/'
4905+base_http = 'http://urlgrabber.baseurl.org/test/'
4906 base_ftp = 'ftp://localhost/test/'
4907
4908 # set to a proftp server only. we're working around a couple of
4909
4910=== modified file 'test/munittest.py'
4911--- test/munittest.py 2005-10-23 12:29:28 +0000
4912+++ test/munittest.py 2014-12-13 22:24:13 +0000
4913@@ -113,7 +113,7 @@
4914 __all__ = ['TestResult', 'TestCase', 'TestSuite', 'TextTestRunner',
4915 'TestLoader', 'FunctionTestCase', 'main', 'defaultTestLoader']
4916
4917-# Expose obsolete functions for backwards compatability
4918+# Expose obsolete functions for backwards compatibility
4919 __all__.extend(['getTestCaseNames', 'makeSuite', 'findTestCases'])
4920
4921
4922@@ -410,7 +410,7 @@
4923 (default 7) and comparing to zero.
4924
4925 Note that decimal places (from zero) is usually not the same
4926- as significant digits (measured from the most signficant digit).
4927+ as significant digits (measured from the most significant digit).
4928 """
4929 if round(second-first, places) != 0:
4930 raise self.failureException, \
4931@@ -422,7 +422,7 @@
4932 (default 7) and comparing to zero.
4933
4934 Note that decimal places (from zero) is usually not the same
4935- as significant digits (measured from the most signficant digit).
4936+ as significant digits (measured from the most significant digit).
4937 """
4938 if round(second-first, places) == 0:
4939 raise self.failureException, \
4940
4941=== modified file 'test/test_byterange.py'
4942--- test/test_byterange.py 2005-10-23 12:29:28 +0000
4943+++ test/test_byterange.py 2014-12-13 22:24:13 +0000
4944@@ -25,7 +25,7 @@
4945
4946 import sys
4947
4948-from StringIO import StringIO
4949+from cStringIO import StringIO
4950 from urlgrabber.byterange import RangeableFileObject
4951
4952 from base_test_code import *
4953@@ -52,18 +52,6 @@
4954 self.rfo.seek(1,1)
4955 self.assertEquals('of', self.rfo.read(2))
4956
4957- def test_poor_mans_seek(self):
4958- """RangeableFileObject.seek() poor mans version..
4959-
4960- We just delete the seek method from StringIO so we can
4961- excercise RangeableFileObject when the file object supplied
4962- doesn't support seek.
4963- """
4964- seek = StringIO.seek
4965- del(StringIO.seek)
4966- self.test_seek()
4967- StringIO.seek = seek
4968-
4969 def test_read(self):
4970 """RangeableFileObject.read()"""
4971 self.assertEquals('the', self.rfo.read(3))
4972
4973=== modified file 'test/test_grabber.py'
4974--- test/test_grabber.py 2010-06-21 20:36:19 +0000
4975+++ test/test_grabber.py 2014-12-13 22:24:13 +0000
4976@@ -86,7 +86,7 @@
4977
4978 class HTTPTests(TestCase):
4979 def test_reference_file(self):
4980- "download refernce file via HTTP"
4981+ "download reference file via HTTP"
4982 filename = tempfile.mktemp()
4983 grabber.urlgrab(ref_http, filename)
4984
4985@@ -98,6 +98,7 @@
4986
4987 def test_post(self):
4988 "do an HTTP post"
4989+ self.skip() # disabled on server
4990 headers = (('Content-type', 'text/plain'),)
4991 ret = grabber.urlread(base_http + 'test_post.php',
4992 data=short_reference_data,
4993
4994=== modified file 'test/test_mirror.py'
4995--- test/test_mirror.py 2005-12-31 15:34:22 +0000
4996+++ test/test_mirror.py 2014-12-13 22:24:13 +0000
4997@@ -28,7 +28,7 @@
4998 import string, tempfile, random, cStringIO, os
4999
5000 import urlgrabber.grabber
The diff has been truncated for viewing.

Subscribers

People subscribed via source and target branches

to all changes: