Comment 5 for bug 139855

Revision history for this message
Fabien Tassin (fta) wrote :

Here are some thoughts, hoping they will help make this bug move forward..

I assume that the raw data is available somewhere. No one explained how the PPA
files are spread to the world, but as the user URL is unique, it seems reasonable
to assume that the data is in the form of httpd or web proxy logs somewhere.
Then it's a matter of post-precessing that. I also assume we can ignore all the direct
downloads from the LP pages (librarian), focusing on what's available through APT
should be enough.

The next problem is to interpret those data.
The OP asked for some precise figures:

1/ Downloads stats for each package in the archive

what do we want to know?

Ideally, number of users for each version over time: if my assumption about the
logs is correct, they only show downloads, with no way to distinguish between
upgrades and new installs, so accounting just the number of downloads will not
give an accurate representation of the number of installations. The information
has to come from the user's machine, identified by a unique ID (like with
popcon) - not the IP address - maybe transported in a (fake) http referrer. It
will still not catch removals though..

Number of downloads over time: this seems possible, but tricky to represent as
there's an unknown (and increasing) number of versions.
http://popcon.debian.org/stat/release.png is a good example as to why it is
tricky. For fast moving PPAs, such as dailies, or trunk/tip builds, it's even
worse.

2/ Distribution release used

this should be easy. I also find this info very valuable, as there's no point
spending time maintaining backports for a distro used by no one.
It should probably not be based on the indexes stats, as it's possible to have
multiple versions of the same repository, esp. when a new ubuntu is released,
PPAs maintainers often take time to start producing debs for the new version
(debs are not copied like in the real archives).
It should come from the download stats, aggregated by package numbers.

3/ Number of users subscribed to the archive over time

i don't think we'll ever get stats per user, it's always per machine (not to
mention proxies/caches).

4/ Number of download requests over time

hm, this is 1/, sort of..

5/ Amount of data transfered over time

this one should be trivial.

In the meantime, what about giving the PPA owners access to their raw logs,
properly anonymized, for ex by md5-ing IP addresses? The privacy risk will be
the same as with popcon (i.e. if there's just 1 user for a given package, it's
safe to assume it's the PPA maintainer, making him a target), but given a md5,
finding the IP to exploit is, well, you know..
This could allow users to experiment, and maybe find good ideas, create mockups..