Improve crawl item metadata

Bug #644020 reported by alexisrossi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Archivecollections
Fix Committed
High
siznax

Bug Description

From Brewster:

hank, can you also look at the books metadata and see what we can gen up for web that is related (I know the conflict between keeping the same tag, but not quite right vs making a new "right" tag that does not integrate)... I will opt more towards overloading existing tags.

example current web metadata:
http://ia360703.us.archive.org/10/items/WIDE-20100916011406115-00119-00137-ia360917/WIDE-20100916011406115-00119-00137-ia360917_meta.xml

example book:
http://ia311527.us.archive.org/2/items/americanaborigin00east/americanaborigin00east_meta.xml

<collection>wide_201009</collection>
<collection>webwidecrawl</collection>

* can we have in the first collection's item the description of the crawl parameters and what version of the crawler was used?

* I think we should have a scanner field that would be the hostname
e.g. ia3706.us.archive.org (this will help us distinguish what organization "scanned" the web, so alexa would have something different in this field, maybe just "alexa.com")

* What would go in the creator field? is that the organization such as "Internet Archive" and "Alexa Internet"?

* scandate would be helpful as well. This will then feed into reports better.

* should we have date and year, like books do? <date>1853</date>
<year>1853</year> I dont know why we have 2, so manybe we should just have <date>...

* I think we should have sponsor, and that would be "Internet Archive" or "Alexa Internet" <sponsor>University of Pittsburgh Library System</sponsor>

* I think we should have scanning center. This would be "San Francisco"
<scanningcenter>indiana</scanningcenter>

* <operator><email address hidden></operator>
would be good. <email address hidden> or <email address hidden>

* <imagecount>206</imagecount> would be helpful. this might want to be set during derive. this is not quite right, but at least would leverage the metamgr. I would go with it and have the number of "captures"

* <identifier-access>
http://www.archive.org/details/americanaborigin00east
</identifier-access>
would be good as well.

then can we have a meeting with alexis, hank, me, and anyone else that wants to.

metadata ho!

-brewster

Changed in archivecollections:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → siznax (steve-archive)
Revision history for this message
siznax (siznax) wrote :

i believe this has been handled. please re-open if not.

Changed in archivecollections:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.