Archivecollections

Improve crawl item metadata

Bug #644020 reported by alexisrossi on 2010-09-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Archivecollections	Fix Committed	High	siznax

Bug Description

From Brewster:

hank, can you also look at the books metadata and see what we can gen up for web that is related (I know the conflict between keeping the same tag, but not quite right vs making a new "right" tag that does not integrate)... I will opt more towards overloading existing tags.

example current web metadata:
http://ia360703.us.archive.org/10/items/WIDE-20100916011406115-00119-00137-ia360917/WIDE-20100916011406115-00119-00137-ia360917_meta.xml

example book:
http://ia311527.us.archive.org/2/items/americanaborigin00east/americanaborigin00east_meta.xml

<collection>wide_201009</collection>
<collection>webwidecrawl</collection>

* can we have in the first collection's item the description of the crawl parameters and what version of the crawler was used?

* I think we should have a scanner field that would be the hostname
e.g. ia3706.us.archive.org (this will help us distinguish what organization "scanned" the web, so alexa would have something different in this field, maybe just "alexa.com")

* What would go in the creator field? is that the organization such as "Internet Archive" and "Alexa Internet"?

* scandate would be helpful as well. This will then feed into reports better.

* should we have date and year, like books do? <date>1853</date>
<year>1853</year> I dont know why we have 2, so manybe we should just have <date>...

* I think we should have sponsor, and that would be "Internet Archive" or "Alexa Internet" <sponsor>University of Pittsburgh Library System</sponsor>

* I think we should have scanning center. This would be "San Francisco"
<scanningcenter>indiana</scanningcenter>

* <operator><email address hidden></operator>
would be good. <email address hidden> or <email address hidden>

* <imagecount>206</imagecount> would be helpful. this might want to be set during derive. this is not quite right, but at least would leverage the metamgr. I would go with it and have the number of "captures"

* <identifier-access>
http://www.archive.org/details/americanaborigin00east
</identifier-access>
would be good as well.

then can we have a meeting with alexis, hank, me, and anyone else that wants to.

metadata ho!

-brewster

alexisrossi (alexis-archive) on 2010-09-21

Changed in archivecollections:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → siznax (steve-archive)

Revision history for this message

siznax (siznax) wrote on 2011-01-19:

i believe this has been handled. please re-open if not.

Changed in archivecollections:
status:	Confirmed → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.