ebook img

Libraries University Columbia THURMAN ALEX PDF

60 Pages·2004·6.13 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Libraries University Columbia THURMAN ALEX

ALEX THURMAN Columbia University Libraries Cornell Metadata Working Group Forum May 17, 2013 http://archive.org Internet Archive Wow! •Crawling the public Web since 1996 •Over 280 billion URLs •Archive searchable by URL via Wayback Machine, which contains 5 petabytes of crawl data from 1996-2012 •Wayback Machine gets 1,000 visits per second •Web archives from 1996-2007 backed up at mirror site at Bibliotheca Alexandrina •Also collects millions of digitized books, movies, audio and concert recordings •Over 10 Petabytes of data in total But! •IA web collections vast but not comprehensive •General Internet crawls take about 3 months, many websites change faster or are short-lived or not well-linked/discoverable •Depth of capture of individual websites varies •Archive too huge to be indexed for full-text search •Robots.txt restrictions are obeyed, so many sites fully or partially blocked from archiving http://netpreserve.org 44 Members Member Type national/regional libraries 29 university libraries 7 other non-profits 4 commercial 2 archives 2 Member Region Europe 23 North America 13 [Internet Archive, CDL, LC, US GPO, Columbia, UNT, GWU, LANL, ODU, 3 in Canada] Asia 4 Oceania 2 North Africa 2 Types of web archiving Domain crawls National domains (.pt, .fr, .at) National/regional/local government domains (.gov.uk, state.nc.us) University domains (columbia.edu) Thematic/selective collections Event-based collections Commercial/legal/compliance Personal digital archiving https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives https://en.wikipedia.org/wiki/File:Map_of_Web_archiving_initiatives_worldwide.png http://www.webarchive.org.uk/ukwa/

Description:
Wow! •Crawling the public Web since 1996 •Over 280 billion URLs •Archive searchable by URL via Wayback Machine, which contains 5 petabytes of
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.