ancient history of the uk web
DESCRIPTION
Slides for a presentation on recent work with Web Archives at the Oxford Internet Institute (http://www.oii.ox.ac.uk/) given at WIRE2014 (http://wp.comminfo.rutgers.edu/nsfia/schedule/)TRANSCRIPT
![Page 1: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/1.jpg)
Ancient History of the UK Web
With support by and thanks to Ning Wang and Adham Tamer
Josh Cowls, Scott A. Hale, Helen Margetts, Eric T. Meyer, Ralph Schroeder, Taha Yasseri
![Page 2: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/2.jpg)
Past Web Archive Activities at OII • 2008-2009. JISC/NEH Transatlantic Digitisation Collaboration: World Wide Web of
Humanities (Jisc & NEH funded) – OII, Internet Archive, Hanzo Archives – Meyer, E.T., Carpenter, K., Middleton, M. (2009). World Wide Web of Humanities: Final
Report to JISC. Online: http://www.jisc.ac.uk/media/documents/programmes/digitisation/humanitiesfinalreport.pdf
• 2010. Researcher Engagement with Web Archives (Jisc funded) – OII, VKS – Dougherty, M., Meyer, E.T., Madsen, C., van den Heuvel, C., Thomas, A., Wyatt, S. (2010).
Researcher Engagement with Web Archives: State of the Art. London: JISC. Online: http://ssrn.com/abstract=1714997 and http://ie-repository.jisc.ac.uk/544/
– Thomas, A., Meyer, E.T., Dougherty, M., van den Heuvel, C., Madsen, C., Wyatt, S. (2010). Researcher Engagement with Web Archives: Challenges and Opportunities for Investment. London: JISC. Online: http://ssrn.com/abstract=1715000 and http://ie-repository.jisc.ac.uk/543/
– Dougherty, M., Meyer, E.T. (2014). Community, Tools, and Practices in Web Archiving: The state of the art in relation to social science and humanities research needs. Journal of the American Society of Information Science & Technology. http://onlinelibrary.wiley.com/doi/10.1002/asi.23099/abstract
• 2011. Using Web Archives: A Futures Perspective (IIPC funded) – OII – Meyer, E.T., Thomas, A.J., Schroeder, R. (2011). Web Archives: The Future(s). London:
IIPC. Online: http://ssrn.com/abstract=1830025
![Page 3: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/3.jpg)
Recent Web Archive Activities at OII • 2013-2015: Jisc Big Data project (Jisc funded)
– OII, British Library
– Prepare and release hyperlink corpus
• 2014-2015: Big UK Domain Data for the Arts and Humanities (AHRC funded)
– IHR, OII, British Library
– Supporting researchers in Arts & Humanities to use web archive data
– Producing edited book of empirical studies concerning the history of the UK web
• First paper from these combined projects
– Hale, S.A., Yasseri, T., Cowls, J., Meyer, E.T., Schroeder, R., Margetts, H. (2014, July). Mapping the UK webspace: Fifteen years of British universities on the web. ACM WebSci’14, Bloomington, Indiana. http://papers.ssrn.com/abstract=2435481 or http://arxiv.org/abs/1405.2856
![Page 4: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/4.jpg)
Big Data: Demonstrating the Value of the UK Web Domain Dataset
for Social Science Research
This project aims to enhance JISC's UK Web Domain archive, a 30 TB archive of the .uk country-code top level domain collected from 1996 to 2010. It will extract link graphs from the data and disseminate social science research using the collection.
February 2012 - February 2014
![Page 5: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/5.jpg)
Taming a mammoth: Web Archive Dataset Preparation
30 TB compressed data
6.2TB metadata and links
2.5 TB temporal links
![Page 6: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/6.jpg)
30 TB compressed data in (w)arc format
– Approx. 4.5 million files
– Mix of binary and plain text payloads along
with header data
– Two formats: old arc and newer warc
Housed at the BL, access restrictions
![Page 7: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/7.jpg)
WARC/1.0
WARC-Type: response
WARC-Target-URI: http://hits.guardian.co.uk/b/ss/guardiangu-blogs,guardiangu-news,guardiangu-
network/1/H.22.2/56938?ns=guardian&pageName=Prisoner+of+war+camps+in+the+UK+mapped+and+listed.+Download+the+d
ata%3AGraphic%3A1476560&ch=News&c3=GU.co.uk&c4=History+%28Books+genre%29%2CBooks%2CSecond+world+war+
%28News%29%2CGermany%2CUK+news%2CTechnology&c5=Not+commercially+useful%2CCorporate+IT&c6=Simon+Roger
s&c7=10-Nov-
08&c8=1476560&c9=Graphic&c10=Blogpost&c11=News&c13=&c25=Datablog&c30=content&h2=GU%2FNews%2Fblog%2FDa
tablog&c2=GUID:(none)
WARC-Date: 2010-12-05T02:58:00Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-IP-Address: 66.235.138.18
WARC-Record-ID: <urn:uuid:7d5ce147-9b4b-46cb-8975-ee93b4d0dda8>
Content-Type: application/http; msgtype=response
Content-Length: 740
HTTP/1.1 302 Found
Date: Sun, 05 Dec 2010 02:58:00 GMT
Server: Omniture DC/2.0.0
X-C: ms-4.3.1
Expires: Sat, 04 Dec 2010 02:58:00 GMT
Last-Modified: Mon, 06 Dec 2010 02:58:00 GMT
Cache-Control: no-cache, no-store, must-revalidate, max-age=0, proxy-revalidate, no-transform, private
Pragma: no-cache
ETag: "4CFAFFB8-0E4C-7443902F"
Vary: *
P3P: policyref="/w3c/p3p.xml", CP="NOI DSP COR NID PSA OUR IND COM NAV STA"
Location: http://b.scorecardresearch.com/r?c2=6035250&d.c=gif&d.o=guardiangu-
network&d.x=243551159&d.t=page&d.u=http%3A%2F%2Fwww.guardian.co.uk%2Fnews%2Fdatablog%2F2010%2Fnov%2F08
%2Fprisoner-of-war-camps-uk
xserver: www422
Content-Length: 0
Keep-Alive: timeout=15
Connection: close
Content-Type: text/plain
![Page 8: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/8.jpg)
Extract meta-data and links (wat format)
– Approx. 4.5 million files
– 6.2TB on disk compressed
– Housed at OII
– Structured JSON
– Different formats for arc/warcs
![Page 9: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/9.jpg)
{ "Container": { "Filename": "DOTUK-HISTORICAL-1996-2010-GROUP-AA-XAAAAA-20110428000000-00000.arc.gz", "Offset": "88937", "Compressed": true, "Gzip-Metadata": { "Header-Length": "10", "Inflated-CRC": "-1223265901", "Inflated-Length": "26073", "Deflate-Length": "4463", "Footer-Length": "8" } }, "Envelope": { "ARC-Header-Length": "102", "ARC-Header-Metadata": { "Date": "20080509081524", "Target-URI": "http://www.ukhomeinteriors.co.uk/content/ext_corbels.php", "Content-Length": "25970", "Content-Type": "text/html", "IP-Address": "83.223.106.10" }, "Payload-Metadata": { "Actual-Content-Type": "application/http; msgtype=response", "Block-Digest": "sha1:MCCZNOKBJHTZ5MMMCUJGBPE25C2TVUWF", "HTTP-Response-Metadata": { "Headers-Length": "591", "HTML-Metadata": { "Head": { "Title": "Exterior Corbels",
![Page 10: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/10.jpg)
Plain text lists Build own ad-hawk Hadoop cluster, fix incompatibilities, divide into smaller batches
– Build plain text lists of pages and hyperlinks
– Remove error page (e.g., 404 Not Found)
– Remove pages not in .uk
– Standardize dates (many formats)
– Standardize hyperlinks (trailing /, etc.)
– Fix/remove tons of invalid hyperlinks (whitespace, invalid characters, etc.)
Load results into Apache Hive (2.5 TB)
![Page 11: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/11.jpg)
Source Destination Time LinkText
http://octopus.well.ox.ac.uk:80/ http://octopus.well.ox.ac.uk:80/links.html 1032758438 Links
http://octopus.well.ox.ac.uk:80/ http://octopus.well.ox.ac.uk:80/projects.html 1001793436 Projects
http://octopus.well.ox.ac.uk:80/computing.shtml http://debian.org/ 1075794060 Debian/GNU
![Page 12: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/12.jpg)
Overall Statistics
Third-level-
domains:
e.g.
ox.ac.uk
![Page 13: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/13.jpg)
Relative size of second-level-domains
![Page 14: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/14.jpg)
Number of links within SLD per node
![Page 15: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/15.jpg)
Cross-domain links (2010)
Absolute Normalized to target size
![Page 16: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/16.jpg)
Case of ac.uk
Mapping the UK Webspace: Fifteen Years of British Universities on the Web
Hale et al., WebSci'14, available: http://arxiv.org/abs/1405.2856
121 UK universities websites and links 1) League table ranking 2) Group affiliation 3) Geographical location
![Page 17: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/17.jpg)
Group Affiliations
![Page 18: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/18.jpg)
League table ranking
![Page 19: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/19.jpg)
Geography
Colour ~ intensity
![Page 20: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/20.jpg)
Gravity Law σ𝑖𝑗 =
𝑠𝑖𝑗
𝑠𝑖𝑜𝑢𝑡𝑠𝑗
𝑖𝑛
𝑠𝑖𝑗 =𝑠𝑖𝑜𝑢𝑡𝑠𝑗
𝑖𝑛
𝑟0.28
![Page 21: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/21.jpg)
Big UK Domain Data for the Arts and Humanities
Primary aim: developing a methodological and theoretical framework within which to study over 15 years of UK domain data – with lessons for the future study of web archives more generally
![Page 22: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/22.jpg)
Big UK Domain Data for the Arts and Humanities
The dataset:
– Crawled from 1996 – 2013
– Approximately 65 TB, billions of words
– Building interface to allow search by retrieval date, target domain of links, sentiment
– Allow qualitative and quantitative analysis – and iteration between multiple research techniques
![Page 23: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/23.jpg)
Big UK Domain Data for the Arts and Humanities
Key outputs:
– Ten bursary projects using web archive data to investigate a broad range of topics, for example… • Armed services recruitment online
• The accessibility of the web for disabled users
• Online discussions of ‘Beat’ poetry
– An edited book of empirical studies concerning the history of the UK web, featuring chapters on, for example… • Constitutional and institutional change in UK government
• The BBC’s online presence
• The ‘web of faith’ online
![Page 24: Ancient History of the UK Web](https://reader035.vdocuments.mx/reader035/viewer/2022081403/554ba9c2b4c905ae618b520d/html5/thumbnails/24.jpg)
Next
● Studies underway at OII, BL, IHR
● Book and articles
– Study overall growth of .uk
– Case study of .gov.uk
– Study of media and select committee
visibility
● Releasing data open source