mapping implicit processes: extracting social networks from digital corpora
Post on 18-Jun-2015
Embed Size (px)
DESCRIPTIONIn the first half of the nineteenth century, the number of newspapers published in Britain grew exponentially and spread far beyond the large urban centres into market towns and centres throughout the nation. Although these provincial newspapers remained weekly or bi-weekly publications throughout the period, they still required a significant amount of reportage to fill their four to eight pages. This material was shamelessly, and often haphazardly, gleaned from international periodicals in the form of scissors-and-paste reprints. Through these half-hearted shortcuts, we can develop a significant understanding of newspaper networks before the rise of international telegraphy and the slow decline of the scissors-and-paste system. Utilising highly detailed transcriptions of newspaper content from the wider Anglophone world, and considering omissions, additions and typesetting errors, we can trace key dissemination pathways of news content from its origin in various British towns and colonies, through its many reprints, abridgments, summaries and commentaries, to the pages of provincial press. By mapping the shape and directionality of these network connections, a greater understanding of news dissemination and editorial links can be achieved. These networks can then form the statistical basis of further qualitative studies into the spread of ideas or interpersonal connections. However, close reading such as this requires painstaking transcription and careful examination of individual articles — processes that severely limit the speed and scale at which these networks can be constructed. The paper will demonstrate how, through careful reflections on implicit processes undertaken during traditional close reading, and the adaptation of digital tools developed for distance reading and edition tracking, the mapping of these news networks can be significantly automated and the quantitative influence of key hubs can be preliminarily determined.
- 1. VIEW THESE SLIDESMAPPING IMPLICIT PROCESSES:EXTRACTING SOCIAL NETWORKS FROM DIGITAL CORPORAM. H. BealsShef f ield Hallam University@mhbealsABOUT ME
2. OverviewUnderstanding Scissors-and-Paste Journalism in Georgian BritainComputer-Aided Identification of Reprints and MemesUnderstanding Dissemination PathwaysManual Construction of Social NetworksComputer-Aided Ordering of Dissemination PathwaysFuture Plans 3. Scissors-and-Paste Journalism in Georgian BritainProliferation of Colonial and Provincial PressesSpread of Journeyman PrintersReduction of Stamp DutyNew Profit ModelsEntertaining and Literary ContentAdverts to Attract Readers to Sell to AdvertisersManual Dissemination of NewsLimited Number of SpecialsPostal Exchange, Subscriptions, CorrespondenceNo Telegraph until 1840s and Not Used for Miscellany 4. Computer-Aided Identification of Reprints & MemesPromiseLarge-Scale Digitisation EffortsKeyword SearchingnGram Matching (WCopyFind)Edition Tracking (Juxta)Viral Texts Project (Cordell, Dillon, and Smith)Large-Scale Corpus of Nineteenth Century NewspapersExtensive, Automatic Repair of OCR ErrorsIdentification of Highly Reprinted Materials (Memes)Discussion and Exploration of Meme Traits and and PatternsPerilsDiscrete Digital Corpera (Paywalls)Offline Penumbra (Curation)Lost Nodes (Incomplete Data)OCR Variability (50-80%) 5. Computer-Aided Identification of Reprints & Memes# concordanceset.pyimport redef replace_words(text, word_dic):rc = re.compile('|'.join(map(re.escape, word_dic)))def translate(match):return word_dic[match.group(0)]return rc.sub(translate, text)def getNGrams(wordlist, n):return [wordlist[i:i+n] for i in range(len(wordlist)-(n-1))]basenumber = raw_input('What is the first id number? )number = str(basenumber)numberint = int(basenumber)basenumberend = raw_input('What is the last id number? )endnumber = int(basenumberend)ngram = raw_input('How many words should be in a phrase? )ngrams = int(ngram)combifile = 'combine.txtlistopen = open(combifile, "r)wordlist = listopen.read()splitlist = wordlist.split()listopen.close()ngramslist = getNGrams(splitlist, ngrams)if ngramslist:ngramslist.sort()last = ngramslist[-1]for i in range(len(ngramslist)-2, -1, -1):if last == ngramslist[i]:del ngramslist[i]else:last = ngramslist[i]tidystring = 'for item in ngramslist:number = str(basenumber)numberint = int(basenumber)lineitem = " ".join(item)print lineitemtidystring += str('n' + lineitem + ',')while (numberint