2010 digital humanities london - dutch republic of letters
Post on 17-May-2015
333 Views
Preview:
DESCRIPTION
TRANSCRIPT
scholarlycommunication@ 1650
scholarlycommunication@ 2050
Letters, Ideas and Information Technology
Erik-Jan Bos, Univ. Utrecht, erik-jan.bos@phil.uu.nl
Charles van den Heuvel, VKS,charles.vandenheuvel@vks.knaw.nl
Dirk Roorda (that’s me), DANS,dirk.roorda@dans.knaw.nl
Using digital corpora of letters to disclose the circulation of
knowledge in the 17th century
http://ckcc.huygens.knaw.nl/
Nota
Beeckman
Cats STEVIN
Huygens STEVIN
Langeren
relation disciplines
direct - water
indirect - literature
4
Corpora of17th century scholars
Corpora of17th century scholars
Constantijn Huygens Christiaan Huygens Grotius Descartes Swammerdam Leeuwenhoek Barleaus Spinoza and more?
Corpus Number of letters:
In posession?
Format Metadata Normalized?
Grotius 7946 Yes TEI In Interp element
Yes, DBNL codes
Van Leeuwenhoek
337 Yes TEI In Interp element
Yes, DBNL codes
Descartes 750 Yes XML (no TEI)
other markup
No, plain text
Barlaeus 1200 300 ready Word unknown unknown
Swammerdam 80 Yes Word unknown unknown
Constantijn Huygens
7295 Yes xml Probably Interp element
DBNL codes
Christiaan Huygens
2900? Medio 2010 probably TEI
Probably Interp element
DBNL codes
CEN -MetadataCEN -Metadata
Catalogus Epistularum Neerlandaricum265,000 descriptions of approximately 1,000,000 lettersfrom 1600 – now of which100,000 letters in 17th century
Research Questions
• History of science:• How did knowledge circulate in the 17th-
century Dutch Republic?
• Patterns in knowledge growth:• How can we visualise sets of letters that
exhibit features of knowledge circulation?
• Re-use:• How can we expose the sources, annotations,
and resulting patterns to further research?
Challenge
Traditional scholarship• interpretation• close reading• solving puzzles
East is east and
East
WestComputational methods•dealing with patterns•gleaned from large quantities of texts•by automatic tools
West is west and ...
Issues to deal with
• making the sources uniformly available• well coded in TEI, access rights
• overcoming the language barrier • (17th cent varieties of French, Latin, Dutch)
• named entity recognition & concepts• people, places, dates, concepts, instruments• mixture of interpretation and algorithms
• creating useful visualisations• aiding exploration by historians of science
ICT in Humanities Research
• collaboratory• e-Laborate as starting point
• algorithmic pipelines• from source material to visualisation
• infrastructure• archiving results• re-using data• developing new algorithms• disseminating the methodology
collaboratorycollaboratory
pipelines
pipelines (current)
• language detection, usingLanguage Identification from Text Using N-gram Based
Cumulative Frequency Addition
Bashir Ahmed, Sung-Hyuk Cha, and Charles Tappert 2004
• results
pipelines (current)• spelling normalisation
• VARD (http://www.comp.lancs.ac.uk/~barona/vard2/)• with help from (http://www.dicollecte.org/home.php?prj=fr)
• results• French: VARD works (after improvements),
although designed for historical English• Dutch: still on the lookout for a combination of
resources, tools, and dexterity• Latin: later
pipelines (current)
pipelines (current)
• named entity recognition• known tools get 70%• search for optimal tools in the next stage
pipelines (insights)
• expect the most from statistical methods
• language technology may boost results
• it remains to be seen by how much
Topic-Author-TimeTopic-Author-TimeSource: Scott Weingart UIA
infrastructure
the project’s legacy
• more than publications• curated sources, annotations, visualisations
• more than algoritms• a framework for analysis of historical texts
• more than a piece of historical research• data and (intermediate) results worthwhile to
• linguists, computer scientists, sociologists
• more than a passive dataset• extensible, dynamic, interactive
preserving the results
• part of the CLARIN infrastructure• http://www.clarin.eu/ • http://www.clarin.nl/
• materials in a Trusted Digital Repository (DANS)• http://easy.dans.knaw.nl/dms
working with CLARIN
• CLARIN-EU• Outreach to humanities: use cases• CKCC one of 10 selected projects• received expert input for choice of language
tools
• CLARIN-NL• CKCC one of 10 initial projects in the Dutch
national construction effort• support for applying language technology
Adapting to CLARIN
• Conforming to standards
• CLARIN standards are in evolution• (and will remain evolvable)
• Common MetaData Infrastructure• a registry of metadata components• defined by the community• with explicit semantics (http://www.isocat.org/ )
• Data in TEI (as export/import format)
Trusted Digital Repository
• materials• reliable (provenance metadata) • findable (CMDI metadata)• referable (persistent identifiers)• accessible (viewable in webbrowser)• usable (downloadable)
• sooner or later: • high-performance computing• memento: a time-sensitive webinterface to the
dynamic contents of the collaboratory (http://arxiv.org/abs/0911.1112 )
http://www.clarin.eu/node/3073
http://ckcc.huygens.knaw.nl/
top related