a spatio-temporal visual analysis tool for historical dictionaries
TRANSCRIPT
Alejandro Benito [email protected]
Universidad de Salamanca (Spain)
A visual historical exploration tool for lexical resources: the case of Austrian language
Antonio [email protected]
Universidad de Salamanca (Spain)
Roberto Theró[email protected]
Universidad de Salamanca (Spain)
Eveline [email protected] Academy
of Sciences (Austria)
DIGITAL HUMANITIES
Aims of our research
● Rethink the string-based, text-only workflows traditionally employed by academics
● Propose a new visual, interactive framework for language data exploration
● Provide a new perspective of data
● Speed up knowledge extraction
Approach: (3 + 1) Computational Pillars
NLP: Text mining and information retrieval techniques
GIS: Natural approach to data
SNA: Applying graph theory to data to perform relationship and pattern detection
DATAVIS: Central pillar. Unleashes the computational power of the 3.
- Questionnaires are delivered amongst population
- They are retrieved back and analysed in search of particularities
- Complementary fieldwork:
* Personal interviews* Drawings & Maps* Notes * Other artefacts
- Cards are generated for each word found
- Further research is performed afterwards using the cards and artefacts
Where do dictionaries come from?
WBÖ, dbo@ema & exploreAT!
* WBÖ is an initiative with almost 100 years of history
* The DBÖ project (1993) starts creating a digital databank out of WBÖ data
* dbo@ema: Attempt to expose DBÖ data to the general public by using the Web (1st Dataset)
* TUSTEP-XML format is employed(2nd Dataset)
* In the process, historical geocoding data is also generated
exploreAT! 4 Core Topics:
• Digital infrastructures
• E-Lexicography
• Visual analysis tools
• Citizen Science
Timeline
Timeline created by: A. Dorn; clip art: www.clker.com
Our approach
Documental search engine
• Full support of string-based queries, typically used in lexicography (NLP)
• On top of that we add more dimensions:
* Spatial (GIS) and temporal searches* Fuzzy + word distance queries
Visualisation
• Linked-view layout that unleashes the computational power of GIS, SNA and NLP to the novel user
• SNA visualisations employing different techniques to foster pattern identification
Software methodology
There is not a predefined, by-the-book architectural model in DH
1. Microprototyping is necessary to provide an initial insight of the data
2. Periodic exchange meetings with the team of lexicography experts
3. Progressive extraction, refinement and integration of requirements in the final prototype
Bernard et al. (2015)
Prototyping
Problematic of the two datasets
DBO@EMA
Too tight relational approach
Cumbersome scheme, difficult to work with
Slow string query response times
Models a traditional concept of dictionary
TUSTEP-XML
Too slow for creating a responsive visualisation for the web
Only contains textual information
Does not hold the required dimensions
Loose formatting.
/(1\d{3})(-\d{2})*|(1\d{1}.x):(\d{2})*-(\d{2})*/gort = ort.replace(/\/[A-za-z]+/i, "");
SELECT * FROM gemeinde WHERE nameKurz LIKE ? OR nameKurz LIKE ? OR originaldaten LIKE ? OR originaldaten LIKE ?
TUSTEP-XML
HEURISTIC RULES&
SUPERVISED PROCESS
dbo@ema dataset+
Historical geocoding data
● Time● Space● Other features
Some numbers
Total of 2.206.227 records (95.3%) out of the initial 2.314.031 processed
Remaining 4.7% is discarded because of formatting errors in the data source
In the imported 2.2M records set:
* 26.6% have temporal but no spatial dimension
* 32.4% have only spatial dimension
* 9.8% contain both spatial and temporal dimensions
* 31.1% of the records do not contain spatial or temporal dimensions
Our architectural model in DH1. There is not a standard architecture in DH
2. We learnt from other researchers and dbo@ema initiative
3. Our proposal is oriented towards:
a. Deal with big amounts of data(>1M records)
b. Enhance user experience
c. Reactive components
d. Support visualisations able to perform in interaction times
Resolution change
AND
OR
Study cases
1. Visual exploration of the usage of the word “red” and possible referents
2. Popular plant names ending in “-kraut” (herb)
Discussion
Provide a new concept of historical dictionary
Experts’ validation was very positive
Future lines of workIncorporate other data sources (OpenLink)
Visual representation of fuzzy results
Deal with areas of terrain instead of aggregations only
Ability to update/validate the dataset using citizen science
Results
That was about it for now.
Thanks for listening!
Questions?