visual analysis and historical discovery

16
VISUAL ANALYSIS AND HISTORICAL DISCOVERY Summer School on Big Data Information Visulisation Chandan Kumar (University of Oldenburg) Julia Juergens (University of Hildesheim) Percy Perez (University of St. Andrews) Victoria Hore (University of Oxford) BRIGHTSOLID: NEWSPAPER DATASET

Upload: ajquigley

Post on 14-Jul-2015

426 views

Category:

Entertainment & Humor


2 download

TRANSCRIPT

VISUAL ANALYSIS AND HISTORICAL DISCOVERY

Summer School on Big Data Information Visulisation

Chandan Kumar (University of Oldenburg) Julia Juergens (University of Hildesheim) Percy Perez (University of St. Andrews) Victoria Hore (University of Oxford)

BRIGHTSOLID: NEWSPAPER DATASET

Data description • Newspapers

• Fife Herald 1833-1878 • The Dundee Courier & Argus 1890-1899

• Data set

• 154 GB of XML files • 16 048 issues (1 METs file for 1 issue) • 77 954 pages (1 ALTO file for 1 page) • no images

Data files

Title MET

- OCR errors - No meaning

ALTO

Methodology

Architectural overview

Data processing • 20 years data analyzed

• 12 years have complete titles • 8 years do not have complete titles • 6189 files analysed • 314 meta files per year ( Avg)

• 12 years => 3754 issues • Word counting, formating files to/from XML, D3 and Jigsaw

• Hadoop processing was impressive

Idea generation • What happened in the 19th century?

• Find interesting stories

• Where were events happening? • Overview of mentioned locations

• What were the most common topics? • Overview of frequent words • Categorization of words

• Who was mentioned? • Entity recognition of names

Visualization (overview)

Visualization (overview)

Visual Exploration with Jigsaw • Jigsaw already has good functions and visualizations!

Visualisations (Beyond Jigsaw) • More numerical analysis

• User selected dimensions and exploration

• Dynamic visualization

• topics, locations, entities

• Pattern analysis

Interactive visualisation

Dynamic exploration

Insights • Industrial revolution in Dundee

• Frequency analysis, cluster overview, positive sentiments

• LATEST MOVEMENTS OF DUNDEE JUTE FLEET • Entity relations, bigram analysis

• Calcutta, Indian subcontinent? • Location-commercial significance

• Baxter Brothers was the world's largest linen manufacturer (1840-1890) • Family names-organization

Conclusions • A really steep learning curve • Big data is BIG • Distributed computing is important • Data wants to tell interesting stories (we just need to interact) • Visualisation is powerful • Jigsaw is awesome • Lot of useful visualisation tools are ready to be used

• Generalizations and Interactions (future work)

THANK YOU FOR THE COOL (SCHOOL) EXPERIENCE

Big thanks to BRIGHTSOLID for providing the interesting dataset

Chandan Kumar Julia Juergens Percy Perez Victoria Hore