text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/hcas-slides.pdf ·...

43
Beatrice Alex Edinburgh Language Technology Group School of Informatics [email protected] @bea_alex Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014 Text mining big data: potential and challenges

Upload: others

Post on 11-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

Beatrice Alex!Edinburgh Language Technology Group!School of [email protected]!@bea_alex!

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Text mining big data: potential and challenges

Page 2: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

LTG

The Edinburgh Language Technology Group

Research and development of natural language processing techniques and technology.

Collaboration in projects with partners in a range of different disciplines (biodiversity, biomedicine, history and literature).

Aggregating, text mining, geo-parsing and linking data.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 3: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

LTG

Ongoing projects:

Palimpsest (Mining Literary Edinburgh, AHRC)

UK Connectivity (Analysis of social media, British Council)

BotaniTours (Information aggregation and presentation of botanical points of interest in the Scottish Borders, Smart Tourism and dot.rural).

Trading Consequences (Text mining trends in commodity trading of large 19th century text collections, Jisc, ESRC, AHRC).

New: Text mining brain scan reports for clinical neurologists.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 4: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

TEXT MINING

Describes a set of linguistic, statistical and/or machine learning techniques that model and structure the information content of textual resources.!

Turns unstructured text into structured data (e.g. relational database or linked data).

Is very useful for analysing large text collections automatically (overcoming data paralysis).

Goal: Analyse large amounts of textual data to enable scholars to discover novel patterns and explore hypotheses.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 5: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

TYPES OF ANALYSES

Named entity recognition.

Grounding, e.g. geo-referencing.

Relation extraction.

Clustering, e.g. topic modelling.

Sentiment analysis.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 6: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

TYPES OF ANALYSES

Location mention recognition and geo-parsing output for Picturesque Notes by R.L.Stevenson (Palimpsest, http://palimpsest.blogs.edina.ac.uk/ @LitPalimpsest).

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 7: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

TYPES OF ANALYSES

Location mention recognition and geo-parsing output for Picturesque Notes by R.L.Stevenson (Palimpsest, http://palimpsest.blogs.edina.ac.uk/ @LitPalimpsest).

Trading Consequences visualisation interface (@digtrade http://tradingconsequences.blogs.edina.ac.uk)

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 8: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

TYPES OF ANALYSES

pipes, drums, Queen

dogs

home nationsbaton

UK Connectivity: Sentiment analysis of tweets on the Commonwealth Opening Ceremony in Glasgow 2014

Rod

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 9: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

TYPES OF ANALYSES

UK Connectivity project: person names and sentiment for Ukraine twitter data for a week in March, June and July 2014.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 10: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

TYPES OF ANALYSES

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Early March 2014 End of June 2014 Mid July 2014

Geo-referenced user location data of tweeters talking about the Ukraine.

Page 11: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

TYPES OF ANALYSES

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

<pal-snippet end="s432" start="s432" score="0.43">"When we go to Edinburgh," she said, "remind me while we're there to go and visit Miss Brodie's grave."</pal-snippet>!...!<pal-snippet end="s547" start="s546" score="0.71">Now they were in a great square, the Grassmarket, with the Castle, which was in any case everywhere, rearing between a big gap in the houses where the aristocracy used to live. It was Sandy's first experience of a foreign country, which intimates itself by its new smells and shapes and its new poor.</pal-snippet>

Snippet analysis to rank by “interestingness”.

The Prime of Miss Jean Brodie (Muriel Spark)

Page 12: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

POTENTIAL

Making text collections more accessible.

Enabling distant reading.

Can be applied in an assisted curation setting.

Discovery.

Linking data sets.

Can be optimised with user input.

TM output is more accessible to users through visualisations.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 13: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

ACCESSIBILITY

More and bigger available data sets.

Manual analysis becomes difficult, if not impossible.

Text mining makes textual data more accessible.

It can improve search but also provide other entry points into data, e.g. via visualisations.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 14: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

ACCESSIBILITY

Text mining can be used to analyse trends in large collections and thereby enable distant reading.

At the same time it can be used to point readers to individual examples and direct them back to original sources (close reading).

Source: http://www.nassrgrads.com/online-academics-questions-for-grad-students/

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 15: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

ASSISTED CURATION

Automatic processing of the bulk of information, followed by more careful manual annotation, correction, selection.

Important when going from big to small data where quality matters.

Can produce high precision and recall.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 16: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

ASSISTED CURATION

Text mining output visualised in the Palimpsest assisted curation interface developed at SACHI, University of St.Andrews.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 17: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

DISCOVERY

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 18: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

LINKING

Text mining often involves linking data sets.

Location mention -> gazetteer entry with lat/lon

Person name -> Wikipedia page

Gene mention -> unique identifier in gene ontology

Plant -> Wikispecies page

This can help discovery.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 19: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

LINKING

Page 20: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

LINKING

BotaniTours: Combining various data sets as well as linking out to other existing sites containing plant species information.

http://groups.inf.ed.ac.uk/BotaniTours

Page 21: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

USER INPUT

Iterative development (meetings, prototyping, interviews, manual annotation, continuous feedback).

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 22: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

USER INPUT

User workshop to improve the functionality of the Trading Consequences interface at CHESS 2013 organised by Prof. Colin Coates and colleagues (Hinrichs et al., DH 2014)

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 23: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

USER INPUT

Annotations of brain scan reports to develop a text mining pipeline for this data. Collaboration with Dr. William Whiteley.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 24: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

DATA VISUALISATION

Photo by: Daniel Belasco Rogers

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 25: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

DATA VISUALISATION

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Trading Consequences location tree visualisation developed by Uta Hinrichs, at University of St.Andrews.

Page 26: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

CHALLENGES

Availability of data sources

Gazillions of formats

Data quality

Data size

Limitations of text mining

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 27: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

AVAILABILITY

Even though original printed sources might be out of copyright, their electronic copies often aren’t.

Even if they are, then it can be time-consuming before a collection is actually available.

Make it freely available and you’ll be amazed what other people can do with it. Go on!

“open but not free” :-(

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 28: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

FORMATS

Various gazetteers combined into one Edinburgh gazetteer for the Palimpsest project.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 29: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

NOISY DATA

Page 30: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

NOISY DATA

Page 31: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

NOISY DATA

Page 32: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

NOISY DATA

Page 33: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

QUALITY RATING

Alex and Burns, DATeCH 2014.

Quality Distribution

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

We need more clarity on the quality of existing digitised collections.

Page 34: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

NATURE OF DATABotaniTours: Geo-referenced flowering plant data from GBIF for the Scottish Border.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 35: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

DATA SIZE

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

We can process hundreds of thousands of documents, millions of pages, billions of words, millions of tweets (but often less), usually gigabytes (or less, rarely more). Is that big data?

The entire British Library Nineteenth Century Books collection is 16Tb (1Tb of text, 15Tb of images).

Parallelisation is important.

Shallow text processing can be done relatively quickly but deeper semantic analysis still takes too long to be practical.

Page 36: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

LIMITATIONS OF TM

Text mining is not 100% accurate, especially if the data contains errors or isn’t running text.

Intrinsic evaluation (using a gold standard) and error analysis is important.

Openness about performance helps to manage expectations of users and let’s them understand the strengths and weaknesses of our technology.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 37: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

Trading Consequences: Intrinsic evaluation of the prototype and an improved system (Klein et al. 2014).

LIMITATIONS OF TM

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 38: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

Trading Consequences: Intrinsic evaluation of the prototype and an improved system (Klein et al. 2014).

And geo-referencing evaluation when varying the number of GeoNames candidates considered. (Alex et al., to appear)

LIMITATIONS OF TM

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 39: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

SUMMARY

We are consumers and creators of data: we use existing text collections, mine and enrich them.

Text mining can be useful for assisting and speeding up manual analysis.

We often work with visualisations of mined data in order facilitate their analysis.

We like to tailor our technology to users and ask them for feedback to improve its performance.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 40: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

SUMMARY

Better processes need to be put in place to get access to data. Some providers (HathiTrust) already do this very well.

Data standards can be useful. Format conversion takes up huge amounts of my time and I haven’t seen this improve much yet.

It’s important to make the quality of data more explicit. We need to think about how to deal with low quality and fuzzy data.

Text mining is not going to replace human analysis!

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 41: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

LTG STAFF AND STUDENTS

Prof. Ewan Klein

Prof. Jon Oberlander

Dr. Claire Grover

Dr. Colin Matheson

Dr. Beatrice Alex

Richard Tobin

Dr. Kate Byrne

Dr. Michael Roth

Xuri Tang (visiting)

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Clare Llewellyn

Daniel Duma

Paolo Pareti

Amy Isard

Page 42: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

LTG TOOLS

The Edinburgh Geoparser: a tool for geo-referencing text.

LT-XML2 and LT-TTT2: XML-based software for shallow linguistic processing of text.

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014

Page 43: Text mining big data: potential and challengeshomepages.inf.ed.ac.uk/balex/talks/HCAS-slides.pdf · Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki,

THANK YOU

Questions?

Contact: [email protected]

Website: http://homepages.inf.ed.ac.uk/balex/

Twitter: @bea_alex

Next LaTeCH at ACL 2015 in Beijing!

Big Data Approaches to Intellectual, Cultural and Linguistic History, Helsinki, 01/12/2014