broad twitter corpus: a diverse named entity recognition resource

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

Leon DerczynskiKalina Bontcheva

Ian Roberts

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

“I strongly recommend this paper”

“It is therefore a very useful resource”

“Impact of resources: 5Overall recommendation: 5

Reviewer Confidence: 5”

wow

so review

very paper

much japan

Most of our language tech was trained on news

The bias is:

- middle class- white

-working age- educated

- male- 1980s/1990s- from the US

- journalist- following AP guidelines

Your phone rewards you if you talk and write like

(and that's ok.. sort of)

Photo © Michael Jang 1983

Your phone rewards you if you talk and write like

(and that's ok.. sort of)

.. and punishes you when you don't.

(not cool!)

The REAL problem:

Our studies have centred on a tiny, over-biased set of data

There is no variation!(analyse some WSJ if you are not convinced..)

It's time to up our game; social media is a cheap & unprecedented resource

e.g. Baldwin @ WNUT15; Hovy @ ACL15

Social media is incredibly powerful

- sample of all global discourse- warns of earthquakes

- sends fire engines- predicts virus outbreaks (e.g. WNV)

Traditional tools have awful performance

Stanford NER 40% F1

Single-topic recall 66%.. cross-topic 33%

What kind of entities do we find in social media?

High variety – ages quickly

News Tweets

PER Politicians, business leaders, journalists, celebrities

Sportsmen, actors, TV personalities, celebrities, names of friends

LOC Countries, cities, rivers, and other places related to current affairs

Restaurants, bars, local landmarks/areas, cities, rarely countries

ORG Public and private companies, government organisations

Bands, internet companies, sports clubs

Why a new corpus?

Existing ones are tiny, and hyperfocused

Name Tokens Schema Annotation Notes

UMBC 7K PLO Crowd Low IAA

Ritter 46K Freebase Expert, single No IAA

Microsoft 12K PLO + Product ? Private

MSM 29K PLO + Misc Expert, multipleNo hashtags /

usernames

What kind of variance do we see?

Temporal:- concept drift over time

- daily cycles (work, family, socialising)- weekly cycles

- time of year (seasonal behaviours)

Spatial- many different anglophone regions

- different surface forms in each- different signifiers (LLC – Ltd. - DAC)

Social- WSJ readers and writers

- net celebrities- tv characters

Corpus design:

Temporal- drawn over six years, from twitter archive

- selected over multiple temporal cycles

Spatial- spread over six anglophone regions:

UK, US, IE, CA, NZ, AU

Social- general segment- selection for news

- selection for commentary

Annotation problems

Workflow:Crowdsourcing platform interfaces = pita

Not in USA, so no mturk access

Solution:

- GATE Crowdsourcing plugin- Load corpus, set up task, add API

key, launch job, done!- Automatic result collection &

alignment- Even Java/Swing is prettier than

mturk’s back end

Annotation problems

Task designLots of training required

Many entity types

SolutionBrief instructionsClean interface

Annotate just one entity type at a time- pricy but way better, and overall, quicker

Annotation problems

Annotator recallPretty serious problem

People have limited knowledge, limited world experienceExpert annotators actually not good – we’re desperately overfit

Don’t believe me? Who can explain this real document?KKTNY in 45 min!!!!!

Annotation problems

Annotator recallPretty serious problem

People have limited knowledge, limited world experienceExpert annotators actually not good – we’re desperately overfit

Don’t believe me? Who can explain this real document?KKTNY in 45 min!!!!!

Solution:Ignore traditional IAA

Pool the results - “max recall”Rare knowledge ≠ Wrong knowledge

Post-solution:Expert adjudication step

Annotation problems

Crowd can be pretty dumbNot its fault – we gave no education

People need precise idea of task

Solution 1Ensure workers get good score on known data first

Lace the text with gold data, for monitoring & feedback

Solution 2Keep task focused (just one entity type)

Give instructions & examples

Results – annotator quality

Experts are consistent, but don’t get far

Crowd is varied and inconsistent, but gets superior recall performance

Remember, recall is the problem with soc med!

GroupRecall over final

annotationsF1 IAA

Expert 0.309 0.835

Crowd 0.837 0.350

Results: size

Name Tokens Schema Annotation Notes

UMBC 7K PLO Crowd Low IAA

Ritter 46K Freebase Expert, single No IAA

Microsoft 12K PLO + Product ? Private

MSM 29K PLO + Misc Expert, multipleNo hashtags /

usernames

BTC(Broad Twitter

Corpus)165K PLO

Expert + Crowd

Source JSON available

Documents 9 551

Tokens 165 739

Person 5 271

Location 3 114

Organisation 3 732

Total 12 117

Results: diversity

Sorry Botswana, Bahamas, South Africa,

Malta.. looking forward to seeing you crowdsource!

Results: diversity

By year, and month

Results: diversity

By day of month, weekday, and time of day

Results: IAA

Adjudication is the agreement with max-recall

Naïve is micro-averaged lenient match

Note that max-recall performs very well (according to expert..)

Level Adjudication Naïve

Whole doc 0.839 N/a

Person 0.920 0.799

Location 0.963 0.861

Organisation 0.936 0.954

All 0.940 0.877

Results: popular surface forms

CONLL is: * ancient* US and int.rel. centric* about cricket???

Results: long tail steepness

Tail vs. head tells us something about diversityIf a few forms make up many mentions, the corpus is more boring:

- less variety (qualitative)- harder to generalise

about (maths!)

We bisect at h-index point, and compare

proportions

Corpus distribution

Totally legal to give source; it’s under 50K tweets

- JSON- GATE docs

- CoNLL

All intermediate crowdsourcing data included in the GATE docs

Available before Dec 16

To be extra sure, also available as “rehydratable standoff”

Thanks! And thank you everyone!

Alonso & Lease, 2011Bontcheva et al. 2014aBontcheva et al. 2014bCallison-Burch & Dredze, 2010Difallah et al. 2013Finin et al. 2010Hovy et al. 2013Khanna et al. 2010Morris et al. 2012Sabou et al. 2014

Balog et al. 2012Bollacker et al. 2008Hovy 2010Rowe et al. 2013Ritter et al. 2011Rose et al. 2002Tjong Kim Sam et al. 2003

Coppersmith et al. 2014De Choudhury et al. 2013Kedzie et al. 2015Neubig et al. 2011Tumasjan et al. 2010

Eisenstein et al. 2010Eisenstein 2013Hu et al. 2013Kergl et al. 2014Mascaro & Goggins 2012Tufekci 2014

Bontcheva et al. 2013Liu et al. 2011Lui & Baldwin 2012Magdy & Elsayed 2016Mostafa 2013O’Connor et al. 2010

Fromreide et al. 2014Masud et al. 2010

broad twitter corpus: a diverse named entity recognition resource

Science