smart qualitative data: methods and community tools for data mark-up (squad) louise corti uk data...

Smart Qualitative Data: Smart Qualitative Data: Methods and Community Tools for Methods and Community Tools for

Data Mark-Up (Data Mark-Up (SQUAD) SQUAD)

Louise CortiLouise CortiUK Data Archive, University of EssexUK Data Archive, University of Essex

QUADS Demonstrator WorkshopQUADS Demonstrator Workshop28 September 200628 September 2006

2

Access to qualitative dataAccess to qualitative data access to qualitative research-based datasetsaccess to qualitative research-based datasets

resource discovery points – cataloguesresource discovery points – catalogues

online data searching and browsing of multi-media online data searching and browsing of multi-media datadata

new publishing forms: re-presentation of research new publishing forms: re-presentation of research outputs combined with data – a guided touroutputs combined with data – a guided tour

text mining, natural language processing and e-text mining, natural language processing and e-science applications offer richer access to digital data science applications offer richer access to digital data banksbanks

underpinning these applications is the need for underpinning these applications is the need for agreed methods, standards and toolsagreed methods, standards and tools

3

Applications of formats and Applications of formats and standardsstandards

standard for data producers to store and publish standard for data producers to store and publish data in multiple formatsdata in multiple formats e.g UK Data Archive and ESDS Qualidata Onlinee.g UK Data Archive and ESDS Qualidata Online

data exchange and data sharing across dispersed data exchange and data sharing across dispersed repositories and software packages (eg CAQDAS)repositories and software packages (eg CAQDAS)

more precise searching/browsing of archived more precise searching/browsing of archived qualitative data beyond the catalogue recordqualitative data beyond the catalogue record

shared toolsets for preparing qualitative data for shared toolsets for preparing qualitative data for sharing and archivingsharing and archiving

4

Our own needsOur own needs ESDS Qualidata online system ESDS Qualidata online system

limited functionality - currently keyword search, KWIC limited functionality - currently keyword search, KWIC retrieval, and browse of textsretrieval, and browse of texts

wish to extend functionality wish to extend functionality display of marked-up features (e.g.. named entities) display of marked-up features (e.g.. named entities) linking between sources (e.g.. text, annotations, linking between sources (e.g.. text, annotations,

analysis, audio etc)analysis, audio etc)

for 5 years we have been developing a generic descriptive for 5 years we have been developing a generic descriptive standard and format for data that is customised to social standard and format for data that is customised to social science research and which meets generic needs of varied science research and which meets generic needs of varied data types data types

some important progress through TEI and Australian some important progress through TEI and Australian collaboration collaboration

5

How useful is textual data?How useful is textual data?dob: 1921dob: 1921Place: OldhamPlace: Oldhamfinalocc: Oldhamfinalocc: Oldham

[Welham][Welham]

U id='1' who='interviewer' Right, it starts with your grandparents. So give me the U id='1' who='interviewer' Right, it starts with your grandparents. So give me the names and dates of birth of both. Do you remember those sets of names and dates of birth of both. Do you remember those sets of grandparents?grandparents?

U id='2' who='subject' Yes.U id='2' who='subject' Yes.U id='3' who='interviewer' Well, we'll start with your mum's parents? Where did U id='3' who='interviewer' Well, we'll start with your mum's parents? Where did

they live?they live?U id='4' who='subject' They lived in Widness, Lancashire.U id='4' who='subject' They lived in Widness, Lancashire.U id='5' who='interviewer' How do you remember them?U id='5' who='interviewer' How do you remember them?U id='6' who='subject' When we Mum used to take me to see them and me U id='6' who='subject' When we Mum used to take me to see them and me

Grandma came to live with us in the end, didn't she?Grandma came to live with us in the end, didn't she?U id='7' who='Welham' Welham: Yes, when Granddad died - '48.U id='7' who='Welham' Welham: Yes, when Granddad died - '48.U id='8' who='interviewer' So he died when he was 48?U id='8' who='interviewer' So he died when he was 48?U id='9' who='Welham' Welham: No, he was 52. He died in 1948.U id='9' who='Welham' Welham: No, he was 52. He died in 1948.U id='10' who='interviewer' But I remember it. How old would I be then?U id='10' who='interviewer' But I remember it. How old would I be then?U id='11' who='Welham' Welham: Oh, you would have been little then.U id='11' who='Welham' Welham: Oh, you would have been little then.U id='12' who='subject' I remember him, he used to have whiskers. He used to U id='12' who='subject' I remember him, he used to have whiskers. He used to

put me on his knee and give me a kiss. put me on his knee and give me a kiss. ......

6

What are we interested in finding in What are we interested in finding in data?data?

short term:short term: how can we exploit the contents of our data?how can we exploit the contents of our data? how can data be shared?how can data be shared? what is currently useful to mark-up?what is currently useful to mark-up?

long termlong term what might be useful in the future?what might be useful in the future? who might want to use your data?who might want to use your data? how might the data be linked to other data how might the data be linked to other data

sets?sets?

7

SQUAD Project: Smart Qualitative SQUAD Project: Smart Qualitative

DataData Primary aim:Primary aim:

to explore methodological and technical solutions for to explore methodological and technical solutions for exposing digital qualitative data to make them fully exposing digital qualitative data to make them fully shareable and exploitable shareable and exploitable

collaboration between collaboration between

UK Data Archive, University of Essex (lead partner)UK Data Archive, University of Essex (lead partner) Language Technology Group, Human Language Technology Group, Human

Communication Research Centre, School of Communication Research Centre, School of Informatics, University of EdinburghInformatics, University of Edinburgh

18 months duration, 1 March 2005 – 31 August 200618 months duration, 1 March 2005 – 31 August 2006

8

SQUAD: main objectivesSQUAD: main objectives developing and testing universal standards and technologiesdeveloping and testing universal standards and technologies

long-term digital archiving long-term digital archiving publishingpublishing data exchangedata exchange

defining context for research data (e.g. interview settings and defining context for research data (e.g. interview settings and dynamics and micro/macro factorsdynamics and micro/macro factors

user-friendly tools for semi-automating processes already used to user-friendly tools for semi-automating processes already used to prepare qualitative data and materials (Qualitative Data Mark-up prepare qualitative data and materials (Qualitative Data Mark-up Tools (QDMT) Tools (QDMT) formatted text documents ready for outputformatted text documents ready for output mark-up of structural features of textual datamark-up of structural features of textual data annotation and anonymisation toolannotation and anonymisation tool automated coding/indexing linked to a domain ontology automated coding/indexing linked to a domain ontology

providing demonstrators and guidanceproviding demonstrators and guidance

9

What features can be marked-up?What features can be marked-up?

spoken interview texts provide the clearestspoken interview texts provide the clearest――and most and most commoncommon――example of kinds of typical encoding features: example of kinds of typical encoding features:

3 basic groups of structural features3 basic groups of structural features

utterance, specific turn taker, defining idiosyncrasies utterance, specific turn taker, defining idiosyncrasies in transcriptionin transcription

links to analytic annotation and other data types (e.g.. links to analytic annotation and other data types (e.g.. thematic codes, concepts, audio or video links, thematic codes, concepts, audio or video links, researcher annotations) researcher annotations)

identifying information such as real names, company identifying information such as real names, company names, place names, occupations, temporal names, place names, occupations, temporal information information

10

Identifying elementsIdentifying elements Identify atomic elements of information in textIdentify atomic elements of information in text

Person namesPerson names Company/Organisation namesCompany/Organisation names LocationsLocations DatesDates TimesTimes PercentagesPercentages OccupationsOccupations Monetary amountsMonetary amounts

Example:Example:• ItalyItaly's business world was rocked by the announcement 's business world was rocked by the announcement last last

ThursdayThursday that that Mr. VerdiMr. Verdi would leave his job as vice-president of would leave his job as vice-president of Music Masters of Milan, IncMusic Masters of Milan, Inc to become operations director of to become operations director of Arthur AndersonArthur Anderson..

11

How do we annotate our data?How do we annotate our data?

human effort?human effort?

how long does one document take to mark up by hand?how long does one document take to mark up by hand? how much data do you want/need?how much data do you want/need? how many annotators do you have?how many annotators do you have?

human error – like traditional coding error human error – like traditional coding error accuracyaccuracy expertise in subject areaexpertise in subject area boredomboredom subjective opinionssubjective opinions

what if we decide to add more categories for mark-up at a what if we decide to add more categories for mark-up at a later date?later date?

can this be automated at all?can this be automated at all?

12

Automating content extraction using Automating content extraction using rulesrules

rules can be writtenrules can be written lists of common names, useful to a pointlists of common names, useful to a point lists of pronouns (I, he, she, me, my, they, them, etc)lists of pronouns (I, he, she, me, my, they, them, etc)

““me mum”; “them cats”, but which entities do pronouns me mum”; “them cats”, but which entities do pronouns refer to?refer to?

rules regarding typical surface cues:rules regarding typical surface cues: CapitalisedWord CapitalisedWord

probably a name of some sort e.g. “John found it probably a name of some sort e.g. “John found it interesting…”interesting…”

first word of sentences is useless though first word of sentences is useless though title CapitalisedWord - probably a person name, e.g. “Mr. title CapitalisedWord - probably a person name, e.g. “Mr.

Smith” or “Mr. Average”?Smith” or “Mr. Average”?

Works ok but requires several months for a person to write these Works ok but requires several months for a person to write these rulesrules

each new domain/entity type requires more timeeach new domain/entity type requires more time requires experienced experts (linguists, biologists, etc.)requires experienced experts (linguists, biologists, etc.)

13

What about more intelligent content What about more intelligent content extraction mechanisms?extraction mechanisms?

machine learning:machine learning:

manually annotate texts with entitiesmanually annotate texts with entities 100,000 words can be done in 1-3 days depending on experience100,000 words can be done in 1-3 days depending on experience the more annotated data you have, the higher the accuracythe more annotated data you have, the higher the accuracy if the system hasn’t seen it or hasn’t seen anything that looks like it, if the system hasn’t seen it or hasn’t seen anything that looks like it,

then it can’t tell what it isthen it can’t tell what it is So - garbage in, garbage outSo - garbage in, garbage out

Latest approach uses a mixture of rules and machine learningLatest approach uses a mixture of rules and machine learning

Recent focus on relation and event extractionRecent focus on relation and event extraction

Mike JohnsonMike Johnson is now head of is now head of the department of computingthe department of computing. Today . Today he announced new funding opportunities.he announced new funding opportunities.

person(person(Mike-JohnsonMike-Johnson)) head-of(head-of(the-department-of-computingthe-department-of-computing, , Mike-JohnsonMike-Johnson)) announced(announced(Mike-JohnsonMike-Johnson, new funding opportunities, today), new funding opportunities, today)

15

UK Data Archive - NLP UK Data Archive - NLP collaborationcollaboration

ESDS Qualidata making use of options for semi-automated ESDS Qualidata making use of options for semi-automated mark-up of some components of its data collections using mark-up of some components of its data collections using natural language processing and information extractionnatural language processing and information extraction

new partnerships created – new methods, tools and jargon new partnerships created – new methods, tools and jargon to learn!to learn!

new area of application for NLP to social science datanew area of application for NLP to social science data

growing interest in UK in applying NLP and text mining to growing interest in UK in applying NLP and text mining to social science texts – data and research outputs such as social science texts – data and research outputs such as publications’ abstractspublications’ abstracts

16

Project progressProject progress

defined areas of context for qualitative datadefined areas of context for qualitative data

drafted a metadata schema with mandatory elementsdrafted a metadata schema with mandatory elements

built a Java GUI – with step-by-step componentsbuilt a Java GUI – with step-by-step components data clean up tooldata clean up tool named entity mark-up toolsnamed entity mark-up tools annotation tool - NITE XML Toolkitannotation tool - NITE XML Toolkit

extended functionality of ESDS Qualidata Online extended functionality of ESDS Qualidata Online system to include links to audio-visual material, other system to include links to audio-visual material, other documents, research outputs and mapping systemsdocuments, research outputs and mapping systems

Defining contextDefining context rich context enables informed re-use of data. But defining how to

provide context for raw data to make it more ‘usable’ is complex

both micro and macro level features should be considered

detailed information on sampling procedures, field work approaches and question guides, analysis. Personal fieldwork observations

timelines e.g events and political chronologies

SQUAD has identified a minimal generic set of elements that represent a baseline for contextualising data

QUADS workshop to address common problems.

Papers being prepared for dedicated edited collection in Journal in Methodological Innovations Online

sirius.soc.plymouth.ac.uk/~andyp/

Metadata standards in useMetadata standards in use

DDI for Study description, Data file description, Other DDI for Study description, Data file description, Other study related materials, links to variable description for study related materials, links to variable description for quantified parts (variables)quantified parts (variables)

for data content and data annotation: the Text for data content and data annotation: the Text Encoding InitiativeEncoding Initiative standard for text mark-up in humanities and social standard for text mark-up in humanities and social

sciencessciences

used consultant to help text the TEI-conformant DTD used consultant to help text the TEI-conformant DTD

evaluating other schemaevaluating other schema

TEI SchemaTEI SchemaThe XML schema will specify a ‘reduced’ set of Text Encoding

Initiative (TEI) elements:

core tag set for transcription names, numbers, dates <persname> links and cross references <ref> notes and annotations <note> text structure <body> unique to spoken texts <kinesic> linking, segmentation and alignment <link> advanced pointing - XPointer framework text and AV synchronisation contextual information (participants, setting, text)

20

Metadata for model transcript outputMetadata for model transcript output

Study Name Study Name <titlStmt><titl>Mothers and <titlStmt><titl>Mothers and daughters</titl></titlStmt>daughters</titl></titlStmt>

DepositorDepositor <distStmt><depositr>Mildred <distStmt><depositr>Mildred Blaxter</depositr></distStmt>Blaxter</depositr></distStmt>

Interview number Interview number <intNum>4943int01</intNum><intNum>4943int01</intNum>Date of interviewDate of interview <intDate>3 May 1979</intDate> <intDate>3 May 1979</intDate>Interview IDInterview ID <persName>g24</persName><persName>g24</persName>Date of birthDate of birth <birth>1930</birth><birth>1930</birth>GenderGender <gender>Female</gender><gender>Female</gender>OccupationOccupation <occupation>pharmacy <occupation>pharmacy

assistant</occupation>assistant</occupation>Geo regionGeo region <geoRegion>Scotland</geoRegion><geoRegion>Scotland</geoRegion>Marital statusMarital status <marStat>Married</marStat><marStat>Married</marStat>

21

Transcript with XML mark-upTranscript with XML mark-up

XML: enabling a standardised format XML: enabling a standardised format for interview transcriptsfor interview transcripts

XML and XSL: enabling web-enabled XML and XSL: enabling web-enabled display, search and browsedisplay, search and browse

Automating XML mark-up Automating XML mark-up Input data fileInput data file

Data processed through Edinburgh Data processed through Edinburgh LT-XML and CME toolsLT-XML and CME tools

The main Graphical User Interface (GUI)

Invokes the SQUADCoder in NXT

NXT toolNXT tool

Locate the NXT metadata file

The NXT generic window – running the SQUAD Coder

The SQUADCoder WindowThe SQUADCoder WindowTranscription view The Named Entity Hierarchy All the references to a

particular entity

Annotation tool - anonymiseAnnotation tool - anonymise

The Coreference Action Panel

Annotation toolAnnotation tool

Enter pseudonym

Anonymised dataAnonymised dataThe Anonymised Transcription View

Annotated data Annotated data what formats and how stored?what formats and how stored?

NXT uses ‘stand off’ annotation – annotation linked to or NXT uses ‘stand off’ annotation – annotation linked to or references individual wordsreferences individual words

uses the NITE NXT XML modeluses the NITE NXT XML model

creates new anonymised versioncreates new anonymised version

intend to : intend to : save original file save original file save matrix of references - names to pseudonymssave matrix of references - names to pseudonyms outputs annotations – who worked on the file etcoutputs annotations – who worked on the file etc

32

Enhancing multimedia displayEnhancing multimedia display

ESDS Qualidata Online ESDS Qualidata Online

XML enabling link to and simultaneously display:XML enabling link to and simultaneously display:

memos and annotationsmemos and annotations other documentsother documents URLsURLs photosphotos audio and videoaudio and video mapsmaps

Future workFuture work from Autumn:from Autumn:

funding to formalising a data exchange standardfunding to formalising a data exchange standard testing Qualitative Data Interchange Format – Australia Unistesting Qualitative Data Interchange Format – Australia Unis non-proprietary exchangeable bundle - metadata, data and non-proprietary exchangeable bundle - metadata, data and

annotation – expressed to RDFannotation – expressed to RDF testing import and export from CAQDAS packages eg Atlas-titesting import and export from CAQDAS packages eg Atlas-ti

develop archiving tool for annotated data develop archiving tool for annotated data

key word extraction systems to help conceptually index key word extraction systems to help conceptually index qualitative data – text mining collaborationqualitative data – text mining collaboration

exploring grid-enabling data – e-science collaborationexploring grid-enabling data – e-science collaboration

we welcome collaboration and testerswe welcome collaboration and testers

42

InformationInformation ESDS Qualidata Online site:ESDS Qualidata Online site:

www.esds.ac.uk/qualidata/online/www.esds.ac.uk/qualidata/online/

SQUAD website:SQUAD website:

quads.esds.ac.uk/projects/squad.aspquads.esds.ac.uk/projects/squad.asp

EEdinburgh NLP toolsdinburgh NLP toolswww.ltg.ed.ac.uk/www.ltg.ed.ac.uk/software/software/

NNITE NXT toolkitITE NXT toolkit::www.ltg.ed.ac.uk/NITEwww.ltg.ed.ac.uk/NITE

ESDS Qualidata site:ESDS Qualidata site: www.esds.ac.uk/qualidata/www.esds.ac.uk/qualidata/

SQUAD staffSQUAD staff

Louise Corti - UK Data Archive , Essex Louise Corti - UK Data Archive , Essex (PI)(PI)

Claire Grover - LTG, Edinburgh (PI)Claire Grover - LTG, Edinburgh (PI)

Libby Bishop - UK Data Archive, EssexLibby Bishop - UK Data Archive, Essex

Maria Milosavljevic - LTG, EdinburghMaria Milosavljevic - LTG, Edinburgh

Mijail A. Kabadjov- LTG, EdinburghMijail A. Kabadjov- LTG, Edinburgh

http://www.esds.ac.uk/qualidata/online/

http://quads.esds.ac.uk/projects/squad.asp


http://www.ltg.ed.ac.uk/

http://www.ltg.ed.ac.uk/software/



http://www.ltg.ed.ac.uk/NITE

smart qualitative data: methods and community tools for data mark-up (squad) louise corti uk data...

Documents

data sharing

data sets

data producers

textual data

qualitative data access

multimedia data online

digital qualitative

caqdas data exchange