event extraction - yyy•semi-supervised learning 1. a few high-precision seed patterns or seed...

Post on 27-Jun-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

复旦大学大数据学院School of Data Science, Fudan University Chinese Event Extraction

杨依莹

2017.11.22

2

3

1

纲大

ACE program

CRF++:YetAnotherCRFtoolkit

Assignment3:Chineseeventextraction

1

复旦大学大数据学院School of Data Science, Fudan University

ACE program

AutomaticContentExtraction(ACE)program:

• TheobjectiveoftheAutomaticContentExtraction(ACE)Programwastodevelopextractiontechnologytosupportautomaticprocessingofsourcelanguagedata(intheformofnaturaltextandastextderivedfromASRandOCR).

• Theprogramrelatesto English, Arabic and Chinese texts.

• TheACEcorpusisoneofthestandardbenchmarksfortestingnewinformationextraction algorithms.

复旦大学大数据学院School of Data Science, Fudan University

ACE program

AutomaticContentExtraction(ACE)program:

Givenatextin naturallanguage,theACEchallengeistodetect:

1. entitiesmentionedinthetext,suchas:persons,organizations,locations,facilities,weapons.

2. relations betweenentities,suchas:personAisthemanagerofcompanyB.Relationtypesinclude:role,part,located,near,andsocial.

3. eventsmentionedinthetext,suchas:interaction,movement,transfer,creationanddestruction.

复旦大学大数据学院School of Data Science, Fudan University

ACE program

AutomaticContentExtraction(ACE)program:

Anexampleoftext

复旦大学大数据学院School of Data Science, Fudan University

ACE program : entity

• EntityDetectionandTracking(EDT)• ACEtasksidentifiedseventypesofentities:Person,Organization,

Location,Facility,Weapon,VehicleandGeo-PoliticalEntity(GPEs).Eachtypewasfurtherdividedintosubtypes.

• Foreverymention,theannotatoridentifiedthemaximalextentofthestringthatrepresentstheentityandlabeledtheheadofeachmention.Nestedmentionswerealsocaptured.

复旦大学大数据学院School of Data Science, Fudan University

ACE program : relation

• RelationDetectionandCharacterization(RDC):• involvedtheidentificationofrelationsbetweenentities.• Foreveryrelation,annotatorsidentifiedtwoprimaryarguments

(namely,thetwoACEentitiesthatarelinked)aswellastherelation'stemporalattributes.

复旦大学大数据学院School of Data Science, Fudan University

• Createnewstructuredknowledgebases,usefulforanyapp

• Augmentcurrentknowledgebases• AddingwordstoWordNet thesaurus,factstoFreeBase orDBPedia

• DBpedia:anontologyderivedfromWikipediacontainingover2billionRDFtriples.

• Freebase:adatasetfromWikipediainfoboxes.• On16December2015,Googleofficiallyannouncedthe KnowledgeGraphAPI,whichismeanttobeareplacementtotheFreebaseAPI.

• Supportquestionanswering• Thegranddaughterofwhichactorstarredinthemovie“E.T.”?• (acted-in?x“E.T.”)(is-a?yactor)(granddaughter-of?x?y)

ACE program : relation

复旦大学大数据学院School of Data Science, Fudan University

ACE program : relation

AutomaticContentExtraction(ACE)program:• 7 types and17subtypesrelationsfrom“RelationExtraction

Task”

ARTIFACT

GENERALAFFILIATION

ORGAFFILIATION

PART-WHOLE

PERSON-SOCIAL PHYSICAL

Located

Near

Business

Family Lasting Personal

Citizen-Resident-Ethnicity-Religion

Org-Location-Origin

Founder

EmploymentMembership

OwnershipStudent-Alum

Investor

User-Owner-Inventor-Manufacturer

GeographicalSubsidiary

Sports-Affiliation

复旦大学大数据学院School of Data Science, Fudan University

• Physical-LocatedPER-GPE• He was in Tennessee

• Part-Whole-SubsidiaryORG-ORG• XYZ, the parent company of ABC

• Person-Social-FamilyPER-PER• John’s wife Yoko

• Org-AFF-FounderPER-ORG• Steve Jobs, co-founder of Apple…

ACE program : relation

复旦大学大数据学院School of Data Science, Fudan University

• UsingPatternstoExtractRelations• lexico-syntacticpattern(词典-语义规则)

ACE program : relation

复旦大学大数据学院School of Data Science, Fudan University

• SupervisedLearning

1. Findallpairsofnamedentities

2. Decideif2entitiesarerelated

3. Ifyes,classifytherelation

ACE program : relation

复旦大学大数据学院School of Data Science, Fudan University

• SupervisedLearning• Themostimportantstep:classification• e.g.AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.

ACE program : relation

复旦大学大数据学院School of Data Science, Fudan University

• Semi-supervisedLearning1.Afewhigh-precisionseedpatternsorseedtuples.2.Findingsentencesthatcontainentitiesintheseedpair.3.Extractandgeneralizethecontexttolearnnewpatterns.

Maycausesemanticdrift

ACE program : relation

复旦大学大数据学院School of Data Science, Fudan University

• Semi-supervisedLearning• Toavoidsemanticdrift,weintroduceconfidencevalue.

• Settingconservativeconfidencethresholdsfortheacceptanceofnewpatternsandtuples.

ACE program : relation

复旦大学大数据学院School of Data Science, Fudan University

ACE program : event

AutomaticContentExtraction(ACE)program:• EventDetectionandCharacterization(EDC)

2

3

1

纲大

ACE program

CRF++:YetAnotherCRFtoolkit

Assignment3:Chineseeventextraction2

复旦大学大数据学院School of Data Science, Fudan UniversityDescription

• Inthisassignment,youwill need to use sequencelabeling models for Chinese event extraction.

• Event information aredefinedas two parts:• Trigger:themainwordthatmostclearlyexpressestheoccurrenceofanevent.

• Argument:anentity,temporalexpressionorvaluethatplaysacertainroleintheevent.

• Forexample:“因特尔在中国成立了研究中心”

• “成立”isthetrigger oftypeBusiness• “英特尔”,“中国”and“研究中心”aretheargumentsoftypeAgent,PlaceandOrg

复旦大学大数据学院School of Data Science, Fudan UniversityDescription

• Thistaskisseparatedastwosubtasks:• Triggerlabeling:identify thetriggerwordinthesentence,andclassify ittothefollowing8types:

• Argumentlabeling:identify alltheargumentsinthesentence,andclassify themto35types(somearelistedbelow,alltypescouldbefoundinthetrainingfile):

• You are required to use both HMM and CRF models forthis task. You can use any toolkit for theirimplementation.

• Note that the performance of HMM can be very poor.

复旦大学大数据学院School of Data Science, Fudan UniversityFormal Definition

InputAsequenceofsegmentedChinesewords.

OutputLabeleachwordwith‘T_type’(trigger),‘A_type’(argument)or‘O’(neithertriggernorargument).Saveyourlabelingresultafterthereallabelseparatedwithtab.

fg1:input fg2:traininginstance fg3:testingresult

复旦大学大数据学院School of Data Science, Fudan UniversityProvided Files

• trigger_train.txt &trigger_test.txt :• Thesetwofilescontain1,918and669 instancesfortrainingandtesting,respectively.

• Eachlinecontainsonewordanditslabelseparatedbytabs.• Instancesareseparatedbyblankline.

• argument_train.txt &argument_test.txt :• Thesetwofilescontain2,131and997 instancesfortrainingandtesting,respectively.

• Yourjobistopredictthesequencelabelforinstancesintestfiles,andwriteyourpredictionsinresultfiles.Thelabelsintestfilesareonlyforevaluation.

• eval.py• Thisfilecanhelpyouevaluateyourmodel’srecall,accuracy,precisionandF1-score.

复旦大学大数据学院School of Data Science, Fudan UniversitySubmission

• Generateazipfileandnameitas“sid_homework-3.zip”.

• Itshouldincludeapythonfilenamed“extraction.py”,twooutputfilesnamed“trigger_result.txt”and“argument_result.txt”,andawrittenreportnamed“chinese eventextraction.pdf”.

• Program:codesshouldbewritteninpython.

• Report:thereportneedstobewritteninEnglishwithnomorethan4pages.

复旦大学大数据学院School of Data Science, Fudan UniversityEvaluation

• Wewillmarkyourhomeworkbasedonthefourcriteria:

• Finalaccuracy(20%)• Program(30%)• Report(40%)• HMM implementation (10%)

复旦大学大数据学院School of Data Science, Fudan UniversityDue

• SubmityourhomeworkviaE-learningsystem.• Deadline:Mid-nightatDecember 8th 2017

• Ifyouhaveanyquestionsaboutthishomework,sendemailtoTAorourcoursemailbox.

• TAinCharge• 杨依莹(zoeyangyy@163.com )

2

3

1

纲大

ACE program

CRF++:YetAnotherCRFtoolkit

Assignment3:Chineseeventextraction

3

复旦大学大数据学院School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

• CRF++(http://taku910.github.io/crfpp/ ) isasimple,customizable,andopensourceimplementationof ConditionalRandomFields(CRFs) forsegmenting/labelingsequentialdata.

• CRF++isdesignedforgenericpurposeandwillbeappliedtoavarietyofNLPtasks,suchasNamedEntityRecognition,InformationExtractionandTextChunking.

复旦大学大数据学院School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

• Template basic

• Each line in the template file denotes one template. In each template, special macro %x[row,col] will be used to specify a token in the input data.

• Here you can find some examples for the replacements

Input: Data

He PRP B-NP

reckons VBZ B-VP

the DT B-NP << CURRENT

current JJ I-NP

account NN I-NP

template expandedfeature%x[0,0] the%x[0,1] DT%x[-1,0] reckons%x[-2,1] PRP%x[0,0]/%x[0,1] the/DTABC%x[0,1]123 ABCDT123

复旦大学大数据学院School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

• Training(encoding)• Use crf_learn command:

%crf_learn template_file train_file model_file

• Thereare4majorparameterstocontrolthetrainingcondition-aCRF-L2orCRF-L1:Changingtheregularizationalgorithm.DefaultsettingisL2.Generallyspeaking,L2performsslightlybetterthanL1.-cfloat:Withthisoption,youcanchangethehyper-parameterfortheCRFs.Thisparametertradesthebalancebetweenoverfitting andunderfitting.-fNUM:Thisparametersetsthecut-offthresholdforthefeatures.CRF++usesthefeaturesthatoccursnolessthanNUMtimesinthegiventrainingdata.Thedefaultvalueis1.-pNUM:IfthePChasmultipleCPUs,youcanmakethetrainingfasterbyusingmulti-threading.NUMisthenumberofthreads.

复旦大学大数据学院School of Data Science, Fudan University

CRF++: Yet Another CRF toolkit

• Testing(decoding)• Use crf_test command:

%crf_test -mmodel_file test_files

• wheremodel_file isthefile crf_learn creates.test_file isthetestdatayouwanttoassignsequentialtags.Thisfilehastobewritteninthesameformatastrainingfile.

• -v optionsetsverboselevel.defaultvalueis0.Youcanalsohavemarginalprobabilitiesforeachtagandaconditionalprobablyfortheoutput.

%crf_test -v1-mmodeltest.data|head

Rockwell NNP B B/0.992465International NNP I I/0.979089Corp. NNP I I/0.954883's POS B B/0.986396Tulsa NNP I I/0.991966

Thanks for your attention!

感谢各位聆听!

top related