event extraction - yyy•semi-supervised learning 1. a few high-precision seed patterns or seed...
TRANSCRIPT
![Page 1: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/1.jpg)
复旦大学大数据学院School of Data Science, Fudan University Chinese Event Extraction
杨依莹
2017.11.22
![Page 2: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/2.jpg)
2
3
1
纲大
ACE program
CRF++:YetAnotherCRFtoolkit
Assignment3:Chineseeventextraction
1
![Page 3: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/3.jpg)
复旦大学大数据学院School of Data Science, Fudan University
ACE program
AutomaticContentExtraction(ACE)program:
• TheobjectiveoftheAutomaticContentExtraction(ACE)Programwastodevelopextractiontechnologytosupportautomaticprocessingofsourcelanguagedata(intheformofnaturaltextandastextderivedfromASRandOCR).
• Theprogramrelatesto English, Arabic and Chinese texts.
• TheACEcorpusisoneofthestandardbenchmarksfortestingnewinformationextraction algorithms.
![Page 4: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/4.jpg)
复旦大学大数据学院School of Data Science, Fudan University
ACE program
AutomaticContentExtraction(ACE)program:
Givenatextin naturallanguage,theACEchallengeistodetect:
1. entitiesmentionedinthetext,suchas:persons,organizations,locations,facilities,weapons.
2. relations betweenentities,suchas:personAisthemanagerofcompanyB.Relationtypesinclude:role,part,located,near,andsocial.
3. eventsmentionedinthetext,suchas:interaction,movement,transfer,creationanddestruction.
![Page 5: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/5.jpg)
复旦大学大数据学院School of Data Science, Fudan University
ACE program
AutomaticContentExtraction(ACE)program:
Anexampleoftext
![Page 6: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/6.jpg)
复旦大学大数据学院School of Data Science, Fudan University
ACE program : entity
• EntityDetectionandTracking(EDT)• ACEtasksidentifiedseventypesofentities:Person,Organization,
Location,Facility,Weapon,VehicleandGeo-PoliticalEntity(GPEs).Eachtypewasfurtherdividedintosubtypes.
• Foreverymention,theannotatoridentifiedthemaximalextentofthestringthatrepresentstheentityandlabeledtheheadofeachmention.Nestedmentionswerealsocaptured.
![Page 7: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/7.jpg)
复旦大学大数据学院School of Data Science, Fudan University
ACE program : relation
• RelationDetectionandCharacterization(RDC):• involvedtheidentificationofrelationsbetweenentities.• Foreveryrelation,annotatorsidentifiedtwoprimaryarguments
(namely,thetwoACEentitiesthatarelinked)aswellastherelation'stemporalattributes.
![Page 8: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/8.jpg)
复旦大学大数据学院School of Data Science, Fudan University
• Createnewstructuredknowledgebases,usefulforanyapp
• Augmentcurrentknowledgebases• AddingwordstoWordNet thesaurus,factstoFreeBase orDBPedia
• DBpedia:anontologyderivedfromWikipediacontainingover2billionRDFtriples.
• Freebase:adatasetfromWikipediainfoboxes.• On16December2015,Googleofficiallyannouncedthe KnowledgeGraphAPI,whichismeanttobeareplacementtotheFreebaseAPI.
• Supportquestionanswering• Thegranddaughterofwhichactorstarredinthemovie“E.T.”?• (acted-in?x“E.T.”)(is-a?yactor)(granddaughter-of?x?y)
ACE program : relation
![Page 9: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/9.jpg)
复旦大学大数据学院School of Data Science, Fudan University
ACE program : relation
AutomaticContentExtraction(ACE)program:• 7 types and17subtypesrelationsfrom“RelationExtraction
Task”
ARTIFACT
GENERALAFFILIATION
ORGAFFILIATION
PART-WHOLE
PERSON-SOCIAL PHYSICAL
Located
Near
Business
Family Lasting Personal
Citizen-Resident-Ethnicity-Religion
Org-Location-Origin
Founder
EmploymentMembership
OwnershipStudent-Alum
Investor
User-Owner-Inventor-Manufacturer
GeographicalSubsidiary
Sports-Affiliation
![Page 10: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/10.jpg)
复旦大学大数据学院School of Data Science, Fudan University
• Physical-LocatedPER-GPE• He was in Tennessee
• Part-Whole-SubsidiaryORG-ORG• XYZ, the parent company of ABC
• Person-Social-FamilyPER-PER• John’s wife Yoko
• Org-AFF-FounderPER-ORG• Steve Jobs, co-founder of Apple…
ACE program : relation
![Page 11: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/11.jpg)
复旦大学大数据学院School of Data Science, Fudan University
• UsingPatternstoExtractRelations• lexico-syntacticpattern(词典-语义规则)
ACE program : relation
![Page 12: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/12.jpg)
复旦大学大数据学院School of Data Science, Fudan University
• SupervisedLearning
1. Findallpairsofnamedentities
2. Decideif2entitiesarerelated
3. Ifyes,classifytherelation
ACE program : relation
![Page 13: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/13.jpg)
复旦大学大数据学院School of Data Science, Fudan University
• SupervisedLearning• Themostimportantstep:classification• e.g.AmericanAirlines,aunitofAMR,immediatelymatchedthemove,spokesmanTimWagnersaid.
ACE program : relation
![Page 14: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/14.jpg)
复旦大学大数据学院School of Data Science, Fudan University
• Semi-supervisedLearning1.Afewhigh-precisionseedpatternsorseedtuples.2.Findingsentencesthatcontainentitiesintheseedpair.3.Extractandgeneralizethecontexttolearnnewpatterns.
Maycausesemanticdrift
ACE program : relation
![Page 15: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/15.jpg)
复旦大学大数据学院School of Data Science, Fudan University
• Semi-supervisedLearning• Toavoidsemanticdrift,weintroduceconfidencevalue.
• Settingconservativeconfidencethresholdsfortheacceptanceofnewpatternsandtuples.
ACE program : relation
![Page 16: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/16.jpg)
复旦大学大数据学院School of Data Science, Fudan University
ACE program : event
AutomaticContentExtraction(ACE)program:• EventDetectionandCharacterization(EDC)
![Page 17: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/17.jpg)
2
3
1
纲大
ACE program
CRF++:YetAnotherCRFtoolkit
Assignment3:Chineseeventextraction2
![Page 18: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/18.jpg)
复旦大学大数据学院School of Data Science, Fudan UniversityDescription
• Inthisassignment,youwill need to use sequencelabeling models for Chinese event extraction.
• Event information aredefinedas two parts:• Trigger:themainwordthatmostclearlyexpressestheoccurrenceofanevent.
• Argument:anentity,temporalexpressionorvaluethatplaysacertainroleintheevent.
• Forexample:“因特尔在中国成立了研究中心”
• “成立”isthetrigger oftypeBusiness• “英特尔”,“中国”and“研究中心”aretheargumentsoftypeAgent,PlaceandOrg
![Page 19: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/19.jpg)
复旦大学大数据学院School of Data Science, Fudan UniversityDescription
• Thistaskisseparatedastwosubtasks:• Triggerlabeling:identify thetriggerwordinthesentence,andclassify ittothefollowing8types:
• Argumentlabeling:identify alltheargumentsinthesentence,andclassify themto35types(somearelistedbelow,alltypescouldbefoundinthetrainingfile):
• You are required to use both HMM and CRF models forthis task. You can use any toolkit for theirimplementation.
• Note that the performance of HMM can be very poor.
![Page 20: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/20.jpg)
复旦大学大数据学院School of Data Science, Fudan UniversityFormal Definition
InputAsequenceofsegmentedChinesewords.
OutputLabeleachwordwith‘T_type’(trigger),‘A_type’(argument)or‘O’(neithertriggernorargument).Saveyourlabelingresultafterthereallabelseparatedwithtab.
fg1:input fg2:traininginstance fg3:testingresult
![Page 21: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/21.jpg)
复旦大学大数据学院School of Data Science, Fudan UniversityProvided Files
• trigger_train.txt &trigger_test.txt :• Thesetwofilescontain1,918and669 instancesfortrainingandtesting,respectively.
• Eachlinecontainsonewordanditslabelseparatedbytabs.• Instancesareseparatedbyblankline.
• argument_train.txt &argument_test.txt :• Thesetwofilescontain2,131and997 instancesfortrainingandtesting,respectively.
• Yourjobistopredictthesequencelabelforinstancesintestfiles,andwriteyourpredictionsinresultfiles.Thelabelsintestfilesareonlyforevaluation.
• eval.py• Thisfilecanhelpyouevaluateyourmodel’srecall,accuracy,precisionandF1-score.
![Page 22: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/22.jpg)
复旦大学大数据学院School of Data Science, Fudan UniversitySubmission
• Generateazipfileandnameitas“sid_homework-3.zip”.
• Itshouldincludeapythonfilenamed“extraction.py”,twooutputfilesnamed“trigger_result.txt”and“argument_result.txt”,andawrittenreportnamed“chinese eventextraction.pdf”.
• Program:codesshouldbewritteninpython.
• Report:thereportneedstobewritteninEnglishwithnomorethan4pages.
![Page 23: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/23.jpg)
复旦大学大数据学院School of Data Science, Fudan UniversityEvaluation
• Wewillmarkyourhomeworkbasedonthefourcriteria:
• Finalaccuracy(20%)• Program(30%)• Report(40%)• HMM implementation (10%)
![Page 24: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/24.jpg)
复旦大学大数据学院School of Data Science, Fudan UniversityDue
• SubmityourhomeworkviaE-learningsystem.• Deadline:Mid-nightatDecember 8th 2017
• Ifyouhaveanyquestionsaboutthishomework,sendemailtoTAorourcoursemailbox.
• TAinCharge• 杨依莹([email protected] )
![Page 25: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/25.jpg)
2
3
1
纲大
ACE program
CRF++:YetAnotherCRFtoolkit
Assignment3:Chineseeventextraction
3
![Page 26: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/26.jpg)
复旦大学大数据学院School of Data Science, Fudan University
CRF++: Yet Another CRF toolkit
• CRF++(http://taku910.github.io/crfpp/ ) isasimple,customizable,andopensourceimplementationof ConditionalRandomFields(CRFs) forsegmenting/labelingsequentialdata.
• CRF++isdesignedforgenericpurposeandwillbeappliedtoavarietyofNLPtasks,suchasNamedEntityRecognition,InformationExtractionandTextChunking.
![Page 27: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/27.jpg)
复旦大学大数据学院School of Data Science, Fudan University
CRF++: Yet Another CRF toolkit
• Template basic
• Each line in the template file denotes one template. In each template, special macro %x[row,col] will be used to specify a token in the input data.
• Here you can find some examples for the replacements
Input: Data
He PRP B-NP
reckons VBZ B-VP
the DT B-NP << CURRENT
current JJ I-NP
account NN I-NP
template expandedfeature%x[0,0] the%x[0,1] DT%x[-1,0] reckons%x[-2,1] PRP%x[0,0]/%x[0,1] the/DTABC%x[0,1]123 ABCDT123
![Page 28: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/28.jpg)
复旦大学大数据学院School of Data Science, Fudan University
CRF++: Yet Another CRF toolkit
• Training(encoding)• Use crf_learn command:
%crf_learn template_file train_file model_file
• Thereare4majorparameterstocontrolthetrainingcondition-aCRF-L2orCRF-L1:Changingtheregularizationalgorithm.DefaultsettingisL2.Generallyspeaking,L2performsslightlybetterthanL1.-cfloat:Withthisoption,youcanchangethehyper-parameterfortheCRFs.Thisparametertradesthebalancebetweenoverfitting andunderfitting.-fNUM:Thisparametersetsthecut-offthresholdforthefeatures.CRF++usesthefeaturesthatoccursnolessthanNUMtimesinthegiventrainingdata.Thedefaultvalueis1.-pNUM:IfthePChasmultipleCPUs,youcanmakethetrainingfasterbyusingmulti-threading.NUMisthenumberofthreads.
![Page 29: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/29.jpg)
复旦大学大数据学院School of Data Science, Fudan University
CRF++: Yet Another CRF toolkit
• Testing(decoding)• Use crf_test command:
%crf_test -mmodel_file test_files
• wheremodel_file isthefile crf_learn creates.test_file isthetestdatayouwanttoassignsequentialtags.Thisfilehastobewritteninthesameformatastrainingfile.
• -v optionsetsverboselevel.defaultvalueis0.Youcanalsohavemarginalprobabilitiesforeachtagandaconditionalprobablyfortheoutput.
%crf_test -v1-mmodeltest.data|head
Rockwell NNP B B/0.992465International NNP I I/0.979089Corp. NNP I I/0.954883's POS B B/0.986396Tulsa NNP I I/0.991966
![Page 30: event extraction - yyy•Semi-supervised Learning 1. A few high-precision seed patterns or seed tuples. 2. Finding sentences that contain entities in the seed pair. 3. Extract and](https://reader033.vdocuments.mx/reader033/viewer/2022060406/5f0f4be47e708231d4437564/html5/thumbnails/30.jpg)
Thanks for your attention!
感谢各位聆听!