towards constructing a chinese information extraction system to support innovations in library...
TRANSCRIPT
Towards Constructing a Chinese Information Towards Constructing a Chinese Information Extraction System to Support Innovations in Extraction System to Support Innovations in Library ServicesLibrary Services
World Library and Information Congress: 72nd IFLA General Conference and Council, 20-24 August 2006, Seoul, Korea
Library of Chinese Academy of Sciences
Zhang Zhixiong, Li Sa, Wu Zhengxin, Lin Ying
outlineoutline
1.1. IntroductionIntroduction
2.2. What is IE (Information Extraction)? What is IE (Information Extraction)?
3.3. Potential functions in Innovations of Potential functions in Innovations of Library ServicesLibrary Services
4.4. Constructing a Chinese Information Constructing a Chinese Information Extraction SystemExtraction System
5.5. Tests and EvaluationTests and Evaluation
1. Introduction1. Introduction
Library of Library of Chinese Academy of Sciences – Now changing the name to National Science Now changing the name to National Science
Library of ChinaLibrary of China– about 400 staffs, HQ in Beijing, 3 branches about 400 staffs, HQ in Beijing, 3 branches
in Lanzhou, Chengdu, Wuhan, in Lanzhou, Chengdu, Wuhan, – serve 90 CAS research institutes across the serve 90 CAS research institutes across the
countrycountry– in 2001,initiated Chinese National Science in 2001,initiated Chinese National Science
Digital Library (CSDL) programDigital Library (CSDL) program
1. Introduction1. Introduction
CSDL (Chinese National Science Digital CSDL (Chinese National Science Digital Library )Library )– provided abundant digital information provided abundant digital information
resources for users. (e-journals,6000 resources for users. (e-journals,6000 west,11000 Chinese, 15000 in one day)west,11000 Chinese, 15000 in one day)
– developed information systems to support developed information systems to support networked services.networked services.
Union Catalogs & Document Union Catalogs & Document DeliveryDelivery
FFederated database searchederated database search
Digital reference Digital reference
remote authenticationremote authentication
1. Introduction1. Introduction
CSDL (Chinese National Science Digital CSDL (Chinese National Science Digital Library )Library )– provided abundant digital information provided abundant digital information
resources for users. (e-journals,6000 resources for users. (e-journals,6000 west,11000 Chinese, 15000 in one day)west,11000 Chinese, 15000 in one day)
– developed information systems to support developed information systems to support networked services.networked services.
– Carried out lots of training and propaganda Carried out lots of training and propaganda programprogram
1. Introduction1. Introduction
CSDL become one of the key research CSDL become one of the key research facility to researcher and graduated facility to researcher and graduated students of CAS.students of CAS.
WhileWhile– Information requirement of researcher and Information requirement of researcher and
graduated students changed rapidlygraduated students changed rapidly– Traditional information retrieval methods is Traditional information retrieval methods is
not sufficientnot sufficient
1. Introduction1. Introduction
The User of CSDL want to:The User of CSDL want to:– get rid of the information noise get rid of the information noise – effectively get a comprehensive view of effectively get a comprehensive view of
recent development of domainrecent development of domain– disclose significant relationships between disclose significant relationships between
informationinformation The Librarian of CSDL want to:The Librarian of CSDL want to:
– improve the service standard of CSDL improve the service standard of CSDL – turn the digital library into a knowledge turn the digital library into a knowledge
repositoryrepository
1. Introduction1. Introduction
Information Extraction (IE) is the Information Extraction (IE) is the emerging technology serves to our needsemerging technology serves to our needs
outlineoutline
1.1. IntroductionIntroduction
2.2. What is IE (Information Extraction)?What is IE (Information Extraction)?
3.3. Potential functions in Innovations of Potential functions in Innovations of Library ServicesLibrary Services
4.4. Constructing a Chinese Information Constructing a Chinese Information Extraction SystemExtraction System
5.5. Tests and EvaluationTests and Evaluation
2.What is IE 2.What is IE (Information (Information Extraction)? Extraction)? NLP Research Group, Univsity of SheffielNLP Research Group, Univsity of Sheffiel
dd– Information extraction (IE) is a term that has cInformation extraction (IE) is a term that has c
ome to be applied to the activity of automaticaome to be applied to the activity of automatically extracting pre-specified sorts of informatiolly extracting pre-specified sorts of information from natural language textsn from natural language texts
2.What is IE 2.What is IE (Information (Information Extraction)? Extraction)? Dr. Hamish CunninghamDr. Hamish Cunningham
– IE is a process that takes texts (and sometimes IE is a process that takes texts (and sometimes speech) as input and produces fixed-format, speech) as input and produces fixed-format, unambiguous data as outputunambiguous data as output
– InputInput unstructuredunstructured free textfree text
– OutputOutput fixed-formatfixed-format unambiguousunambiguous
2.What is IE 2.What is IE (Information (Information Extraction)? Extraction)? Output (structured information source) can Output (structured information source) can
be used for:be used for:– searching searching – analysisanalysis– generating summarygenerating summary– constructing indicesconstructing indices
##### ####### NHS TRUST - PATIENT CASE NOTE ########:######### ####### DOB: 1944 CLEF-RMH-Entry-Key: 52A4F6DB2B46E
AB 1992 Seen in General Surgical This lady who has had a mastectomy and left open capsulotomy and removal of her prosthesis was seen by me in the clinic today on behalf of XXXXXXXXXXX. She has extensive bony lymphoedema in her left arm which does not seem to be getting any better although she is more or less reconciled to the problem. The original problem was that she complained of shooting pain in the direction of ulna nerve and although there does not seem to be any evidence of local, regional or distant recurrence the pain itself warrants management in a pain clinic. XXXXXXXXX could be seen in the pain clinic at the XXXXXXX but as this would involve a lot of travelling would like to be treated nearer her home. I wonder whether it would be possible for you to investigate if there is a pain clinic available at XXXXXXXXXXX as I am sure XXXXX could be treated and benefit from its management. I have otherwise arranged for her to be seen in the clinic again in a year's time. There are no signs of recurrence at this time.
5213A4F612F1
IE, A exampleIE, A example
recurrence
no signs of recurrence
bony lymphoedema
shooting pain in thedirection of ulna nerve
pain
Interventions
Problems
Problem Site
Locations
left arm
local, regional or distant
a year’s time
today
at this time
Time
pain clinic
clinic
pain clinic
General Surgical
pain clinic
mastectomy left open capsulotomyremoval of her prosthesis
management
management
IE, A exampleIE, A example
Extracted Information could be collected…
Interventions
Problems
Problem Site
Locations
Time
recurrence
no signs of recurrence
bony lymphoedema
shooting pain in thedirection of ulna nerve
pain
left arm
local, regional or distant
a year’s time
today
at this time
pain clinic
clinic
pain clinic
General Surgical
pain clinic
mastectomy left open capsulotomyremoval of her prosthesis
management
management
recurrence
no signs of recurrence
bony lymphoedema
shooting pain in thedirection of ulna nerve
pain
left arm
local, regional or distant
a year’s time
today
at this time
pain clinic
clinic
pain clinic
General Surgical
pain clinic
mastectomyleft open capsulotomy
removal of her prosthesis
management
management
recurrence
no signs of recurrencebony lymphoedema
shooting pain in thedirection of ulna nerve
pain
left armlocal, regional or distant
a year’s timetoday
at this time
pain clinicclinic
pain clinicGeneral Surgical
pain clinic
mastectomy
left open capsulotomy
removal of her prosthesis
managementmanagement
2.What is IE 2.What is IE (Information (Information Extraction)? Extraction)? 5 kinds of Information Extraction tasks5 kinds of Information Extraction tasks
– Named Entity recognition (NE)Named Entity recognition (NE)– Coreference resolution (CO)Coreference resolution (CO)– Template Element construction (TE) Template Element construction (TE) – Template Relation construction (TR) Template Relation construction (TR) – Scenario Template production (ST)Scenario Template production (ST)
2.What is IE 2.What is IE (Information (Information Extraction)?Extraction)? NE is about finding entitiesNE is about finding entities CO about which entities and references CO about which entities and references
(such as pronouns) refer to the same thing(such as pronouns) refer to the same thing TE about what attributes entities haveTE about what attributes entities have TR about what relationships between TR about what relationships between
entities there areentities there are ST about events that the entities ST about events that the entities
participate in. participate in.
2.What is IE (Information 2.What is IE (Information Extraction)?Extraction)?
Information Extraction will:Information Extraction will:– play a very important role in coping with the play a very important role in coping with the
huge collections of digital information huge collections of digital information – bring innovations in library servicesbring innovations in library services
outlineoutline
1.1. IntroductionIntroduction
2.2. What is IE (Information Extraction)? What is IE (Information Extraction)?
3.3. Potential functions in Innovations of Potential functions in Innovations of Library ServicesLibrary Services
4.4. Constructing a Chinese Information Constructing a Chinese Information Extraction SystemExtraction System
5.5. Tests and EvaluationTests and Evaluation
3. Potential functions in 3. Potential functions in Innovations of Library Innovations of Library ServicesServices1.1. Automatic annotation and metadata creatAutomatic annotation and metadata creat
ionion– automatic annotation of digital materialsautomatic annotation of digital materials– automatic acquisition of metadataautomatic acquisition of metadata– For example, MnM, S-CREAM, AERODAFor example, MnM, S-CREAM, AERODA
ML, SemTag, KIM, hTechsightML, SemTag, KIM, hTechsight– ontology-based IE techniquesontology-based IE techniques
3. Potential functions in 3. Potential functions in Innovations of Library Innovations of Library ServicesServices2.2. Improving data mining in information Improving data mining in information
analysisanalysis– Large-scale data analysis Large-scale data analysis – Detection of many types of evidenceDetection of many types of evidence– Get enough structured data for analysis Get enough structured data for analysis
3. Potential functions in 3. Potential functions in Innovations of Library Innovations of Library ServicesServices3.3. Developing knowledge base from free teDeveloping knowledge base from free te
xtxt– statistical and numeric databasesstatistical and numeric databases– terminological databaseterminological database– fact sheetsfact sheets
– SOBA (SmartWeb Ontology-Based AnnotatSOBA (SmartWeb Ontology-Based Annotation)ion)
3. Potential functions in 3. Potential functions in Innovations of Library Innovations of Library ServicesServices4.4. Generating answers in digital reference Generating answers in digital reference
systemsystem– Most research libraries establish digital Most research libraries establish digital
reference service reference service – Can we get answers directly from Can we get answers directly from
information systemsinformation systems– Natural language QA (Question Answering) Natural language QA (Question Answering)
SO…SO…
IE is very importantIE is very important How to build an IE system (Chinese)How to build an IE system (Chinese) CSDL try to find an effective wayCSDL try to find an effective way
outlineoutline
1.1. IntroductionIntroduction
2.2. What is IE (Information Extraction)? What is IE (Information Extraction)?
3.3. Potential functions in Innovations of Potential functions in Innovations of Library ServicesLibrary Services
4.4. Constructing a Chinese Information Constructing a Chinese Information Extraction SystemExtraction System
5.5. Tests and EvaluationTests and Evaluation
4. Constructing a Chinese 4. Constructing a Chinese Information Extraction Information Extraction SystemSystem A Chinese IE solution A Chinese IE solution
– which makes full use of GATEwhich makes full use of GATE– trying to develop a Chinese IE plug-in to trying to develop a Chinese IE plug-in to
process Chinese information resource based process Chinese information resource based on GATE framework. on GATE framework.
4. Constructing a Chinese 4. Constructing a Chinese Information Extraction Information Extraction SystemSystem GATEGATE
– (General Architecture for Text Engineering)(General Architecture for Text Engineering)
– Open Source, Developed from 1995Open Source, Developed from 1995
– GATE, a frameworkGATE, a framework Language Resources (LRs) Language Resources (LRs) Processing Resources (PRs) Processing Resources (PRs) Visual Resources (VRs) Visual Resources (VRs)
– ANNIE (A Nearly-New IE system)ANNIE (A Nearly-New IE system) tokeniser, sentence splitter, POS tagger, gazetteer, finite stattokeniser, sentence splitter, POS tagger, gazetteer, finite stat
e transducer and orthomatchere transducer and orthomatcher
ANNIE PipelineANNIE Pipeline
GATE: good for EnglishGATE: good for English
GATE: Not so good for GATE: Not so good for ChineseChinese
4.Constructing a Chinese 4.Constructing a Chinese Information Extraction Information Extraction SystemSystem Key difficulties for Chinese information eKey difficulties for Chinese information e
xtractionxtraction– Chinese tokenizing Chinese tokenizing – Chinese gazetteersChinese gazetteers– Chinese named entity recognitionChinese named entity recognition
Chinese tokenizingChinese tokenizing
English languageEnglish language– words are separated by white space and words are separated by white space and
punctuationpunctuation
Chinese LanguageChinese Language– without any separation between words without any separation between words
a simple sentencea simple sentence
(I am a Chinese)
can be broken into several forms with segmenter
(I am a Chinese)
(I am China person)
(I am center country person)
Chinese gazetteersChinese gazetteers
GATE gazetteer lists for EnglishGATE gazetteer lists for English– very abundant very abundant
GATE gazetteer lists for Chinese process GATE gazetteer lists for Chinese process – simple and short gazetteers such as date, time, simple and short gazetteers such as date, time,
organization, location, money, province etcorganization, location, money, province etc– for a flexible language like Chinese, the list is for a flexible language like Chinese, the list is
very limitedvery limited
Chinese named entity Chinese named entity recognitionrecognition GATE system uses JAPE (a Java GATE system uses JAPE (a Java
Annotation Patterns Engine) rules to Annotation Patterns Engine) rules to recognize NErecognize NE
JAPE rules JAPE rules
grammar of Chinese is quite different grammar of Chinese is quite different from that of Englishfrom that of English
the JAPE rules provided by GATE are not the JAPE rules provided by GATE are not suitable for Chinese textssuitable for Chinese texts
We need to rewrite JAPE rules to We need to rewrite JAPE rules to implement Chinese information extractionimplement Chinese information extraction
Solutions to the problemsSolutions to the problems
three main tasks we three main tasks we have done have done 1.1. Integrating ICTCLAS to perform words Integrating ICTCLAS to perform words
segmentationsegmentation
three main tasks we three main tasks we have donehave done
2.2. Developing Chinese gazetteers to enrich Developing Chinese gazetteers to enrich GATE language resourcesGATE language resources
three main tasks we three main tasks we have donehave done
3.3. Rewriting JAPE rules to recognize Rewriting JAPE rules to recognize Chinese NEChinese NE
Chinese JAPE rule Chinese JAPE rule
outlineoutline
1.1. IntroductionIntroduction
2.2. What is IE (Information Extraction)? What is IE (Information Extraction)?
3.3. Potential functions in Innovations of Potential functions in Innovations of Library ServicesLibrary Services
4.4. Constructing a Chinese Information Constructing a Chinese Information Extraction SystemExtraction System
5.5. Tests and EvaluationTests and Evaluation
5.Tests and Evaluation5.Tests and Evaluation
one years of working, we implemented the one years of working, we implemented the system system
carry out experimentscarry out experiments
Same piece of articleSame piece of article
Our outputOur output
ConclusionsConclusions
bring forth a solution for Chinese bring forth a solution for Chinese information extraction systeminformation extraction system
carried out a valuable experimentcarried out a valuable experiment still many works need to be donestill many works need to be done lay a good foundation for our future works lay a good foundation for our future works
Thanks!Thanks!
谢谢!谢谢!