estrazione di informazioni da testo
DESCRIPTION
Estrazione di informazioni da testo. Perchè occuparsene?. E’ un’applicazione particolarmente complessa. Sfrutta la maggior parte delle risorse utilizzate in compiti di analisi. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/1.jpg)
Estrazione di informazioni da testo
![Page 2: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/2.jpg)
Perchè occuparsene?• E’ un’applicazione particolarmente complessa.• Sfrutta la maggior parte delle risorse utilizzate in
compiti di analisi.• Il suo studio permette quindi di avere una buona
panoramica delle problematiche e delle tecnologie utilizzate nell’analisi del linguaggio naturale.
![Page 3: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/3.jpg)
Cosa è l’Estrazione di Informazioni da Testo?• Information retrieval (IR): cercare e informazioni in testi a
fronte di richieste specifiche.• Recupero di passaggi: cercare e trovare passaggi
(paragrafi, frasi) all’interno di un testo che possano fornire risposte a determinati quesiti.
• Estrazione di informazioni (IE): trovare informazioni che possano riempire schemi (templates) predefiniti.
• Domanda-risposta (Question-answering): dare risposte a domande di tipo generale formulate da un utente: IE+IR
• Comprensione di testi: modellare la comprensione dei testi da parte di umani.
![Page 4: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/4.jpg)
Tipo di domande
• IR
• Recupero di passaggi
• IE
• Domanda/risposta
• Comprensione dei testi
Pre-definite. Aspetti fissi della informazione testuale
![Page 5: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/5.jpg)
What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
![Page 6: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/6.jpg)
What is “Information Extraction”Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
![Page 7: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/7.jpg)
What is “Information Extraction”Information Extraction = segmentation + classification + clustering + association
As a familyof techniques:October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
aka “named entity extraction”
![Page 8: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/8.jpg)
What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
![Page 9: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/9.jpg)
What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
![Page 10: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/10.jpg)
What is “Information Extraction”Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation NA
ME
TITL
E
ORGA
NIZA
TION
Bill
Gat
esCEO
Micr
osof
tBill
Ve g
h te
VPMicr
osof
tRich
ard
Stal
lman
foun
der
Free
Sof
t..
*
*
*
*
![Page 11: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/11.jpg)
Un esempio: FASTUS (1993)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan.The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
![Page 12: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/12.jpg)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan.The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with production of 20,000 iron and “metal wood” clubs a month
Un esempio: FASTUS (1993)
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
![Page 13: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/13.jpg)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be supplied to Japan.The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
![Page 14: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/14.jpg)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
![Page 15: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/15.jpg)
Come funziona FASTUS1.Parole complesse e nomi propri
2.Sintagmi semplici: nominali, verbali, particelle
3.Sintagmi complessi:
4.Eventi rilevantiCostruzione di semplici templates
5. Fusione di templates, nel casoPresentino informazioni sullo stesso evento
set upnew Twaiwan dollars
a Japanese trading househad set up
production of 20, 000 iron and metal wood clubs
[company][set up][Joint-Venture]with[company]
![Page 16: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/16.jpg)
Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern and a Japanese trading house toproduce golf clubs to be supplied to Japan.
The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20million new Taiwan dollars, will start production in January 1990with production of 20,000 iron and “metal wood” clubs a month.
TIE-UP-1Relationship: TIE-UPEntities: “Bridgestone Sport Co.” “a local concern” “a Japanese trading house”Joint Venture Company: “Bridgestone Sports Taiwan Co.”Activity: ACTIVITY-1Amount: NT$200000000
ACTIVITY-1Activity: PRODUCTIONCompany: “Bridgestone Sports Taiwan Co.”Product: “iron and ‘metal wood’ clubs”Start Date: DURING: January 1990
![Page 17: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/17.jpg)
Altro esempio – un template sbagliato………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on thesecond floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou,was hacked to death with 45 cm watermelon knives. ……….
Name of the Venture: Yaxing BenzProducts: buses and bus chassisLocation: Yangzhou,ChinaCompanies involved: (1)Name: X? Country: German (2)Name: Y? Country: China
Template sbagliato
![Page 18: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/18.jpg)
Template giusto A German vehicle-firm executive was stabbed to death ….………. Jurgen Pfrang, 51, reportedly stumbled upon the robbers on thesecond floor of his Nanjing home early on Sunday. The deputy general manager of Yaxing Benz, a Sino-German joint venture that makes buses and bus chassis in nearby Yangzhou,was hacked to death with 45 cm watermelon knives. ……….
Crime-Type: Murder Type: StabbingThe killed: Name: Jurgen Pfrang Age: 51 Profession: Deputy general managerLocation: Nanjing, China
![Page 19: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/19.jpg)
Chi esegue l’interpretazione?
(1) IR
(2) Recupero passaggi
(3) IE
(5) Comprensione testi
(4) Domanda/risposta
Utente
Utente
Sistema
Sistema
Sistema
![Page 20: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/20.jpg)
Insieme di testi
Sistema di IR
Caratterizzazione dei testi
richiesta
![Page 21: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/21.jpg)
Sistema di IR
Caratterizzazione dei testi
Richiesta
interpretazioneconoscenza
Insieme di testi
![Page 22: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/22.jpg)
Recupero passaggiIR
Caratterizzazione dei testi
richiesta
Interpretazioneconoscenza
Insieme di testi
![Page 23: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/23.jpg)
Caratterizzazione dei testi
Queries
Interpretazione
conoscenza
Sistema di IE
testi template
Elaborazione Linguaggio
naturaleInsieme di testi
Recupero passaggi
IR
![Page 24: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/24.jpg)
Interpretazione
conoscenza
Sistema di IE
testi Templates
![Page 25: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/25.jpg)
Interpretazaioneconoscenza
IE
Testi TemplatesPredefinito
Approccio generale All’elaborazione/
Comprensione del LN
IE: un approccioPragmatico al NLP
![Page 26: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/26.jpg)
(1)IR,
(2) recupero passaggi
(3) ie
(5) Comprensione di testi
(4) Domanda/Risposa
Valutazione delle prestazioni
Metodologia chiara
Metodologia non chiara
Metodologia chiara
Metodologia abbastanzavaga
Metodologia vaghissima
![Page 27: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/27.jpg)
N
N: documenti correttiM: documenti recuperatiC: documenti recuperati che sono corretti
M
C
domanda
Insieme dei documenti
Precision:
Recall:
CMCN
F-Value:
P
R
P+R2P ・ R
![Page 28: Estrazione di informazioni da testo](https://reader035.vdocuments.mx/reader035/viewer/2022070500/5681680b550346895ddd9186/html5/thumbnails/28.jpg)
N
N: Templates correttiM: Templates recuperatiC: Templates corretti che sono stati recuperati
M
C
domanda
Insieme dei documenti
Precision:
Recall:
CMCN
F-Value:
P
R
P+R2P ・ R
Il tutto è più complicato per laPossibilità di template parzialmenteriempiti