noisy text analytics: an exercise in futility? rohini srihari janya, inc. 8 january 2007
TRANSCRIPT
![Page 1: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/1.jpg)
Noisy Text Analytics: An Exercise in Futility?Noisy Text Analytics: An Exercise in Futility?
Rohini SrihariJanya, Inc.
www.janyainc.com
8 January 2007
![Page 2: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/2.jpg)
Overview: Noisy Text Analytics
• All Text is Noisy!– Does not fit shrink wrapped processing, adaptation is
necessary
• Business and national security interests in processing:– Open source data (e.g. web pages)
– Consumer generated media (Blogs, newsgroups, chat, text messaging, etc.)
• Key is to identify analysis requirements clearly– Not necessary to understand everything
![Page 3: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/3.jpg)
Challenging Problems
• Mixed modalities
– Structured and unstructured; free text cannot be processed in a vacuum; need to correlate information from different sections
– Text with images, figures
• Improve within document information consolidation, Cross-document information consolidation
• World models for discourse processing
– Need to bring in more context; relate text analytics to semantic web activities (DAML/OWL)
– Dynamic use of online resources
• Adaptive text analytics
– extraction requirements are constantly changing, so is data!
– Corpus-based learning
• Flexible architectures
– Integrating additional preprocessing, handling streaming data etc.
![Page 4: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/4.jpg)
USMTF Document Structure
OPER/BRAVE CHILD//MSGID/BDAREP PHASE2/NMJIC/F-0005//BDAREPID/BEN:1111-22222/REPCOUNT:1//ICOD/011630ZJAN2002//BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENTCONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITIONEFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ONTHE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
![Page 5: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/5.jpg)
Sample Document
OPER/BRAVE CHILD//MSGID/BDAREP PHASE2/NMJIC/F-0005//BDAREPID/BEN:1111-22222/REPCOUNT:1//ICOD/011630ZJAN2002//BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENTCONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITIONEFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ONTHE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
Sets
![Page 6: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/6.jpg)
Sample Document
OPER/BRAVE CHILD//MSGID/BDAREP PHASE2/NMJIC/F-0005//BDAREPID/BEN:1111-22222/REPCOUNT:1//ICOD/011630ZJAN2002//BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENTCONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITIONEFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ONTHE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
Fields
![Page 7: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/7.jpg)
Sample Document
OPER/BRAVE CHILD//MSGID/BDAREP PHASE2/NMJIC/F-0005//BDAREPID/BEN:1111-22222/REPCOUNT:1//ICOD/011630ZJAN2002//BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333//GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENTCONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS,INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITIONEFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ONTHE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
Free-text field
![Page 8: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/8.jpg)
Sample Document
TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0//ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON/MAXRECUP:6MON//GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND ISFUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES ISCLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPITVIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATINGTO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRESIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TORECONSTITUTE C2 EQUIPMENT//
Entity Description/Name Field
![Page 9: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/9.jpg)
Sample Document
TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0//ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON/MAXRECUP:6MON//GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND ISFUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES ISCLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPITVIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATINGTO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRESIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TORECONSTITUTE C2 EQUIPMENT//
Reference to Structured Sets from Free Text
![Page 10: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/10.jpg)
Cross-Document Entity Profile
![Page 11: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/11.jpg)
Corpus-Based Learning
• Training phase requires four inputs– Document repository (unlabeled training data)– Config file1 for DTL Context (how to create unlabeled train data)– Seed file (how to label a small amount of unlabeled train data)– Config file2 for Learning Tool
• How to learn a model• How to use learned model in Semantex
DTLContext
DocumentRepository
LearnedModel
Config File1
LearningTool
Trainer
TrainingData
Seed File
Config File2
![Page 12: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/12.jpg)
Versatility of learning tool applied to different tasks
• Example: Nominal Event Classifier– Seedfile: 95 unambiguous
event nominals, 295 unambiguous nonevent nominals
– Repository: News texts processed by Semantex
– Config file (DTL): Look at features surrounding nouns
– Config file (LearningTool): Learn using a mixture model
• Example: Disease outbreak Classifier– Seedfile: 10 verb types
representative of disease outbreak
– Repository: Medical reports processed by Semantex
– Config file (DTL): Look at features surrounding verbs
– Config file (LearningTool): Learn using distributional similarity
Example: Name Disambiguation
• Are two instances of Tom Smith the same individual?
![Page 13: Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. 8 January 2007](https://reader033.vdocuments.mx/reader033/viewer/2022061306/551478e7550346ea6e8b45a6/html5/thumbnails/13.jpg)
Conclusions
• Dealing with noisy text is not a futile exercise!– Already commercial applications available
– Need to specify analysis requirements clearly
– Adapt IE technology appropriately