one tool, many industries text mining with oracle omar alonso chuck adams oracle corp. text mining...

32
One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Upload: sienna-savell

Post on 28-Mar-2015

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

One Tool, Many Industries

Text Mining with Oracle

Omar AlonsoChuck Adams

Oracle Corp.

Text Mining Summit, Boston, 2005

Page 2: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Agenda

Introduction Text mining Define problems Present solutions A look at Oracle’s technology stack Oracle’s roadmap A case study Conclusions

Page 3: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Data mining and Text mining

OLTP

OLAP

DM

Keyword search

BK

TM• Classification

• Clustering

• Ontologies

• NLP

• Inexact match

Structured Data Unstructured Data

Page 4: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

An analogy

RFID and robot vision– Put tags on everything instead having the

robot do the vision

Similar approach for text mining– Language is very social, not technical– Instead, start with a unified storage model– Then do mining

Page 5: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

What about text mining?

Text mining is one of many features in text technology

Real future of text technology is business intelligence (BI)

What is BI? – Ability to make better decisions

What are the obstacles today?– Structured data is well understood– Unstructured data is different

Page 6: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Text and XML

Increased exploitationof structure

Plain Old File System

File System on Steroids(WinFS)

Records Mgmt, ECMDynamic Doc Generation

Traditional Content Mgmt

XML Content Mgmt.

Page 7: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

First problem: access

No uniform access over all sources Each source has separate storage and

algebra Examples

– Email – Databases– Applications– Web

Page 8: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Second problem: management Management of unstructured of data

very poor compared with structure data Cleaning Noise is larger than in structure data Security Multilingual

Page 9: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Third problem – user needs Perception with current search engines Large data -> 80/20 rule Doesn't provide uniform information Two users type same query and get the

same results– Cricket the game or cricket the bug?

Page 10: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Foundations

XML as the common model XML allows:

– Manipulation data with standards– Mining becomes more data mining– RDF emerging as a complementary model

The more structure you can explore the better you can do mining

Integration use cases

Page 11: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Foundations - II

Unstructured data is too AI Too easy to get fooled by the complexity Hybrid solution Domain knowledge

– You know your domain– You own the content – You can do better

Page 12: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Remember?

Page 13: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Personalization problem

Lack of personalization You own the content, you own the user Two users type the same query:

“financials”– Sales rep looks for customers and other deals– Tech guy looks for bugs, architecture, etc.

LDAP shows who they are Combination with query logs shows

patterns in the same peer group Recommendation systems

Page 14: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Better Answers: Beyond Keywords

Noise theory– As you cast your nets ever wider, you catch disproportionately more

junk Must develop new models of Quality in the face of comprehensiveness

– Combine Link-Analysis with Context-sensitive relevance– Personalization

Must summarize information– Theme Maps, Gists

Show patterns in information vs. many pages of hit-lists– Tree Maps, Stretch Viewer

Ability to post-process and refine search hit lists– Dynamic categories for navigation– Reorder by date

Progressive query relaxation– Nearest inexact match

Page 15: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Technology StackBetter Answers

Relevance Toward BI

Progressive Relaxation

Multi-Criterion Support

Visualization

Classification

Personalization

Direct Answers

Link Analysis

Query Log Analysis

Metadata Extraction

Keyword Ranking

Intelligent Match

Duplicate Elimination

Page 16: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005
Page 17: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005
Page 18: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005
Page 19: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Oracle’s position

Text mining is one of many tools for information retrieval and discovery in many assets

Text mining is best used in the context of other techniques

– Personalization– Search query logs– Visualization

Product: one integrated platform

Page 20: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Oracle platform

Integrated platform vs. niche technology

Full-text searching

XML

Classification

Clustering

Visualization

Google, FAST

Tamino

Autonomy

Vivisimo

Inxight

One platform, low cost, low complexity

Several products, different APIs, performance, maintenance cost, etc.

Application search SAP/TREX

Page 21: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Oracle platform

“If I can see further than anyone else, it is only because I am standing on the shoulders of giants” – Isaac Newton

Oracle provides you all the functionality– Plus you get backup, recovery, scalability,

and other benefits

You build the mining application

Page 22: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Case study

Federal customer High Performance Text Information

Mining and Entity Extraction

Page 23: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Business Need

Enterprise Search Capability Information Fusion Profiles and alerting Security – user need to know Entity identification and extraction High Performance ingestion, search, and

indexing Scalability

Page 24: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Challenges

Search quality Performance Scalability Document formats Integration Operations and maintenance

Page 25: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Solutions Architecture

Oracle 10g Integrated Framework 10g release 2

– Oracle Real Application Clusters– Oracle Text

Full text and rule based indexingExtensible thesauriDocument classificationDocument filters

– Oracle Partitioning– Oracle Virtual Private database– Oracle Advanced Security

Page 26: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Technical Architecture

Application Server

EDL Portal User

EDL Portal User

Oracle 10g RAC

Application Server

LoadBalancer

Oracle 10g RACInterconnect

Enterprise MetaData Layer

Scalar, Domain, andB*Tree Indices

EDL Portal User

EDL Portal User

ADS OID

Process Isolated RAC DBNodes. 1 tuned for Userquery and the other fordata synchronization

Application Server

Key meta dataconsolidated and indexedfor enterprise data layer

access.

CIA PKI Authenticationfrom ADSN clients

ADS LDAP Integrated forClient and Server

Authentication

ExistingMissionSystem

Network BasedIntegration Hub and EDLSynchronization Services

Federated Data AccessJ2EE Services for

mission system drill

ExistingMissionSystem

ExistingMissionSystem

Page 27: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Scalable load and indexing

Oracle 9i& 9i Text

Raw Payload Payload Index

Scalar Indexes XML Indexes

DataCollec-

tion

Preprocess&

Filtering

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

JavaLoad

Thread

Java LoadDistri-bution

Process

Standard-ized

Xml DTD

UTF8 TextExtracted

fromCollection

Page 28: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Real world results

Single search for user Profiles and alerts Couple second query response 80,000,000 + documents indexed 1.2 TB raw text and growing 700 Gig index size Incremental index 1-2 Gig / day

Page 29: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Next Steps

Oracle 10gText Indexstructure

Entityidentification

andextraction

engine

Languagespecific

dictionary

Languagespecific

dictionary

Languagespecific

dictionary

Languagespecific

dictionary

ExtractedEntities

XMLInterface

Relationshipdetectionengine

• Entity Extraction and Relationship Awareness

Page 30: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Oracle database 10g release 2 Enterprise Search Capability Information Fusion Profiles and alerting Security – user need to know Entity identification and extraction High Performance ingestion, search, and

indexing Scalability

Page 31: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005

Conclusions

Text mining is one of many features needed for BI on unstructured data

– Not a silver bullet in itself

Must exploit other approaches – metadata (XML, RDF), personalization, classification, entity extraction, full-text search, …

– Hybrid solution

Focus on an integrated platform that gives you all the functionality

Drive the platform for your information need

Page 32: One Tool, Many Industries Text Mining with Oracle Omar Alonso Chuck Adams Oracle Corp. Text Mining Summit, Boston, 2005