the sas text mining solution - · pdf fileqthis presentation provides an overview of the sas...

Post on 12-Mar-2018

216 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Copyright © 2001, SAS Institute Inc. All rights reserved.

The SAS Text Mining SolutionBernd.Drewes@eur.sas.com

Senior Data Mining ConsultantSAS International

Wayne.Thompson@sas.comProduct Manager, Enterprise Miner

SAS World Wide Marketing

Copyright © 2001, SAS Institute Inc. All rights reserved.

SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or Trademarks of their respective companies.

OverviewThis presentation provides an overview of the SAS text mining

solution including example applications.

Copyright © 2001, SAS Institute Inc. All rights reserved.

3

Text Mining TimelinesExperimental

Enterprise Miner 4.1, SAS 8.2, currently availableEarly Adopter Text Mining Program

ProductionAdd-on to Enterprise Miner 4.2 and 5.0, SAS 9.0, 1st quarter

of 2002. Separate API to potentially include text mining into other SAS

products and solutions.

Copyright © 2001, SAS Institute Inc. All rights reserved.

4

Early Adopter ProgramPlease note:

Enterprise Miner for Text is experimental and has a separate setinit. It will be made available on a restricted basis under an Early Adopter program. Please turn to your local SAS office for details."

Copyright © 2001, SAS Institute Inc. All rights reserved.

5

Why Text Mining ?

Copyright © 2001, SAS Institute Inc. All rights reserved.

6

What is Text Mining?Discovering and using knowledge that

exists in the document collection as a whole

Uncovering patterns within the collectionEstablish connections between documents and the

terms in the collection as a wholeCombining free-form text and quantitative variables

to derive information

Knowledge

Copyright © 2001, SAS Institute Inc. All rights reserved.

7

Text Mining is notNatural Language Processing/Understanding

Information Retrieval

Knowledge extraction from a single document

Copyright © 2001, SAS Institute Inc. All rights reserved.

8

Text Mining ApplicationsAutomatic Classification of Documents

Email filteringOrganize documents by topic into categoriesNews item routingGenerate classification codes from purchase descriptions

ClusteringSurvey data analysisCustomer complaints/comments

Copyright © 2001, SAS Institute Inc. All rights reserved.

9

Text Mining Applications (cont.)Other forms of Prediction

Stock market prices from business news announcementsCustomer satisfaction from customer commentsCost based on call center log

Copyright © 2001, SAS Institute Inc. All rights reserved.

10

How Does It Work?Document pre-processing

Data cleansingData reduction

Document analysisclassificationclusteringpredictionetc.

Copyright © 2001, SAS Institute Inc. All rights reserved.

11

Leveraging Enterprise Miner

Copyright © 2001, SAS Institute Inc. All rights reserved.

12

Data InputAs a variable

contained in a SAS data set.

As documents residing on the file system.

Copyright © 2001, SAS Institute Inc. All rights reserved.

13

Text Parsing (cont.)Stop lists

Filter out low information words• articles (e.g. the, a, this)• prepositions (e.g. of, from, by)• conjunctions (e.g. and, but, or)• etc.

Consider document subject matter

Copyright © 2001, SAS Institute Inc. All rights reserved.

14

Text Parsing

Create Term-Document Frequency MatrixRecord number of times term i appears in document jUses compressed representation in order to handle

large collections

Copyright © 2001, SAS Institute Inc. All rights reserved.

15

Term FrequenciesWhich terms should be

stopped?

Copyright © 2001, SAS Institute Inc. All rights reserved.

16

Weighing wordsWeigh words differently. Actually done in the SVD node

focus on words that help to distinguish between textsde-emphasize words that

• are common to most documents• only occur rarely

Copyright © 2001, SAS Institute Inc. All rights reserved.

17

Singular Value Decomposition (SVD)Dimension reductionFrom thousands to hundreds or less

Terms and documents projected into same spaceDocuments with similar meaning (but dissimilar

word patterns) can still be near one anotherOther variables can be combined with the results

of the SVDQuantitative variablesCategorical variables

Copyright © 2001, SAS Institute Inc. All rights reserved.

18

Benefits of SVD

Assumes that there is something conceptual underlying words (obscured by word choice)

these concepts can be expressed as collections or combination of words

SVD is used to estimate the structure of these concepts and find their most relevant words

Copyright © 2001, SAS Institute Inc. All rights reserved.

19

Text Clustering (E-M Cluster)

Generates groups of similar documents

from output of SVD

fast clustering of many documents

each document may belong to several classes (“fuzzy clustering”)

Copyright © 2001, SAS Institute Inc. All rights reserved.

20

Mining the TextUtilize Enterprise Miner’s tools

Memory-Based ReasoningNeural NetworkRegression

etc.Combine other variables with the results of the SVD

Quantitative variablesCategorical variablesTarget variables

Copyright © 2001, SAS Institute Inc. All rights reserved.

21

Review of the Text Mining ProcessText parsing extracts terms and creates a term-

document frequency matrixEach document is characterized by a vector in

multi-dimensional spaceThe vectors are then reduced down to100 or so

dimensions by the SVDResults can be classified, clustered or used for

prediction with existing Enterprise Miner tools

Copyright © 2001, SAS Institute Inc. All rights reserved.

22

top related