a presentation by w h inmon bridging the gap between unstructured data and structured data
Post on 18-Dec-2015
216 views
TRANSCRIPT
A presentation byW H Inmon
BRIDGING THE GAP BETWEEN UNSTRUCTURED DATA AND STRUCTURED DATA
- unstructured data
- .doc files- .txt files- .xls files- email- transcripted telephone
The informal systems of the corporation:
.Txt
.Doc
- structured systems- structured data
- corporate transactions- corporate reports- corporate databases -customer files- audit reports
The formal systems of a corporation:
Program
It is estimated that less than 20% of corporatesystems are structured.
80%
.Txt
.Doc
20%
Program
.Txt
.Doc
searchengines
legal discovery
email archive
taxonomy
ontology
document mgmt
web content
Program
dbms
businessintelligence
applications
transactionsOLTP
ERP
compliance
imagine what would happen if thetwo worlds could be integrated…….
the world of dbms, analytics, and other processing opens up.
.Txt
.Doc
searchengines
legal discovery
email archive
taxonomy
ontology
document mgmt
web content
Program
dbms
businessintelligence
applications
transactionsOLTP
ERP
compliance
.Txt
.Doc
tight integration betweenthe two types of data.
There is a gulf between the two worlds: - technology - business practice - organizational - historical
.Txt
.Doc
Program
Think of the possibilities!
.Txt
.Doc
Program
Imagine this -
Reports and visualization show a lot.
have you ever wondered why youcan’t hook up your Business Objects toemail? or telephone conversations?
.Txt
.Doc
text
numbers
There is a fundamental disconnect between unstructured dataand business intelligence.
So what would happen if we had powerful visualizationfor text?
BusinessIntelligence
liver cancer
skin cancer
thirst
diabetes
blood pressure
correlative information becomesvery easy to spot
for the general population
for women
for women who smoke
for women who smokeover the age to 50
doing analysis on sub populationsof women
for the general population
for women who smokeover the age to 50
the contrast between the different correlations of different populationsleads to great insight
service
delivery
late
broken
installation salesmanattitude
wait too long
did not fit
what about looking at customer feedback – complaints?now you can see the broader picture of what is happening
but there are plenty of other places wherethe technology applies –
- manufacturing warranties – (what patterns of defects are there?)
- Weblogs (marketing – who is saying what?)
- customer complaints – (what are the problem products?)
- general email – (What’s the buzz? what is on people’s minds?)
- insurance claims (what are the circumstances of accidents?)
.Txt
.Doc
another possibility is the monitoringof email and the transport of emailto the structured environment
Monitoring emails and other corporate conversations -
.Txt
.Doc
Sarbanes Oxley
HIPAABASEL II
compliance – making sure that email is being used properly - compliance - corporate standard for language
Jan 3 - vp to vp “This is going to be a real barn burner of a quarter….”
Jan 5 – finance to vp“It looks like we are going to do $9,000,000 this quarter…”
Jan 5 – president to analyst“This quarter looks like we are going to break new records…”
Feb 1 – employee to employee“Did you see the stock market? Everything is going down…”
Feb 3 – president to vp“What is happening to sales in the midwest? We didn’t expect this…”
Feb 4 – sales manager to vp
Feb 3 – vp to vp
“The sales cycle looks like it is extending. The economy is tanking…”
“It looks like we are going to be a little short this quarter…”Feb 6 – president to vp
“What are we going to do to get sales up? Do we need to do some discounting?”
Mar 2 – sales person to vp“Demand has dried up. We aren’t going to close as many sales this quarter as we thought…”
A bunch of emails and conversations:
What do you do with them?
Jan 3 - vp to vp “This is going to be a real barn burner of a quarter….”
Jan 5 – finance to vp
“It looks like we are going to do $9,000,000 this quarter…”
Jan 5 – president to analyst
“This quarter looks like we are going to break new records…”
Feb 1 – employee to employee“Did you see the stock market? Everything is going down…”
Feb 3 – president to vp
“What is happening to sales in the midwest? We didn’t expect this…”
Feb 4 – sales manager to vp
Feb 3 – vp to vp
“The sales cycle looks like it is extending. The economy is tanking…”
“It looks like we are going to be a little short this quarter…”
Feb 6 – president to vp
“What are we going to do to get sales up? Do we need to do some discounting?”
Mar 2 – sales person to vp
“Demand has dried up. We aren’t going to close as many sales this quarter as we thought…”
Examining emails (“combing” them) for important corporate information:
Sarbanes Oxley quarter stock sales discount demand sales cycle
external categories
sales email – Feb 2 email – Mar 5 phone – Mar 8 ………………
quarter email – Jan 2 email – Jan 4 email – Feb 5 ………………
discount phone conversation – Jan 6 email – Jan 12 email – Jan 14 …………………………..
sales cycle email – Feb 24 phone conversation – Mar 14 meeting notes – Mar 18 …………………………….
StructuredEnvironment
The “combed” information is brought over tothe structured environment.
Now you can use standard tools, such as Cognos, Business Objects,Crystal Reports, MicroStrategy to do analysis.
customer data
probabilisticmatch
Emails and telephone conversations can be linkedto CDI/CRM data.
But there are other ways that communications can be used
A true 360 degree viewof the customer can beformed.
“I placed an order last week andwhen it arrived it was the wrongsize. And then your companywould not take it back. I’m mad.”
how easy is it going to be to engageMrs Jones until she has satisfactionabout her order
A true 360 degree viewof the customer can beformed.
communications
demographics
delivering on the promise of CDI
.Txt
.Doc
Program
can’t I just use a search engineto link the two worlds?
integration
integration
integration
integration
search engines do not integrate textual information
.Txt
.Doc
Program
integration
integration
integration
integration
text doesn’t need to be searched, it needs to be integrated
.Txt
.Doc
Program
integration
integration
integration
integration
“ha”
“head ache”“heart attack”“Hepatitis A”
.Txt
.Doc
Program
integration
integration
integration
integration
“oblique fractured ulna”“oblique fractured tibia”“obliq fractured tarsi”
“broken bone”
.Txt
.Doc
Program
1 – stop word editing2 – stemming3 – synonym replacement4 – synonym concatenation5 – homograph resolution6 – alternate spelling resolution7 – external category classification
8 – theming9 – probabilistic matching10 – negation exclusion11 – concept clustering12 – mid process editing13 – change sensitivity
What is meant by editing, integrating text?
integration
integration
integration
integration
.Txt
.Doc
Program
DW 2.0 Transactiondata
Current++
Verycurrent
O lder
Less thancurrent
Interactive
Integrated
Near line
Archival
R eferen ce ,m aster d ata
R eferen ce ,m aster d ata
R eferen ce ,m aster d ata
Sum m ary
Subj
Subj
Subj
Subj
Detailed
Sum m ary
Subj
Subj
Subj
Subj
Detailed
Sum m ary
Subj
Subj
Subj
Subj
Detailed
Appl
Appl
Appl
Continuoussnapshotdata
Subj
Subj SubjProfiledata
Continuoussnapshotdata
Subj
Subj SubjProfiledata
Continuoussnapshotdata
Subj
Subj SubjProfiledata
Text to subj
Text id ......
Internal, external
Textualsubjects
Capturedtext
Linkage
Text to subj
Text id ......
Internal, external
Textualsubjects
Capturedtext
Linkage
Text to subj
Text id ......
Internal, external
Textualsubjects
Capturedtext
Linkage
Sim p lep o inte rs
Sim p lep o inte rs
Sim p lep o inte rs
Business
Business
Business
Technical
Technical
Technical
Unstruc tu re dc o m p o n e n t
Struc tu re dc o m p o n e n t
C C o p y r ig h t 2 0 0 6 B i ll In m o n a n d In m o n D a ta S y s te m s
The a rc h ite c tu re fo r the ne xt g e ne ra tio n o f d a ta w a re h o usin g
D W 2 .0 is a tradem ark o f B ill Inm on and Inm on D a ta System s. A ll righ ts rese rved .
C “T he a rch itectu re fo r the nex t genera t ion o f da ta w arehous ing ” is copyrigh ted by B ill Inm on and Inm on D a ta System s. 2006For a detailed description of
how the unstructured environmentshould be linked to the structuredenvironment, go to -
www.inmoncif.com
and look for DW 2.0 TM
or go to -www.inmondatasystems.com
Unstructured Data
Structured Environment
Query
Business Objects,Cognos,MicroStrategy,Crystal Reports
DB2
probabilisticmatch
visualization