Download - Text Analytics
Text Analytics End to End
Gary Robinson, IBM
© 2013 IBM Corporation
Scenario
� Source and analyze blogs and news articles about a popular brand or service across various social media sites
− “IBM Watson”
− Analytics include
� Watson applications by industry and within an industry
� Watson association with Jeopardy!
� Simple sentiment/tone scoring
Scenario
� Process
− Collect data
− Transform and subset
− Develop and test a Text Analytics extractor using Eclipse
− Publish and deploy the extractor to a BigInsights cluster.
− Apply the Text Analytics extractor from BigSheets
− Analyze and chart the results
Text Analytics
� Identify and extract structured information from unstructured and semi-structured text
� To enable analytics
− chart, report, join, aggregate, slice, dice and drill, model, mine…
Text Analytics
� 80% of the world’s data is unstructured or semi-structured text
� Social media is rife with information about products and services
− Discussions, blogs, tweets…
� Applications often lock up useful information in blobs, description fields and semi-structured records that are difficult or impossible to open up for analysis
− Call center records, log files…
� How do you get a metrics based understanding of facts from unstructured text?
I had an iphone, but it's dead @JoaoVianaa.
(I've no idea where it's) !Want a blackberry now !!!
I had an iphone, but it's dead @JoaoVianaa.
(I've no idea where it's) !Want a blackberry now !!!
@rakonturmiami im moving to miamiin 3 months.
i look foward to the new lifestyle
@rakonturmiami im moving to miamiin 3 months.
i look foward to the new lifestyle
I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others http://4sq.com/gbsaYR
I'm at Mickey's Irish Pub Downtown (206 3rd St, Court Ave, Des Moines) w/ 2 others http://4sq.com/gbsaYR
BigInsights & Streams Text Analytics
� High Performance rule based Information Extraction Engine
� Highly scalable solution for at-rest and in-motion analytics
� Pre-built extractors, and toolkit to build custom Extractors
� Declarative Information Extraction (IE) system based on an algebraic framework
� Sophisticated tooling to help build, test, and refine rules
� Developed at IBM Research since 2004
� Embedded in several IBM products
Applications of Text analytics
� Broad range of applications in many industries
− CRM Analytics - Voice of customer, Product and Services gap analysis, Customer churn
− Social Media Analytics - Purchase intent, Customer churn prediction, Reputational Risk
− Digital Piracy - illegal broadcast of streaming and video content
− Log Analytics - Failure analysis and root cause identification, Availability assurance
− Regulatory Compliance - Data Redaction to Identify and protect sensitive information
Deploy to Streams and BigInsights
AQL Language
Optimizer
CompiledPlan
InputDocuments
Streams BigInsights
Cluster
Extracted Information
Downstream Integration
And processing
Text AnalyticsModule
Text AnalyticsModule
ExtractorExtractor
Developing an Extractor
Select documents to work withSelect documents to work with
Label examples of interesting textLabel examples of interesting text
Label clues or elements within or around the examples
Label clues or elements within or around the examples
Create or refine AQL to extract basic features
Create or refine AQL to extract basic features
Create or refine AQL to Generate candidate concepts
Create or refine AQL to Generate candidate concepts
Create or refine AQL to Filter and Consolidate
Create or refine AQL to Filter and Consolidate
Top D
ow
n
Bott
om
up
AQL
� Annotation Query Language
− SQL like
� Familiar syntax and concepts make it easier to learn and understand
− Declarative
� Describes what computation should be performed and not how to compute it
� Separates semantics from implementation
− Compiled and optimized for execution
� Text Analytics Module (TAM) is deployed to the cluster for execution by the Text Analytics run time
AQL
� Fundamental concepts
− Views
� Created with Select or Extract expressions
� Are not materialized unless explicitly requested using ‘output view <name>’ or ‘select into’
� The ‘Document’ view identifies the set of input documents
− select… from Document d
AQL
� Fundamental concepts
− Extract expressions
� Typically used to extract basic features
� Extract from columns in other views including the text column in the Document view
� Basic capabilities include extraction using regex, dictionary and sequence
� Other operations include splits, blocks and parts of speech
AQL
� Fundamental concepts
− Select expressions
� Typically used to combine, aggregate and filter extracted fields to create candidate concepts and final values
� Select existing columns and extract from columns
− Specified using <from list>
� Rich set of operators and clauses
− where, consolidate, group by, order by, and limit clauses are optional
Select vs Extract
� Which do I use when?
− Both have a <select list>
− But you can only specify an <extract specification> in an extract expression
− Both have a <from list>
− You can apply simple predicate based filters in the <having clause> of an extract expression or in the <where clause> of a select expression
− But you can only use predicates to combine rows from views – join – using the <where clause> of a select expression
− You can apply a <consolidation policy> or a <limit> in either an extract or a select expression
− But you can only <group> and <order> using a select expression
� extract
� <select list>,
� <extraction specification>
� from <from list>
� [having <having clause>]
� [consolidate on <column> [using '<policy>' [with priority from <column> [priority order]]]]
� [limit <maximum number of output tuples for each document>];
� select
� <select list>
� from <from list>
� [where <where clause>]
� [consolidate on <column> [using '<policy>' [with priority from <column> [priority order]]]]
� [group by <group by list>]
� [order by <order by list>]
� [limit <maximum number of output tuples for each document>];
Select vs Extract
� If you need to extract – use an extract expression
� If you need to group, order or join – use a select expression
� extract
� <select list>,
� <extraction specification>
� from <from list>
� [having <having clause>]
� [consolidate on <column> [using '<policy>' [with priority from <column> [priority order]]]]
� [limit <maximum number of output tuples for each document>];
� select
� <select list>
� from <from list>
� [where <where clause>]
� [consolidate on <column> [using '<policy>' [with priority from <column> [priority order]]]]
� [group by <group by list>]
� [order by <order by list>]
� [limit <maximum number of output tuples for each document>];
Scenario
Acquire the Data
Source social media data from BoardReader, an IBM business partner with a commercial offering that provides a searchable archive of various web based data sources
BoardReader App
Transform and Export using BigSheets
Extract a subset of social media data from a BigSheets workbook populated with data from IBM’s sample Boardreader application.
Inside a BigSheets workbook, press the 'Export As' button and export the workbook using the aspects specified to DFS
Download this file to the local FS of the eclipse development environment to use as sample input data for text analytics development
Building a Text Analytics Extractor
� Working in the Eclipse environment you will build an Extraction Plan and use the Extraction tasks Workflow to develop and test a simple extractor
Building a Text Analytics Extractor
� Using the Eclipse tools
Developing Simple AQL
� Simple dictionary based extraction
Testing the Extractor
� Run from the workflow and examine the results
Publish the Extractor
Configure and Deploy Application
� Back in the BigInsights Web Console the extractor is available to be deployed
Run the Extractor from BigSheets
Additional Analytics
� Develop and deploy additional extractors
− Understand Watson applications in Healthcare
− Understand the link with Jeopardy!
− Understand the tone/sentiment
Additional Resources
� Big Data Hub
http://www.ibmbigdatahub.com/
� DeveloperWorks
http://www.ibm.com/developerworks/bigdata/
� Big Data and Analytics on YouTube
http://www.youtube.com/ibmbigdata
� Big Data University
http://www.bigdatauniversity.com/