big data and internet thinking - sjtuwuct/bdit/slides/lec8.pdfanalytical technology, resources, and...
TRANSCRIPT
Big Data and Internet Thinking
Chentao WuAssociate Professor
Dept. of Computer Science and [email protected]
Download lectures
• ftp://public.sjtu.edu.cn
•User: wuct
•Password: wuct123456
•http://www.cs.sjtu.edu.cn/~wuct/bdit/
Schedule
• lec1: Introduction on big data, cloud computing & IoT
• Iec2: Parallel processing framework (e.g., MapReduce)
• lec3: Advanced parallel processing techniques (e.g., YARN, Spark)
• lec4: Cloud & Fog/Edge Computing
• lec5: Data reliability & data consistency
• lec6: Distributed file system & objected-based storage
• lec7: Metadata management & NoSQL Database
• lec8: Big Data Analytics
Collaborators
Contents
Big Data Analytics1
Big Data Challenges
It’s not just about the data…
Machine Learning/Deep Learning
IoT (Internet of Things) & Sensor Analytics
Modeling Willingness-to-Pay
Natural Language Processing
Analyzing Data @ Scale
Creating a Lake Streaming Consumer Behavior Data
Big Data Big Data Analytics+
• Leveraging a computer’s ability to learn without being explicitly programmed to solve business problems
• Understanding value drivers from the ever-growing network of connected physical objects and the communication between them
• Mining product reviews to estimate willingness-to-pay for product features
• Understanding human speech as it is spoken through application of computer science, AI, and computational linguistics
• Using distributed computing and machine learning tools to analyze hundreds of gigabytes of data
• Mining social data in real time to understand when and where consumers are making choices
1
2
3
4
5
6
Methods of using Big Data to generate insightRefers to the DATA only
• It is important to understand the distinction between Big Data sets (large, unstructured, fast, and uncertain data) and ‘Big Data Analytics’.
It’s also about what, how, and why you use it• Big Data Analytics – the process of harnessing Big Data to yield
actionable insights – is a combination of five key elements:
Decisions Analytics Data TechnologyMindset &
Skills
The value of Big
Data Analytics is
driven by the unique
decisions facing
leaders, companies,
and countries today.
In turn, the type,
frequency, speed,
and complexity of
decisions drive how
Big Data Analytics is
deployed.
To leverage the
variety and volume of
Big Data while
managing its
volatility, advanced
analytical
approaches are
necessary, such as
natural language
processing, network
analysis, simulative
modeling, artificial
intelligence, etc.
Big Data Analytics is
about
operationalizing new
and more data, but it
is also about data
quality, data
interoperability, data
disaggregation, and
the ability to
modularize data
structures to quickly
absorb new data and
new types of data.
To store, manage,
and use Big Data
often requires
investments in new
technologies and
data processing
methods, such as
distributed
processing (e.g.,
Hadoop), NoSQL
storage, and Cloud
computing.
Big Data Analytics
requires firm
commitment to using
analytics in decision-
making; a decisive
mentality capable of
employing in-the-
moment intelligence;
and investment in
analytical technology,
resources, and skills.
Big Data Analytical Capabilities• Continuing increases in processing capacity have opened the
door to a range of advanced algorithms and modeling techniques that can produce valuable insights from Big Data.
Tra
dit
ion
al
Em
erg
ing
Structured Unstructured
A/B/N Testing
Experiment to find the most effective
variation of a website, product, etc
Sentiment Analysis
Extract consumer reactions based on
social media behavior
Complex Event Processing
Combine data sources to recognize events
Predictive Modeling
Use data to forecast or infer behavior
Regression
Discover relationships between variables
Time Series Analysis
Discover relationships over time
Classification
Organize data points into known categories
Simulation Modeling
Experiment with a system virtually
Spatial Analysis
Extract geographic or topological information
Cluster Analysis
Discover meaningful groupings of data
points
Signal Analysis
Distinguish between noise and meaningful
information
Visualization
Use visual representations of
data to find and communicate info
Network Analysis
Discover meaningful nodes and
relationships on networks
Optimization
Improve a process or function based on
criteria
Deep QA
Find answers to human questions
using artificial intelligence
Natural Language Processing
Extract meaning from human speech or
writing
Forward-Looking vs. Rear-View Analytics• Big Data Analytics improves the speed and efficiency with which we
understand the past, and opens up entirely new avenues for preparing for and adapting to the future.
What happened?Describe, summarize
and analyze historical data
What should be done?
Recommend ‘right’ or optimal actions or
decisions
How do we adapt to change?
Monitor, decide, and act autonomously or semi-autonomouslyWhat could
happen?Predict future
outcomes based on the past
• Observed behavior or events
• Non-traditional data sources such as social listening and web crawling
• Forward-looking view of current and future value
• Sentiment Scoring
• Graph analysis and Natural Language Processing to identify hidden relationships and themes
• Dual objective models
• Behavioral economics
• Real-time product and service propositions (graph analysis, entity resolution on data lakes to infer present customer need)
• Rapid evaluation of multiple ‘what-if’ scenarios
• Optimization decisions and actions
• Monitor results on a continuous basis
• Dynamically adjust strategies based on changing environment and improved predictions
• Agent-based and dynamic simulation models, time-series analysis
Descriptive Analytics
Predictive Analytics
Prescriptive Analytics
Continuous Analytics
Inc
re
as
ing
Bu
sin
es
s V
alu
e
Why did it happen?
Identify causes of trends and outcomes
• Observed behavior or events
• Non-traditional data sources such as social listening and web crawling
• Statistical and regression analysis
• Dynamic visualization
Diagnostic Analytics
Increasing Sophistication of Data & Analytics
Rear-view Forward-looking
Examples of Big Data Analytics in Action• Market Leaders are leveraging Big Data Analytics to generate
value by starting with a business need and focusing on implementing actionable insights quickly and decisively.
Company Business Need Data and Analytics Impact
Greater tailoring of credit card offers to fit customer needs
Statistical model based on public credit and demographic data to targetcustomized products to customers
Net revenue grew at a CAGR of 32% from 1994 to 2003; prompted competitors to shift focus to data and analytics
Data-enabled engine prognostics, monitoring, maintenance and repair
Analysis of sensor data from hundreds of sensors in 4,000 engines to identify and solve issues weeks in advance
Over 70% annual revenue from the aircraft engine division attributable to this service
Search-to-purchase conversion by anticipating intent of a shopper’s search and delivering relevant results
Semantic search, which enables discovery using algorithms that rank results via social signals from around the web
Increases 10-15% the likelihood that a customer will complete their purchase – translating to millions of dollars in revenue
Transformation from subscription streaming service to original content producer
Analysis of data from 66 million subscribers’ viewing habits and preferences
Revenue and subscriber base increased by 15% and 9% respectively in 2013
Leverage Internet of Things (IoT) by connecting machines to facilitate data-enabled prognostics, increase efficiency and reduce downtime
Launched software to help airlines and railroads move their data to the cloudand predict mechanical malfunctions, improve safety, and reduce trip cancellations and cost
Estimated 1% reduction in fuel costs, projected to save the airline industry $30 billion over 15 years
ImpactBig Data AnalyticsBusiness Need
Big Data Analytics in Development• Big Data Analytics is making an equally impressive impact on
Development interventions – allowing decision-makers to reach and serve previously neglected populations.
Company Business Need Data and Analytics Impact
More transparent, reliable, and low-cost method to track inflation in Argentina
Web scraping of online price data used to produce price indices, and econometric analysis used to model disaggregated impacts of policies
Government statistical offices shifting to accept Big Data. Central banks using Big Data to see day-to-day volatility.
Understand how migrants act as arbitrageurs to bring labor markets into equilibrium
Iterative analysis of call detail records (CDRs) to track movement of migrants in response to local shocks to labor demand (weather, economy, conflict, etc.)
Informing labor policy design in low-income countries to incentivize or disincentivizemigratory behavior
The city of Rio de Janeiro wanted to improve its emergency response by better predicting heavy rainfall and subsequent severe landslides and flooding
The city combines data from 30 city agencies – including weather, satellite, video, GPS, historic rainfall, and topographic survey data – in a central Operations Center
Rio has improved emergency response time by 30%, catalogued 200+ flood points, and can now predict heavy rains 48 hours in advance on a half-km basis
Create a better ecosystem for mobile services in the agricultural sectors of Kenya, Tanzania, and Mozambique
Remote crowdsourced data gathered via cell phones used to connect farmers to markets, assess farmers’ credit worthiness, and incubate new mobile businesses with greater predictors of success
M-PESA is being used to lower costs for farmers to receive loans and perform transactions with distributers and buyers, as well as to provide geography-specific market information
ImpactBig Data AnalyticsBusiness Need
Big Data Landscape
Creating a Big Data OrganizationStep 1: Be Yourself• Beginning with a clear understanding of the specific questions you intend to use
Big Data Analytics to address can help guide where and which data solutions are deployed.
Value enablement
Value enhancement
Strategic
Tactical
Operational
Day to day operations• Struggle to move from narrow focus on reactive
operations to more proactive, comprehensive management of daily operations
• High value for digitization of operational processes across program units
• Often already proficient in traditional business intelligence
Enabling strategy and improving performance• Use analytics to reduce political divergence and
drive consensus• Real-time analytics to enable quick responses to
events• Use data to develop personalized services• Need for more objective and higher quality data
Delivering future value• Data-driven decision-making in real time• Use analytics to develop new
programs/opportunities• Relies heavily on data supplied by others• Often struggles to move away from exclusively
intuitive decision-making
Creating a Big Data OrganizationStep 2: Secure People & Skills• The competencies required of “data scientists” within an analytics organization
or project converge from multiple skill domains.
Organization-specific knowledge about data assets – including enterprise “metadata” – their location and appropriate business context for use in advanced analytics
Comfort in programming across various languages, a thorough understanding of external and
internal data sources, data gathering, storing, and retrieving
methods which help combine disparate data sources to generate unique insights
Subject Area or Domain Expertise
Computer Science & ProgrammingStatistical &
Mathematical
Organization-specificInformation Knowledge
Expertise in statistical techniques, tools and languages used to run analyses that generate insights to effectively determine and communicate actionable insights
Deep understanding of industry, subject area, or research domain to help determine which questions need
answering and on what frequency, specificity, or geography
Creating a Big Data OrganizationStep 3: Let objectives dictate structure, not vice versa• How analytics efforts or organizations are structured – whether reporting is vertically or
horizontally aligned, how interconnected or autonomous separate units are, how resources and successes are shared – can influence efficiency and impact.
CENTRAL Analytics Competency Center
Distributed Analytics Centralized Analytics
LOCAL
CENTRAL Analytics Competency Center
ETL
Data Warehouse
BI Applications
Metadata Repository
Data Mart
Federated Analytics
LOCAL
CENTRAL Analytics Competency Center
ETL
Data Warehouse
BI Applications
Metadata Repository
Data Mart
ETL
Data Warehouse
BI Applications
Metadata Repository
Data Mart
Objectives• Adopt previously proven practices• Highly focused analytics support
• Subject area-specific innovations• Repeatable models
• Governance• Aligning analytics to organization-
wide strategy
Data Warehouses, Marts, etc.
• Deployed locally • Deployed locally• Some data and models shared
across groups
• Deployed and managed centrally
Analytics Tools• Managed locally • Managed locally, but connected to
group framework• Controlled centrally, with units
having access to shared resources
Analytics Staff/ Competencies
• Placed within individual units • Placed within individual units• Skills tailored to specific region or
subject matter
• Placed within central analytics team,available as needed to support individual units
The ‘Hub-Spoke’ operating model often serves as a well-synchronized, connected system
Competency Center
‘Standardization’
2
Local Business Operations
Global Business Strategy
Local Adoptionof Practices
Centers of Excellence (Regional)
CompetencyCenter
(‘Standards’)
Central Decision Hub
Local‘Spoke’
Central Decision
Hub1
Center of Excellence(Regional)
3
Center of Excellence(Regional)
3
Center of Excellence(Regional)
3
1234
Local‘Spoke’
4
Local‘Spoke’
4
Local‘Spoke’
4
Local‘Spoke’
4
Local‘Spoke’
4
Local‘Spoke’
4
Local‘Spoke’
4
Local‘Spoke’
4
Local‘Spoke’
4
Local‘Spoke’
4
Sample Hub-Spoke Interaction Model
Creating a Big Data OrganizationStep 4: Invest in Appropriate Infrastructure• Big Data introduces challenges related to data volume and variety, processing
constraints, and new data structures that traditional data infrastructure is not equipped to support Objective Considerations Impact
Identify the type of analysis that will be
conducted and define which analytics
capabilities will be employed
Dictates performance needs along with data structures and processing architecture
Interface could restrict the ability to perform analysis ad hoc and restrict ability to update
Support for analysis specific data structures can improve performance and reduce analysis effort
Define the data set that will be used for the analysis including its
sources, size, and structure
Size of data sets introduce need for scalable infrastructure and performance
Variability of source data models and data set structure require data model flexibility
Diverse sources will require scalability, modelflexibility, and flexible interfaces
Define the timeliness and frequency of the
analysis results for reporting and
downstream systems
Frequency of analysis will dictate the processing architecture (batch or real time)
The timeliness of the analysis will impact the need for scalability and performance
In and out bound interfaces are defined by the use of data and required flexibility
Analytics Capabilities
Data Variety
Application
Analysis Type
Size
Structure
Sources
Frequency
Speed
Interfaces
Analysis Flexibility
Analysis Structures
Contents
Architecture Design2
Emerging Infrastructure Options• To harness Big Data, storage solutions must be able to support
targeted analytics capabilities, data diversity and performance needs
Distributed Processing Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware
NoSQL Embedded and persisted storage that implement data models through document, graph, and dictionary structures
Cloud Computing Cloud computing can improve flexibility, scalability and cost management and enable a cohesive business strategy across a org
• Scalability Issues
• Big Data set information extraction and queries require large volumes of processing cycles that can quickly scale
• Data storage solutions need to provide flexible data models to better ingest unstructured and semi structured data
• Need to combine and link multiple data sources
Traditional challenges being addressed…
Building an Analytics Organization: Critical Components
Distributed Processing Hadoop and similar solutions that provide scalable distributed storage and distributed computation on commodity hardware
Introduction to Hadoop
• Hadoop is based on work done by Google in early 2000s (combination of Google File System (GFS) and MapReduce)
• Useful for analyzing copious amounts of complex data across multiple data sources
• Distributes data as it is initially stored in the system
• Applications are written in high-level code
• Computation happens where data is stored, whenever possible
• Data is replicated multiple times on the system for increased availability and reliability
Faster and Lower Cost Analysis
Linear Scalability
Greater flexibility
Emerging Infrastructure – Computing/Storage Options
Building an Analytics Organization: Critical Components
Emerging Infrastructure – Storage Options
NoSQL Embedded and persisted storage that implement data models through document, graph, and dictionary structures
NoSQL - Storage Types
Document StoreKey – Value Store
Graph StoreColumnar Store
So
luti
on
Ex
am
ple
s
Increasing Data Complexity
Pros: Simplicity & ScalabilityCons: Lack of advanced features/queries
Pros: Scalability & FlexibilityCons: Complexity
Pros: Easy to UseCons: Scalability
Pros: Graph JoinsCons: Flexibility
Building an Analytics Organization: Critical Components
Emerging Infrastructure – System Options
Cloud Computing The model is compelling; cloud computing can improve flexibility, scalability and cost management. Businesses best able to realize the potential will establish a cohesive business strategy as cloud computing can transform your entire organization — people, processes, and systems
Source: PwC, “Digital IQ Snapshot: Cloud,”; PwC, “FS Viewpoint: Clouds is the forecast”
Cloud transformation begins at the infrastructure level and leads to more agile applications, resulting in faster speed to market and more flexibility to meet client needs.
The key benefits, beyond consolidation, include standardized application and development environments, resulting in better controlled and more efficient application lifecycles.
Relational Reference Architecture
Extended Relational Reference Architecture
Non-Relational Reference Architecture
Data Discovery: Non-Relational Architecture
Business Reporting: Hybrid Architecture
Contents
Big Data Algorithms3
Key components of Mahout in Hadoop (1)
Key components of Mahout in Hadoop (2)
Key Components of Spark MLlib
Spark ML Basic Statistics
◼ Correlation: Calculating the correlation between two series of data is a common operation in Statistics➢Pearson’s Correlation➢ Spearman’s Correlation
Example of Popular Similarity Measurements
◆Pearson Correlation Similarity◆Euclidean Distance Similarity◆Cosine Measure Similarity◆Spearman Correlation Similarity◆Tanimoto Coefficient Similarity (Jaccard coefficient)◆Log-Likelihood Similarity
Pearson Correlation Similarity
Data:
Missing Data
On Pearson Similarity
Three problems with the Pearson Similarity:
1. Not take into account of the number of items in which two users’ preferences overlap. (e.g., 2 overlap items ==> 1, more items may not be better.)
2. If two users overlap on only one item, no correlation can be computed.
3. The correlation is undefined if either series of preference values are identical.
Adding Weighting. WEIGHTED as 2nd parameter of the constructor can cause the
resulting correlation to be pushed towards 1.0, or -1.0, depending on how many
points are used.
Spearman Correlation SimilarityExample for ties
Pearson value on the relative ranks
Basic Spark Data FormatData: 1.0, 0.0, 3.0
// straightforward
// number of parameters, location of non-zero indices, and non-zero values
// number of parameters, Sequence of non-value values (index, value)
Correlation Example in Spark1.0, 0.0, 0.0, -2.04.0, 5.0, 0.0, 3.06.0, 7.0, 0.0, 8.09.0, 0.0, 0.0, 1.0
Euclidean Distance Similarity
Similarity = 1 / ( 1 + d )
Cosine Similarity
Cosine similarity and Pearson similarity get the same results if data are normalized (mean == 0).
Spearman Correlation Similarity is time consuming.Need to use Caching ==> remember s user-user similarity which was previously computed.
Caching User Similarity
Tanimoto (Jaccard) Coefficient Similarity
Discard preference values
Log-LikeLihood SimilarityAsses how unlikely it is that the overlap between the two users is just due to chance.
Performance MeasurementsUsing GroupLens data (http://grouplens.org): 10 million rating MovieLens dataset.
• Spearnman: 0.8• Tanimoto: 0.82• Log-Likelihood: 0.73• Euclidean: 0.75• Pearson (weighted): 0.77• Pearson: 0.89
Spark ML Basic Statistics
• Hypothesis testing: Hypothesis testing is a powerful tool in statistics to determine whether a result is statistically significant. Spark ML currently supports Pearson’s Chi-squared (χ2) tests for independence.
• ChiSquareTest conducts Pearson’s independence test for every feature against the label.
Chi-Square Tests (1)
Chi-Square Tests (2)
Chi-Square Tests (3)
We would reject the null hypothesis that there is no relationship between location and type of malaria. Our data tell us there is a relationship between type of malaria and location.
Chi-Square Tests in Spark
Spark ML Clustering
Example: Clustering
FeatureSpace
Clustering
Clustering – on feature plane
Clustering example
Steps on Clustering
Making Initial Cluster Centers
K-means Clustering
HelloWorld Clustering Scenario Result
Testing difference distance measures
Manhattan and Cosine distances
Tanimoto distance and weighted distance
Results Comparison
Sample Code of K-Means Clustering in Spark
Vectorization Example
0: Weight1: Color2: Size
Canopy Clustering (estimate the number of clusters)Tell what size clusters to look for. The algorithm will find the number of clusters that have approximately that size. The algorithm uses two distance thresholds. This method prevents all points close to an already existing canopy from being the center of a new canopy.
Other Clustering Algorithms
Hierarchical clustering
Different Clustering Algorithms
https://github.com/HewlettPackard/cacti
Spark ML Classification
Spark ML Classification
Classification - definition
Classification example: using SVM to recognize a Toyota Camry
Classification example: using SVM to recognize a Toyota Camry
When to use Big Data System for Classification?
The advantage of using Big Data System for Classification
How does a classification systems work?
Key Terminology for Classification
Input and Output of a classification model
Four types of values for predictor variables
Sample data that illustrates all four values
Supervised vs. Unsupervised Learning
Work flow in a typical classification project
Classification Example – Color-Fill
Position looks promising, especially the x-axis ==> predictor variable. Shape seems to be irrelevant. Target variable is “color-fill” label.
Classification Example – Color-Fill (another feature)
Fundamental classification algorithm
Example of fundamental classification algorithms: • Naive Bayesian• Complementary Naive Bayesian• Stochastic Gradient Descent (SDG) • Random Forest• Support Vector Machines
Choose algorithm
Support Vector Machine (SVM)
maximize boundary distances; remembering “support vectors”
nonlinear kernels
Example SVM code in Spark
Contents
Tools Support4
Data Mining, Text Mining, and Natural Language Processing
Extraction of implicit, previously unknown, and potentially useful information from data
Data Mining
Analysis of large quantities of natural language text and detecting lexical or linguistic usage patterns to extract probably useful information
Text Mining
Natural Language Processing
NLP is a theoretically motivated range ofcomputational techniques for analyzing and representing naturally occurring textsat one or more levels of linguistic analysis for the purpose of achieving human-likelanguage processing for a range of tasks or applications.
NLP Tools
Tool Description Analysis Type
OpenNLPA machine learning based toolkit for the processing of natural language text. Link
• Tokenization• sentence segmentation• Part-of-speech tagging
• Named entity extraction• Chunking, parsing• Coreference resolution.
GATEA Java suite of tools that can perform natural language processing tasks for multiple languages. Link
• Information extraction• Part of speech tagging
• Tokenizer• Sentence splitter
NLTK A suite of libraries and programs for symbolic and statistical natural language processing Python. Link
• Information extraction• Part of speech tagging,• Tokenizer
• Word categorization• Text classification
Stanford NLP
Statistical NLP toolkits for various computational linguistics problems that can be incorporated into applications with human language technology needs.Link
• Including tokenization• Part-of-speech tagging• Named entity recognition• Parsing
• Classification• Segmentation• Coreference Resolution
LingPipe
A tool kit for processing text using computational linguistics. Link
• Sentiment analysis• Entity recognition• Clustering• Topic classification
• Part of speech tagging• Sentence detection• Disambiguation
MontyLingua
A suite of libraries and programs for symbolic and statistical natural language processing for both Python and Java. Link
• Information extraction• Part of speech tagging • Tokenizer• Word categorization
• Text generation• Stemming• Phrase chunking
Rosetta Linguistic Platform
A suite of linguistic analysis components that integrate into applications for mining unstructured data. Link
• Language Identification• Name, places, and key
concept extraction
• name matching • name translation
Text Mining/Analytics Tools
Tool Description Analysis Type
RapidMinerAn open source environment for machine learning, data mining, text mining, predictive analytics, and business analytics. Link
• Document classification• Sentiment analysis• Topic tracking
• Data mining• Traditional analytics
SAS Text MinerA suite of text processing and analysis tools. Link, • Text Parsing
• Filtering• Feature Extraction• Topic Clustering
VisualText
Integrated development environment for building information extraction systems, natural language processing systems, and text analyzers. Link
• Information extractions• Summarization• Categorization
• Data Mining• Document Filtering• Natural Language
Search
SAS Sentiment Analysis
Commercial tool that is dedicated to customer sentiment analysis. Link
• Customer sentiment monitoring
• sentiment discovery
TextifierTool for sorting large amounts of unstructured text with The Public Comment Analysis Toolkit (PCAT).Link
• Topic modeling,• Information retrieval
• Document analysis• Social media analysis
InfiniteInsight
System for automatically preparing and transforming unstructured text attributes into a structured representation. Link
• Term frequency• Term frequency inverse• Document frequency • Root word coding• synonym identification
• Customization of stop words
• Stemming rules• Concepts merging
ClustifySoftware for grouping related documents into clusters, providing an overview of the document set and aiding with categorization. Link
• Document clustering
Text Mining/Analytics Tools Cont.
Tool Description Analysis Type
AttensityAnalyze
Customer analytics applications that help analyze high volumes of customer conversations across multiple channels. Link
• Unstructured communication analysis
• sentiment analysis
• consumer profiling
ReVerbA program that automatically identifies and extracts binary relationships from English sentences. Link
• Information extraction
• Topic Identification
• Topic Linking
Open text summarizer
Open source tool for summarizing texts. Link
• Document summarization
Open CalaisWeb based API that is used to analyze content and extract topics or information. Link
• Attribute/feature extraction
• Fact identification
KnowledgeSearch
Family of techniques tools for searching and organizing large data collections. Link
• Semantic Analysis
KH CoderA free software for Quantitative Content Analysis or Text Mining Link
• Text Parsing• document search
• Network analysis
Image Analytics Overview
Overview
• The process of pulling relevant information from an image or sets of images for advanced classification and traditional analysis
• Applies image capture, image processing, and machine learning techniques to extract, quantify, and structure, image information
Advantages
• Provides a method to structure, organize, and search information that is stored within images
•Offers an additional data set that can be applied to understanding consumer behavior, automating business processes, and discovering knowledge enterprise content
Image Analytics Tools
Tool OverviewImage
ProcessingComputer
VisionMachine Learning
OpenCV
Open source library of computer vision functions that is accessible via C, Java, and Python
X X X
PAXit Image Analysis
Integrated image analysis platform that provides basic feature identification functions
X X
ImageJ
Java based image processing platform that can be accessed via an API and expanded with custom plugins
X
PIL Python image processing library X
PyBrainA modular machine learning library for Python
X
Audio Analytics OverviewOverview
•The process of capturing audio and analyzing its features as to extract content and context of an event
•Applies speech analysis and signal processing principles to structure audio information for analysis via NLP or traditional analytics techniques
Advantages
•Provides a method for identifying events or common patterns within sound bytes
•Offers a way of capturing not only the content and topics within a conversation, but also the emotions and context
Audio Analytics Tools
Tool OverviewAudio
ProcessingInformation
Retrieval
ClamA C++ library that provides varying level of audio processing and information retrieval capabilities
X X
CallMiner
A tool that is capable of translating calls to a more structured text data set and combining with other communication forms
X
NuanceLogs calls and structures audio for text based search and retrieval
X
yaafeAduio feature extraction toolkit with wrappers for several languages
X
PRAAT Multiple platform audio analysis toolkit X
Social Network → Applications (1)Analysis Objectives
Collaboration Analysis
Evaluate team structures , information flows among team members, and information exchanges with other teams to improve working structures
• Identify team structures that are not effective
• Identify informal organizational structures
• Identify individuals/roles or groups that are influential to collaborative work environments
Content/Knowledge
Management
Evaluate how knowledge or content is diffused and accessed within an organization
• Improve content and knowledge distribution
• Identify content bottlenecks, open communication flows, and establish channels
• Explore impact of new communication methods
Community Mining
Identify groups or informal teams that share knowledge, communicate frequently, solve problems, or work together to perform specific tasks
• Improved structures for key organizational functions.
• Improved information flows
• Identify potential bottlenecks for organizational functions
• Identify cultural patterns to build other communities
Organization Development
Explore formal and informal organization structures and how individuals work with one another to improve the design of the organization
• Improve hierarchy and structure of organization to better align with the informal practices
• Identify team members that are effective leaders and would impact the organization if promoted
Social Network → Applications (2)
Analysis Objectives
Disaster recovery planning
Assess organizational structures and communication patterns as they relate to the groups that play a role in disaster recovery plans
• Identify communication improvements to disaster recovery teams
• Identify weak links among functional groups to improve collaboration during recovery plan execution
Data/Information
Dissemination
Assess how data points or information sets originate or are distributed across the enterprise to their intended targets
• Identify overlapping information sets and bottlenecks for information dissemination
• Assess how organization structures or information architecture impact the flow of information to its targets
Fraud Detection / prevention
Assess the organization or external network to identify communication or collaboration patterns that align with known fraudulent activity
• Identify network agents that collaborate with known fraudulent agents
• Identify activities that align with known fraudulent behavior
ProcessDiscovery /
Improvement
Analyze the organization structure and communication patterns to uncover process improvements or identify new processes
• Identify process improvements through discovery of hidden process steps, communication flows , and actors
• Discover undocumented or informal processes that are hidden within frequent collaboration and communication paths
Supply Chain Analysis
Evaluate the structure of a supply network and the interactions among the entities that comprise the network to identify gaps, bottlenecks and sourcing strategies
• Identify communication gaps that could impact dependent process or operations
• Identify strategic relationships to optimize the supply network
• Identify supply nodes that create inefficiencies
Social Network → Applications (3)
Analysis Objectives
Novelty/Sentiment Diffusion
Analysis
Observe how a specific topic, news articles or sentiment diffuses through a consumer network
• Assess how target consumers/market will react to a piece of news or campaign
• Evaluate how long news, data, or sentiment will be retained within a system and how far it will spread
Market Influencer Identification
Monitor and analyze connections within social media networks to identify markets or consumers that are influential within communities
• Identify individuals or groups that influence markets and adoption
• Identify untapped markets
• Identify market segments as targets for ad campaigns to improve product/service adoption
Consumer Segmentation
Analyze the connections and consumer attributes within the target market to discover communities or groups with common characteristics
• Improve product or service offerings based on attributes that connect the consumer market
• Develop strategies to target new or existing consumers based on identified segmentation characteristics
Product or BrandDiffusion Analysis
Analyze the flow of communication or ideas through a market segment to evaluate how a product may diffuse
• Identify segments or individuals that will be likely early adopters
• Identify incentives or campaigns that will improve product/service adoption
Recommendation Systems
Analyze consumer network connections and common features among consumers to develop recommendations
• Identify new feature sets for products and services
• Assess new markets for selling similar or new products
• Target consumers with specific products or services
Social Network → Tools (1)
Tool OverviewNetworkAnalysis
Network Visual
Network Manipulation
SNAPA general purpose network analysis and graph mining library for C++ . Link X X
StatnetA package for R that provides capabilities for social network statistical analysis. Link
X
libSNA, graphTool, networkX
Python libraries for network analysis and manipulation. libSNA, networkX, graphTool
X X
JUNGJava package for network analysis and modeling. Link X X X
NodeXLExcel plug-in that provides an easy to use and interactive interface to explore and visualize networks Link
X X
Social Network → Tools (2)
Tool OverviewNetworkAnalysis
Network Visual
NetworkManipulation
GEPHIInteractive open source platform for network analysis and visualization. Gephi X X X
UcinetCommercial social network analysis tool with separate visualization component. Link X X
Graphviz Open source graph visualization package. Link X
NetMinerProprietary package that provides the ability to develop and implement custom algorithmslink
X X X
kxen SNANetwork analysis package that provides predictive analytics and customer MDM integration. Link X X X
ProMOpen source package for mining business process networks. Link X X X
CytoscapeOpen source tool for network modeling, and analysis. Can connect to external data sources Link X X X
NetworkWorkbench
Large-Scale Network Analysis, Modeling and Visualization Toolkit for Biomedical, Social Science and Physics Research. Link
X X X
Contents
Deep QA/Mind/Brain Systems5
DeepQA/Mind/BrainWhat is DeepQA?
• DeepQA forms that core of Watson, the open domain question analysis and answering system
• The DeepQA stack is comprised of set of search, NLP, learning, and scoring algorithms
• DeepQA operates on a distributed computing infrastructure that leverages Map Reduce and the Unstructured Information Management Architecture
What is the target problem set?
• Understanding the meaning and context of human language
• Searching and retrieving information from large library of unstructured information
• Identifying accurate and precise answers to questions that are complex and must sourced from a large knowledge set
DeepQA Infrastructure TechnologyData Management and Search
Technology Links
Unstructured Information Architecture
UIMA Link
SQL ServerMySQL Link
Apache Derby Link
Java Natural Language Toolkit
Open NLP Link
Stanford NLP Link
Map/Reduce Apache Hadoop Link
CommonsenseKnowledgebase
OpenCYC Link
Open Mind Common Sense Link
Triple StoreApache Jena Link
OpenAnzo Link
Text SearchLucene Link
Open FTS Link
DeepQA Infrastructure TechnologyPlatform and Administration
Technology Links
Web Server Apache Link
Virtualization HostVMWare Link
Zen Link
Distributed File System
Apache Hadoop Link
OpenAFS Link
File Management/Archival
rSync Link
OS Fedora Link
Cloud ManagementExtreme Cloud Administration Link
Open Nebula Link
Business ApplicationsOverview Objectives
Knowledge Discovery
Search internal and external unstructured/structured information assets to uncover previously unknown knowledge
• Identify information about a subject through deep analysis of internal and external information sources
• Answer questions about a business problem or trend that may be difficult to analyze within traditional data sources
E-Discovery
Search documents and communications to uncover relevant information associated with a specific topic
• Identify business topics and trends within communication and documents
• Search for non compliance activities within internal and external data sources
ContractEvaluations
Search through single or multiple contracts to answer specific questions about the nature of the contract
• Identify key facts or issues that comprise a contract or sets of contracts
• Identify contracts or legal documents that contain similar entities or features
RelationshipManagement
Provide the ability to interact with consumers providing precise responses to technical and open domain questions
• Provide a platform for automatically answering consumer questions about products or services
• Reduce reliance on call centers and improve interaction with consumers
Consumer Discovery
Search consumer communications, social media, and sales information to identify opportunities and demographics
• Identify background information about consumers• Identify consumer qualities that create risks or represent
opportunities
Technical Troubleshooting
Find answers to technical and process problems through
• Utilize unstructured data and communications to identify solutions or root causes to system and process problems
Areas for Further ResearchInfrastructure/Tools and Search Technologies/Concepts
Topic Research
Tools
HadoopMap/Reduce
The tool is used to distribute queries, analysis, and other processing activities across multiple CPUs. Further research is required to understand the tools architecture and how to integrate it with other tool kits. OpenNLP, UIMA, Lucene, etc.
OpenNLPA Java library for NLP tasks. Need to evaluate the tools capabilities and gaps as well as how it can be incorporated into the UIMA
OpenCYCAn open common sense reasoning platform. Need to better understand the tools role as well as how it fits within the other technologies
UIMAAn architecture for managing unstructured data. Further research is needed to understand how to run in parallel and how the SDK can be applied to NLP activities
LuceneA text search platform. Further research is needed to understand the library and how to incorporate it into UIMA
Search
Text Search Scoring
Algorithms are used to score search results based on their alignment with the question. Further research is needed to understand what models and scoring metrics can be applied to search results at various phases of DeepQA.
Triple Store Search
Triple stores maintain data in a subject-predicate-object structure and is used for turning around quick facts. Further research is needed to understand the philosophy and technologies behind these data storage mechanisms
CommonsenseReasoning
Research is required to understand the branch of AI, technologies and role within DeepQA.
Document/Information
Retrieval
Generate research on information and document retrieval practices. Technologies and algorithms need to be reviewed. Falls within a broader research topic for enterprise search.
Areas for Further ResearchMachine Learning and Natural Language Processing
Topic Description
Machine Learning
MetaLearnersResearch the concept and how they are to used evaluate learning models and assign a confidence score based on the learning models that are used to rank search results
Question Classification
Identify techniques and models that can be employed to analyze and classify questions
Search Ranking Models
Research models are available for ranking search results based on the various search and recall techniques that are employed for a question
NLP
Logical FormAnalysis
Research how SNA is used to discover logical relationships within text and product an understanding about the information within the text
SemanticStructure Analysis
Identify tools and algorithms that are employed to uncover semantic relationships within texts/phrases and how these relationships can be applied to extract relevant information for question analysis and search
RelationshipAnalysis
Research techniques and tools for uncovering temporal, geospatial and spatial relationships within a knowledge set
Feature Extraction
Evaluate tools and algorithms that are used to extract features of entities from text and identify methods for structuring the data for search
Phrase AnalysisIdentify algorithms and tools that can be applied to extract key phrases from text based on a search context
Thank you!