data mining & machine learning applications
TRANSCRIPT
Data Mining &
Machine Learning
Applications
David TJ Huang
Learning Outcomes
Understand what is pattern and what is noise
Recognize the influence of input data and
preprocessing to the mining results
Connect data mining and machine learning
algorithms to real world problems
Know how pattern & noise are defined differently in
different problems
Real World Applications
PART III
Algorithms and the World All of these algorithms and techniques…
• Association Rule Mining (ARM)
• Clustering
• Classification
• Change Detection
• Artificial Neural Nets (ANN)
• Genetic Algorithms (GA)
• Natural Language Processing (NLP)
• Graph Theories
• Regression
• Etc.
Algorithms and the World How are they applied onto real world data to improve our lives?
• Google search queries
• Tweets
• Facebook posts / check-ins / messages
• Web click streams
• Browsing history
• Emails
The Akinator Similar to the 20 question game…
• Given a series of 20 yes and no questions, the goal is to try and guess
the person on your mind
http://en.akinator.com
The Akinator The actual algorithm of the Akinator is unknown, but let’s try and
build one…
• How can we use the knowledge we have on machine learning
algorithms to build a replica of the Akinator?
The Akinator The actual algorithm of the Akinator is unknown, but let’s try and
build one…
• How can we use the knowledge we have on machine learning
algorithms to build a replica of the Akinator?
Let’s start with the data input…
• What is the data input and what does it consist of?
• How to represent the data input?
• Is there any preprocessing needed?
• If so, what sort of preprocessing should we do?
• Cleaning, Integration, Reduction, Transformation
The Akinator Things to consider…
• What are the patterns that we are finding/using?
• What are the potential noise in the data/system?
• How to deal with the noise?
• Supervised or Unsupervised?
• If supervised, what task? If unsupervised, what task?
• Can we do it using both supervised and unsupervised?
• How do you build up your model?
• How do you evaluate your model?
• Is there a need to do Change Detection?
Google Flu Trends A web-based tool for real-time surveillance of disease outbreaks
• Use IP addresses & keyword searches that are related to the flu
• Symptoms of an influenza complication
• Influenza complication
• Specific influenza symptom
• General influenza symptom
• Cold/flu remedy
• Term for influenza
• Antibiotic medication
• Related disease
http://www.google.com/flutrends
Google Flu Trends A web-based tool for real-time surveillance of disease outbreaks
• Use IP addresses & keyword searches that are related to the flu
Google Flu Trends Again, the actual models are unknown…
• So can we go through the same process and come up with a possible
replica of the model?
Google Flu Trends Again, the actual models are unknown…
• So can we go through the same process and come up with a possible
replica of the model?
Let’s start with the data input…
• What is the data input and what does it consist of?
• How to represent the data input?
• Is there any preprocessing needed?
• If so, what sort of preprocessing should we do?
• Cleaning, Integration, Reduction, Transformation
Google Flu Trends Things to consider…
• What are the patterns that we are finding/using?
• What are the potential noise in the data/system?
• How to deal with the noise?
• Supervised or Unsupervised?
• If supervised, what task? If unsupervised, what task?
• Can we do it using both supervised and unsupervised?
• How do you build up your model?
• How do you evaluate your model?
• Is there a need to do Change Detection?
Recommending and Overfitting When the system is making recommendations AND updating the
model based the results of those recommendations…
• It is very likely that you are going to overfit
• Input data is not clean per se
• Other examples:
• Page rank & website suggestions
Netflix
Netflix
Netflix
References • The Akinator
• http://en.akinator.com
• Google Flu Trend
• http://www.google.com/flutrends
• Assessing Google Flu Trends Performance in the United States
during the 2009 Influenza Virus A (H1N1) Pandemic
• http://journals.plos.org/plosone/article?id=10.1371/journal.pone.
0023610