big data — your new best friend

62
Big Data: Your New Best Friend Reuven M. Lerner, PhD MegaComm 2016 • February 18th, 2016 1 Big Data.key - February 18, 2016

Upload: reuven-lerner

Post on 12-Apr-2017

309 views

Category:

Technology


3 download

TRANSCRIPT

Big Data:Your New Best Friend

Reuven M. Lerner, PhD MegaComm 2016 • February 18th, 2016

1 Big Data.key - February 18, 2016

Who am I?

• Long-time programmer, consultant, trainer

• Python, Git, PostgreSQL, Ruby

• Linux Journal columnist

2

2 Big Data.key - February 18, 2016

My stuff• Newsletter: http://lerner.co.il/newsletter

• Blog: http://blog.lerner.co.il/

• Daily Tech Video: http://dailytechvideo.com/

• Or @DailyTechVideo on Twitter

• Mandarin Weekly: http://MandarinWeekly.com

• Or @MandarinWeekly on Twitter

3

3 Big Data.key - February 18, 2016

Elections!

• Israel had elections last year

• The United States has elections this year

• Rumor has it, the world contains some other countries, many of which also hold elections

4

4 Big Data.key - February 18, 2016

Polls

• Before an election, politicians, reports, and political junkies (like me) look at the polls.

• We want to know who is ahead, and who is behind

• The politicians want to know which groups like (and dislike) them, so that they can focus their rhetoric and campaigning

5

5 Big Data.key - February 18, 2016

Are the polls always right?

6

6 Big Data.key - February 18, 2016

7

7 Big Data.key - February 18, 2016

8

8 Big Data.key - February 18, 2016

Polls are statistical models• Polls use math to predict the likelihood of a

particular outcome, based on a number of inputs

• Models are toy versions of reality

• They allow us to explore and understand reality, and should bear some connection to it

• But there will always be a distinction between a model and the real world

9

9 Big Data.key - February 18, 2016

Models are important!• They allow us to explore, understand the world

• They enable us to make predictions

• They reduce costs, and allow us to do things that are otherwise impossible or unethical

• 2013 Nobel Prize in Chemistry — for scientists who engaged in modeling of chemistry

10

10 Big Data.key - February 18, 2016

Testing models

• Elections are unusual: You only have one shot at testing your model to see if it’s accurate

• But in your business, you can create and test models every day, modifying the number, type, and weights of the inputs

11

11 Big Data.key - February 18, 2016

Big data• 90% of the data ever created was generated in the

last two years (according to IBM):

• Writing, video, audio

• Travel, e-commerce, electricity use, phone calls

• Metadata, as well

• Maybe people aren’t just numbers… but given how often we’re quantified, we’re not that far away

12

12 Big Data.key - February 18, 2016

13

13 Big Data.key - February 18, 2016

14

14 Big Data.key - February 18, 2016

But numbers are good (if you’re a computer)

• Modern computers can hold billions of them

• Store not only information about people, but their characteristics and traits, as well as dates and times

15

15 Big Data.key - February 18, 2016

Your business

• When you make business decisions, what factors are you considering?

• Are you trying to check all of the possible correlations, across all of the data?

• Or are you sampling, and hoping that your sample is an accurate and representative one?

16

16 Big Data.key - February 18, 2016

• Your business is now collecting lots and lots of data

• Who is buying your products and services?

• How often do they visit your Web site?

• Which of your e-mail messages do they open?

• What do they buy?

• How old are they, and where do they live?

Enter big data!

17

17 Big Data.key - February 18, 2016

Why “big” data?

• It sounds sophisticated and high-tech.

• There really is a lot of it.

• Often, there’s more than we can fit (or process) on a single computer

• But often, it’s not really that big

18

18 Big Data.key - February 18, 2016

Enter data science

• Data scientists come up with ways to turn raw data into useful information

• They create and use models to find correlations among the many pieces of data you’re collecting

• They can help you use these correlations to improve your marketing, sales, and production

19

19 Big Data.key - February 18, 2016

What is data science?

A person employed to analyze and interpret complex digital data, such as the usage statistics of a website, especially in order to assist a business in its decision-making.

— Oxford Dictionary

20 Big Data.key - February 18, 2016

More realistically…

Data scientist (noun): Person who is better at statistics than any soft‐ ware engineer and better at software engineering than any statistician.

— Josh Wills

21 Big Data.key - February 18, 2016

Graphically…

From Drew Conway, 2010

22 Big Data.key - February 18, 2016

Look for correlations• Data scientists look for correlations

• Using those correlations, we know where we have been successful (and not)

• These can be interesting, useful, or crucial

• Being able to analyze lots of factors, and thus find correlations in them, allows our models to be more sophisticated — and also predictive

23

23 Big Data.key - February 18, 2016

24 Big Data.key - February 18, 2016

Spurious correlations

• http://tylervigen.com/spurious-correlations

25 Big Data.key - February 18, 2016

Data scientists’ tools

• Programming languages + libraries

• Data sets

• Machine learning

• Distributed processing systems

26 Big Data.key - February 18, 2016

Programming languages

• R

• Python

• Julia

• Clojure

27 Big Data.key - February 18, 2016

Data sets

• Your own

• Public ones

28 Big Data.key - February 18, 2016

What do data sets look like?

• Excel spreadsheets

• CSV files

• Multiple CSV files (e.g., separated by date)

• Databases you can clone — but this is rare

29 Big Data.key - February 18, 2016

Cleaning the data

• Remove bad, incomplete data

• Remove data that isn’t relevant for the investigation you’re doing

• But don’t remove too much, ruining your data!

30 Big Data.key - February 18, 2016

Machine learning

• The computer can learn to categorize things as well as humans do

• Then, when given new data, it can decide into which category to put the new item

31 Big Data.key - February 18, 2016

Spam filters

• Spam filters use a simple form of machine learning

• Is a particular e-mail message spam?

• Check the contents, using a variety of factors

• If the factors make this document similar to other spam documents, then mark it as spam

32 Big Data.key - February 18, 2016

Aha!

• Wondering why e-mail from certain people always gets put into the “junk” e-mail box?

• Because those people send mail that looks (to the machine-learning system) too much like junk

• Mark the messages as not being junk, so your spam-control system can learn over time

33 Big Data.key - February 18, 2016

Experience is important

• In people, learning is a matter of experience

• Machine learning is all about computers also gaining that experience

34 Big Data.key - February 18, 2016

35 Big Data.key - February 18, 2016

36 Big Data.key - February 18, 2016

37 Big Data.key - February 18, 2016

38 Big Data.key - February 18, 2016

Models

• Machine learning employs many models

• Each model uses different techniques to train the computer into which categories data should be put

• Supervised vs. unsupervised learning

• The computer can then be given new data

39 Big Data.key - February 18, 2016

Example: K nearest neighbors

• One common machine-learning algorithm finds the closest k (a number) items to a new piece of data

• We then have an election — to which category does most existing data belong?

• Our new data point joins the majority category

40 Big Data.key - February 18, 2016

Lots of other models• Linear regression

• Logistic regression

• Neural networks

• Deep learning

• K-means clustering

• And many, many others — with lots in active development!

41 Big Data.key - February 18, 2016

Data science use cases

• So, where is data science being used?

• And how can we apply it to our businesses?

42 Big Data.key - February 18, 2016

A/B testing• Find out what your users respond to

• Try two (or more) different versions of your Web site

• Compare to see which one has greater conversions (i.e., e-commerce success)

• Use the better one… and then do another experiment, ad infinitum

43 Big Data.key - February 18, 2016

44 Big Data.key - February 18, 2016

45 Big Data.key - February 18, 2016

46

46 Big Data.key - February 18, 2016

Correlations!

• Amazon is one of the most successful data-science shops

• They’re always collecting information on what people look at and buy — and they suggest other products based on that behavior

• How often are they right? (Very often, actually)

47 Big Data.key - February 18, 2016

Fraud detection

• What behavior is correlated with a stolen credit card?

• What language is correlated with a research paper that was already written and submitted?

48 Big Data.key - February 18, 2016

Interact with data• Visualizations provide us (humans) with insights

• Many data scientists spend their time helping others create powerful, useful visualizations

• GIS (geographic information systems) allow us to take data, and put it on maps. Some maps are event interactive, letting us explore data in new ways

49 Big Data.key - February 18, 2016

Add GIS, and create maps

• https://openaccess9000.cartodb.com/viz/3459b348-8212-11e5-b022-0e8c56e2ffdb/public_map

50 Big Data.key - February 18, 2016

This fire hydrant might earn more than you

From I Quant NY51 Big Data.key - February 18, 2016

“Half the money I spend on advertising is wasted; the trouble is I don't know which half.”

— John Wanamaker

52 Big Data.key - February 18, 2016

Advertising

• We can show ads online, and know who has clicked on them.

• But we can do better: Show ads to the people for whom they’re most relevant, and most likely to be appropriate

• How can we do that?

53 Big Data.key - February 18, 2016

Some ideas• Show people ads based on text searches

• Show people ads based on what they have explicitly told us

• Show people ads based on what content they have indicated they like

• Show people ads based on their friends’ preferences and demographics

54 Big Data.key - February 18, 2016

Aha!

• No wonder Google and Facebook are pioneers in the area of big data

• They’re using enormous amounts of data to display ads that people like

• And they get lots of additional data points every day, thanks to searches and “likes”

55 Big Data.key - February 18, 2016

Data sets

• UCI’s machine learning data set

• https://archive.ics.uci.edu/ml/datasets/Housing

• Newsletter with new data sets:

• http://tinyletter.com/data-is-plural/

56 Big Data.key - February 18, 2016

Really big data

• What do we do when the data is too big?

• What if it will take too long to process, or the data is too big to store on a single machine?

• Then we call in the truly big guns — distributed processing systems

57 Big Data.key - February 18, 2016

Map-reduce• map-reduce has been around for decades on

individual computers

• But only now (thanks to Google’s implementation for distributed systems), everyone wants to use it

• map: apply a function to every element of a sequence

• reduce: turn a sequence of values into single (or small) value

• Not all data can be broken apart easily!

58 Big Data.key - February 18, 2016

• Create a Hadoop cluster, including storage of the data you want to understand there

• Run a map-reduce query on your data — apply a function to it (e..g, do you contain the phrase “machine learning”) and then reduce into an HTML page

• Use virtual machines in the cloud to make your cluster bigger or smaller, as necessary

59 Big Data.key - February 18, 2016

• More modern, real-time, in-memory analysis system

• Open-source system that’s increasingly popular

• Built on the same filesystem as Hadoop

• Connections from Java, Python, R

• Has a suite of highly parallel machine-learning models

• Because your data is in memory (and split across multiple virtual machines), it runs much faster

60 Big Data.key - February 18, 2016

How old are you?

• http://how-old.net

61 Big Data.key - February 18, 2016

Thanks! Any questions?

• You can always find me at:

[email protected]

• http://www.lerner.co.il/

• http://blog.lerner.co.il/

• http://lerner.co.il/newsletter

• @reuvenmlerner on Twitter

62 Big Data.key - February 18, 2016