staying ahead of the data avalanche - · pdf filestaying ahead of the data avalanche ......

Post on 06-Mar-2018

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Staying Ahead of the Data AvalancheChallenges and Opportunities in Analytics

Prof. Dr. Seppe vanden BrouckeSAS Analytics Experience Rome – 8 November 2016

Presenter: Seppe vanden Broucke

• Assistant professor in Data and Process Science at department of Decision Sciences and Information Management at KU Leuven (Belgium)

• PhD in Applied Economics at KU Leuven, Belgium in 2014• Title: Advances in Process Mining: Artificial Negative Events and Other Techniques

• Research: business data mining and analytics, machine learning, process management, process mining

• Contact: www.dataminingapps.com www.seppe.netseppe.vandenbroucke@kuleuven.be

BIGDATA

“We live in a data flooded world”

“Making sense of mountains of data” aka

“Scale your data mountain”

“The data avalanche”“Data is

the new

oil”

“The data tsunami”

BIGDATA

“It all sounds kind

of dangerous”

BIGDATA

DATASCIENCE+ =

But so many success stories…

&ANALYTICS

“We live in magical times”

Uber

Contextual RNN-GANs for Abstract

Reasoning Diagram Generation

Arnab Ghosh*, Viveka Kulharia*, Amitabha

Mukerjee, Vinay Namboodiri, Mohit Bansal

Measuring an Artificial Intelligence System's

Performance on a Verbal IQ Test For Young Children

Stellan Ohlsson, Robert H. Sloan, György Turán, Aaron

Urasky

BIGDATA “Let the good

times roll”

DATAANALYTICS

+

So why do so many projects fail?

“During 2015, only 15% of Fortune 500 organizations were able to

exploit big data for competitive advantage” – Gartner

“Data maturity of companies is very disparate, and

the most advanced of them start doubting.”

– Christophe Bourguignat

“75 % have invested in Big Data, but only 10% have

projects in production.”

Companies face disillusions. They start asking

questions: I know how much it costs, but how much

do I earn? What is my return on investment?

Machine learning and data science have ( just) reached “peak hype”

The challenges ahead

TALENT PROCESSTOOLS,

FILES,

FEEDS

COMMU-

NICA-

TION

MEA-

SURING

PRIVACY,

COM-

PLIANCE

ETHICS

QUALITY

TALENT“A data scientist is like a gold-coloured unicorn:

mythical powers, but impossible to find”

TALENT“A data scientist is like a gold-coloured unicorn:

mythical powers, but impossible to find”

Programmer

TALENT Or a spider with 25 legs?

Data science as a straight through process?PROCESS

Adhering to a data science workflow is A-OK:

• CRISP-DM

• The KDD process

• SEMMA

• BinaryEdge

Data science as a straight through process?PROCESS

Data

Selection Cleaning Transformation DiscoveryInterpretation/

Evaluation

Selected Data

Cleaned/Processed

Data

Transformed Data

Mined Model/Patterns

Knowledge/Insights

Not really...PROCESS

Data

Selection Cleaning Transformation DiscoveryInterpretation/

Evaluation

Selected Data

Cleaned/Processed

Data

Transformed Data

Mined Model/Patterns

Knowledge/Insights

More like a loopPROCESS

Experiments can take a while…PROCESS

These things are hardPROCESS

• How to create a sense of urgency?

• What does it mean to be finished?

• You can’t predict the future.

Throw it over the wall projectsCOMMU-

NICA-

TION

Throw it over the wall projectsCOMMU-

NICA-

TION

I want to put this GBM into production,

though some steps are done using R and SAS

Anyone know what this XGBoost thing is?Why aren’t we

deployed yet? We have all this data, why can’t

we find interesting customers?

Talking helpsCOMMU-

NICA-

TION

• Learn each other’s language

• Think with your business hat

• Teach semantics (why a shorter lead list is not easier

to produce)

• Convert hard problems into simpler ones

• Use examples, methaphors, analogies

• Show them and show them often

• IT and data science can live together

“Not everything that counts can be counted…

and not everything that can be counted counts”MEA-

SURING

• Show before and after

• “When are you happy?”

• Accept failures

• Manual measuring can be a good thing• Hard to automate subjective feelings…

“No one ever got fired for installing Hadoop on a

cluster… right?”

TOOLS,

FILES,

FEEDS

A fool can ask more questions in an hour than a

wise man can answer in a hundred years

TOOLS,

FILES,

FEEDS

• Focus on the files

• What are we going to use it for?

A data scientist can find, love, and ditch more

tools/libraries/… in an hour than a procurement

officer can vet in a hundred years

Focus on feeds, files, dataTOOLS,

FILES,

FEEDS

• Let them (us) own the data

• Ship fast, ship often

• Focus on format and storage standards, not on

technology:

“Can I get information on X for months A and B with only those

columns that changed?”

... “Can I get it myself?”

• Where’s your golden data set?

• Trust your experts

Technology moves too fast anyway…TOOLS,

FILES,

FEEDS

• HDFS?

• What about HFD5, or Kudo?

• Do we even have unstructured data?

• Do we know what to do with it?

• V’s of Big Data – yeah right!

• BigSQL, or Hive, or Slurp?

• Cloudera, Hortonworks, Teradata, Oracle, I want Hadoop!?

• What do you mean we need H2O on top of Spark on top of Hadoop? We just installed X

• We did these things before… they weren’t hard then

• True, but…

It’s a difficult balanceTOOLS,

FILES,

FEEDS

The wall of deployementTOOLS,

FILES,

FEEDS

• Versioning

• Collaboration

• Scalable execution

• Multiple language support

• Multiple kernel support

• Monitoring

• Scheduling

• Acyclic dependency graphs

• Quite different from playing in a notebook• Vendors are starting to help out

• SAS, SPSS, Domino Data Labs, sense.io, ScienceOps

<-> Jupyter, Rodeo, Your 3GB PIP packages

• Not familiar both to most data scientists (too messy) and IT shops (too

unfamiliar)

• Can new hires get set up in the environment to run analyses on their first day?

• Can data scientists utilize the latest tools/packages without help from IT?

• Can data scientists use on-demand and scalable compute resources without help from IT/dev ops?

• Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions?

• Does collaboration happen through a system other than email or copying files?

• Can predictive models be deployed to production without custom engineering or infrastructure work?

• Is there a single place to search for past research and reusable data sets, code, etc?

• Do your data scientists use the best tools money can buy?

Source: https://blog.dominodatalab.com/joel-test-data-science/

The “Joel Test” for Data ScienceTOOLS,

FILES,

FEEDS

Garbage in…QUALITY

“This model is gonna be great!”

Sometimes they are…QUALITY

• Really: everyone has bad data• But: more “bad” means more time

• Do make sure to get a continuous source

to the “bad” data

• Survey: 50+ banks participating world-wide• Most banks indicated that between 10–20 percent of their data suffer from data

quality problems

• Manual data entry is one of the key problems

• Diversity of data sources and consistent corporate wide data representation the

main challenges for data quality

• Regulatory compliance is the key motive to improve data quality

Oh boy…

• Datensparsamkeit

• Cookie law

• Basel II / III

• Who knows where the cloud is anyways?

• EU directives outdated

• “It’s all on Facebook anyway”

PRIVACY,

COM-

PLIANCE

Academics are just getting started…PRIVACY,

COM-

PLIANCE

In more ways than one...PRIVACY,

COM-

PLIANCE

“If only we didn’t have to worry about this”PRIVACY,

COM-

PLIANCE

Use it as a competitive

advantage?

PRIVACY,

COM-

PLIANCE

45

https://backchannel.com/an-exclusive-look-at-how-ai-and-machine-learning-work-at-apple-8dbfb131932b#.crky6nt6k

Data science for good?ETHICS

• Can an algorithm be racist? Sexist?

• “Will Predictive Models Outliers Be The New Socially

Excluded?” Companies like DataKind, or Bayes Impact

• Concept of open models

The challenges today

TALENT PROCESSTOOLS,

FILES,

FEEDS

COMMU-

NICA-

TION

MEA-

SURING

PRIVACY,

COM-

PLIANCE

ETHICS

QUALITY

Thank you

top related