experiences with big data by srinivasan seshadri
TRANSCRIPT
![Page 1: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/1.jpg)
EXPERIENCES WITH BIG DATASRINIVASAN SESHADRI, FOUNDER ZETTATA
![Page 2: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/2.jpg)
WORLD BEFORE BIG DATA
It is a Capital Mistake to Theorize before one has Data Sherlock Holmes
![Page 3: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/3.jpg)
HOWEVER, DO NOT WANT TO BE HERE
![Page 4: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/4.jpg)
AXIOMS
• Measure, Measure, Measure
• Garbage in, Garbage out
• Correlation is not Causation
• More Data Beats Cleverer Algorithms
• Algorithms that do better with more data are more interesting
• Independent Sources Of data add new signals
• Feature Engineering is the key to being a good data scientist
• How do machines and Human interplay in Big Data?
• Learn many models ‐ ensembles
• Outliers are always interesting..
![Page 5: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/5.jpg)
MEASURE, MEASURE, MEASURE
• Have a Hypothesis• Create a metric to determine if hypothesis is correct• Build a solution that can be measured • Iterate
If you can not measure it you can not improve it – Lord Kelvin
![Page 6: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/6.jpg)
GARBAGE IN GARBAGE OUT
![Page 7: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/7.jpg)
WHAT DO YOU WANT THE ANSWER TO BE?
![Page 8: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/8.jpg)
CORRELATIONS
![Page 9: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/9.jpg)
CORRELATION IS NOT CAUSATION
• Correlation in Data Need Not Imply Correlation in Real Life
• Can find random correlations in large amounts of data
• Correlation Does Not Imply Causation
![Page 10: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/10.jpg)
CORRELATION IS NOT CAUSATION
![Page 11: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/11.jpg)
CORRELATION STRIKES AGAIN!!
![Page 12: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/12.jpg)
MORE DATA BEATS CLEVERER ALGORITHMS
• Adding IMDB data For Netflix prize
• Adding Protein Expression Data or Patient Data to Gene Expression Data
• Bag of Words Approach for Word Sense Disambiguation
![Page 13: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/13.jpg)
WORD SENSE DISAMBIGUATION
• Bank
• Sloping Land Alongside a river or a lake. It typically has thick vegetation growing..
• A financial institution that takes deposits from some customers and gives loans to others who require the money.
To disambiguate in typical sentences look for co‐occurrences of words with words in definition. Unsupervised Learning. Bootstrap a model.
The pilot landed the plane on the Hudson River amongst several boats and an appreciative audience cheered from the banks of the river.
He issued a check and took it to the bank so he could transfer money.
Can look for frequent co‐occurrences with each sense of the word (boats and check respectively) and build a larger bag of words in which to disambiguate.
![Page 14: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/14.jpg)
WORD SENSE DISAMBIGUATION
![Page 15: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/15.jpg)
FEATURE ENGINEERING
Can not expect arbitrarily complex models to be learned by the computer
![Page 16: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/16.jpg)
FEATURE ENGINEERING
CITYY 1 LAT. CITY 1 LNG. CITY 2 LAT. CITY 2 LNG. DRIVABLE?
123.24 46.71 121.33 47.34 Yes
123.24 56.91 121.33 55.23 Yes
123.24 46.71 121.33 55.34 No
123.24 46.71 130.99 47.34 No
![Page 17: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/17.jpg)
FEATURE ENGINEERING
DISTANCE (MI.) DRIVABLE?
14 Yes
28 Yes
705 No
2432 No
![Page 18: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/18.jpg)
OF HUMANS AND MACHINES• Partnership is important
• Aha moment and the strategy comes from humans..
• Machines do the hard work of calculating fast and do not tire
• Maybe some day Machines will be able to do more than they are asked to do explicitly.. Today Explicit Instructions are the norm..
![Page 19: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/19.jpg)
ENSEMBLES ‐ OUTLIERS ARE NOT INTERESTING – FOR CLASSIFIERS
• Learn many models from random subsets of training data
• Effect of outliers is reduced on a majority of the models
• Random Forests
![Page 20: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/20.jpg)
OUTLIERS ARE ALWAYS INTERESTING FOR RANKING PROBLEMS
• You have to be so good that they can not ignore you• My personal thesis: Average in everything is boring. Be outstanding in something.
• Outliers along some dimension always have interesting information – whenever you are combining multiple variables to come up with one global rank• Search• Job Interviews!
![Page 21: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/21.jpg)
UNKNOWN UNKNOWNS – VERY INTERESTING TO A BUSINESS – OUTLIERS
![Page 22: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/22.jpg)
BIG DATA AND HEALTHCARE
![Page 23: Experiences with big data by Srinivasan Seshadri](https://reader031.vdocuments.mx/reader031/viewer/2022031914/55d4f9b2bb61eb3f428b4572/html5/thumbnails/23.jpg)
ARE YOU IN THE JOB MARKET?