data science for tackling the challenges of big data dr. brand niemann director and senior data...
TRANSCRIPT
1
Data Science for Tackling the Challenges of Big Data
Dr. Brand NiemannDirector and Senior Data Scientist/Data Journalist
Semantic Communityhttp://semanticommunity.info/
http://www.meetup.com/Federal-Big-Data-Working-Group/http://semanticommunity.info/Data_Science/Federal_Big_Data_Working_Group_Meetup
November 14, 2014
2
Overview• Six Week MIT Online Course:
– Started November 4th and Completed November 12th.• Mined this MIT Online Course for Data Sets and Ideas:
– Found subset of the slides that contained data sets and ideas and were interesting and useful visualizations in themselves.
• Professor Karger's Lecture Slides on Visualization User Interfaces Were All About My Heroes:– Tukey, Tufte, Sneiderman, and Spotfire. (In fact it was everything leading
up to Spotfire, but Spotfire itself!)• Preserve My Work & Present Tutorial to the Federal Big Data
Working Group Meetup:– MindTouch Knowledge Base, Excel Spreadsheet Index, and Spotfire
Interactive Visualizations.
3
MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Assessment
Web Site (private)
4
MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Course Progress
https://mitprofessionalx.edx.org/courses/MITProfessionalX/6.BDX/2T2014/progress
5
MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Big Data Storage
Web Site (private)
6
MITProfessionalX 6.BDx Tackling the Challenges of Big Data: Modern Databases
Web Site (private) and Script (Public)
Script
7
Courseware: Big Data Storage• I was especially interested in the following since both
Professors Stonebraker and Madden presented to our Federal Big Data Working Group Meetup:– This module begins with an overview of a number of these technologies by
renowned database professor Mike Stonebraker. In his unique and ardent fashion, Mike expresses his skepticism about many new technologies, particularly Hadoop/MapReduce and NoSQL, and voices support for many new relational technologies, including column stores and main memory databases.
– After that, Professors Matei Zaharia and Samuel Madden provide a more nuanced view of the tradeoffs between the various approaches, discussing Hadoop and its derivatives, as well as NoSQL and its tradeoffs, in more detail.
– Professor Stonebraker expresses a number of strong opinions in this module. Which of them do you agree with? Which do you disagree with? Why?
3.0 Introduction to Big Data Storage and Discussion 3
8
Selected Slides: Professor Sam Madden
What Is This Course Going to Cover? Other Techniques We'll Cover
9
Selected Slides: Professor David Karger
Overview Interaction Strategy
10
Selected Slides: Professor Daniela Rus
Case Study: Transportation in Singapore
1.1 Case Study: Transportation - PDF of Presentation slides (Rus)
11
Google Search: Singapore Taxi Data
12
Think Business:Why can’t I find a taxi when I really need one?
http://thinkbusiness.nus.edu/smart-finance/item/131-why-can%E2%80%99t-i-find-a-taxi-when-i-really-need-one?
Based on: Labor Supply Decisions of Singaporean Cab Drivers, May 8, 2013Newer Paper: Labor Supply Decisions of Singaporean Cab Drivers, September 2014
13
Labor Supply Decisions of Singaporean Cab Drivers: Table 1: Summary Statistics by Days
http://www.ushakrisna.com/Cabdrivers.pdf
14
MIT Big Data Knowledge Base: Table 1 Spreadsheet
Spreadsheet
My Note: Image PDF so had to hand build!
15
Singapore Land Transport Authority:Traffic Info Service Providers
http://www.lta.gov.sg/content/ltaweb/en/industry-matters/traffic-info-service-providers.html
16
Singapore Land Transport Authority:MyTransport.sg
http://www.mytransport.sg/content/mytransport/home/dataMall.html#All_Datasets
Screen Scrape
17
Singapore Land Transport Authority:All Datasets Spreadsheet
Spreadsheet
18
MIT Big Data Knowledge Base: MindTouch
Data Science for Tackling the Challenges of Big Data
Labor Supply Decisions of Singaporean Cab Drivers, September 2014, as a Data Science Data Publication
19
MIT Big Data:Knowledge Base Spreadsheet
Spreadsheet
20
MIT Big Data:Course Participant Spreadsheet
Spreadsheet
My Note: This was mapped in Spotfire after data curation (cleaning of the country names).Spotfire has built in data curation functions.
21
MIT Big Data:Spotfire Cover Page
Web Player
22
MIT Big Data:Student Enrollment
Web Player
23
MIT Big Data:Singaporean Cab Drivers
Web Player
24
New York City Open Data: Socrata
https://nycopendata.socrata.com/
25
New York City Open Data:Search Results
Web Site
My Note: Could Only Find Taxi Drivers Data.
26
New York City Open Data:Data Table
Web Site and Medallion_Drivers_-_Active.xlsx
Download: XLSX
27
Visualizing NYC’s Open Data:Socrata Beta
https://nycopendata.socrata.com/viz
28
MIT Big Data Assessment:Questions and Answers
• Big Data Collection– 2) Data science requires:
• Knowledge of statistics• Knowledge of data management• Knowledge of curation• Alloftheabove-correct
• Big Data Systems– 13) For which of the following tasks is interactive visualization most useful? (choose all
that apply)• Developingahypothesisaboutdata-correct• Formally confirming a hypothesis• Communicatingaconclusionaboutdata-correct• All of the above
• Big Data Analytics:– 13) Big Data means that there's no shortage of useful data.
• True• False-correct Story