1.2 visualizing twitter - university of...

13
1 Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor and Director of Big Data at CSAIL Massachusetts Institute of Technology Tackling The Challenges of Big Data Visualizing Twitter Introduction to Twitter Data Samuel Madden Professor and Director of Big Data at CSAIL Massachusetts Institute of Technology © 2014 Massachusetts Institute of Technology Tackling the Challenges of Big Data This Module 1. Understanding Twitter data 2. Demonstration of MapD, a Twitter visualization system 3. Technology behind MapD 4. Other Approaches to Interactive Analytics

Upload: others

Post on 21-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

1

Tackling The Challenges of Big Data Visualizing Twitter

Samuel Madden

Professor and Director of Big Data at CSAIL

Massachusetts Institute of Technology

Tackling The Challenges of Big Data Visualizing Twitter

Introduction to Twitter Data

Samuel Madden Professor and Director of Big Data at CSAIL

Massachusetts Institute of Technology

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

This Module

1.  Understanding Twitter data!!2.  Demonstration of MapD, a Twitter visualization system!

3. Technology behind MapD!!4. Other Approaches to Interactive Analytics!

Page 2: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

2

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Part 1: Twitter

•  500 million tweets a day!– > 8 million/day “geocoded”!

•  More than just 140 characters:!– Geo Coordinates!– Timestamp!– User and Follower information!– Reply Information!– Hashtags!– Device/Platform Used to Post!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Geocoded Data

Geocoded Tweets containing “snow” on Oct 4, 2013!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

“Big Data” and Twitter

•  Volumes and rates are massive; need new tools to interactively visualize data!

Page 3: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

3

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

“Big Data” and Twitter

Not the best way to visualize 500 M Tweets!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

“Big Data” and Twitter

•  Volumes and rates are massive; need new tools to interactively visualize data!

•  Want to correlate with external and internal data sets!– E.g., brand preference vs census district income!

•  Want to do deep analysis of content!– What product, show, or person is being discussed!– What opinion is being expressed (“sentiment analysis”)!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Spatial Correlations

Source: http://www.richblockspoorblocks.com!

0!0.005!

0.01!0.015!

0.02!0.025!

0.03!0.035!

0.04!0.045!

0.05!

0! 100,000! 200,000! 300,000! 400,000! 500,000!

% T

wee

ts

Men

tioni

ng

Star

buck

s!

Zip Code Income!

Requirement: Interactive learning and statistics!

Page 4: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

4

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

“Big Data” and Twitter

•  Volumes and rates are massive; need new tools to interactively visualize data!

•  Want to correlate with external and internal data sets!– E.g., brand preference vs census district income!

•  Want to do deep analysis of content!– What product, show, or person is being discussed!– What opinion is being expressed (“sentiment analysis”)!

Tackling The Challenges of Big Data Use Case: Visualizing Twitter

Introduction to Twitter

THANK YOU

Tackling The Challenges of Big Data Visualizing Twitter

MapD Demo

Samuel Madden Professor and Director of Big Data at CSAIL

Massachusetts Institute of Technology

Page 5: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

5

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Part 2: MapD Demo

•  What is MapD?!– GPU Accelerated Database!– With a Twitter Visualization in Front of it!

•  What Will You See!– ~ 50 M tweets displayed!– Nothing is pre-computed!

•  Designed to tackle the “volume” and “velocity” challenges!

Tackling The Challenges of Big Data Visualizing Twitter

MapD Demo

THANK YOU

Tackling The Challenges of Big Data Visualizing Twitter

How MapD Works

Samuel Madden Professor and Director of Big Data at CSAIL

Massachusetts Institute of Technology

Page 6: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

6

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

How Does MapD Work?

•  Key insight: GPUs have enough memory that a cluster of them can store substantial amounts of data!

•  Not an accelerator, but a full blown SQL Database!!

WHAT IS MAPD? MapD is: � A GPU (Graphics Processing Unit)-

accelerated SQL column store database � Scales to any number of Nvidia

GPUs � A real-time map generator � Uses GPUs to render point and

heatmaps of query results in milliseconds

� A WMS web-server � Can serve out of the box as the

backend for a web mapping client, allowing for querying and visualization of billions of features

� Fast and cost-effective � 4 Nvidia commodity GPUs provide

provide over 12 Teraflops of compute power and nearly 1 TB/sec of memory bandwidth

147,201,658 tweets from Oct 1, 2012 to Nov 6, 2012

Relative intensity of “tornado” on Twitter (with point overlay) from Febuary 29, 2012 to March 1, 2012

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

GPUs

•  Massive parallelism enables interactive browsing interfaces!– High End GPUs can provide!

250 GB/sec of bandwidth!5X conventional microprocessors!

4 Teraflops compute !10X conventional multi-core microprocessor!

– 7 – 70x speedup in database ops!•  Challenges!

– Limited Memory on GPUs – But growing!!– Limited bandwidth between CPU and GPU – Will Change!!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

“Shared Nothing” Processing Multiple GPUs, with data partitioned between them!

Node 1! Node 2! Node 3!

Query: Heatmap tweets containing “rain”!

Filter!text ILIKE ‘rain’!

Filter!text ILIKE ‘rain’!

Filter!text ILIKE ‘rain’!

Tweets 0 – 100M! Tweets 100M – 200M! Tweets 200M – 300M!Images:http://graffletopia.com!

Page 7: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

7

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Tweet Indexing on GPU •  Encode tweets using a “dictionary”!

Word! Encoding!…! …!

Rain! 57663!

Rainbow! 57664!

Rainman! 57665!

Rainy! 57666!

…! …!

Filter!text ILIKE ‘rain’!

!

Filter!SELECT tweetid FROM words

WHERE id = 57663!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Example: Filtering in Parallel !•  Column-oriented execution!

– Avoids wasting memory bandwidth!•  Plan:!

– Produce bitmap of tweets to read!– Read tweets, increment output bins

in bitmap!

Filter!SELECT tweet id FROM words WHEREid = 57663!

TweetId! WordId!…! …!1! 57663!2! 57664!2! 27!3! 8841!…! …!

TweetId! Lat! Lon!…! …!1! -41.5! 23.1!2! -41.7! 77.4!3! -37.4! 48.2!4! 28.4! -44.0!…! …!Data Tables Reside in GPU Memory!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Example: Filtering in Parallel !•  1000+ GPU Threads!•  Running in “Warps”!•  Threads in same warp run exactly the same instructions!

– Balanced data per thread improves performance!

Bitmap!0!0!0!0!0!0!…!

Tweet 1!

Tweet n!

Warp 2!

Warp 3!

Warp 1!

TweetId! WordId!…! …!1! 57663!2! 57664!2! 27!3! 8841!…! …!

Page 8: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

8

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Example: Filtering in Parallel !•  1000+ GPU Threads!•  Running in “Warps”!•  Threads in same warp run exactly the same instructions!

– Need same amount of data to be efficient!

Bitmap!1!0!0!0!1!0!…!

Tweet 1!

Tweet n!

Warp 2!

Warp 3!

Warp 1!

TweetId! WordId!…! …!1! 57663!2! 57664!2! 27!3! 8841!…! …!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Example: Filtering in Parallel !•  1000+ GPU Threads!•  Running in “Warps”!•  Threads in same warp run exactly the same instructions!

– Need same amount of data to be efficient!

Bitmap!1!0!0!0!1!0!…!

Tweet 1!

Tweet n!

Warp 2!

Warp 3!

Warp 1!

TweetId! WordId!…! …!1! 57663!2! 57664!2! 27!3! 8841!…! …!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Example: Filtering in Parallel !•  1000+ GPU Threads!•  Running in “Warps”!•  Threads in same warp run exactly the same instructions!

– Need same amount of data to be efficient!

Bitmap!1!0!1!1!1!0!…!

Tweet 1!

Tweet n!

TweetId! WordId!…! …!1! 57663!2! 57664!2! 27!3! 8841!…! …!

Page 9: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

9

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Example: Filtering in Parallel !•  1000+ GPU Threads!•  Running in “Warps”!•  Threads in same warp run exactly the same instructions!

– Need same amount of data to be efficient!

Bitmap!1!0!1!1!1!0!…!

Tweet 1!

Tweet n!

Lat! Lon!…!-41.5! 23.1!-41.7! 77.4!-37.4! 48.2!28.4! -44.0!…!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Example: Filtering in Parallel !•  1000+ GPU Threads!•  Running in “Warps”!•  Threads in same warp run exactly the same instructions!

– Need same amount of data to be efficient!

Bitmap!1!0!1!1!1!0!…!

Lat! Lon!…!-41.5! 23.1!-41.7! 77.4!-37.4! 48.2!28.4! -44.0!…!

Warp 2!

Warp 3!

Warp 1!

Output buffer!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Example: Filtering in Parallel !•  1000+ GPU Threads!•  Running in “Warps”!•  Threads in same warp run exactly the same instructions!

– Need same amount of data to be efficient!

Bitmap!1!0!1!1!1!0!…!

Lat! Lon!…!-41.5! 23.1!-41.7! 77.4!-37.4! 48.2!28.4! -44.0!…!

Warp 2!

Warp 3!

Warp 1!

Output buffer!

Page 10: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

10

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Parallel Plumbing

Once parallel versions of all operators are built, programmers can just think in terms of SQL!

E.g.,!!SELECT heatmap(lat,lon) !!WHERE tweettext ILIKE “rain”!

!Scaling to 100’s of millions of points in milliseconds!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Summary

•  Interactivity can create a qualitative difference!

•  GPUs can dramatically accelerate some tasks!– Compute intensive tasks via parallelism!– Data intensive tasks via increased memory bandwidth!

•  Clever use of hardware enables dramatic speedups!•  Arrays of multiple GPUs help solve large problems!

Try it yourself:!http://mapd.csail.mit.edu!

Tackling The Challenges of Big Data Visualizing Twitter

How MapD Works

THANK YOU

Page 11: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

11

Tackling The Challenges of Big Data Visualizing Twitter

Other Approaches to Interactive Analytics

Samuel Madden Professor and Director of Big Data at CSAIL

Massachusetts Institute of Technology

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Other Approaches to Interactive Analytics •  Interactive Analytics Important in Many Situations!

•  Examples!– Fault Diagnostics!–  Intrusion Detection!– Financial Analysis!– Online Advertising!

•  GPUs & Massive Parallelism are Just One Way Approach!

!

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Other Techniques We’ll Cover

•  Brute Force / Massive Parallelism!– MapD (Hadoop, and many others)!

•  Partitioning!– Split the data, operate on the part you need!

•  Sampling!– Operate on a Subset of the Data!

•  Summarization!– Operate on a Summary of the Data!

Page 12: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

12

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Partitioning

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Sampling

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Summarization

Page 13: 1.2 Visualizing Twitter - University of Baltimorehome.ubalt.edu/ntsbagga/BI/1.2_Visualizing_Twitter.pdf · Tackling The Challenges of Big Data Visualizing Twitter Samuel Madden Professor

13

© 2014 Massachusetts Institute of Technology!Tackling the Challenges of Big Data!

Other Techniques We’ll Cover

•  Brute Force / Massive Parallelism!– MapD (Hadoop, and many others)!

•  Partitioning!– Split the data, operate on the part you need!

•  Sampling!– Operate on a Subset of the Data!

•  Summarization!– Operate on a Summary of the Data!

Tackling The Challenges of Big Data Visualizing Twitter

THANK YOU