6 things to expect when you are visualizing

Post on 12-Apr-2017

344 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Krist Wongsuphasawat / @kristw

6 THINGS TO EXPECT WHEN YOU ARE VISUALIZING

6 THINGS TO EXPECT WHEN YOU ARE VISUALIZINGKrist Wongsuphasawat / @kristw

Computer Engineer Bangkok, Thailand

Chulalongkorn University

Krist Wongsuphasawat / @kristw

Programming + Soccer

Computer Engineer Bangkok, Thailand

Krist Wongsuphasawat / @kristw

Programming + Soccer

Computer Engineer Bangkok, Thailand

Krist Wongsuphasawat / @kristw

(P.S. These are actually not my robots, but our competitors’.)

Krist Wongsuphasawat / @kristw

Computer Engineer Bangkok, Thailand

Krist Wongsuphasawat / @kristw

Computer Engineer Bangkok, Thailand

PhD in Computer Science Information Visualization Univ. of Maryland

Krist Wongsuphasawat / @kristw

Computer Engineer Bangkok, Thailand

IBMMicrosoft

PhD in Computer Science Information Visualization Univ. of Maryland

PhD in Computer Science Information Visualization Univ. of Maryland

IBMMicrosoft

Data Visualization Scientist Twitter

Krist Wongsuphasawat / @kristw

Computer Engineer Bangkok, Thailand

#interactive visualizations

Open-source projects

Visual Analytics Tools

DATA =ME+ VIS

Data, I’m ready!

Data, I’m ready!

Here I come!

WHAT TO EXPECT?

1. EXPECT TO FIND THE REAL NEED

INPUT (DATA)What clients think they have

INPUT (DATA)What clients think they have What they usually have

YOUWhat clients think you are

YOUWhat clients think you are What they will get

OUTPUT (VIS)What clients ask for

OUTPUT (VIS)What clients ask for What they really need

COMMUNICATE

GOALSPresent data Communicate information effectively

Analyze data Exploratory data analysis

Tools to analyze data Reusable tools for exploration

Enjoy

Combination of above

GOALSPresent data Communicate information effectively

Analyze data Exploratory data analysis

Tools to analyze data Reusable tools for exploration

Enjoy

Combination of above

Who are the audience? What do you want to tell?

What are the questions?

Who will use this? What would they use this for?

Who are the audience?

I need this. Take this.

I need this. Here you are.

I need this. Take this.

& COMPROMISE

2. EXPECT TO CLEAN DATA

2. EXPECT TO CLEAN DATA A LOT

70-80% of time cleaning data

“DATA JANITOR”

Collect + Clean + Transform

DATA WRANGLING

WHY DOES IT TAKE SO MUCH TIME?

2.1 Many sources and data format

DATA SOURCESOpen data Publicly available

Internal data Private, owned by clients’ organization

Self-collected data Manual, site scraping, etc.

Combine the above

DATA FORMATStandalone files txt, csv, tsv, json, Google Docs, …, pdf*

Databases doesn’t necessary mean they are organized

API better quality with more overhead

Website

Big data*

NEED TO…Change format e.g. tsv => json

Combine data

Resolve multiple sources of truth

2.2 Data transformation is needed.

EXAMPLESConvert latitude/longitude into zip code

Change country code from 3-letter (USA) to 2-letter (US)

Correct time of day based on users’ timezone

etc.

2.3 Data collection issues

EXAMPLESTypos

Incorrect values

Incorrect timestamps

Missing data

2.4 Definition of “clean” data

IS THIS CLEAN?USER RESTAURANT RATING========================A MCDONALD’S 3B MCDONALDS 3C MCDONALD 4D MCDONALDS 5E IHOP 4F SUBWAY 4

IS THIS CLEAN?USER RESTAURANT RATING========================A MCDONALD’S 3B MCDONALDS 3C MCDONALD 4D MCDONALDS 5E IHOP 4F SUBWAY 4

How many reviews are there? Clean.

How many restaurants are there? Not clean. McDonald, McDonald’s, McDonalds

2.5 Bigger data, bigger problems

HAVING ALL TWEETSHow people think I feel.

How people think I feel. How I really feel.

HAVING ALL TWEETS

Hadoop Cluster

GETTING BIG DATA

Data Storage

Scalding (slow)

GETTING BIG DATAHadoop Cluster

Data Storage

Tool

Scalding (slow)

GETTING BIG DATAHadoop Cluster

Data Storage

Tool

Your laptop Smaller dataset

Hadoop Cluster

Scalding (slow)

Data Storage

Tool

Final dataset

Tool node.js / python / excel (fast)

Your laptop

GETTING BIG DATA

Smaller dataset

CHALLENGESSlow Long processing time (hours)

Get relevant Tweets hashtag: #oscars keywords: “moonlight” (movie name)

Too big Need to aggregate & reduce size

Harder to spot problems

RAMSAY & RAMSEY

2.6 New issues can show up any time.

RECOMMENDATIONSAlways think that you will have to do it again document the process, automation

Reusable scripts break a gigantic do-it-all function into smaller ones

Reusable data keep for future project

3. EXPECT TRIALS AND ERRORS

It’s gonna be legen-

Celebrate your trials

#D3BrokeAndMadeArt

When your vis starts working

“Necessity is the mother of invention.”

— English Proverb

“Necessity is the mother of invention.”

— English Proverb

DEADLINE

EXAMPLE PROJECTS

PROJECT 1:

GAME OF THRONES #INTERACTIVE

INTERACTIVE.TWITTER.COM

WHAT TO EXPECTtimely Deadline is strict. Also can be unexpected events.

wide audience easy to explain and understand, multi-device support

one-off project

scope analyze data to find stories and find best way to present them

from fans’ conversations

Reveal the talking points of every episode of

Problem is coming.CHAPTER I

Problem

Want to know what the audience talk about a TV show

from Tweets

HBO’s Game of Thrones

Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons.

Brief Story

A King dies. 

A lot of contenders wage a war to reclaim the throne.

Minor characters with no claim to the throne set their own plans in action to gain power

when all the major characters end up killing each other.

Brave/Honest/Honorable characters die.

Intelligent but shady characters and characters who know nothing

continue to live.

While humans are busy killing each other, ice zombies “White walkers” are invading from the North.

The only group who seems to care about this is neutral group called the Night’s Watch.

HBO’s Game of Thrones

Based on a book series “A Song of Ice and Fire” Medieval Fantasy. Knights, magic and dragons.

Many characters. Anybody can die.

6 seasons (60 episodes) so far

Multiple storylines in each episode

Problem

Want to know what the audience talk about a TV show

from Tweets

Ideas

Common words Too much noise

Ideas

Common words Too much noise

Characters How o!en each character were mentioned?

I demand a trial by prototyping.CHAPTER II

Prototyping

Pull sample data from Twitter API

Entity recognition and counting naive approach

List of namesDaenerys Targaryen,Khaleesi

Jon Snow

Sansa Stark

Tyrion Lannister

Arya Stark

Cersei Lannister

Khal Drogo

Gregor Clegane,Mountain

Margaery Tyrell

Joffrey Baratheon

Bran Stark

Theon Greyjoy

Jaime Lannister

Brienne

Eddard Stark,Ned Stark

Ramsay Bolton

Sandor Clegane,Hound

Ygritte

Stannis Baratheon

Petyr Baelish,Little Finger

Robb Stark

Bronn

Varys

Catelyn Stark

Oberyn Martell

Daario Naharis

Davos Seaworth

Jorah Mormont

Melisandre

Myrcella Baratheon

Tywin Lannister

Tommen Baratheon

Grey Worm

Tyene Sand

Rickon Stark

Missandei

Roose Bolton

Robert Baratheon

Jojen Reed

Jeor Mormont

Tormund Giantsbane

Lysa Arryn

Yara Greyjoy,Asha Greyjoy

Samwell Tarly,Sam

Hodor

Victarion Greyjoy

High Sparrow

Dragon

Winter

Dothraki

Sample Tweet

Sample Tweet

Sample data

Character CountHodor 10000

Jon Snow 5000

Daenerys 4000

Bran Stark 3000

… …

*These numbers are made up for presentation, not real data.

When you play the game of vis, you iterate or you die.

CHAPTER III

Where to go from here?

+ episodes

The Guardian & Google Trendshttp://www.theguardian.com/news/datablog/ng-interactive/2016/apr/22/game-of-thrones-the-most-googled-characters-episode-by-episode

+ emotion

+ connections

+ connections

Gain insights from a single episode emotion & connections

Sample data

Character CountJon Snow+Sansa 1000

Tormund+Brienne 500

Bran Stark+Hodor 300

… …

Character CountHodor 10000

Jon Snow 5000

Daenerys 4000

… …

INDIVIDUALS CONNECTIONS

+ top emojis + top emojis

*These numbers are made up for presentation, not real data.

Graph

NODES LINKS

+ top emojis + top emojis

Character CountJon Snow+Sansa 1000

Tormund+Brienne 500

Bran Stark+Hodor 300

… …

Character CountHodor 1000

Jon Snow 500

Daenerys 400

… …

*These numbers are made up for presentation, not real data.

Network Visualization

Node-link diagram

Force-directed layout http://blockbuilder.org/kristw/762b680690e4b2b2666dfec15838a384

Issue: Hairball

Issue: Occlusions

Tried: Fixed positions

+ Collision Detection

http://blockbuilder.org/kristw/2850f65d6329c5fef6d5c9118f1de6e6

+ Community Detection

https://github.com/upphiminn/jLouvain

+ Collision Detection (with clusters)

https://bl.ocks.org/mbostock/7881887

Tormund + Brienne

Issue: Convex hull

http://bl.ocks.org/mbostock/4341699

x & y only, no radius

Example

Fix it

Fix it

Let’s get other episodes.

Hadoop remembers.CHAPTER IV

More data

Hadoop

Rewrite the scripts in Scalding to get archived data

How much data do we need?

Whole week?

5 days?

2 days?

A day?

etc.

How much data do we need?

Transitions

Changing episode

A#er switching episode1. Store old positions for existing characters.

2. Assign positions for new characters.

Community transition

t=0 t=1

Smoother

t=0 t=1t=0.5 t=0.51

ColorsDefault: D3 category10 Distinct but nothing about the context

Custom palette Colors related to the groups/houses.

Black = Night’s Watch Blue = North Red = Daenerys Gold = Lannister …

Hold the vis.CHAPTER V

The vis is not enough.

Legend

Navigation

Top 3

Adjust threshold

Recap

Filtered Recap

Tooltip

Demohttps://interactive.twitter.com/game-of-thrones

Mobile Support

A visualizer always evaluates his work.CHAPTER VI

Self & Peer

Does it solve the problem?

Google Analytics

Pageviews

Visitors

Actions

Referrals Sites/Social

Feedback

Feedback

PROJECT 2:

VISUAL ANALYTICS TOOLS FOR LOGGING

WHAT TO EXPECTricher, more features to support exploration of complex data

more technical audience product managers, engineers, data scientists

accuracy

designed for dynamic input

long-term projects

Data sources

Output

explore

analyze

present

get

*

*

Data sources

Output

explore

analyze

present

get

*

*

ad-hoc scripts

Data sources

Output

explore

analyze

present

get

*

*

ad-hoc scripts tools for exploration

USER ACTIVITY LOGS

UsersUseTwitter

UsersUse

Product Managers

Curious

Twitter

UsersUse

Curious

Engineers

Log datain Hadoop

Write Twitter

Instrument

Product Managers

WHAT ARE BEING LOGGED?

tweet

Activities

WHAT ARE BEING LOGGED?

tweet from home timeline on twitter.com tweet from search page on iPhone

Activities

WHAT ARE BEING LOGGED?

tweet from home timeline on twitter.com tweet from search page on iPhone

sign up log in

retweet etc.

Activities

ORGANIZE?

LOG EVENT A.K.A. “CLIENT EVENT”

[Lee et al. 2012]

LOG EVENT A.K.A. “CLIENT EVENT”

client : page : section : component : element : actionweb : home : timeline : tweet_box : button : tweet

1) User ID 2) Timestamp 3) Event name

4) Event detail

[Lee et al. 2012]

LOG DATA

UsersUse

Curious

Engineers

Log datain Hadoop

Twitter

Instrument

Write

Product Managers

bigger than Tweet data

UsersUse

Curious

Engineers

Log datain Hadoop

Data Scientists

Ask

Twitter

Instrument

Write

Product Managers

UsersUse

Curious

Engineers

Log datain Hadoop

Data Scientists

Find

Ask

Twitter

Instrument

Write

Product Managers

LOG DATA

UsersUse

Curious

Engineers

Log datain Hadoop

Data Scientists

Find, Clean

Ask

Twitter

Instrument

Write

Product Managers

UsersUse

Curious

Engineers

Log datain Hadoop

Data Scientists

Find, Clean

Ask

Monitor

Twitter

Instrument

Write

Product Managers

UsersUse

Curious

Engineers

Log datain Hadoop

Data Scientists

Find, Clean, Analyze

Ask

Monitor

Twitter

Instrument

Write

Product Managers

Log data

EngineersData Scientists

Usersin Hadoop

Find, Clean, Analyze

Use

Monitor

Ask

Curious

1 2

Twitter

Instrument

Write

Product Managers

client page section component element action

Event50,000+ event types

client page section component element action

Event50,000+ event types

one graph / event

x 50,000

DESIGN

See

Client event collection

Engineers & Data Scientists

See

Client event collection

Engineers & Data Scientists

narrow down

Interactions search box => filter

See

HOW TO VISUALIZE?

narrow down

Client event collection

Engineers & Data Scientists

Interactions search box => filter

See

Client event collection

Engineers & Data Scientists

client : page : section : component : element : action

HOW TO VISUALIZE?

narrow down

Interactions search box => filter

CLIENT EVENT HIERARCHY

iphone home -

- - impression

tweet tweet click

iphone:home:-:-:-:impression

iphone:home:-:tweet:tweet:click

DETECT CHANGES

iphone home -

- - impression

tweet tweet click

iphone home -

- - impression

tweet tweet click

TODAY

7 DAYS AGO

compared to

CALCULATE CHANGES

+5% +5% +5%

+10% +10% +10%

-5% -5% -5%

DIFF

DISPLAY CHANGES

iphone home -

- - impression

tweet tweet click

Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]

DISPLAY CHANGES

home -

- - impression

tweet tweet click

iphone

Demo Demo Demo

Demo / Scribe Radar

Twitter for Banana

PROJECT 3:

VISUAL ANALYTICS TOOLS FOR EXPERIMENTATION

A/B TESTING

RUN AN EXPERIMENTDevelop feature

Track metrics 1. No. of Tweets read 2. No. of Tweets sent 3. No. of Users 4. …

Set bucket size How many users?

RETROSPECTIVE ANALYSISData scientist analyzed 100+ past experiments.

Many useful insights. - We could move metric A by X% on average. - Experiment 18 moved metric A the most - Which team was able to move metric A successfully? - etc.

RETROSPECTIVE ANALYSISData scientist analyzed 100+ past experiments.

Many useful insights. - We could move metric A by X% on average. - Experiment 18 moved metric A the most - Which team was able to move metric A successfully? - etc.

Amount of knowledge transfer = slide deck + wiki page.

Reproduce for recent experiments? Manually.

RETROSPECTIVE ANALYSISData scientist analyzed 100+ past experiments.

Many useful insights. - We could move metric A by X% on average. - Experiment 18 moved metric A the most - Which team was able to move metric A successfully? - etc.

Amount of knowledge transfer = slide deck + wiki page.

Reproduce for recent experiments? Manually.

Make results more accessible and convenient to use.

RETROSPECTIVE ANALYSISData scientist analyzed 100+ past experiments.

Many useful insights. - We could move metric A by X% on average. - Experiment 18 moved metric A the most - Which team was able to move metric A successfully? - etc.

Amount of knowledge transfer = slide deck + wiki page.

Reproduce for recent experiments? Manually.

Make results more accessible and convenient to use.

Automatic

Metric MoverI like to move it, move it

Krist Wongsuphasawat, Joseph Liu, Matthew Schreiner, Andy Schlaikjer, Lucile Lu and Busheng Lou

Set OKRs

Process

# of posts

Implement a feature

Set OKRs

Process

Setup experiment

# of posts

# of posts

Implement a feature

Set OKRs

Interpret results

Process

Run experiment

+1.0%

Setup experiment

# of posts

# of posts

Implement a feature

Set OKRs

Interpret results

Process

Run experiment

+1.0%

Setup experiment

How easy/hard it is to move this metric?How much change to aim for?

Challenges

# of posts

# of posts

Implement a feature

Set OKRs

Interpret results

Process

Run experiment

+1.0%

How much to expect from one experiment?What were the successful features?Who had experience with this?Setup experiment

How easy/hard it is to move this metric?How much change to aim for?

Challenges

# of posts

# of posts

Implement a feature

Set OKRs

Interpret results

Process

Run experiment

+1.0%

How much to expect from one experiment?What were the successful features?Who had experience with this?Setup experiment

How easy/hard it is to move this metric?How much change to aim for?

How good is this?

Challenges

# of posts

# of posts

Past experiments

Metric Mover

Exp. 1

Exp. 2

Exp. 3

Exp. 4

Metric: No. of Posts

Exp. 1

Exp. 2

Exp. 3

Exp. 4

Metric: No. of PostsControl buckets

Exp. 1

Exp. 2

Exp. 3

Exp. 4

Metric: No. of Posts

Exp. 1

Exp. 2

Exp. 3

Exp. 4

Metric: No. of Posts

Insignificant buckets

Exp. 1

Exp. 2

Exp. 3

Exp. 4

Metric: No. of Posts

Metric: No. of Posts

Metric: No. of Posts

% change

0-1%-2% 2%1%

Metric: No. of Posts

% change

0-1%-2% 2%1%

|scaled impact|

100,000,000

1,000,000

10,000

100

Users who watch cat GIFs Users who like cat GIFs Users who post cat GIFs

**These are fake data.**

WORKFLOWIdentify needs

Design and prototype Make it work for sample dataset

Refine, generalize and productionize Make it work for other cases

Document and release

Maintain and support Keep it running, Feature requests & Bugs fix

What separates good and great work

4. EXPECT TIME FOR REFINEMENT

REFINE & POLISHUX / UI + Mobile Support

Color

Animation / Transition

Performance Loading time, Data file size

“The little of visualisation design” by Andy Kirk http://www.visualisingdata.com/2016/03/little-visualisation-design/

“The first 90% of the code accounts for the first 90% of the development time.

The remaining 10% of the code accounts for the other 90% of the development time.”

— Tom Cargill, Bell Labs

or find ways to get some

5. EXPECT FEEDBACK

“Feedback is the breakfast of champion.”

— Ken Blanchard

FEEDBACKDuring development Feedback sessions with clients/potential users

After release Logging User study Forum, User group Office hours

6. EXPECT TO IMPROVE

HOW TO BE BETTER?Time is limited.

Learn from the past

Expand skills

Get help / Grow the team

Improve tooling Solve a problem once and for all

Automate repetitive tasks

http://twitter.github.io/labella.js

Demo / Labella.js

https://github.com/twitter/d3kit

Demo / d3Kithttp://www.slideshare.net/kristw/d3kit

yeoman.io

Demo / Yeoman

SUMMARY

EXPECT…1. to find the real need

2. to clean data a lot

3. trials and errors

4. time for refinement

5. feedback

6. to improveKrist Wongsuphasawat / @kristw

kristw.yellowpigz.com

THANK YOU

QUESTIONS?

My colleagues at Twitter for their collaboration and support in these projects;

and my wife for taking care of the baby while I make these slides.

ACKNOWLEDGEMENT

RESOURCESImages Banana phone http://goo.gl/GmcMPq Bar chart https://goo.gl/1G1GBg Boss https://goo.gl/gcY8Kw Champions League http://goo.gl/DjtNKE Database http://goo.gl/5N7zZz Fishing shark http://goo.gl/2fp4zW Frustrated programmer https://goo.gl/ZLDNny Globe visualization http://goo.gl/UiGMMj Harry Potter http://goo.gl/Q9Cy64 Holding phone http://goo.gl/It2TzH Jon Snow https://goo.gl/CACWxE Jon Snow lightsaber https://goo.gl/CJt1Tn Kiwi orange http://goo.gl/ejQ73y

Kiwi http://goo.gl/9yk7o5 Library https://goo.gl/HVeE6h Library earthquake http://goo.gl/rBqBrs Minion http://goo.gl/I19Ijg Nemo https://goo.gl/m0pmzC Orange & Apple http://goo.gl/NG6RIL Pile of paper http://goo.gl/mGLQTx Scrooge McDuck https://goo.gl/aKv8D7 Trash pile http://goo.gl/OsFfo3 Watercolor Map by Stamen Design Yes GIF https://goo.gl/agvlAE

top related