challenges of visualizing and exploring big data big data visualizationchallenges.pdf · 4/23/2013...

22
Challenges of Visualizing and Exploring Big Data © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 Will Gorman, Chief Architect, Pentaho @wpgorman April 23, 2013

Upload: others

Post on 26-May-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

Challenges of Visualizing and Exploring Big Data

© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Will Gorman, Chief Architect, Pentaho

@wpgorman

April 23, 2013

Page 2: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

About Me

• Started Career at GE Research • 6 years at Pentaho • Hobbies include LEGO

2 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 3: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

About Pentaho

Delivering the future of analytics: modern, unified data integration and business intelligence platform

• Full business analytics & data integration • Native integration into big data ecosystem • Embeddable, cloud-ready analytics

Open source development model enables fast and broad innovation Critical mass achieved:

• Over 1,200 commercial customers • Over 10,000 production deployments • Over 185 countries

One Download Every 30 Seconds

© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555 3

Page 4: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

Why?

• Traditional Approaches to OLAP don’t cut it in Big Data • There’s a need to analyze large amounts of data • The data isn’t in a form designed for traditional analysis • There’s an opportunity for software to solve the problem

4 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 5: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

Traditional OLAP Stack

5 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Visualization

Storage

Query Processing

• MDX, SQL • MOLAP • ROLAP • MPP

• In-Memory • Columnar • Compression

• Crosstab, Bar, Line, Scatter • Drilldown, Drill Through • Filtering

Facts

Dim

Dim

Dim

Dim Dim

Page 6: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

Challenges

• Big Data tends to be… – Nested – Schemaless – Unstructured – LARGE

• But we still want the types of experiences we had with regular

data… – Near real time analytical querying

6 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 7: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

MPP Databases

• Databases like Vertica and Greenplum give us a taste of Big Data – 10 to 100 times better performance than traditional systems

• Scale out architectures

– Address volume and velocity, but not variety and value

7 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 8: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

Approach with Current Technologies

Hybrid Architecture • A combination of Data Integration and Analytics

8 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Marts

Adhoc

Warehouse

Data Lake

Data Sources

Page 9: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

Pentaho Instaview

9 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Configuration • Kettle Input Dialogs • Parameterized Input

Auto-Modeling

Bulk Loading • Agile Mart Step

Tem

plat

e D

efin

ition

Visualization

Transformation Engine

Analyzer

Mondrian - ROLAP

Prompting Library

Agile BI Modeler

Spoon

Instaview Components

Powered by Existing Technologies

Mon

etD

B

Page 10: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

Limitations of Current Approach

• Manually Intensive – Cleansing and Prep are done at a low level

• Analysis limited to traditional data sizes • Expensive

10 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 11: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

But folks are working to solve these problems

• Google Dremel / Big Query – “A letter from the future”

• Cloudera Impala • Berkeley’s Spark and Shark • Mapr’s Apache Drill • MetaMarket’s Druid • HortonWork’s Stinger Initiative

11 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 12: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

With Some Interesting Algorithms and Datastructures

• Parquet – Dremel-based Columnar format for Hadoop • Database Cracking – Continuous Physical Reorganization

12 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 13: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

What about MDX?

• MDX is useful for representing business questions

• Pentaho is investing in Mondrian, an Open Source ROLAP engine, to play nice with Big Data – Support for Impala and other big data query engines – Iterators over lists – Batch cell processing over recursion – Query planning for better native pushdown – Distributed Member Cache – Skunkworks: NOROLAP?

13 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

select {[Customer].[All Customers].Children} on columns, {[Measures].[Sales]} on rows

from [SalesFact] where {[Year].[2013]}

Page 14: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

But queries aren’t everything

Once these systems can provide the OLAP performance that we’ve been wanting, there are still some issues… • We still need to get data into these systems • We still need to explore and visualize the data

14 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

? ?

Page 15: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

How do you Deal with a Petabyte of Dirty Data?

• Cleanse and augment data as it flows into the system – Streaming and Messaging Systems – Storm, Spark, Kafka – Descriptive and Predictive Systems – Mahout, WEKA – ETL – Kettle – Mashups – Bring data together

• Post process data via Map Reduce • “Some Coding Required”

15 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 16: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

Visualizing Big Data

• Some things never change… – We still want to see high level aggregations like we always have

• But new visualizations and interactions will unlock insights into these unique types of datasets

16 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 17: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

We Need More Flexibility

• Visualization libraries like D3 are becoming more prevalent over traditional charting packages

• These tools work a layer below traditional charting, where developers work with SVG primitives

• “Some Coding Required”

17 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 18: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

To Implement Unique Visualizations

• Graphs - Relationships • Heatmaps • Treemaps

18 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

• Powered by WebDetails C-Tools – Community Chart Components – Community Graphics Generator – Based on Protovis

Page 19: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

And to Enable Big Data Interactions

Traditional Exploration – Filter – Pivot – Drill Down

New Interactions

– Zoom – Tiling / Paging – Travel in Time – Google Motion Chart

19 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 20: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

Join the Conversation…

irc.freenode.net ##pentaho http://www.github.com/pentaho http://forums.pentaho.com

20 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 21: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

Thank You

Join the conversation. You can find us on:

http://blog.pentaho.com @Pentaho Facebook.com/Pentaho Pentaho Business Analytics

© 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555

Page 22: Challenges of Visualizing and Exploring Big Data Big Data VisualizationChallenges.pdf · 4/23/2013  · How do you Deal with a Petabyte of Dirty Data? • Cleanse and augment data

References

D3 - http://d3js.org/ Data Lake – http://blog.pentaho.com/2010/10/15/pentaho-hadoop-and-data-lakes/ Database Cracking – http://pdf.aminer.org/000/094/728/database_cracking.pdf Dremel – http://research.google.com/pubs/pub36632.html Drill – http://incubator.apache.org/drill/ Druid - https://github.com/metamx/druid/wiki Google Big Query - https://developers.google.com/bigquery/ Google Motion Chart - https://developers.google.com/chart/interactive/docs/gallery/motionchart Impala - http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise-core/cloudera-enterprise-RTQ.html Parquet – http://parquet.io/ Spark - http://spark-project.org/ Stinger - http://hortonworks.com/blog/100x-faster-hive/ WebDetails C-Tools: http://www.webdetails.pt/ctools.html Will’s LEGO Models – http://www.battlebricks.com

22 © 2012, Pentaho. All Rights Reserved. pentaho.com. Worldwide +1 (866) 660-7555