what is big data discovery, and how it complements traditional business analytics
TRANSCRIPT
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
What is Big Data Discovery,and how it complements traditional Business Analytics Mark Rittman, CTO, Rittman Mead November 2015
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
About the Speaker
•Mark Rittman, Co-Founder of Rittman Mead •Oracle ACE Director, specialising in Oracle BI&DW •14 Years Experience with Oracle Technology •Regular columnist for Oracle Magazine •Author of two Oracle Press Oracle BI books •Oracle Business Intelligence Developers Guide •Oracle Exalytics Revealed •Writer for Rittman Mead Blog :http://www.rittmanmead.com/blog
•Email : [email protected] •Twitter : @markrittman
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
About Rittman Mead
•Oracle BI and DW Gold partner •Winner of five UKOUG Partner of the Year awards in 2013 - including BI •World leading specialist partner for technical excellence, solutions delivery and innovation in Oracle BI
•Approximately 80 consultants worldwide •All expert in Oracle BI and DW •Offices in US (Atlanta), Europe, Australia and India •Skills in broad range of supporting Oracle tools: ‣OBIEE, OBIA, ODIEE ‣Big Data, Hadoop, NoSQL & Big Data Discovery ‣Essbase, Oracle OLAP ‣GoldenGate ‣Endeca
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Many Organisations are Running Big Data Initiatives
•Many customers and organisations are running initiatives around “big data” •Some are IT-led and are looking for cost-savings around data warehouse storage + ETL •Others are “skunkworks” projects in the marketing department that are now scaling-up •Projects now emerging from pilot exercises •And design patterns starting to emerge
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Common Big Data Design Pattern : “Data Reservoir”
•Typical implementation of Hadoop and big data in an analytic context is the “data lake” •Additional data storage platform with cheap storage, flexible schema support + compute •Data lands in the data lake or reservoir in raw form, then minimally processed •Data then accessed directly by “data scientists”, or processed further into DW
MartsData Warehouse
Σ Σ
BusinessIntelligence
• Online• Scalable• Flexible• Cost
Effective
Hadoop
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Key Differentiator from DW Storage : Flexible Cheap Storage
•HDFS storage and “schema-on-read” technology gives you ability to store data in “raw” form ‣Cheap expandable storage makes it feasible to retain all incoming data at detail level ‣Schema-on-read technology gives you ability to apply formats and schema on-demand
•A useful complement to the curated, structured and focused datasets in the DW
•Underlying Hadoop platform supports compute and analysis on these raw datasets
•But how do you make sense of what’s in there?
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Data Integration and BI Tools can Access Data Reservoir
•Data integration tools such as Oracle Data Integrator can now load and process Hadoop data •BI tools such as Oracle Business Intelligence 12c can report on Hadoop data •Great for answering questions you know about •Fantastic for navigating hierarchies and drilling •Superb when you understand the dataand the value that’s within it ‣But what if you don’t yet?
Access direct Hive or extract using ODI12c for structured OBIEE dashboard analysis
What pages are people visiting? Who is referring to us on Twitter? What content has the most reach?
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Different Skills and Techniques Required for Raw Datasets
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
But … Getting Value and Information Out of Hadoop is Hard
•Specialist skills typically needed to ingest and understand data coming into Hadoop •Data loaded into the reservoir needs preparation and curation before presenting to users •But we’ve heard a similar story before, from the world of BI…
6
ToolComplexity• EarlyHadooptoolsonlyforexperts• ExistingBItoolsnotdesignedforHadoop
• Emergingsolutionslackbroadcapabilities
80%efforttypicallyspentonevaluatingandpreparingdata
DataUncertainty• Notfamiliarandoverwhelming• Potentialvaluenotobvious
• Requiressignificantmanipulation
Overlydependentonscarceandhighlyskilledresources
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Back to 2012…
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
What Was Oracle Endeca Information Discovery?
•Part of the acquisition of Endeca back in 2012 by Oracle Corporation
•Based on search technology and concept of “faceted search”
•Data stored in flexible NoSQL-style in-memory database called “Endeca Server”
•Added aggregation, text analytics and text enrichment features for “data discovery” ‣Explore data in raw form, loose connections, navigate via search rather than hierarchies ‣Useful to find out what is relevant and valuable in a dataset before formal modeling
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Endeca Server Technology Combined Search + Analytics
•Proprietary database engine focused on search and analytics •Data organized as records, made up of attributes stored as key/value pairs •No over-arching schema, no tables, self-describing attributes
•Endeca Server hallmarks: ‣Minimal upfront design ‣Support for “jagged” data ‣Administered via web service calls ‣“No data left behind” ‣“Load and Go”
•But … limited in scale (>1m records) ‣… what if it could be rebuilt on Hadoop?
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Oracle Big Data Discovery
•A visual front-end to the Hadoop data reservoir, providing end-user access to datasets •Catalog, profile, analyse and combine schema-on-read datasets across the Hadoop cluster •Visualize and search datasets to gain insights, potentially load in summary form into DW
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Typical Big Data Discovery Use-Cases
•Provide a visual catalog and search function across data in the data reservoir
•Profile and understand data, relationships, data quality issues
•Apply simple changes, enrichment to incoming data •Visualize datasets including combinations (joins) ‣Bring in additional data from JDBC + file sources
•Prepare more structured Hadoop datasets for use with OBIEE
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
BDD for Reporting and Cataloging the Data Reservoir
•Very good tool for analysing and cataloging the data in the production data reservoir •Basic BI-type reporting against datasets, joined together, filtered, transformed etc •Cross-functional and cross-subject area analysis and visualization •Great complement to OBIEE ‣Able to easily access semi-structured data ‣Familiar BI-type graph components ‣Ideal way to give analysts access to semi-structured customer datasets
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Comparing BDD to Oracle Visual Analyzer
•Visual Analyzer also provides a form of “data discovery” for BI users ‣Similar to Tableau, Qlikview etc ‣Inspired by BI elements of OEID
•Uses OBIEE RPD as the primary datasource, so data needs to be curated + structured
•Probably a better option for users who aren’t concerned its “big data”
•But can still connect to Hadoop viaHive, Impala and Oracle Big Data SQL
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Managed vs. Free-Form Data Discovery
•Data in the data reservoir typically is raw, hasn’t been organised into facts, dimensions yet • In this initial phase, you don’t want to it to be - too much up-front work with unknown data •Later on though, users will benefit from structure and hierarchies being added to data •But this takes work, and you need to understand cost/benefit of doing it now vs. later
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Supporting the Transition From Data Lab to Data Factory
•Most data science and big data projects start small, using R and single data scientist •Big Data Discovery help opens up data science and the data reservoir to the enterprise
6
D a t aL a b
Invent Commercialize
D a t aF a c t o r y
Research Develop
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Enabling Hadoop Datasets for Access by Big Data Discovery
•Relies on datasets in Hadoop being registered with Hive Catalog •Presents semi-structured and other datasets as tables, columns •Hive SerDe and Storage Handler technologies allow Hive to run over most datasets •Hive tables need to be defined before dataset can be used by BDD
CREATE external TABLE apachelog_parsed( host STRING, identity STRING, … agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \”]*|\"[^\"]*\")(-|[0-9]*) (-|[0-9]*)(?: ([^ \"] *|\".*\") ([^ \"]*|\".*\"))?" ) STORED AS TEXTFILE LOCATION '/user/flume/rm_website_logs;
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Today’s Layered Data Warehouse Architecture
Virtu
aliz
atio
n &
Q
uery
Fed
erat
ion
Enterprise Performance Management
Pre-built & Ad-hoc BI Assets
Information Services
Data Ingestion
Information Interpretation
Access & Performance Layer
Foundation Data Layer
Raw Data Reservoir
Data Science
Data Engines & Poly-structured sources
Content
Docs Web & Social Media
SMS
Structured Data Sources
•Operational Data •COTS Data •Master & Ref. Data •Streaming & BAM
Immutable raw data reservoir Raw data at rest is not interpreted
Immutable modelled data. Business Process Neutral form. Abstracted from business process changes
Past, current and future interpretation of enterprise data. Structured to support agile access & navigation
Discovery Lab Sandboxes Rapid Development Sandboxes
Project based data stores to support specific discovery objectives
Project based data stored to facilitate rapid content / presentation delivery
Data Sources
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Data Reservoir within the Big Data Management Platform
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Oracle Big Data Discovery Architecture
•Adds additional nodes into the CDH5.3 cluster, for running DGraph and Studio •DGraph engine based on Endeca Server technology, can also be clustered
•Hive (HCatalog) used for reading table metadata,mapping back to underlying HDFS files
•Apache Spark then used to upload (ingest)data into DGraph, typically 1m row sample
•Data then held for online analysis in DGraph •Option to write-back transformations tounderlying Hive/HDFS files using Apache Spark
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Ingesting & Sampling Datasets for the DGraph Engine
•Datasets in Hive have to be ingested into DGraph engine before analysis, transformation •Can either define an automatic Hive table detector process, or manually upload •Typically ingests 1m row random sample ‣1m row sample provides > 99% confidence that answer is within 2% of value shownno matter how big the full dataset (1m, 1b, 1q+) ‣Makes interactivity cheap - representative dataset
Amount'of'data'queried
The'100%'premium
CostAccuracy
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Search, Catalog, Join and Analyse Data Using BDD Studio
•Big Data Discovery Studio then provides web-based tools for organising, searching datasets •Visualize Data, Transform and Export back to Hadoop
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Example Scenario : Social Media Analysis
•Rittman Mead want to understand drivers and audience for their website ‣What is our most popular content? Who are the most in-demand blog authors? ‣Who are the influencers? What do they read?
•Three data sources in scope:
RM Website Logs Twitter Stream Website Posts, Comments etc
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Objective of Exercises
•Enrich and transform incoming schema-on-read social media data to add value •Do some initial profiling, analysis to understand datasets •Provide a user interface for the Big Data team •Provide a means to identify candidate dimensions and facts for a more structured (managed) data discovery tool like Visual Analyzer
Combine with site content, semantics, text enrichment Catalog and explore using Oracle Big Data Discovery
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Data Sources used for Data Discovery Exercise
Spark
Hive
HDFS
Spark
Hive
HDFS
Spark
Hive
HDFS
Cloudera CDH5.3 BDA Hadoop Cluster
Hive Client
HDFS Client
BDD DGraphGateway
Hive Client
BDD StudioWeb UI
BDD Node
BDD Data Processing
BDD Data Processing
BDD Data Processing
Ingest semi-process logs
(1m rows)
Ingest processedTwitter activity
Write-backTransformations
to full datasets
UploadSite page and
comment contents
Persist uploaded DGraphcontent in Hive / HDFS
Data Discovery using Studio
web-based app
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Ingesting Site Activity and Tweet Data into DGraph
•Tweets and Website Log Activity stored already in data reservoir as Hive tables •Upload triggered by manual call to BDD Data Processing CLI ‣Runs Oozie job in the background to profile,enrich and then ingest data into DGraph
[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli [oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t access_per_post_cat_author [oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t rm_linked_tweets
Hive
Apache Spark
pageviews X rows
pageviews >1m rows
Profiling pageviews >1m rows
Enrichment pageviews >1m rows
BDD
pageviews >1m rows
{ "@class" : "com.oracle.endeca.pdi.client.config.workflow. ProvisionDataSetFromHiveConfig", "hiveTableName" : "rm_linked_tweets", "hiveDatabaseName" : "default", "newCollectionName" : “edp_cli_edp_a5dbdb38-b065…”, "runEnrichment" : true, "maxRecordsForNewDataSet" : 1000000, "languageOverride" : "unknown" }
1
23
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Ingesting Site Activity and Tweet Data into DGraph
•Two output datasets from ODI process have to be ingested into DGraph engine •Upload triggered by manual call to BDD Data Processing CLI ‣Runs Oozie job in the background to profile,enrich and then ingest data into DGraph
[oracle@bddnode1 ~]$ cd /home/oracle/Middleware/BDD1.0/dataprocessing/edp_cli [oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t access_per_post_cat_author [oracle@bddnode1 edp_cli]$ ./data_processing_CLI -t rm_linked_tweets
Hive
Apache Spark
pageviews X rows
pageviews >1m rows
Profiling pageviews >1m rows
Enrichment pageviews >1m rows
BDD
pageviews >1m rows
{ "@class" : "com.oracle.endeca.pdi.client.config.workflow. ProvisionDataSetFromHiveConfig", "hiveTableName" : "rm_linked_tweets", "hiveDatabaseName" : "default", "newCollectionName" : “edp_cli_edp_a5dbdb38-b065…”, "runEnrichment" : true, "maxRecordsForNewDataSet" : 1000000, "languageOverride" : "unknown" }
1
23
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Ingesting and Sampling Hive Data into Big Data Discovery
[oracle@bigdatalite ~]$ cd /home/oracle/movie/Middleware/BDD1.0/dataprocessing/edp_cli [oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t access_per_post_cat_author [oracle@bigdatalite edp_cli]$ ./data_processing_CLI -t rm_linked_tweets
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
View Ingested Datasets, Create New Project
• Ingested datasets are now visible in Big Data Discovery Studio •Create new project from first dataset, then add second
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Automatic Enrichment of Ingested Datasets
• Ingestion process has automatically geo-coded host IP addresses •Other automatic enrichments run after initial discovery step, based on datatypes, content
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Initial Data Exploration On Uploaded Dataset Attributes
•For the ACCESS_PER_POST_CAT_AUTHORS dataset, 18 attributes now available •Combination of original attributes, and derived attributes added by enrichment process
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Explore Attribute Values, Distribution using Scratchpad
•Click on individual attributes to view more details about them •Add to scratchpad, automatically selects most relevant data visualisation
1
2
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Filter (Refine) Visualizations in Scratchpad
•Click on the Filter button to display a refinement list
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Display Refined Data Visualization
•Select refinement (filter) values from refinement pane •Visualization in scratchpad now filtered by that attribute ‣Repeat to filter by multiple attribute values
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Save Scratchpad Visualization to Discovery Page
•For visualisations you want to keep, you can add them to Discovery page •Dashboard / faceted search part of BDD Studio - we’ll see more later
12
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Select Multiple Attributes for Same Visualization
•Select AUTHOR attribute, seeinitial ordered values, distribution
•Add attribute POST_DATE ‣choose between multiple instances of first attribute split by second ‣or one visualisation with multiple series
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Data Transformation & Enrichment
•Data ingest process automatically applies some enrichments - geocoding etc •Can apply others from Transformation page - simple transformations & Groovy expressions
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Standard Transformations - Simple & Using Editor
•Group and bin attribute values; filter on attribute values, etc •Use Transformation Editor for custom transformations (Groovy, incl. enrichment functions)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Datatype Conversion Example : String to Date / Time
•Datatypes can be converted into other datatypes, with data transformed if required •Example : convert Apache Combined Format Log date/time to Java date/time
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Transformations using Text Enrichment / Parsing
•Uses Salience text engine under the covers •Extract terms, sentiment, noun groups, positive / negative words etc
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Create New Attribute using Derived (Transformed) Values
•Choose option to Create New Attribute, to add derived attribute to dataset •Preview changes, then save to transformation script
12
3
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Commit Transforms to DGraph, or Create New Hive Table
•Transformation changes have to be committed to DGraph sample of dataset ‣Project transformations kept separate from other project copies of dataset
•Transformations can also be applied to full dataset, using Apache Spark ‣Creates new Hive table of complete dataset
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Upload Additional Datasets
•Users can upload their own datasets into BDD, from MS Excel or CSV file •Uploaded data is first loaded into Hive table, then sampled/ingested as normal
12
3
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Join Datasets On Common Attributes
•Used to create a dataset based on the intersection (typically) of two datasets •Not required to just view two or more datasets together - think of this as a JOIN and SELECT
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Join Example : Add Post + Author Details to Tweet URL
•Tweets ingested into data reservoir can reference a page URL •Site Content dataset contains title, content, keywords etc for RM website pages •We would like to add these details to the tweets where an RM web page was mentioned ‣And also add page author details missing from the site contents upload
Main “driving” dataset Contains tweet user details,tweet text, hashtags, URL referenced, location of tweeter etc
Contains full details of each site page, including URL, title, content, category
Join on URL referenced in tweet
Contains the post author details missing from the Site Content dataset
Join on internal Page ID
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Select From Multiple Visualisation Types
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Multi-Dataset Join Step 1 : Join Site Contents to Posts
•Site contents dataset needs to gain access to the page author attribute only found in Posts •Create join in the Dataset Relationships panel, using Post ID as the common attribute •Join from Site contents to Posts, to create left-outer join from first to second table
1
2
3
Previews rows from the join, based onpost_id = a (post_id column)
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Multi-Dataset Join Step 2 : Standardise URL Formats
•URLs in Twitter dataset have trailing ‘/‘, whereas URLs in RM site data do not •Use the Transformation feature in Studio to add trailing ‘/‘ to RM site URLs •Select option to replace the current URL values and overwrite within project dataset
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Multi-Dataset Join Step 3 : Join Tweets to Site Content
•Join on the standardised-format URL attributes in the two datasets •Data view will now contain the page content and author for each tweet mentioning RM
1
2
3
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Key BDD Studio Differentiator : Faceted Search Across Hadoop
•BDD Studio dashboards support faceted search across all attributes, refinements •Auto-filter dashboard contents on selected attribute values - for data discovery •Fast analysis and summarisation through Endeca Server technology
Further refinement on“OBIEE” in post keywords
3Results now filteredon two refinements
4
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Create Discovery Pages for Dataset Analysis
•Select from palette of visualisation components •Select measures, attributes for display
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
•Search on attribute values, text in attributes across all datasets •Extracted keywords, free text field search
Faceted Search Across Project
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Key BDD Studio Differentiator : Faceted Search Across Hadoop
•BDD Studio dashboards support faceted search across all attributes, refinements •Auto-filter dashboard contents on selected attribute values •Fast analysis and summarisation through Endeca Server technology
“Mark Rittman” selected from Post Authors Results filtered on
selected refinement
1 2
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Export Prepared Datasets Back to Hive, for OBIEE + VA
•Transformations within BDD Studio can then be used to create curated fact + dim Hive tables •Can be used then as a more suitable dataset for use with OBIEE RPD + Visual Analyzer •Or exported then in to Exadata or Exalytics to combine with main DW datasets
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Further Analyse in Visual Analyzer for Managed Dataset
•Users in Visual Analyzer then havea more structured dataset to use
•Data organised into dimensions, facts, hierarchies and attributes
•Can still access Hadoop directlythrough Impala or Big Data SQL
•Big Data Discovery though was key to initial understanding of data
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
… And Finally Additional Resources How to Contact Us
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Additional Resources
•Articles on the Rittman Mead Blog ‣http://www.rittmanmead.com/category/oracle-big-data-appliance/ ‣http://www.rittmanmead.com/category/big-data/ ‣http://www.rittmanmead.com/category/oracle-big-data-discovery/
•Slides will be on the BI Forum USB sticks •Rittman Mead offer consulting, training and managed services for Oracle Big Data ‣Oracle & Cloudera partners ‣http://www.rittmanmead.com/bigdata
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : [email protected] W : www.rittmanmead.com
Rittman Mead and Big Data on Oracle Case Study - Rittman MeadMark Rittman, CTO, Rittman Mead March 2015