sydney spark meetup - september 2015
TRANSCRIPT
EXCELLENCE FIRST
Apache Spark
JDBC Extraction
Sandbox Data Discovery
• Works for Servian as a managing consultant
• Specialize in data warehousing and business intelligence
– Informatica, SSIS, MicroStrategy, Cognos, OBIEE,
Tableau etc.
• Know how to code
– Java, .NET, C/C++, Python, Scala etc.
• Evangelizing Spark within Servian and our clients
• Certified Spark Developer
Who am I
• 2 Full Day Trainings (Core, SQL, Streaming, MLlib, GraphX)
• 30+ Consultants trained in Spark and Scala
• Certified Spark Developers
• Various other clients on different big data technologies
• Accelerator Solution for Data Warehousing using Spark
• Big Data Hackathon (6 teams use Spark)
Spark at Servian
Data Warehousing
• Dealing with Files (60%+) – Extraction is done for you
• Dealing with RDBMS (35%+) – You need to do extraction
• Data Discovery (Sandbox Area)
– Visual Discovery
– Other Analytical Discovery (Machine Learning/Graph)
– Little to no logic binding (Source Aligned)
JDBC from Spark
• DataFrame API from version 1.3+
• JDBC Connection String
• Table Name
• Partition By Column
• Lower Bound
• Upper Bound
• Number of Partitions
• What about some certainty in Lower/Upper Bound? – Use Modulo
JDBC from Spark
JDBC from Spark – Take 2
• Partition by different mechanism?
• Transform into integer
• Custom Query
• select * from [tablename] where [partition_column] between [_low]
to [_hi]
• [tablename] can be a subquery
• [partition_column] can have expressions
JDBC from Spark – Take 2
JDBC from Spark – Take 3
• Have lots of datasets
• Parallel data landing
• Various relational data sources
• Want to quickly run discovery learning/graph
• Benefit from columnar format compression and performance
• Quickly build a source aligned ODS on Hadoop
JDBC from Spark – Take 3
Distributed Hierarchical Storage (HDFS + Tachyon)
Raw DW/Lake
Next Generation Warehousing
StructuredData Sources
Extraction Framework
Time Variant Load Framework
UnstructuredSemi-Structured
Structured Content
(Parquet)
Metadata(Parquet)
Ingestion/StreamingFramework
Time Variance
Batch Control
Unstructured
Derivations
Aggregates(Parquet)
MachineLearningModels
(Parquet)
AnalyticsFramework
Real TimeEngines
Data Marts
BI ToolsVisualizations
SparkComponents
Spark
Data Flow(Directed)
Files
Data Sources
ArchitectureComponents
Questions?
Thank you all!
• https://au.linkedin.com/in/andyhuangyh