making sense of big data with hadoop
DESCRIPTION
TRANSCRIPT
Making Sense of
BIG DATA with Hadoop
© 2012 Pythian
● 13 years with a pager● Oracle ACE Director● Oak table member● Senior consultant for Pythian● @gwenshap● http://www.pythian.com/
news/author/shapira/● [email protected]
© 2012 Pythian
Pythian Recognized Leader:
• Global industry-leader in remote database administration services and consulting for Oracle, Oracle Applications, MySQL and Microsoft SQL Server
• Work with over 165 multinational companies such as LinkShare Corporation, IGN Entertainment, CrowdTwist, TinyCo and Western Union to help manage their complex IT deployments
Expertise:
• One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 7 Oracle ACEs/ACE Directors. Heavily involved in the MySQL community, driving the MySQL Professionals Group and sit on the IOUG Advisory Board for MySQL.
• Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata, Oracle GoldenGate & Oracle RAC
Global Reach & Scalability:
• 24/7/365 global remote support for DBA and consulting, systems administration, special projects or emergency response
3
What is Big Data?
© 2012 Pythian
MORE DATA THAN YOU CAN HANDLE
© 2012 Pythian
MORE DATA THAN RELATIONAL DATABASESCAN HANDLE
© 2012 Pythian
MORE DATA THAN RELATIONAL DATABASESCAN HANDLE CHEAPLY
© 2012 Pythian
Data Arriving at fast RatesTypically unstructuredStored without aggregationAnalyzed in Real TimeFor Reasonable Cost
© 2012 Pythian
Complex Data Architecture
© 2012 Pythian
Your Data
is NOTas BIG
as you think
Why Big Data?Why Hadoop?
© 2012 Pythian
BECAUSE WE CAN
© 2012 Pythian
More Data Beats Smarter Algorithms
© 2012 Pythian
emailPhotos
Tweets
Job posting
Blog posts
Medicalimaging
Sensors
Video
Tags Scanned docs
Data is Messy
© 2012 Pythian 16
An Imperial College Team found:• 3,000 patients under 19 were treated in geriatric
clinics
• between 15,000 and 20,000 men have been admitted to obstetric wards
• and almost 10,000 to gynecology wards
http://www.straightstatistics.org/blog/2012/04/06/why-are-so-many-men-pregnant
Unstructured Eventually Structured
Data
© 2012 Pythian
Scalable Storage+
Massive Parallel Processing
+Reasonable Cost
© 2012 Pythian
Hadoop: Platform for distributed computing
© 2012 Pythian
Hadoop is Scalable. But not fast.
Much Ado about Hadoop
© 2012 Pythian
Assumptions• Lots of data• Large Files• Unstructured• Scan entire files• Unreliable Hardware• Adding servers = increase capacity
© 2012 Pythian
Principles• Bring Code to Data• Share Nothing
© 2012 Pythian
HDFS• Distributed• Replicated• Big Files• Write Once• Read Entire File
© 2012 Pythian
/users/shapira/log-1, blocks {1,4,5}/users/shapira/log-2, blocks {2,3,6}
1
1
2
12 2
3
3 3
4
4
4
5
5
5
666
StartJob 1
StartJob 2
Map
Map
…
Map
Hadoop Job
Map
Map
…
Map
Combine
Combine
Reduce
Reduce?
…
Reduce?
Reduce
Reduce?
…
Reduce?
StopJob 1
StopJob 1
Results
Map Reduce
© 2012 Pythian
Implementation• Balance disks, cores and RAM• High Bandwidth• More nodes or better nodes?
© 2012 Pythian
It’s about the Ecosystem• Sqoop• Flume• Hive• Pig • HBase
Use Cases
Use Case:Log processing
© 2012 Pythian
Use Case:ETL
OLTP DWH
BI
Use Case:Recommendations
© 2012 Pythian
Use case:Listening to the crowd
© 2012 Pythian 34
Our customers use Hadoop for:• Storing lots of pre-processed data• Merging different data types• Scalable data processing• Advanced data processing
Big Data in your Company
© 2012 Pythian
Easy case:Your CTO heard about Big DataAnd is eager to invest.You have a Big Budget.
© 2012 Pythian
Require
Acquire
Organize
Analyze
Serve
Measure
© 2012 Pythian
Require
HadoopNoSQLOLTP
RDMB
HadoopBI, R
BI, NoSQ
L, Oracle
Measure
© 2012 Pythian
Data Scientist=Sneaky BIDisregards SilosCool Toys
© 2012 Pythian
Mining Tools:• Machine Learning• Cluster Detection• Regression• Graph Analysis• Visualization
© 2012 Pythian
http://nicolasrapp.com/?p=1118
© 2012 Pythian
http://www.orgnet.com/slumlords.html
© 2012 Pythian
Want to do more with your data?Don’t know where to start?No budget?
No problem!
© 2012 Pythian
Sneak Hadoop to Your Business• Find an important business problem• Acquire data (be sneaky!)• Get the tools: R, Hadoop, Tableau• Laptops, desktops, test servers• Analyze data• Make pretty charts• Get business used to it• Wait for an Outage• PROFIT!
Oracle Big DataThe “ETL Machine”
© 2012 Pythian
Hardware18 servers216 cores864G RAM648T disksInfiniband
© 2012 Pythian
SoftwareOracle NoSQLCloudera Hadoop DistributionOracle Loader for HadoopData Integrator for HadoopDirect Connector for HadoopOracle Connector for R
© 2012 Pythian
Cores, Storage, Infiniband and SoftwareMakes Oracle Big DataThe Ultimate ETL Machine
© 2012 Pythian 49
Thank you & Q&A
http://www.pythian.com/news/
http://www.facebook.com/pages/The-Pythian-Group/
http://twitter.com/pythian
http://www.linkedin.com/company/pythian
1-866-PYTHIAN
To contact us…
To follow us…