hadoop summit 2010 keynote
TRANSCRIPT
![Page 1: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/1.jpg)
Hadoop
Trends, Opportunities, Challenges
Hemanth Yamijala
Committer, Hadoop
Technical Lead, Map/Reduce, Yahoo!
![Page 2: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/2.jpg)
What is
• Distributed computing framework
– Offers storage and batch processing for petabytes
of data
– Very suitable for ad-hoc textual processing
applicationsapplications
• Components
– Hadoop Distributed File System
– Map/Reduce programming framework
• Apache Software Foundation project
![Page 3: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/3.jpg)
Hadoop on your Yahoo! page …
![Page 4: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/4.jpg)
Hadoop Adoption Trends - Yahoo!
•Runs the Yahoo! Distribution of Hadoop
•http://github.com/yahoo/hadoop
•230 jobs/hour on average
•4.38 Tb/hour of input, 936 Gb/hour of output
![Page 5: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/5.jpg)
Hadoop on your FB, Twitter pages
– Reporting, analytics, machine learning
• Amazon
– Hosted Hadoop on top of EC2 and S3
– Product search index
– Analytics, social network graphs
• AOL, Microsoft (PowerSet), IBM, …
• http://wiki.apache.org/hadoop/PoweredBy
![Page 6: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/6.jpg)
Support of a vibrant community
Hadoop contributions:
Core: HDFS, Map/Reduce; Non-core: sub-projects Hadoop mailing list traffic
Cloudera Distribution of Hadoop – paid, supported service offering
from Cloudera
![Page 7: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/7.jpg)
Support from Academia, Research
• PSG Tech, Coimbatore
– Semantic search, information retrieval, scheduling, applications in molecular biology –Deep dive on this later
• IIIT, Hyderabad• IIIT, Hyderabad
– Applications in Indian language content processing, scheduling
• IISc, Bangalore
– Modeling a simulator for Hadoop
• Many more – M45, OpenCirrus, …
![Page 8: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/8.jpg)
Hadoop – a RAD tool ?
• Without Hadoop
– Build-out and maintenance of hardware
– Transfer, storage of data - Deep dive on this later
on
– Handling failures, efficiency– Handling failures, efficiency
• Enables rapid experimentation, iteration,
repeatability, low cost of failure
• Great Ecosystem: Streaming, PIG, Hive, Hbase,
Oozie, Avro…
![Page 9: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/9.jpg)
Technical focus areas at Yahoo!
• Security
– Kerberos based authentication
• Backwards Compatibility – 1.0
– APIs cannot be broken between major releases– APIs cannot be broken between major releases
– A new API in Map/Reduce that enables this
• Robustness
– Multiple bug fixes
– Map/Reduce framework refactoring for better
concurrency, simplifying control flow logic
![Page 10: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/10.jpg)
Technical focus areas at Yahoo!
• Append / Sync / Flush
– Until Hadoop 0.20, files were write once
– Append going to open Hadoop for more apps
• Efficiency in scheduling, data processing
– Task scheduling for better utilization, better
sharing policies
– Zero data copy – usage of direct I/O buffers
• Quality engineering
– Automated distributed system testing,
performance benchmarks (deep dive coming)
![Page 11: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/11.jpg)
Agenda for Hadoop Summit
• Lightning Talk by Hari Vasudev (VP Platform
Tech Group, Yahoo!)
• Data Management on Grid by Srikanth
Sundarrajan (Yahoo!)Sundarrajan (Yahoo!)
• Machine Learning using Hadoop- Real Case
Study by Krishna Prasad Chitrapura (Yahoo!)
• Multiple Sequence Alignment using Hadoop
by Dr. Sudha Sadhasivam (PSG Tech,
Coimbatore)
![Page 12: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/12.jpg)
Agenda for Hadoop Summit
• Benchmarking and Optimizing Hadoop
deployments(benchmarking on HiBench) by Mukesh
Gangadhar (Intel)
• Challenges and Uniqueness of QE and RE processes in Hadoop
by Jayant Mahajan (Yahoo!)
• Tuning Hadoop to deliver performance to your application by
Srigurunath Chakravarthi (Yahoo!)
• Panel Discussion: Moderator: Basant Verma (Yahoo!);
Panelist: T. S. Mohan (Infosys), Sudha Sadhasivam (PSG Tech),
Chidambaran Kollengode (Yahoo!) & Jothi Padmanabhan
(Yahoo!),
• Yahoo booth throughout the day: win cool prizes ☺
![Page 14: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/14.jpg)
Backup Slides
![Page 15: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/15.jpg)
Challenges for Yahoo!
• No longer just a wildly successful cool project!
– People are demanding we deliver !
• Production usage, availability, SLAs
– Jobs that MUST finish in 15 minutes, or revenue is – Jobs that MUST finish in 15 minutes, or revenue is
lost, and the time limits are going down
• Usability, Operability
• Scale, Performance
– Ever increasing demands mean we need larger
clusters, faster throughput
![Page 16: Hadoop Summit 2010 Keynote](https://reader033.vdocuments.mx/reader033/viewer/2022052904/557c2f65d8b42ad8478b4d07/html5/thumbnails/16.jpg)
Design considerations
• Cost Effectiveness
– Runs on commodity hardware, Linux
• Linear Scale
• Fault Tolerance• Fault Tolerance
– Block replication, checksums
– Transparent monitoring and re-execution of tasks
• Efficiency
– Data locality
– Efficient resource usage