20100128ebay

31
Thursday, January 28, 2010

Upload: jeff-hammerbacher

Post on 27-Jan-2015

105 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: 20100128ebay

Thursday, January 28, 2010

Page 2: 20100128ebay

Hadoop, Cloudera, and eBayManaging Petabytes with Open Source

Jeff HammerbacherChief Scientist and Vice President of Products, ClouderaJanuary 28, 2010

Thursday, January 28, 2010

Page 3: 20100128ebay

My BackgroundThanks for Asking

[email protected]▪ Studied Mathematics at Harvard▪ Worked as a Quant on Wall Street▪ Conceived, built, and led Data team at Facebook▪ Nearly 30 amazing engineers and data scientists▪ Several open source projects and research papers

▪ Founder of Cloudera▪ Vice President of Products and Chief Scientist▪ Also, check out the book “Beautiful Data”

Thursday, January 28, 2010

Page 4: 20100128ebay

Presentation Outline▪ What is Hadoop?▪ HDFS and MapReduce▪ Hive, Pig, Avro, Zookeeper, HBase

▪ From Steve▪ Why Hadoop?▪ Hadoop for machine learning and modeling▪ Other things I find interesting

▪ What we’re building at Cloudera▪ Questions and Discussion

Thursday, January 28, 2010

Page 5: 20100128ebay

What is Hadoop?▪ Apache Software Foundation project, mostly written in Java▪ Inspired by Google infrastructure▪ Software for programming warehouse-scale computers (WSCs)▪ Hundreds of production deployments▪ Project structure▪ Hadoop Distributed File System (HDFS)▪ Hadoop MapReduce▪ Hadoop Common▪ Other subprojects

▪ Avro, HBase, Hive, Pig, Zookeeper

Thursday, January 28, 2010

Page 6: 20100128ebay

Anatomy of a Hadoop Cluster▪ Commodity servers▪ 1 RU, 2 x 4 core CPU, 8 GB RAM, 4 x 1 TB SATA, 2 x 1 gE NIC▪ Or: 2 RU, 2 x 8 core CPU, 32 GB RAM, 12 x 1 TB SATA

▪ Typically arranged in 2 level architecture▪ Inexpensive to acquire and maintain

ApacheCon US 2008

Commodity Hardware Cluster

•! Typically in 2 level architecture

–! Nodes are commodity Linux PCs

–! 40 nodes/rack

–! Uplink from rack is 8 gigabit

–! Rack-internal is 1 gigabit all-to-all

Thursday, January 28, 2010

Page 7: 20100128ebay

HDFS▪ Pool commodity servers into a single hierarchical namespace▪ Break files into 128 MB blocks and replicate blocks▪ Designed for large files written once but read many times▪ Files are append-only via a single writer

▪ Two major daemons: NameNode and DataNode▪ NameNode manages file system metadata▪ DataNode manages data using local filesystem

▪ HDFS manages checksumming, replication, and compression▪ Throughput scales nearly linearly with node cluster size

Thursday, January 28, 2010

Page 8: 20100128ebay

HDFSHDFS distributes file blocks among servers

!"#$%&'%(#)**+,%-&.%/#$#%&0%$"1%20$13+3&'1% !"

!"#$%&#"'()*+%,"-'./0('

#$%&&'"()*+,%-."$"/$,+010&+-2$)0".0&2$3-".4.0-5"*$++-%"06-"

#$%&&'"7(.02(8,0-%"9(+-":4.0-5;"&2"#79:<"#79:"(."$8+-"0&".0&2-"

6,3-"$5&,)0."&/"()/&25$0(&);".*$+-",'"()*2-5-)0$++4"$)%"

.,2=(=-"06-"/$(+,2-"&/".(3)(/(*$)0"'$20."&/"06-".0&2$3-"

()/2$.02,*0,2-">(06&,0"+&.()3"%$0$<"

#$%&&'"*2-$0-."456'$13'"&/"5$*6()-."$)%"*&&2%()$0-.">&2?"

$5&)3"06-5<"@+,.0-2."*$)"8-"8,(+0">(06"()-A'-).(=-"*&5',0-2.<"

B/"&)-"/$(+.;"#$%&&'"*&)0(),-."0&"&'-2$0-"06-"*+,.0-2">(06&,0"

+&.()3"%$0$"&2"()0-22,'0()3">&2?;"84".6(/0()3">&2?"0&"06-"

2-5$()()3"5$*6()-."()"06-"*+,.0-2<"

#79:"5$)$3-.".0&2$3-"&)"06-"*+,.0-2"84"82-$?()3"()*&5()3"

/(+-."()0&"'(-*-.;"*$++-%"C8+&*?.;D"$)%".0&2()3"-$*6"&/"06-"8+&*?."

2-%,)%$)0+4"$*2&.."06-"'&&+"&/".-2=-2.<""B)"06-"*&55&)"*$.-;"

#79:".0&2-."062--"*&5'+-0-"*&'(-."&/"-$*6"/(+-"84"*&'4()3"-$*6"

'(-*-"0&"062--"%(//-2-)0".-2=-2.E"

"

"

!"#$%&'()'*+!,'-"./%"0$/&.'1"2&'02345.'6738#'.&%9&%.'

"

#79:"6$.".-=-2$+",.-/,+"/-$0,2-.<"B)"06-"=-24".(5'+-"-A$5'+-"

.6&>);"$)4"0>&".-2=-2."*$)"/$(+;"$)%"06-"-)0(2-"/(+-">(++".0(++"8-"

$=$(+$8+-<"#79:")&0(*-.">6-)"$"8+&*?"&2"$")&%-"(."+&.0;"$)%"

*2-$0-."$")->"*&'4"&/"5(..()3"%$0$"/2&5"06-"2-'+(*$."(0"

F"

!"

G"

H"

I"

!"

I"

H"

F"

!"

H"

F"

G"

I"

!"

G"

I"

F"

G"

H"

#79:"

"

" "

"

" "

7#8*3%90$1301$%

+3*+13$&1'%5&:1%;**.51<%

=>#?*0<%@#41A**:%#0)%

B#"**C%"#D1%+&*01131)%

$"1%6'1%*E%01$F*3:'%*E%

&01G+10'&D1%4*>+6$13'%

E*3%5#3.1H'4#51%)#$#%

'$*3#.1%#0)%

+3*41''&0.I%(/@J%6'1'%

$"1'1%$14"0&K61'%$*%

'$*31%10$13+3&'1%)#$#I%

Thursday, January 28, 2010

Page 9: 20100128ebay

Hadoop MapReduce▪ Fault tolerant execution layer and API for parallel data processing ▪ Can target multiple storage systems▪ Key/value data model▪ Two major daemons: JobTracker and TaskTracker▪ Many client interfaces▪ Java▪ C++▪ Streaming▪ Pig▪ SQL (Hive)

Thursday, January 28, 2010

Page 10: 20100128ebay

MapReduceMapReduce pushes work out to the data

!"#$%&'%(#)**+,%-&.%/#$#%&0%$"1%20$13+3&'1% /�

�!"#$%&'()'*+,--.'.$/0&/'1-%2'-$3'3-'30&',+3+'

��������������"������������� ������������"����������� �� � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � ������������� ������������������������#� �������&��� ������������������ �������!�������$�� � ����������� ��������� ���� � � � � � � � � � � � � �����������"&�������$���������������� ��� � ������������"���� �"$�� ���� ��������������� ��������������� � � � � � � � � � � � � � � " � � � � � � � � � &�

!"##$%&'

� ���(������ ����� ��������������� $������� �������������� ����!������������������"��������� ����������"������ ����������" �� ������"#����������������� � � � � � � � � � � � � � � �������"&� � �������������������������� �������� ���������������� ����� � &�

�������������������$������������������ ����% �

� � � � � * � � � � � � & � � ��� 3, '10+ '.1- '+/22 �� ����%)) &���� ��&���) �

,�

.�

0�

,�

.�

/�

-�

.�

/�

,�

-�

0�

-�

/�

0�

(#)**+%$#41'%

#)5#0$#.1%*6%(/789%

)#$#%)&'$3&:;$&*0%

'$3#$1.<%$*%+;'"%=*34%

*;$%$*%>#0<%0*)1'%&0%#%

?@;'$13A%B"&'%#@@*='%

#0#@<'1'%$*%3;0%&0%

+#3#@@1@%#0)%1@&>&0#$1'%

$"1%:*$$@101?4'%

&>+*'1)%:<%>*0*@&$"&?%

'$*3#.1%'<'$1>'A%

Thursday, January 28, 2010

Page 11: 20100128ebay

Hadoop Subprojects▪ Avro▪ Cross-language framework for RPC and serialization

▪ HBase▪ Table storage on top of HDFS, modeled after Google’s BigTable

▪ Hive▪ SQL interface to structured data stored in HDFS

▪ Pig▪ Language for data flow programming; also Owl, Zebra, SQL

▪ Zookeeper▪ Coordination service for distributed systems

Thursday, January 28, 2010

Page 12: 20100128ebay

Hadoop Community Support▪ 185+ contributors to the open source code base▪ ~60 engineers at Yahoo!, ~15 at Facebook, ~15 at Cloudera

▪ Over 500 (paid!) attendees at Hadoop World NYC▪ Regular user group meetups in many cities▪ Bay Area Meetup group has 534 members

▪ Three books (O’Reilly, Apress, Manning)▪ Training videos free online▪ University courses across the world▪ Growing consultant and systems integrator expertise▪ Training, certification, support, and services from Cloudera

Thursday, January 28, 2010

Page 13: 20100128ebay

Hadoop Project Mechanics▪ Trademark owned by ASF; Apache 2.0 license for code▪ Rigorous unit, smoke, performance, and system tests▪ Release cycle of 9 months▪ Last major release: 0.20.0 on April 22, 2009▪ 0.21.0 will be last release before 1.0; nearly complete▪ Subprojects on different release cycles

▪ Releases put to a vote according to Apache guidelines▪ Releases made available as tarballs on Apache and mirrors▪ Cloudera packages a distribution for many platforms▪ RPM and Debian packages; AMI for Amazon’s EC2

Thursday, January 28, 2010

Page 14: 20100128ebay

Hadoop at FacebookEarly 2006: The First Research Scientist

▪ Source data living on horizontally partitioned MySQL tier▪ Intensive historical analysis difficult▪ No way to assess impact of changes to the site

▪ First try: Python scripts pull data into MySQL▪ Second try: Python scripts pull data into Oracle

▪ ...and then we turned on impression logging

Thursday, January 28, 2010

Page 15: 20100128ebay

Facebook Data Infrastructure2007

Oracle Database Server

Data Collection Server

MySQL TierScribe Tier

Thursday, January 28, 2010

Page 16: 20100128ebay

Facebook Data Infrastructure2008

MySQL TierScribe Tier

Hadoop Tier

Oracle RAC Servers

Thursday, January 28, 2010

Page 17: 20100128ebay

Major Data Team Workloads▪ Data collection▪ server logs▪ application databases▪ web crawls

▪ Thousands of multi-stage processing pipelines▪ Summaries consumed by external users▪ Summaries for internal reporting▪ Ad optimization pipeline▪ Experimentation platform pipeline

▪ Ad hoc analyses

Thursday, January 28, 2010

Page 18: 20100128ebay

Workload StatisticsFacebook 2010

▪ Largest cluster running Hive: 8,400 cores, 12.5 PB of storage▪ 12 TB of compressed new data added per day▪ 135TB of compressed data scanned per day▪ 7,500+ Hive jobs on per day▪ 80K compute hours per day▪ Around 200 people per month run Hive jobs

(data from Ashish Thusoo’s Bay Area ACM DM SIG presentation)

Thursday, January 28, 2010

Page 19: 20100128ebay

Why Did Facebook Choose Hadoop?1. Demonstrated effectiveness for primary workload2. Proven ability to scale past any commercial vendor3. Easy provisioning and capacity planning with commodity nodes4. Data access for all: engineers, business analysts, sales managers5. Single system to manage XML/JSON, text, and relational data6. No schemas enabled data collection without involving Data team7. Simple, modular architecture8. Easy to build, deploy, and monitor9. Apache-licensed open source code granted to ASF

Thursday, January 28, 2010

Page 20: 20100128ebay

Why Did Facebook Choose Hadoop?▪ Most importantly: the community▪ Broad and deep commitment to future development from

multiple organizations▪ Interaction with a community often useful for recruiting▪ Growing body of users and operators with prior expertise

meant lower cost of training new users▪ Learn about best practices from other organizations▪ Widely available public materials for improving skills▪ Not then, but now

▪ Commercial training, certification, support, and services▪ Growing body of complementary software

Thursday, January 28, 2010

Page 21: 20100128ebay

Hadoop and Machine Learning/Modeling▪ Data preparation using familiar programming tools▪ Scalable historical storage of data for training and validation▪ Field coding, aggregation, and data quality assertions▪ Feature extraction over massive or complex data sets▪ Efficient sampling and extraction to other tools▪ Combination with other data sets▪ Extensible metadata for organizing data sets

▪ Fundamental operations▪ Matrix multiplication and other linear algebra▪ Statistical tests of significance

Thursday, January 28, 2010

Page 22: 20100128ebay

Hadoop and Machine Learning/Modeling▪ Scoring▪ eHarmony matching users▪ Fraud detection for billing platforms

▪ Genetic Algorithms▪ Mailchimp’s Project Omnivore▪ Xavier Llorà’s research

▪ Collaborative filtering▪ Google News personalization▪ Yahoo! front page personalization (Cokeheads)

Thursday, January 28, 2010

Page 23: 20100128ebay

Hadoop and Machine Learning/Modeling▪ Model fitting▪ EM algorithm and HMMs (Jimmy Lin)

▪ Graph analysis▪ Finding largest connected component (Jeff Hodges)▪ Social graph analysis (Jake Hofman)

▪ Document analysis▪ Named entity extraction (Evri)▪ Document similarity (Jimmy Lin)

▪ Image similarity: Google paper

Thursday, January 28, 2010

Page 24: 20100128ebay

Hadoop and Machine Learning/Modeling▪ Classification▪ Google’s PLANET for building decision trees▪ eBay’s linear Poisson regression for behavioral targeting▪ Sessionization of clickstream logs and path prediction

▪ Bioinformatics▪ Cloudburst▪ Crossbow

▪ Computer vision▪ Face detection▪ Face recognition

Thursday, January 28, 2010

Page 25: 20100128ebay

Hadoop and Machine Learning/Modeling▪ Simulation▪ Protein folding▪ Particle-swarm optimization

▪ Crazy stuff▪ Factoring integers▪ Solving Boggle▪ Generating fractals

▪ Books and conferences▪ MDAC 2010▪ “Data Intensive Text Processing with MapReduce”

Thursday, January 28, 2010

Page 26: 20100128ebay

Hadoop at ClouderaCloudera’s Distribution for Hadoop

▪ Open source distribution of Apache Hadoop for enterprise use▪ Includes HDFS, MapReduce, Pig, Hive, and ZooKeeper▪ Ensures cross-subproject compatibility▪ Adds backported patches and customer-specific patches▪ Adds Cloudera utilities like MRUnit and Sqoop▪ Better integration with daemon administration utilities▪ Follows the Filesystem Hierarchy Standard (FHS) for file layout▪ Tools for automatically generating a configuration▪ Packaged as RPM, DEB, AMI, or tarball

Thursday, January 28, 2010

Page 27: 20100128ebay

Hadoop at ClouderaTraining and Certification

▪ Free online training▪ Basic, Intermediate (including Hive and Pig), and Advanced▪ Includes a virtual machine with software and exercises

▪ Live training sessions▪ One live session per month somewhere in the world▪ If you have a large group, we may come to you

▪ Certification▪ Exams for Developers, Administrators, and Managers▪ Administered online or in person

Thursday, January 28, 2010

Page 28: 20100128ebay

Hadoop at ClouderaServices and Support

▪ Professional Services▪ Get Hadoop up and running in your environment▪ Optimize an existing Hadoop infrastructure▪ Design new algorithms to make the most of your data

▪ Support▪ Unlimited questions for Cloudera’s technical team▪ Access to our Knowledge Base▪ Help prioritize feature development for CDH▪ Early access to upcoming Cloudera software products

Thursday, January 28, 2010

Page 29: 20100128ebay

Hadoop at ClouderaCommercial Software

▪ General thesis: build commercially-licensed software products which complement CDH for data management and analysis

▪ Current products▪ Cloudera Desktop▪ Extensible interface for users of Cloudera software

▪ Upcoming products for data collection▪ Talk to me offline

Thursday, January 28, 2010

Page 30: 20100128ebay

Cloudera DesktopBig Data can be Beautiful

Thursday, January 28, 2010

Page 31: 20100128ebay

(c) 2009 Cloudera, Inc. or its licensors.  "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0

Thursday, January 28, 2010