hadoop successes and failures to drive deployment evolution
TRANSCRIPT
Hadoop Hands OnSuccesses and failures to drive
evolution
Benoit PERROUD
Software Engineer @Verisign & Apache Committer
GITI BigData, EPFL, November 6. 2012
2Verisign Public
• I apologize for speaking “Frenglish”
• The views and statements expressed in this talk do not necessarily reflect the
views of VeriSign, Inc and any other person involved in the company do not
warrant the accuracy, reliability, currency or completeness of those views or
statements and do not accept any legal liability whatsoever arising from any
reliance on the views, statements and subject matter of the talk.
• Apache, Apache Hadoop, Hadoop, Cassandra, Apache Cassandra, Solr, Apache
Solr, Hbase, Apache Hbase, Tomcat, Apache Tomcat, Zookeeper, Apache
Zookeeper, Lucene, Apache Lucene and the yellow elephant logo are either
registered trademarks or trademarks of the Apache Software Foundation in the
United States and/or other countries.
• Java, Glassfish and the Java logo are registered trademarks of Oracle and/or its
affiliates
• Python and the Python logo are either registered trademarks or trademarks of the
Python Software Foundation
• MongoDB, Mongo and the leaf logo are registered trademarks of 10gen, Inc.
• All other marks are the property of their respective owners.
Disclaimer
4Verisign Public
1. MapReduce Processing Framework
• Map Combine Shuffle Reduce
2. Distributed File System (HDFS)
Hadoop 10k Feet View
Credit: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
7Verisign Public
• NameNode crashes. Configuring PNN and SNN
NameNode Single Point of Failure
NFS HA setup is not detailed here.
8Verisign Public
• Data could be internal to the company, but also
external.
Bringing Data into the Cluster
Data Retrieval and Stream Ingestion
are over simplified.
9Verisign Public
• Integration/Validation Cluster setup
Dealing with API Changes
Validation Cluster will be omitted
in further slides for more clarity
20Verisign Public
• Hadoop Next Gen
• YARN (2.0)
• Graph processing
• Neo4J
• Google Pregel / Apache Hama
• Incremental Updates
• Real time ad hoc queries
• Cloudera Impala / Google Dremel
Future Evolutions
21Verisign Public
• Hadoop has gained huge momentum
• Technologies (around Hadoop) are evolving really fast
• There is no “One size fits all” solution
• Design hardly driven by customer needs
• Data quality is a hidden requirement
Conclusion
22Verisign Public
• Data Scientists cost a lot
• Running on commodity hardware still costs a lot
• No one has the full understanding of the full data flow
• And you need several FTE just to track the architecture
• You have a high risk of misuse of these softwares
• Hiring engineers with deep knowledge (meaning:
hands on experience) in some of these softwares is
already a challenge
Conclusion #2
23Verisign Public
Hadoop In Practice
by Alex Holmes
Senior Software Engineer @Verisign
Recommended Reading
Thank You
© 2012 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and
designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United
States and in foreign countries. All other trademarks are property of their respective owners.