vBACD July 2012 - Apache Hadoop, Now and Beyond

Download vBACD July 2012 - Apache Hadoop, Now and Beyond

Post on 10-May-2015

1.473 views

Category:

Technology

2 download

Embed Size (px)

DESCRIPTION

Apache Hadoop, Now and Beyond, Jim Walker, Director of Product Marketing, Hortonworks Hadoop is an open source project that allows you to gain insight from massive amounts of structured and unstructured data quickly and without significant investment. It is shifting the way many traditional organizations think of analytics and business models. While it is deigned to take advantage of cheap commodity hardware, it is also perfect for the cloud as it is built to scale up or down without system interruption. In this presentation, Jim Walker will provide an overview of Apache Hadoop and its current state of adoption in and out of the cloud.

TRANSCRIPT

<ul><li>1.Apache Hadoop &amp; the CloudJim WalkerDir. Product Marketing, HortonworksTwitter @jaymceJuly 10, 2012 Hortonworks Inc. 2012</li></ul><p>2. 1941 2012 Page 2 Hortonworks Inc. 2012 3. Big data market segmentsSoftwareHardwareETL &amp; MgmntAnalytics Applications ServicesDistributions Storage OSS Apache Distributed file Analytic Data Consulting ServersHadoop storesapplication visualization Training Networking Enterprise NoSQL development tools Tech supportDistributionsdatabases platforms Business Software Non-Hadoop Data Advancedintelligence maintenancebig data integration analytics applications Hardwareframeworks Data quality &amp;applications maintenance governance hosting Next Generation Data Warehouse MPP columnar data warehouse appliances In-memory analytics engines Fast data loading Hortonworks Inc. 2012 4. Big data market segmentsSoftwareHardwareETL &amp; MgmntAnalytics Applications ServicesDistributions Storage OSS Apache Distributed file Analytic Data Consulting ServersHadoop storesapplication visualization Training Networking Enterprise NoSQL development tools Tech supportDistributionsdatabases platforms Business Software Non-Hadoop Data Advancedintelligence maintenancebig data integration analytics applications Hardwareframeworks Data quality &amp;applications maintenance governance hosting cloudcloud cloudcloud Next Generation Data Warehouse MPP columnar data warehouse appliances In-memory analytics engines Fast data loading Hortonworks Inc. 2012 5. Analytics started with basic purchase history MegabytesERP Purchase detail Purchase record Payment record Increasing Data Variety and ComplexitySource: Crated in conjunction with Teradata, Inc. Hortonworks Inc. 2012 6. then we added customer informationGigabytes CRM Segmentation Customer Touches MegabytesERP Purchase detailSupport Contacts Purchase record Payment record Offer detailsIncreasing Data Variety and Complexity Source: Crated in conjunction with Teradata, Inc. Hortonworks Inc. 2012 7. and the web started to impactTerabytes WEBWeb logs A/B testingBehavioral Targeting GigabytesCRMDynamic Pricing SegmentationSearch Marketing Customer Touches MegabytesERPAffiliate Networks Purchase detailSupport Contacts Dynamic Funnels Purchase record Payment record Offer detailsOffer historyIncreasing Data Variety and ComplexitySource: Crated in conjunction with Teradata, Inc. Hortonworks Inc. 2012 8. Big data changes the gameTransactions + InteractionsPetabytes BIG DATA Mobile Web+ ObservationsSentiment User Click Stream SMS/MMS= BIG DATASpeech to Text Social Interactions &amp; Feeds Terabytes WEBWeb logsSpatial &amp; GPS CoordinatesA/B testing Sensors / RFID / Devices Behavioral TargetingGigabytesCRMDynamic Pricing Business Data FeedsSegmentation External Demographics Search MarketingCustomer Touches User Generated ContentMegabytes ERPAffiliate NetworksPurchase detailSupport Contacts HD Video, Audio, ImagesDynamic FunnelsPurchase record Offer detailsOffer history Product/Service LogsPayment record Increasing Data Variety and Complexity Source: Crated in conjunction with Teradata, Inc. Hortonworks Inc. 2012 9. Next-gen data architecture driversBusinessEnable new business models &amp; drive faster growth (20%+) DriversFind insights for competitive advantage &amp; optimal returnsTechnical Data continues to grow exponentiallyDrivers Data is increasingly everywhere and in many formatsLegacy solutions unfit for new requirements growth cloudFinancial Cost of data systems, as % of IT spend, continues to growDrivers Cost advantages of commodity hardware &amp; open source Hortonworks Inc. 2012 10. Apache HadoopOpen Source Data Management SoftwareOne of the best examples of open sourcedriving innovation and creating a market Foundation for big data solutions Enables a rational economics model Powers data-driven business Commodity hardware Loosely coupled, ship early/ship often Consists of many specialized sub-projects Hortonworks Inc. 2012 11. Apache Hadoop &amp; Cloud Makes Sense Broader access of Hadoop to end users, ITprofessionals, and developers cloud Easy installation and configuration andsimplified programming Enterprise-ready distribution with greatersecurity, performance, ease of managementand options for Hybrid IT usage. Integrate with everything via RESTful API Spin up a cluster on demand Ease managementPage 11 Hortonworks Inc. 2012 12. 5 Reasons for Hadoop in the CloudPeople say "shouldyou run Hadoop inthe cloud?I say "it depends". http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.htmlPage 12 Hortonworks Inc. 2012 13. 5 Reasons for Hadoop in the Cloud 1If your data is stored in a cloud, local analysismay make more sense "work near the data" 2For periodic processing (nightly, etc)it might make sense to just rent. 3No upfront capital expense,fund from success 4Easier to expand a cluster;no need to buy just find 5Eliminate networking concerns http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.htmlPage 13 Hortonworks Inc. 2012 14. What is Apache Hadoop?1 PROCESSING Map/Reduce Splits a task across processors near the data &amp; assembles results 2004 white paper MapReduce: Simplified Data Processing on Large Clusters Base of much new tech2 STORAGE Hadoop Distributed File System Distributed across nodes Natively redundant Name node tracks locations Hortonworks Inc. 2012 15. Apache Hadoop related projects3Hive4HBase Apache Hive is a data5HCatalogwarehouse infrastructure built on top of Hadoop (originally by6Pig Facebook) for providing data summarization, ad-hoc query,7Oozie and analysis of large datasets. It provides a mechanism to project structure onto this data8Ambariand query the data using a SQL-like language called9Sqoop HiveQL (HQL).10 Zookeeper Hortonworks Inc. 2012 16. Apache Hadoop related projects3Hive4HBase5HCatalogHBase is a non-relational database. It is columnar and provides fault-tolerant storage6Pig and quick access to large quantities of sparse data. It7Oozie also adds transactional capabilities to Hadoop,8Ambariallowing users to conduct updates, inserts and deletes.9Sqoop10 Zookeeper Hortonworks Inc. 2012 17. Apache Hadoop related projects3HiveHCatalog4HBase HCatalog is a metadata management service for5HCatalogApache Hadoop. It opens up the platform and allows6Pig interoperability across data processing tools such as Pig, Map Reduce and Hive. It also7Oozie provides a table abstraction so that users need not be8Ambariconcerned with where or how their data is stored.9Sqoop Aster SQL-H interfaces with HCatalog10 Zookeeper Hortonworks Inc. 2012 18. Apache Hadoop related projects3Hive4HBase Apache Pig allows you to write complex map reduce5HCatalogtransformations using a simple scripting language. Pig latin6Pig (the language) defines a set of transformations on a data set7Oozie such as aggregate, join and sort among others. Pig Latin is sometimes extended using8AmbariUDF (User Defined Functions), which the user can9Sqoop write in Java and then call directly from the language.10 Zookeeper Hortonworks Inc. 2012 19. Apache Hadoop related projects3Hive4HBase5HCatalogOozie coordinates jobs written in multiple languages such as6Pig Map Reduce, Pig and Hive. It is a workflow system that links7Oozie these jobs and allows specification of order and dependencies between them.8Ambari9Sqoop10 Zookeeper Hortonworks Inc. 2012 20. Apache Hadoop related projects3Hive4HBase5HCatalogApache Ambari operationalizes Hadoop. It provides a mechanism to6Pig monitor and manage a cluster. It also provisions nodes.7Oozie Ambari is a monitoring,8Ambariadministration and lifecycle management project for Apache Hadoop clusters9Sqoop10 Zookeeper Hortonworks Inc. 2012 21. Apache Hadoop related projects3Hive4HBase5HCatalog Sqoop is a set of tools that allow non-Hadoop data stores6Pig to interact with traditional relational databases and data7Oozie warehouses.8Ambari9Sqoop10 Zookeeper Hortonworks Inc. 2012 22. Apache Hadoop related projects3Hive4HBase5HCatalogZooKeeper is a centralized service for maintaining6Pig configuration information, naming, providing distributed7Oozie synchronization, and providing group services.8Ambari9Sqoop10 Zookeeper Hortonworks Inc. 2012 23. Hadoop in Action Interfaces with HCatalog to1 Web Log files via WebHDFS APIs 4 analyze website visits by the type of end resultsWebsiteWebInteractions LogsBig DataOrder RefineryDBDataCustomerDBDataCustomer &amp; Order data via Talend Pre-processes, refines, and 2 3&amp; HCatalog for schemajoins data via Talend, Pig, &amp; HCatalog Hortonworks Inc. 2012 24. Hortonworks Vision &amp; RoleWe believe that by the end of 2015,more than half the worlds data will beprocessed by Apache Hadoop.1 Be diligent stewards of the open source core2 Be tireless innovators beyond the core3 Provide robust data platform services &amp; open APIs4 Enable the ecosystem at each layer of the stack5 Make the platform enterprise-ready &amp; easy to use Hortonworks Inc. 2012 25. Balancing Innovation &amp; Stabilitycustomers relative %The CHASMInnovators,Early Early Late majority,Laggards,technology adopters,majority, conservativesSkepticsenthusiasts visionaries pragmatiststimeCustomers wantCustomers wanttechnology &amp; performancesolutions &amp; convenience Source: Geoffrey Moore - Crossing the Chasm Page 25 Hortonworks Inc. 2012 26. Enabling Hadoop as Enterprise Big Data PlatformApplications,Installation &amp; Configuration,Business Tools,Administration,Development Tools, Monitoring,Open APIs and access High Availability,Data Movement &amp; Integration, Replication,Data Management Systems, Multi-tenancy, ..Systems Management Hortonworks Data Platform DEVELOPERData Platform Services &amp; Open APIs Metadata, Indexing, Search, Security,Management, Data Extract &amp; Load, APIs Hortonworks Inc. 2012 27. Hortonworks Data Platform The ONLY 100% open source data platform for Hadoop Tightly aligned with core Apache code line All code committed back to open source Most complete Apache Hadoop platform Comprehensive management and monitoring Intuitive graphical data integration tools Centralized metadata services for easy data sharingPage 27 Hortonworks Inc. 2012 28. Hortonworks Data Platform Simplify deployment to getstarted quickly and easily Monitor, manage any size clusterwith familiar console and tools Only platform to include dataintegration services to interact1 with any data source Metadata services opens theplatform for integration with Hortonworks Data Platformexisting applicationsDelivers enterprise grade functionality on a provenApache Hadoop distribution to ease management, Dependable high availability simplify use and ease integration into the enterprisearchitectureThe only 100% open source data platform for Apache Hadoop Hortonworks Inc. 2012 29. Apache Distribution StackBuilt on Hadoop 1.0(a.k.a. 0.20.205) Proven at large scale enterpriseimplementations 0.92.1+ 5.1.1 Most stable and reliable version 1.0.3 0.9.23.3.4of Hadoop to date First Apache line supporting 0.4.0security, HBase, WebHDFS Driven by core committers and0.9.0+3.1.3architects at Hortonworks0.9.0+ beta ZookeeperIncludes necessary componentsHCatalog AmbariHBase Talend Sqoopalready integrated and testedOozieCoreHivePigtogether 1.0.3 0.4.0 0.9.2 0.9.0+ 0.92.1+ 0.9.0+ 3.1.33.3.4beta 5.1.1Most stable versions of allHortonworks Distributioncomponents are chosen Tested, Hardened &amp; ProvenDistribution Reduces Risk Page 29 Hortonworks Inc. 2012 30. Management &amp; Monitoring SvcsHortonworks Management Center View the health of cluster operations, server utilization and performance levels Customizable dashboards APIs for integration into 3rd party monitoring tools 100% open source management &amp; monitoring, powered by Apache Ambari, Puppet, Nagios and Gaglia Simple wizard-based installation, configuration &amp; provisioning of any size Hadoop clusterOptimize performance for your Hadoop clusterSimplify Installation and provisioning Page 30 Hortonworks Inc. 2012 31. Data Integration Services Intuitive graphical data integration tools for HDFS, Hive, HBase, HCatalog and Pig Oozie scheduling allows you to manage and stage jobs Connectors for any database, business application or system Integrated HCatalog storage Bridge the gap between legacy data &amp; Hadoop Simplify and speed developmentPage 31 Hortonworks Inc. 2012 32. Which is best for the cloud?vs.Page 32 Hortonworks Inc. 2012 33. Metadata ServicesApache HCatalog provides flexible metadataservices across tools and external access Consistency of metadata and data models across tools(MapReduce, Pig, HBase and Hive) Accessibility: share data as tables in and out of HDFS Availability: enables flexible, thin-client access via REST APIHCatalogShared tableand schemamanagement Raw Hadoop dataTable access opens the Inconsistent, unknownAligned metadata platform Tool specific access REST API Hortonworks Inc. 2012 34. Services IntegrationProvides RESTful API asfront door for Hadoop Existing &amp; New Applications Opens the door toWebHDFSHCatalog RESTful Web Services languages other than Java Thin clients via webMapReduce Pig Hive services vs. fat-clients in HCatalog gateway Insulation from interface ExternalHDFS HBase changes release to release Store Opens Hadoop to integration with existing and new applications Hortonworks Inc. 2012 35. Use cases: optimize outcomes at scaleMedia optimize ContentIntelligenceoptimize Detection Investment optimize AlgorithmsAdvertising optimize PerformanceFraud optimize PreventionRegulationoptimize Compliance Retail / Wholesale optimize Inventory turnsManufacturing optimize Supply chainsHealthcareoptimize Patient outcomesEducation optimize Learning outcomesGovernmentoptimize Citizen servicesSource: Geoffrey Moore. Hadoop Summit 2012 keynote presentation. Hortonworks Inc. 2012 36. Connecting Transactions + Interactions + Observations Audio, Retain runtime models and Video,Images historical data for ongoing 5 Business Web, Mobile, CRM,refinement &amp; analysis ERP, SCM, Transactions Docs, &amp; Interactions Text, XMLWeb Logs, Clicks Big Data4DataSocial,Refinery Discovery &amp; ClassicGraph, 1 ETLFeeds InvestigativeprocessingAnalyticsSensors, 3Share refinedDevices,RFIDdata &amp; runtime 2 Store, aggregate, andmodelsInteractive transform multi-structured dataSpatial, data to unlock value Businessexploration GPS Intelligence &amp; Analytics Retain historical data toEvents, Other unlock additional value 6Dashboards, Reports,Visualization, Hortonworks Inc. 2012 37. 5 Reasons for Hadoop in the Cloud 1If your data is stored in a cloud, local analysismay make more sense "work near the data" 2For periodic processing (nightly, etc)it might make sense to just rent. 3No upfront capital expense,fund from success 4Easier to expand a cluster;no need to buy just find 5Eliminate networking concerns http://steveloughran.blogspot.com/2012/03/hadoop-in-cloud-infrastructures.htmlPage 37 Hortonworks Inc. 2012 38. THANK YOUJim Walkerjim@hortonworks.com@jaymce1 Get Hortonworks Data Platformhortonworks.com/download2 Use the getting started guidehortonworks.com/get-started3 Learn mo...</p>