cloudera development kit (cdk): hadoop application development made easier

Download Cloudera Development Kit (CDK): Hadoop Application Development Made Easier

Post on 20-Aug-2015




2 download

Embed Size (px)


  1. 1. 11Headline Goes HereSpeaker Name or Subhead Goes HereCloudera Developer Kit:Hadoop Application Development Made EasierE. Sammer | Engineering ManagerMay 2013
  2. 2. 22[I]ts not enough to just build ascalable and stable system; the systemalso has to be easy enough forthousands of internal developers of alltypes and all skill levels to use.
  3. 3. 3Hadoop is incredibly powerful3
  4. 4. 4Hadoop is incredibly flexible4
  5. 5. 5Hadoop is incredibly low-level5
  6. 6. 6Hadoop is incredibly complex6
  7. 7. 7A typical system (zoom 100:1)7
  8. 8. 8A typical system (zoom 10:1)8
  9. 9. 9A typical system (zoom 5:1)9
  10. 10. 10What you actually care aboutGetting data from A to BUsing it later10
  11. 11. 11Infrastructure detailsSerialization, file formats, and compressionMetadata capture and maintenanceDataset organization and partitioningDurability and delivery guaranteesWell-defined failure semanticsPerformance and health instrumentation11
  12. 12. 12Cloudera Development KitMake Hadoop accessible to the enterprise developerCodify expert patterns and practicesMake the right thing easy and obviousAddress the most common casesLet developers focus on business logical, not infrastructure12
  13. 13. 13Cloudera Development KitAn open source set of libraries, guides, and examples forbuilding data-oriented systems and applicationsProvides higher level APIs atop existing components of CDHSupports piecemeal adoption via loosely coupled modules13
  14. 14. 14CDK Data ModuleHigh level APIs for interacting with datasets in HDFSConfiguration-based format and schema managementConsistent data model and serialization semanticsMetadata system integration and supportAutomatic dataset partitioning and file management14
  15. 15. 1515DatasetRepository repo = new FileSystemDatasetRepository.Builder().fileSystem(FileSystem.get(new Configuration())).directory(new Path(/data)).get();Dataset events = repo.create(events,new DatasetDescriptor.Builder().schema(new File(event.avsc)).partitionStrategy(new PartitionStrategy.Builder().hash(userId, 53).get()).get());DatasetWriter writer = events.getWriter();;writer.write(new GenericRecordBuilder(schema).set(userId, 1).set(timeStamp, System.currentTimeMillis()).build());writer.close();/data/events/.metadata/schema.avsc/
  16. 16. 16Under developmentConfiguration-based record transformation and filtering engineData pipeline deployment, discovery, and managementWorking with customers, partners, and the community on newmodules and features16
  17. 17. 17Getting startedCDK code repo: example repo: artifacts available from Clouderas Maven repositoryMailing list:
  18. 18. 18 Submit questions in the Q&A panel Watch this webinar on-demand at Follow Cloudera @Cloudera Follow Cloudera Engineering@ClouderaEng Thank you for attending!Learn more about the CDK on GitHub
  19. 19. 1919