hadoop jute record python

12
HADOOP RECORD READER IN PYTHON HUG: Nov 18 2009 Paul Tarjan http://paulisageek.com @ptarjan http://github.com/ptarjan/hadoop_record

Upload: paul-tarjan

Post on 30-Nov-2014

4.046 views

Category:

Technology


1 download

DESCRIPTION

My talk for the Hadoop User Group Nov 18 2009 about: Parsing hadoop records using python

TRANSCRIPT

Page 1: Hadoop Jute Record Python

HADOOP RECORD READER IN

PYTHON

HUG: Nov 18 2009

Paul Tarjan

http://paulisageek.com

@ptarjan

http://github.com/ptarjan/hadoop_record

Page 2: Hadoop Jute Record Python

Hey Jute…

Tabs and newlines are good and all For lots of data, don’t do that

Page 3: Hadoop Jute Record Python

don’t make it bad...

Hadoop has a native data storage format called Hadoop Record or “Jute”

org.apache.hadoop.record

http://en.wikipedia.org/wiki/Jute

Page 4: Hadoop Jute Record Python

take a data structure…

There is a Data Definition Language! module links {

class Link {

ustring URL;

boolean isRelative;

ustring anchorText;

};

}

Page 5: Hadoop Jute Record Python

and make it better…

And a compiler $ rcc -l c++ inclrec.jr testrec.jr

namespace inclrec {

class RI :

public hadoop::Record {

private:

int32_t I32;

double D;

std::string S;

Page 6: Hadoop Jute Record Python

remember, to only use C++/Java $rcc --help

Usage: rcc --language

[java|c++] ddl-files

Page 7: Hadoop Jute Record Python

then you can start to make it better… I wanted it in python Need 2 parts.

Parsing library and DDL translator

I only did the first part If you need second part, let me know

Page 8: Hadoop Jute Record Python

Hey Jute don't be afraid…

Page 9: Hadoop Jute Record Python

you were made to go out and get her… http://github.com/ptarjan/

hadoop_record

Page 10: Hadoop Jute Record Python

the minute you let her under your skin… I bet you thought I was done with “Hey

Jude” references, eh? How I built it

Ply == lex and yaccParser == 234 lines including tests!Outputs generic data typesYou have to do the class transform yourself

You can use my lex and yacc stuff in your language of choice

Page 11: Hadoop Jute Record Python

and any time you feel the pain… Parsing the binary format is hard Vector vs struct???

struct = "s{" record *("," record) "}" vector = "v{" [record *("," record)] "}"

LazyString – don’t decode if not needed99% of my hadoop time was decoding strings I

didn’t need Binary on disk -> CSV -> python ==

wastefull Hadoop upacks zip files – name it .mod

Page 12: Hadoop Jute Record Python

na na na na na

Future workDDL ConverterIntegrate it officiallyRecord writer (should be easy)SequenceFileAsOutputFormatIntegrate your feedback