Hadoop Record Reader In Python

Download Hadoop Record Reader In Python

Post on 20-Jan-2015

3.114 views

Category:

Technology

5 download

Embed Size (px)

DESCRIPTION

 

TRANSCRIPT

<ul><li> 1. Hadoop Record Reader in Python<br />HUG: Nov 18 2009<br />Paul Tarjan<br />http://paulisageek.com<br />@ptarjan<br />http://github.com/ptarjan/hadoop_record<br /></li></ul> <p> 2. Hey Jute<br />Tabs and newlines are good and all<br />For lots of data, dont do that<br /> 3. dont make it bad...<br />Hadoop has a native data storage format called Hadoop Record or Jute<br />org.apache.hadoop.record<br />http://en.wikipedia.org/wiki/Jute<br /> 4. take a data structure<br />There is a Data Definition Language!<br />module links {<br />class Link {<br />ustringURL;<br />booleanisRelative;<br />ustringanchorText;<br />};<br />} <br /> 5. and make it better<br />And a compiler<br />$ rcc -lc++ inclrec.jrtestrec.jr<br />namespace inclrec {<br />class RI :<br />public hadoop::Record {<br />private:<br />int32_t I32;<br />double D;<br />std::string S;<br /> 6. remember, to only use C++/Java<br />$rcc--help<br />Usage: rcc --language<br />[java|c++] ddl-files<br /> 7. then you can start to make it better<br />I wanted it in python<br />Need 2 parts:<br />Parsing library and <br />DDL translator<br />I only did the first part<br />If you need second part, let me know<br /> 8. Hey Jute don't be afraid<br /> 9. you were made to go out and get her<br />http://github.com/ptarjan/hadoop_record<br /> 10. the minute you let her under your skin<br />I bet you thought I was done with Hey Jude references, eh?<br />How I built it<br />Ply == lex and yacc<br />Parser == 234 lines including tests!<br />Outputs generic data types<br />You have to do the class transform yourself<br />You can use my lex and yacc stuff in your language of choice<br /> 11. and any time you feel the pain<br />Parsing the binary format is hard<br />Vector vsstruct???<br />struct= "s{" record *("," record) "}"<br />vector = "v{" [record *("," record)] "}"<br />LazyString dont decode if not needed<br />99% of my hadoop time was decoding strings I didnt need<br />Binary on disk -&gt; CSV -&gt; python == wasteful<br />Hadoopupacks zip files name it .mod<br /> 12. nanananana<br />Future work<br />DDL Converter<br />Integrate it officially<br />Record writer (should be easy)<br />SequenceFileAsOutputFormat<br />Integrate your feedback<br /></p>

Recommended

View more >