avro
TRANSCRIPT
![Page 1: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/1.jpg)
Apache Avro
![Page 2: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/2.jpg)
AvroEtymology & HistorySexy TractorsProject Drivers & OverviewSerializationRPCHadoop Support
![Page 3: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/3.jpg)
EtymologyBritish aircraft manufacturer1910-1963
![Page 4: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/4.jpg)
HistoryDoug Cutting – Cloudera, Hadoop project founder2002 – Nutch 2004 – Google GFS, MapReduce whitepapers2005 – NDFS & MR, Writable & SequenceFile2006 – Hadoop split from Nutch, renamed NDFS
to HDFS2007 – Yahoo gets involved, HBase, Pig,
Zookeeper2008 – Terrasort contest winner, Hive, Mahout,
Cassandra2009 – Oozie, Flume, Hue
![Page 5: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/5.jpg)
HistoryUnderlying serialization system basically
unchangedAdditional language support and data
formatsLanguage, data format combinatorial
explosionC++ JSON to Java BSONPython Smile to PHP CSV
Apr 2009 – Avro proposalMay 2010 – Top-level project
![Page 6: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/6.jpg)
Sexy TractorsData serialization tools, like tractors, aren’t
sexyThey should be!Dollar for dollar storage capacity has increased
exponentially, doubling every 1.5-2 yearsThroughput of magnetic storage and network
has not maintained this paceDistributed systems are the normEfficient data serialization techniques and tools
are vital
![Page 7: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/7.jpg)
Project DriversCommon data format for serialization and
RPCDynamicExpressiveEfficientFile format
Well definedStandaloneSplittable & compressed
![Page 8: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/8.jpg)
Biased ComparisonCSV XML/JSON SequenceFi
leThrift & PB Avro
Language Independent
Yes Yes No Yes Yes
Expressive No Yes Yes Yes Yes
Efficient No No Yes Yes Yes
Dynamic Yes Yes No No Yes
Standalone ? Yes No No Yes
Splittable ? ? Yes ? Yes
![Page 9: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/9.jpg)
Project OverviewSpecification based designDynamic implementationsFile formatSchemas
Must support JSON implementationIDL often supportedEvolvable
First class Hadoop support
![Page 10: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/10.jpg)
Specification Based DesignSchemasEncodingSort orderObject container filesCodecsProtocolProtocol write formatSchema resolution
![Page 11: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/11.jpg)
Specification Based DesignSchemas
Primitive types Null, boolean, int, long, float, double, bytes, string
Complex types Records, enums, arrays, maps, unions and fixed
Named types Records, enums, fixed Name & namespace
Aliaseshttp://avro.apache.org/docs/current/spec.html#
schemas
![Page 12: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/12.jpg)
Schema Example
{ "namespace": "com.emoney", "name": "LogMessage", "type": "record", "fields": [ {"name": "level", "type": "string", "comment" : "this is
ignored"}, {"name": "message", "type": "string", "description" : "this is the
message"}, {"name": "dateTime", "type": "long"}, {"name": "exceptionMessage", "type": ["null", "string"]} ]}
log-message.avpr
![Page 13: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/13.jpg)
Specification Based DesignEncodings
JSON – for debuggingBinary
Sort orderEfficient sorting by system other than writerSorting binary-encoded data without
deserialization
![Page 14: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/14.jpg)
Specification Based DesignObject container files
SchemaSerialized data written to binary-encoded blocksBlocks may be compressedSynchronization markers
CodecsNullDeflateSnappy (optional)LZO (future)
![Page 15: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/15.jpg)
Specification Based DesignProtocol
Protocol nameNamespaceTypes
Named types used in messagesMessages
Uniquely named message Request Response Errors
Wire formatTransportsFramingHandshake
![Page 16: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/16.jpg)
Protocol{ "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings",
"types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type":
"string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ],
"messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } }}
![Page 17: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/17.jpg)
Schema Resolution & Evolution Writers schema always provided to reader Compare schema used by writer & schema expected by reader Fields that match name & type are read Fields written that don’t match are skipped Expected fields not written can be identified
Error or provide default value Same features as provided by numeric field ids
Keeps fields symbolic, no index IDs written in data Allows for projections
Very efficient at skipping fields Aliases
Allows projections from 2 different types using aliases User transaction
Count, date Batch
Count, date
![Page 18: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/18.jpg)
Implementations
Implementation Core Data file Codec RPC/HTTP
C Yes Yes Deflate Yes
C++ Yes Yes ? Yes
C# Yes No N/A No
Java Yes Yes Deflate, Snappy
Yes
Python Yes Yes Deflate Yes
Ruby Yes Yes Deflate Yes
PHP Yes Yes ? No
Core – parse schemas, read & write binary data for a schemaData file – read & write Avro data filesCodec – supported codecsRPC/HTTP – make and receive calls over HTTP
![Page 19: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/19.jpg)
API Generic
Generic attribute/value data structureBest suited for dynamic processing
SpecificEach record corresponds to a different kind of
object in the programming languageRPC systems typically use this
ReflectSchemas generated via reflectionConverting an existing codebase to use Avro
![Page 20: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/20.jpg)
APILow-level
SchemaEncodersDatumWriterDatumReader
High-levelDataFileWriterDataFileReader
![Page 21: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/21.jpg)
Java ExampleSchema schema =
Schema.parse(getClass().getResourceAsStream("schema.avpr"));
OutputStream outputStream = new FileOutputStream("data.avro");
DataFileWriter<Message> writer = new DataFileWriter<Message>(new
GenericDatumWriter<Message>(schema));
writer.setCodec(CodecFactory.deflateCodec(1));writer.create(schema, outputStream);
writer.append(new Message ());
writer.close();
![Page 22: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/22.jpg)
Java Example DataFileReader<Message> reader = new
DataFileReader<Message>( new File("data.avro"), new GenericDatumReader<Message>());
for (Message next : reader) { System.out.println("next: " + next);}
![Page 23: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/23.jpg)
RPCServer
SocketServer (non-standard) SaslSocketServer HttpServer NettyServer DatagramServer (non-standard)
Responder Generic Reflect Specific
Client Corresponding Transceiver LocalTransceiver
Requestor
![Page 24: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/24.jpg)
RPCClient
Corresponding Transceiver for each serverLocalTransceiver
Requestor
![Page 25: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/25.jpg)
RPC ServerProtocol protocol = Protocol.parse(new File("protocol.avpr"));
InetSocketAddress address = new InetSocketAddress("localhost", 33333);
GenericResponder responder = new GenericResponder(protocol) { @Override public Object respond(Protocol.Message message, Object request)
throws Exception { . . . }};
new SocketServer(responder, address).join();
![Page 26: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/26.jpg)
Hadoop SupportFile writers and readersReplacing RPC with Avro
In Flume alreadyPig support is inSplittable
Set block size when writingTether jobs
Connector framework for other languagesHadoop Pipes
![Page 27: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/27.jpg)
FutureRPC
Hbase, Cassandra, Hadoop coreHive in progressTether jobs
Actual MapReduce implementations in other languages
![Page 28: Avro](https://reader035.vdocuments.mx/reader035/viewer/2022062418/554f8e61b4c905d25b8b507b/html5/thumbnails/28.jpg)
AvroDynamicExpressiveEfficientSpecification based designLanguage implementations are fairly solidSerialization or RPC or bothFirst class Hadoop supportCurrently 1.5.1Sexy tractors