avro

28
Apache Avro

Upload: eric-turcotte

Post on 10-May-2015

2.531 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Avro

Apache Avro

Page 2: Avro

AvroEtymology & HistorySexy TractorsProject Drivers & OverviewSerializationRPCHadoop Support

Page 3: Avro

EtymologyBritish aircraft manufacturer1910-1963

Page 4: Avro

HistoryDoug Cutting – Cloudera, Hadoop project founder2002 – Nutch 2004 – Google GFS, MapReduce whitepapers2005 – NDFS & MR, Writable & SequenceFile2006 – Hadoop split from Nutch, renamed NDFS

to HDFS2007 – Yahoo gets involved, HBase, Pig,

Zookeeper2008 – Terrasort contest winner, Hive, Mahout,

Cassandra2009 – Oozie, Flume, Hue

Page 5: Avro

HistoryUnderlying serialization system basically

unchangedAdditional language support and data

formatsLanguage, data format combinatorial

explosionC++ JSON to Java BSONPython Smile to PHP CSV

Apr 2009 – Avro proposalMay 2010 – Top-level project

Page 6: Avro

Sexy TractorsData serialization tools, like tractors, aren’t

sexyThey should be!Dollar for dollar storage capacity has increased

exponentially, doubling every 1.5-2 yearsThroughput of magnetic storage and network

has not maintained this paceDistributed systems are the normEfficient data serialization techniques and tools

are vital

Page 7: Avro

Project DriversCommon data format for serialization and

RPCDynamicExpressiveEfficientFile format

Well definedStandaloneSplittable & compressed

Page 8: Avro

Biased ComparisonCSV XML/JSON SequenceFi

leThrift & PB Avro

Language Independent

Yes Yes No Yes Yes

Expressive No Yes Yes Yes Yes

Efficient No No Yes Yes Yes

Dynamic Yes Yes No No Yes

Standalone ? Yes No No Yes

Splittable ? ? Yes ? Yes

Page 9: Avro

Project OverviewSpecification based designDynamic implementationsFile formatSchemas

Must support JSON implementationIDL often supportedEvolvable

First class Hadoop support

Page 10: Avro

Specification Based DesignSchemasEncodingSort orderObject container filesCodecsProtocolProtocol write formatSchema resolution

Page 11: Avro

Specification Based DesignSchemas

Primitive types Null, boolean, int, long, float, double, bytes, string

Complex types Records, enums, arrays, maps, unions and fixed

Named types Records, enums, fixed Name & namespace

Aliaseshttp://avro.apache.org/docs/current/spec.html#

schemas

Page 12: Avro

Schema Example

{ "namespace": "com.emoney", "name": "LogMessage", "type": "record", "fields": [ {"name": "level", "type": "string", "comment" : "this is

ignored"}, {"name": "message", "type": "string", "description" : "this is the

message"}, {"name": "dateTime", "type": "long"}, {"name": "exceptionMessage", "type": ["null", "string"]} ]}

log-message.avpr

Page 13: Avro

Specification Based DesignEncodings

JSON – for debuggingBinary

Sort orderEfficient sorting by system other than writerSorting binary-encoded data without

deserialization

Page 14: Avro

Specification Based DesignObject container files

SchemaSerialized data written to binary-encoded blocksBlocks may be compressedSynchronization markers

CodecsNullDeflateSnappy (optional)LZO (future)

Page 15: Avro

Specification Based DesignProtocol

Protocol nameNamespaceTypes

Named types used in messagesMessages

Uniquely named message Request Response Errors

Wire formatTransportsFramingHandshake

Page 16: Avro

Protocol{ "namespace": "com.acme", "protocol": "HelloWorld", "doc": "Protocol Greetings",

"types": [ {"name": "Greeting", "type": "record", "fields": [ {"name": "message", "type":

"string"}]}, {"name": "Curse", "type": "error", "fields": [ {"name": "message", "type": "string"}]} ],

"messages": { "hello": { "doc": "Say hello.", "request": [{"name": "greeting", "type": "Greeting" }], "response": "Greeting", "errors": ["Curse"] } }}

Page 17: Avro

Schema Resolution & Evolution Writers schema always provided to reader Compare schema used by writer & schema expected by reader Fields that match name & type are read Fields written that don’t match are skipped Expected fields not written can be identified

Error or provide default value Same features as provided by numeric field ids

Keeps fields symbolic, no index IDs written in data Allows for projections

Very efficient at skipping fields Aliases

Allows projections from 2 different types using aliases User transaction

Count, date Batch

Count, date

Page 18: Avro

Implementations

Implementation Core Data file Codec RPC/HTTP

C Yes Yes Deflate Yes

C++ Yes Yes ? Yes

C# Yes No N/A No

Java Yes Yes Deflate, Snappy

Yes

Python Yes Yes Deflate Yes

Ruby Yes Yes Deflate Yes

PHP Yes Yes ? No

Core – parse schemas, read & write binary data for a schemaData file – read & write Avro data filesCodec – supported codecsRPC/HTTP – make and receive calls over HTTP

Page 19: Avro

API Generic

Generic attribute/value data structureBest suited for dynamic processing

SpecificEach record corresponds to a different kind of

object in the programming languageRPC systems typically use this

ReflectSchemas generated via reflectionConverting an existing codebase to use Avro

Page 20: Avro

APILow-level

SchemaEncodersDatumWriterDatumReader

High-levelDataFileWriterDataFileReader

Page 21: Avro

Java ExampleSchema schema =

Schema.parse(getClass().getResourceAsStream("schema.avpr"));

OutputStream outputStream = new FileOutputStream("data.avro");

DataFileWriter<Message> writer = new DataFileWriter<Message>(new

GenericDatumWriter<Message>(schema));

writer.setCodec(CodecFactory.deflateCodec(1));writer.create(schema, outputStream);

writer.append(new Message ());

writer.close();

Page 22: Avro

Java Example DataFileReader<Message> reader = new

DataFileReader<Message>( new File("data.avro"), new GenericDatumReader<Message>());

for (Message next : reader) { System.out.println("next: " + next);}

Page 23: Avro

RPCServer

SocketServer (non-standard) SaslSocketServer HttpServer NettyServer DatagramServer (non-standard)

Responder Generic Reflect Specific

Client Corresponding Transceiver LocalTransceiver

Requestor

Page 24: Avro

RPCClient

Corresponding Transceiver for each serverLocalTransceiver

Requestor

Page 25: Avro

RPC ServerProtocol protocol = Protocol.parse(new File("protocol.avpr"));

InetSocketAddress address = new InetSocketAddress("localhost", 33333);

GenericResponder responder = new GenericResponder(protocol) { @Override public Object respond(Protocol.Message message, Object request)

throws Exception { . . . }};

new SocketServer(responder, address).join();

Page 26: Avro

Hadoop SupportFile writers and readersReplacing RPC with Avro

In Flume alreadyPig support is inSplittable

Set block size when writingTether jobs

Connector framework for other languagesHadoop Pipes

Page 27: Avro

FutureRPC

Hbase, Cassandra, Hadoop coreHive in progressTether jobs

Actual MapReduce implementations in other languages

Page 28: Avro

AvroDynamicExpressiveEfficientSpecification based designLanguage implementations are fairly solidSerialization or RPC or bothFirst class Hadoop supportCurrently 1.5.1Sexy tractors