xml data binding: encoding for high-performance content-based event routing gail kaiser phil gross...

XML Data Binding:Encoding for High-Performance Content-Based Event RoutingGail Kaiser

Phil GrossColumbia UniversityProgramming Systems Lab

Overview

PSL Intro MEET Project Encoding Conversion Efficiency Encoding Size Efficiency Encoding Classification Efficiency

Programming Systems Lab

“PSL conducts research on Web technologies, collaborative work, virtual worlds, process/workflow, extended transaction models, software development environments and tools, software engineering, information management, and distributed programming systems”

Lately, lots of XML stuff

PSL XML-related Research

FlexML: Flexible XML– Open-ended XML streams that may include “new” tags– Dynamic schema and semantics discovery and

composition XUES: XML-based Universal Event Service

– Event Packager: Data mining over XML structured data– Event Distiller: XML event poset pattern matching– Learning new application-domain events to recognize

DISCUS: Decentralized Information Spaces for Composition and Unification of Services – Rapid and secure application composition using Web

Services– Trust Evolution: PGP Trust + KeyNote + real-world

business

MEET

Multiply Extensible Event Transport Content-based multicast routing Must be efficient enough for embedded

and high-performance applications

MEET Motivations

Personal Life Recorder (sensor oriented) GroupWork Recorder (computer/DB

oriented) Parallel/Grid computing Distributed simulation Battlefield C4I Last, but not least:

– Dissertation submission

Relationship to Other Work

Generally modeling communication like

What actually goes over the line is afterthought

But with N-Way Internet-scale communication– Millions of publishers and subscribers

We can (must!) do better than ASCII text…– Line speed => ≈250 assembly instructions per

packet

Machine ARelational

Machine BXML

MEET Extensibility

Want to scale up, to millions of pubs and subs

Want to scale down, to embedded and wireless

No single solution satisfactory at all scales Composed of hot-swappable subsystems

– Router, transports, clock/causality, types, etc.

Why Types

Event data is not just an opaque bag of bits

Subscriptions are Boolean functions over events

Type safety would be nice What type system to use?

Initial MEET Type Design

Initial design calls for supporting Java, C#, and XML Schema defined objects “out of the box”

XML Schema used as Ur-language/Esperanto for conversions

Subscriptions are arbitrary boolean functions on datatypes

XML Schema is not ideal ur-type– Excessively complex, verbose, etc.

Encodings for Efficiency

Java, C#, XML, ASN.1 have well-defined but proprietary encodings for instances

Would be nice to have an independent encoding scheme with some desirable properties missing from the above– Fast serialization/deserialization– Elimination of redundant information from

message sequences– Data organized for rapid classification/routing

Conversion Efficiency

Need to get to and from wire format as fast as possible

Leverage homogeneity to eliminate unnecessary conversions, e.g., network byte order

ECho system from Eisenhauer et. al., Georgia Tech– Using “native data” for ultra-low latency– Necessary for HPC

Size Efficiency

Ideal for single message is self-describing data With multiple messages of same type, one can

pull out redundant type info, e.g., schema Goal is to go further: If 90% of content of

messages is the same, generate a new subtype with fixed values

From self-describing to all-schema is a continuum

Classification Efficiency

When bits start arriving serially at the router, would like to begin cut-through routing as soon as possible– Avoid the curse of IP/IPv6: source address

first Want key routing bits as close to the

front as possible Want data in fixed locations

Fast Classifying: First Things First

In the packet, type info first (after magic)– Would like to represent type codes as bit

string with “most significant” info e.g. parent type first, followed by subtype identifier, sub-subtype, etc.

– Need access to type hierarchy Popular classification fields at the front

– Need to tag with popularity metadata– “subscribers will want to select on me”

Fast Classifying: Fixed Positions

Would like to avoid scanning through long or variable-length fields

Long/Variable data needs to be in a separate channel/section

Primitives and fixed-length references at the front– References point into data section– Classifier can jump large, uninteresting data

quickly

Plus: Schema Format

We’d like the schema format to be amenable to programmatic manipulation and analysis

For instance, when negotiating formats, we’d like to be able to compute how our original format offer differs from the counter-offer

XML Schema is pretty good for this

Conclusions

Efficient instance transfer is an interesting case for data-binding

Special needs for efficiency But we can negotiate our own format

among the communicating parties Some explicit support for this in a

general data-binding solution could help acceptance

xml data binding: encoding for high-performance content-based event routing gail kaiser phil gross...

Documents