xml data binding: encoding for high-performance content-based event routing gail kaiser phil gross...
TRANSCRIPT
XML Data Binding:Encoding for High-Performance Content-Based Event RoutingGail Kaiser
Phil GrossColumbia UniversityProgramming Systems Lab
Overview
PSL Intro MEET Project Encoding Conversion Efficiency Encoding Size Efficiency Encoding Classification Efficiency
Programming Systems Lab
“PSL conducts research on Web technologies, collaborative work, virtual worlds, process/workflow, extended transaction models, software development environments and tools, software engineering, information management, and distributed programming systems”
Lately, lots of XML stuff
PSL XML-related Research
FlexML: Flexible XML– Open-ended XML streams that may include “new” tags– Dynamic schema and semantics discovery and
composition XUES: XML-based Universal Event Service
– Event Packager: Data mining over XML structured data– Event Distiller: XML event poset pattern matching– Learning new application-domain events to recognize
DISCUS: Decentralized Information Spaces for Composition and Unification of Services – Rapid and secure application composition using Web
Services– Trust Evolution: PGP Trust + KeyNote + real-world
business
MEET
Multiply Extensible Event Transport Content-based multicast routing Must be efficient enough for embedded
and high-performance applications
MEET Motivations
Personal Life Recorder (sensor oriented) GroupWork Recorder (computer/DB
oriented) Parallel/Grid computing Distributed simulation Battlefield C4I Last, but not least:
– Dissertation submission
Relationship to Other Work
Generally modeling communication like
What actually goes over the line is afterthought
But with N-Way Internet-scale communication– Millions of publishers and subscribers
We can (must!) do better than ASCII text…– Line speed => ≈250 assembly instructions per
packet
Machine ARelational
Machine BXML
MEET Extensibility
Want to scale up, to millions of pubs and subs
Want to scale down, to embedded and wireless
No single solution satisfactory at all scales Composed of hot-swappable subsystems
– Router, transports, clock/causality, types, etc.
Why Types
Event data is not just an opaque bag of bits
Subscriptions are Boolean functions over events
Type safety would be nice What type system to use?
Initial MEET Type Design
Initial design calls for supporting Java, C#, and XML Schema defined objects “out of the box”
XML Schema used as Ur-language/Esperanto for conversions
Subscriptions are arbitrary boolean functions on datatypes
XML Schema is not ideal ur-type– Excessively complex, verbose, etc.
Encodings for Efficiency
Java, C#, XML, ASN.1 have well-defined but proprietary encodings for instances
Would be nice to have an independent encoding scheme with some desirable properties missing from the above– Fast serialization/deserialization– Elimination of redundant information from
message sequences– Data organized for rapid classification/routing
Conversion Efficiency
Need to get to and from wire format as fast as possible
Leverage homogeneity to eliminate unnecessary conversions, e.g., network byte order
ECho system from Eisenhauer et. al., Georgia Tech– Using “native data” for ultra-low latency– Necessary for HPC
Size Efficiency
Ideal for single message is self-describing data With multiple messages of same type, one can
pull out redundant type info, e.g., schema Goal is to go further: If 90% of content of
messages is the same, generate a new subtype with fixed values
From self-describing to all-schema is a continuum
Classification Efficiency
When bits start arriving serially at the router, would like to begin cut-through routing as soon as possible– Avoid the curse of IP/IPv6: source address
first Want key routing bits as close to the
front as possible Want data in fixed locations
Fast Classifying: First Things First
In the packet, type info first (after magic)– Would like to represent type codes as bit
string with “most significant” info e.g. parent type first, followed by subtype identifier, sub-subtype, etc.
– Need access to type hierarchy Popular classification fields at the front
– Need to tag with popularity metadata– “subscribers will want to select on me”
Fast Classifying: Fixed Positions
Would like to avoid scanning through long or variable-length fields
Long/Variable data needs to be in a separate channel/section
Primitives and fixed-length references at the front– References point into data section– Classifier can jump large, uninteresting data
quickly
Plus: Schema Format
We’d like the schema format to be amenable to programmatic manipulation and analysis
For instance, when negotiating formats, we’d like to be able to compute how our original format offer differs from the counter-offer
XML Schema is pretty good for this
Conclusions
Efficient instance transfer is an interesting case for data-binding
Special needs for efficiency But we can negotiate our own format
among the communicating parties Some explicit support for this in a
general data-binding solution could help acceptance