xml + query processing: a foundation for intelligent networks

XML + Query Processing: A Foundation for Intelligent

Networks

Michael FranklinUC BerkeleySeptember 2003

Michael Franklin, UC Berkeley

Outline Earlier (non-XML) Projects

– Client-Server EXODUS -> SHORE– DIMSUM - Distributed Query Architecture– DBIS - Dissemination-Based Information

Systems– Telegraph and TelegraphCQ– Lessons Learned

The XML-enabled Computing Landscape Some Research Suggestions


Client-Server Exodus Issue: How to split the functionality of an

OODB across Clients and Servers?

Buffer Manager

Disk Manager

Transaction Mgr

Buffer ManagerAccess Methods

Applications

Object Access/QP


Applications

Object Access/QP


Distribution of OODB Functions

Server is the owner of data.– Shared resources: data and

log disks, server memory. Clients cache second-class

(i.e., soft state) copies to reduce latency.– Can share client caches too…

Query vs. Data Shipping. For Data Shipping:

– Object or Page granularity. Ref: [Sigmod 91,92,94;VLDB 92,93]

Buffer Manager

Disk Manager

Transaction Mgr


Applications

Object Access/QP

Client-ServerProtocol


SHORE - A Peer Server (P2P?) Model

Follow-on to Exodus [Sigmod 94] Among other things,

took caching to its logical conclusion:

All can be clients and servers.– You manage the data

you own (server)– You cache data owned

by others (client) Wide-area is a reasonable next step

– But massive scale changes everything (more on this later).


So, What Happened? Well, all the OODB/ORDB stuff

– But isn’t XML DB just OODB redux? More to the point:

– Models were tightly-coupled: Syncrhonous Need intimate knowledge of the schema

– Limited (and late) standardization for query languages, data model, and schema interchange

This is bad for: Scalability Interoperability Incremental Deployment Resiliance to Change

Also, some people really did want queries (vs. navigation).


DIMSUM - Adding Queries to the Mix Goal - mix declarative specification & caching.

– raises mapping problems similar to materialized view maintenance, but more dynamic.

“Hybrid-Shipping” - Sometimes neither pure strategy is best.

Semantic Caching - remainder queries, semantic replacement functions, …

Query Scrambing - query re-optimization for wide-area delays (vague “deep web” theme)

XJoin - Adaptive, pipelined join operator. Cache Investment - Multiple query cache optimization. Ref: [Sigmod 96,98; VLDB 96,01; TODS 00]


So, What Happened? Still Tightly Coupled:

– Synchronous (modulo Query Scrambling delay tolerance).

– Need to know (and exchange) schemata Basically, a federated database approach with

caching added.– But, federated databases still haven’t caught on.– Q: Why is data warehousing so popular?

Still, some interesting issues raised:– adaptivity for networked query processing.– semantic cache content descriptors raise duality of

queries and data.– pipelined operators for incoming data.


DBIS FrameworkDissemination-Based Information Systems

Outgrowth of “Broadcast Disks” project. [SIGMOD 95]Framework in OOPSLA 97, SIGMOD 98 (Franklin & Zdonik)Toolkit Developed and Demonstrated at SIGMOD 99The DBIS Framework is based on three fundamental principles:

1) No one data delivery mechanism is best for all situations (e.g., apps, workloads, topologies).2) Network Transparency: Must allow different mechanisms for data delivery to be applied at different points in the system.3) Topology, routing, and delivery mechanism should vary adaptively in response to system changes.


Dissemination Network Components

profile

query

response

profile

query

response

DataSources Information

Brokers

ClientProxies


Data Delivery Mechanisms

PushPull

Aperiodic Periodic

Unicast 1-to-n Unicast 1-to-n

Aperiodic Periodic

Unicast 1-to-n Unicast 1-to-nrequest/response

on-demandbroad-cast

polling pollingw\snoop

Email lists

publish/subscribe

Person- alizedNews

Broad-castdisks

Dimensions are largely orthogonal – all combinations are potentially useful.


Network Transparency

ClientsBrokersSources

A fundamental principle for systems design:Type of a link matters only to nodes on each end.


More on Brokers

Brokers are middleware components that can act as both clients and servers.

Must support data caching– Needed to convert pushed-data to pulled-data– Also allows implementation of hierarchical caching

Profile Management– Allow informed data management: push, prefetch,

staging, etc. Profile Matching

– Our assumptions were: No profile language sufficient for all applications. Need an API for adding app-specific profiling


So, What Happened? Focus on combo of Push and Pull. Big deal: Integration of Database

and Networking– If I had a Euro for every review

that said “why is this a db problem?”– Published in DB and Comms venues.

But, we were missing 2 big pieces of the puzzle:– How to deploy this stuff (in the routers?)?– What should the language for profiles and

queries be?These have since been answered


Telegraph:Querying the Networked World

Increasingly ubiquitous networking at all scales.– ad hoc sensor nets, wireless, global Internet

Explosion in numbernumber, typestypes, and locationslocations of data sources and sinks.– mobile devices, P2P networks, data centers

Emerging software infrastructure to put it all together.

““When processing, storage, and transmission cost When processing, storage, and transmission cost micro-dollars, the the only real value is the data and its micro-dollars, the the only real value is the data and its organization.”organization.” (Jim Gray’s 1998 Turing Award Paper)


Telegraph Overview An adaptive system for large-scale shared dataflow

processing.– Sharing and adaptivity go hand-in-hand

Based on an extensible set of operators:1) IngressIngress (data access) (data access) operators

File readers, Sensor Proxies, Screen-Scrapers2) Non-Blocking Data processingData processing operators

Selections (filters), XJoins, … 3) Adaptive RoutingAdaptive Routing Operators

Eddies, STeMs, FLuX, etc. Operators connected through “Fjords” [MF02]

– queue-based framework unifying push&pull.


The Telegraph Project We’ve explored sharing and adaptivity in …

– EddiesEddies: Continuously adaptive queries– FjordsFjords: Inter-module communication– CACQCACQ: Sharing, Tuple-lineage– PSoupPSoup: Query=Data duality– STeMsSTeMs: Half-a-symmetric-join, tuple store– FLuXFLuX: Fault tolerance, load balancing

.. and built a first generation prototype [SIGMODRec01]– Built from scratch in Java

Rewrote as “TelegraphCQ” [CIDR 03]– In “C”, based on open-source PostgreSQL– Focus on continuous queries over streams– Released in July 2003


The TelegraphCQ Architecture

TelegraphCQ Wrapper

ClearingHouse

Wrappers

Proxy

TelegraphCQ Front End

Planner Parser Listener

Mini-Executor

Catalog

Query Plan Queue

Eddy Control Queue

Query Result Queues

}

Shared Memory

Shared Memory Buffer Pool

Disk

Split

TelegraphCQBack End

Modules

Scans

CQEddySplit

Split

TelegraphCQBack End

Modules

Scans

CQEddy


Queries Need Windows: Landmark query

0 105 15 20 25 30 35 40 45 50 55 60

NOW = 40 = t

TimelineSTWindow

TimelineSTWindow

TimelineSTWindow

TimelineSTWindow

NOW = 41 = t

...

...

NOW = 45 = t

NOW = 50 = t


So, What Happened? Decision was made to do

relational first.– Enough hard problems w/o XML– Our early apps weren’t XML

Q: Will they eventually be?– Note: Streams and Aurora made same choice

Developed lots of stream-related technology Project still going strong

– Storage manager, archives, and historical queries– Adaptive Adaptivity– Performance Tunning– Query Language and Window semantics– Distribution


Summary So Far 4 projects over 14 or so years. All exploring aspects

of networked data management. Exodus/SHORE - centrality of caching, work sharing

and work splitting paradigms. DIMSUM - Benefits and challenges of declarative

specificaitons via queries. DBIS - Push, Profiles, broader notion of integrating

networking and data management. Telegraph - Adaptivity, Sharing, CQs, Stream

processing.

But, they all suffer to some extent from the problem of tight coupling in terms of both timing and semantics.


Meta Lessons Learned

1. You don’t have to predict the technology correctly to get a bunch of papers published.

2. Sometimes you actually get it right, but the timing is a bit off.

A lot of pieces have to fall into place before a new technology or architecture clicks.

XML is one such piece, and it’s a BIG one.XML is one such piece, and it’s a BIG one.


How to Make Systems More Network-Friendly

Messaging enables distributed communication that is loosely coupled. A component sends a message to a destination, and the recipient can retrieve the message from the destination. However, the sender and the receiver do not have to be available at the same time in order to communicate. In fact, the sender does not need to know anything about the receiver; nor does the receiver need to know anything about the sender. The sender and the receiver need to know only what message format and what destination to use.

Java Message Service (JMS) API Tutorial

Sun Microsystems


Preaching to the Choir XML (not JMS!) solves both these issues.

– Senders and Receivers can agree on message format (or at least figure most of it out).

– Destinations should be encoded by value not by address. (Didn’t we learn anything during the OODB battles?).

Database people live and breathe both of these. So who better to fix the networked application infrastructure problem?

(Ahem, but, better keep that slow DBMS out of the message flow! e.g., FedEx tracking involves 100,000,000 transactions a day, and RFId will be even more fun.)

XML Message Brokers

•A platform for dynamic, loosely-coupleddynamic, loosely-coupled integration of enterprise applications and data.•Interaction accomplished through exchange of messages in the wide area.

(e.g., Adam Bosworth’s VLDB 02 keynote: http://www.cs.ust.hk/vldb2002/VLDB2002-proceedings/slides/S01P01slides.pdf)

The challenge is to efficiently and quickly match incoming XML documents against the potentially huge set of user profiles.

Underlying Technology: Filtering

XML Conversion

XML Documen

ts Filter Engine

User Profiles

Users

Filtered Data

Data Sources

Our View on Message Brokers (YFilter) Message Brokers perform three main tasks:

– FilteringFiltering - matching of interests.– TransformationTransformation - format conversion for app

integration and preferences.– RoutingRouting - moving bits through the overlay

network Must be lightweight and scalable.

– Effectively they are high-function routers.– Large-scale deployments may entail handling

10’s or 100’s of thousands of queries (subscriptions)

XML is a natural substrate.

YFilter:Shared Path MatchingYanlei Daio et al., ACM TODS, Dec. 2003

For large-scale systems, shared processingshared processing is essential.

YFilter uses an NFA-based approach to share path matching work among queries.

Location steps

/a

//a

/*

//*

NFA fragments

a

*a

*

**

Constructing a Query NFA

Concatenate NFA fragments for location steps in a path expression.

/a a

//b*a

Query “/a//b”

a *b

Constructing the Combined NFA

a

{Q1}

b

Q1=/a/bQ2=/a/cQ3=/a/b/cQ4=/a//b/cQ5=/a/*/bQ6=/a//cQ7=/a/*/*/cQ8=/a/b/c

a {Q2}

c

c {Q3}

{Q4}c

b*

*c {Q5}

c {Q6}

* c{Q7}

{Q3, Q8}

NFA Execution

read <a>

21

match Q1

read <b>

321

match Q3 Q8

read <c>

5

3 9 7 621

read </c>

3 9 7 621

read </b>

21

read </a>

1

initial1

Runtime Stack

NFA

An XML fragment <a> <b> <c> </c> </b> </a>

c

cb

{Q1}

{Q3, Q8}

{Q2} {Q4}

{Q6}

{Q5}{Q7}

a *

c

c* c

c

*

b

1

4

3 5

8

6

12

10

27

11

13

9

9 76 1012

8 11 6

Q5 Q6Q4


Performance Overview

• Sharing provides order-of-magnitude improvements.• In our experiments, even with 100,000 concurrent

queries, filtering was faster than the parser.• No exponential blow-up of active states in NFA

execution• Little sensitivity to occurence of ‘*’ and “//”

YFilter shows little sensitivity to these two parameters because effective prefix sharing keeps the machine size small

• Efficient for query updatesTens of milli-seconds for inserting 1000 queries, and stabilizes at 5 msec after 50,000 queries exist in the system.

Message Transformation Shred FLWR expressions into paths that can be pushed

down into the path matching engine.

Post-process the output using relational-style operators to produce customized messages.– Can apply MQO techniques to these post-plans

Three approaches (differ in the extent to which they push work to the engine)– PathSharing-FPathSharing-F: For clause paths only– PathSharing-FWPathSharing-FW: For & Where clause paths– PathSharing-FWRPathSharing-FWR: For, Where & Return

Inherent tension between path sharing and result customization!

See Yanlei Diao’s VLDB 03 paper (thursday afternoon)

Message Broker – Wrap UpSharing is the key to performance

– NFA provides excellent scalability/performance– PathSharing-FWR performs best, when combined with

optimizations based on the queries and DTD. – When the post-processing is shared, even more scalability

can be achieved. This sharing is facilitated by using relational-like query plans.

On-going work - How to deploy in the wide area?:– Distributed Filtering and Content Delivery Network

Combining distributed query processing and state-of-the-art application-level multicast protocols.

What semantics can/should be provided?

For more information see: www.cs.berkeley.edu/~daioyl/yfilter


Beyond Message-Based Systems Distributed systems need traceability

– Particularly highly dynamic (loosely-coupled) ones– Need to carry provenance information with data

Workflow description– XML-based workflow languages with appropriate

versioning models can provide the platform for the above.

Data needs to be long-lived - Archiving– Marked up data provides an opportunity for future

interpretation?– Schema versioning needed for this.

Semantic Web?– Try it if you like…


Deep/Hidden Web Querying XML is a great way to describe sources. Routing queries to sources is the inverse of

the data dissemination problem.

Yet another instance of the query and data duality.

Stream query processing can help here too.


Self-Publishing/Crawling Following the query routing idea further…

Queries can be continuously crawling through the network acquiring new data.

This can be random or focused (e.g., navigation your Friendster chains).

Even more fun: Mutant Queries (Papadimos et al. OGI)– Queries are partially evaluated and bound as they traverse

the network.– “Hybrid Shipping” on steroids


Topics in Need of Work Query Languages and semantics in streaming,

loosely-coupled, semi-structured environments.

Update consistency models, transactions, exactly-once delivery - How 80’s!

Dynamism and on-the-fly modifications User interaction Platform questions: In or out of the DBMS? Making XML appropriate for other

environments (e.g., sensor networks). …


Conclusions Two technologies are combining to make

distribute/decentralized computing a reality: overlay networks and XML.

Query processing is a way to route data through a network by value.– This is the right way to build an overlay

network.– We are the right people to do it.– XML is the common substrate that enables it.

My plan: revisit many earlier distributed data management ideas in light of this new reality.– And do some new stuff too!

xml + query processing: a foundation for intelligent networks

Documents