real-time query systems for complex data · pdf file support complex analyses over large...

Click here to load reader

Post on 22-Jul-2020

1 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Real-Time Query Systems for Complex Data Sources

    A dissertation presented

    by

    Ian Thomas Rose

    to

    The School of Engineering and Applied Sciences

    in partial fulfillment of the requirements

    for the degree of

    Doctor of Philosophy

    in the subject of

    Computer Science

    Harvard University

    Cambridge, Massachusetts

    May 2011

  • c©2011 - Ian Thomas Rose

    All rights reserved.

  • Thesis advisor Author

    Matt Welsh Ian Thomas Rose

    Real-Time Query Systems for Complex Data Sources

    Abstract

    This dissertation presents techniques for building scalable systems that allow real-

    time querying of complex data sources. In recent years, networking and sensing

    advances have dramatically increased the volume of information available to data

    consumers. However, coping with large scales and high data rates often requires

    processing data in real time, as it arrives, rather than storing it for later analysis. Our

    thesis is that by including the data acquisition process in the overall system design,

    it is possible to build scalable, real-time stream processing systems for complex data

    sources.

    We have built two systems to demonstrate a number of unique design features

    required for scalable operation in our chosen domains. Cobra is a system that taps

    online RSS feeds (such as blogs, news articles and websites’ user comments) as its

    data source. Cobra repeatedly crawls a set of RSS feeds, matching the contents to

    keyword-based user queries, similar to those used in Web search engines. As RSS-

    based content can change frequently, the design ensures that the latency between

    crawls is low, while still scaling to a large number of RSS feeds and many concurrent

    user queries.

    Secondly, Argos is a system for widely-distributed, outdoor wireless network mon-

    iii

  • Abstract iv

    itoring. Capturing 802.11 WiFi traffic across a large urban area, Argos enables a

    wide range of user queries, such as mobile node tracking, malware detection, and

    traffic characterization. Use of a wireless mesh network to connect the deployed snif-

    fer nodes introduces additional challenges due to its limited bandwidth capacity. To

    address this restriction, we designed a novel in-network packet merging process and

    demonstrate its bandwidth savings. Additionally, Argos provides a variety of channel

    management schemes; 802.11 defines up to 14 radio channels but each sniffer can only

    capture from one channel at a time, necessitating policies for when to capture from

    which channel.

    These systems are built around three design principles that aid in the real-time

    querying of complex data sources: query interfaces tailored to the application’s spe-

    cific data types, optimized data collection processes, and allowing queries to provide

    feedback to the collection process.

  • Contents

    Title Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

    1 Introduction 1 1.1 Stream Processing Engines . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Dissertation Summary and Contributions . . . . . . . . . . . . . . . . 4 1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2 Background and Related Work 8 2.1 A Taxonomy of Streaming Solutions . . . . . . . . . . . . . . . . . . . 9 2.2 General Purpose SPEs . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.2.1 Aurora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 TelegraphCQ . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Borealis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.4 STREAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.5 System S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.6 Summary and Limitations . . . . . . . . . . . . . . . . . . . . 19

    2.3 Domain-Specific SPEs . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Example 1: TinyDB . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.2 Example 2: Gigascope . . . . . . . . . . . . . . . . . . . . . . 27

    2.4 Other Query System Architectures . . . . . . . . . . . . . . . . . . . 28 2.4.1 Traditional Distributed Pub-Sub Systems . . . . . . . . . . . . 28 2.4.2 Web Search Engines . . . . . . . . . . . . . . . . . . . . . . . 32 2.4.3 Content-Delivery Networks . . . . . . . . . . . . . . . . . . . . 34 2.4.4 MapReduce-related Systems . . . . . . . . . . . . . . . . . . . 35

    2.5 Network Monitoring and Measurement . . . . . . . . . . . . . . . . . 36 2.5.1 Wired Network Monitoring . . . . . . . . . . . . . . . . . . . . 37

    v

  • Contents vi

    2.5.2 Indoor WLAN Monitoring . . . . . . . . . . . . . . . . . . . . 39 2.5.3 Wardriving Studies . . . . . . . . . . . . . . . . . . . . . . . . 41 2.5.4 Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . 41 2.5.5 Mesh Network Measurements . . . . . . . . . . . . . . . . . . 42

    2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    3 Cobra Architecture 44 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Cobra System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    3.2.1 Crawler service . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Filter service . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.3 Reflector service . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.2.4 Hosting model . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

    3.3 Service provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.3.1 Service instantiation and monitoring . . . . . . . . . . . . . . 59 3.3.2 Source feed mapping . . . . . . . . . . . . . . . . . . . . . . . 60

    3.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    4 Cobra System Evaluation 64 4.1 Properties of Web feeds . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Microbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    4.2.1 Memory usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.2 Crawler performance . . . . . . . . . . . . . . . . . . . . . . . 69 4.2.3 Filter performance . . . . . . . . . . . . . . . . . . . . . . . . 71

    4.3 Scalability measurements . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4 Comparison to other search engines . . . . . . . . . . . . . . . . . . . 78 4.5 Comparison to Prior Work . . . . . . . . . . . . . . . . . . . . . . . . 79 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

    5 Designing and Measuring an Outdoor Wireless Testbed 81 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3 Urban Deployment Considerations . . . . . . . . . . . . . . . . . . . . 86

    5.3.1 Physical environment . . . . . . . . . . . . . . . . . . . . . . . 87 5.3.2 Network Coverage . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.3 Network Security . . . . . . . . . . . . . . . . . . . . . . . . . 90

    5.4 Network Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.4.1 TCP Throughput . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4.2 Ping Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4.3 Implications for Applications . . . . . . . . . . . . . . . . . . . 95

    5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

  • Contents vii

    6 Argos Architecture 99 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 101

    6.2.1 Why Argos? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

    6.3 Argos Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.1 User Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.3.2 Sniffer Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3.3 In-network Traffic Processing . . . . . . . . . . . . . . . . . . 109 6.3.4 Protocol Stack Emulation . . . . . . . . . . . . . . . . . . . . 114 6.3.5 Sniffer Channel Management . . . . . . . . . . . . . . . . . . . 115

    6.4 Example Query: Stolen Laptop Finder . . . . . . . . . . . . . . . . . 117 6.5 Implementation and Deployment . . . . . . . . . . . . . . . . . . . . 118 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

    7 Argos System Evaluation 120 7.1 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 120

    7.1.1 In-network traffic processing . . . . . . . . . . . . . . . . . . . 121 7.1.2 Coordinated channel focusing . . . . . . . . . . . . . . . . . . 124

    7.2 Urban Wireless Traffic Characterization . . .