nondeterministic queries in a relational grid information service

32
Nondeterministic Queries in a Relational Grid Information Service Peter A. Dinda Dong Lu Prescience Lab Department of Computer Science Northwestern University http://plab.cs.northwestern.edu

Upload: teague

Post on 25-Feb-2016

30 views

Category:

Documents


1 download

DESCRIPTION

Nondeterministic Queries in a Relational Grid Information Service. Peter A. Dinda Dong Lu Prescience Lab Department of Computer Science Northwestern University http://plab.cs.northwestern.edu. Overview. RGIS: GIS system based on the relational data model using SQL - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Nondeterministic Queries in a Relational Grid Information Service

Nondeterministic Queries in a Relational Grid Information Service

Peter A. DindaDong Lu

Prescience LabDepartment of Computer Science

Northwestern University

http://plab.cs.northwestern.edu

Page 2: Nondeterministic Queries in a Relational Grid Information Service

2

Overview• RGIS: GIS system based on the relational

data model using SQL• Complex compositional queries can be posed

– “Find me 16 hosts on the same LAN that together have 32 GB of RAM”

• Can be very expensive to answer– Joins: worst case O(n^m) for m tables of size n

• Introduce nondeterminism– User gets random sample of result set– Automated query transformation

Page 3: Nondeterministic Queries in a Relational Grid Information Service

3

Outline• Overview• Model• Implementation• Nondeterministic queries• Performance evaluation• Related work• Conclusions D. Lu and P. Dinda, Synthesizing Realistic

Computational Grids, SC 2003D. Lu, J. Skicewicz, and P. Dinda, Scoped and Approximate Queries in a Relational Grid Information Service, Grid 2003

Page 4: Nondeterministic Queries in a Relational Grid Information Service

4

RGIS Model of a Gridmodule

endpoint

maclinkmacswitch

iplink routerhost

connectorswitchconnectorlink

• Annotated network topology graph

• Annotation examples– Hosts: memory, disk, OS,

NICs, etc.– Router/Switch: backplane

bandwidth, ports– Link: latency and

bandwidth• Highly dynamic data in

streams, not DB• Virtualization, Futures,

Leases– Virtual machines

Network

Data link

Physical

Software

Page 5: Nondeterministic Queries in a Relational Grid Information Service

5

Outline• Overview• Model• Implementation• Nondeterministic queries• Performance evaluation• Related work• Conclusions D. Lu and P. Dinda, Synthesizing Realistic

Computational Grids, SC 2003D. Lu, J. Skicewicz, and P. Dinda, Scoped and Approximate Queries in a Relational Grid Information Service, Grid 2003

Page 6: Nondeterministic Queries in a Relational Grid Information Service

6

id => time, randinsertids

id => {id}virtuals

id => idfutures

id => endtimeleases

resource metadata

Users, Groups,Capabilities, Sessions,Resource limits…Essential viewsuser => {capability}user => {resource limit}

security

id,type,name,blobmoduleexecs

id,typedatasources

id,type,module,datasource

endpointsid,moduleexec

modules

software layer

id,typeinfo,distiphosts

id,typeinfo,distiprouters

id,typeinfo,blobrouterbenchmarks

id,typeinfo,blobhostbenchmarks

network layer

id, name, desc

archtypes, routertypes, switchtypes, linktypes, ostypes, osvendors, hardwarevendors, linktypes, pathtypes, moduletypes, endpointtypes, datatypes, hostbenchmarktypes, routerbenchmarktypes

valid types

id,src,destiplinks

distip => {ip}ipassocs

id,src,destippaths

id,typeinfo,distadxmacswitches

id,typeinfo,src,destmaclinks

data link layer

distadx => {adx}macassocs

ip => macaddripmacassoc

adx => macaddrconnectormacassoc

id,typeinfo,distadxconnectorswitches

id,typeinfo,src,destconnectorlinks

physical layer

distadx => {adx}connectorassocs

Software

Network

Data Link

Physical

Metadata

Types

Security

Page 7: Nondeterministic Queries in a Relational Grid Information Service

7

id => time, randinsertids

id => {id}virtuals

id => idfutures

id => endtimeleases

resource metadata

Users, Groups,Capabilities, Sessions,Resource limits…Essential viewsuser => {capability}user => {resource limit}

security

id,type,name,blobmoduleexecs

id,typedatasources

id,type,module,datasource

endpointsid,moduleexec

modules

software layer

id,typeinfo,distiphosts

id,typeinfo,distiprouters

id,typeinfo,blobrouterbenchmarks

id,typeinfo,blobhostbenchmarks

network layer

id, name, desc

archtypes, routertypes, switchtypes, linktypes, ostypes, osvendors, hardwarevendors, linktypes, pathtypes, moduletypes, endpointtypes, datatypes, hostbenchmarktypes, routerbenchmarktypes

valid types

id,src,destiplinks

distip => {ip}ipassocs

id,src,destippaths

id,typeinfo,distadxmacswitches

id,typeinfo,src,destmaclinks

data link layer

distadx => {adx}macassocs

ip => macaddripmacassoc

adx => macaddrconnectormacassoc

id,typeinfo,distadxconnectorswitches

id,typeinfo,src,destconnectorlinks

physical layer

distadx => {adx}connectorassocs

Page 8: Nondeterministic Queries in a Relational Grid Information Service

8

RGIS Design(Per Site)

Oracle 9i Back EndWindows, Linux, Parallel Server, etc

Oracle 9i Front Endtransactional inserts and updates

using stored procedures, queries using select statements(uses database’s access control)

UpdateManager

Web Interface

Content Delivery Network Interface

For loose consistency

Query Managerand Rewriter

Users

Schema, type hierarchy, indices,PL/SQL stored procedures

for each object

Applications

RDBMSUse of Oracle

is not a requirement of approach

site-to-site (tentative)

Updates encrypted using asymmetric cryptography on network. Only those with appropriate keys have access

Authenticated Direct Interface

SOAP Interface

Page 9: Nondeterministic Queries in a Relational Grid Information Service

9

RGIS Design (Intersite)

RGIS Server RGIS Server

RGIS Server

Update Push ToFriend Site

Update Push ToFriend Site

•Site RGIS server pushes local updates to friend sites

•Site RGIS server consolidates updates from site and friend sites

•Site RGIS server answers all queries originating from its site

A B

C

Page 10: Nondeterministic Queries in a Relational Grid Information Service

10

Insert/Update/Delete

50,000 500,000 5,000,0000

100

200

300

40015002000250030003500

Number of Hosts in Database

Insert (Single)

Insert (Bulk)

Update (Single)

Update (Batches of 100)

Update (Bulk)

Delete (Bulk)

Dual Xeon 1 GHz, 2 GB, 8x36 GB RAID5, Oracle 9i

x x

Page 11: Nondeterministic Queries in a Relational Grid Information Service

11

2,700 lines of authored SQL4,000 lines of generated PL/SQL

22,000 lines of authored Perl

Main dependencies•DBI to Oracle 9i•SOAP::Lite•CGI

Not finished yet!

Page 12: Nondeterministic Queries in a Relational Grid Information Service

12

RGIS Design(Per Site)

Oracle 9i Back EndWindows, Linux, Parallel Server, etc

Oracle 9i Front Endtransactional inserts and updates

using stored procedures, queries using select statements(uses database’s access control)

UpdateManager

Web Interface

Content Delivery Network Interface

For loose consistency

Query Managerand Rewriter

Users

Schema, type hierarchy, indices,PL/SQL stored procedures

for each object

Applications

RDBMSUse of Oracle

is not a requirement of approach

site-to-site (tentative)

Updates encrypted using asymmetric cryptography on network. Only those with appropriate keys have access

Authenticated Direct Interface

SOAP Interface

This talk

Page 13: Nondeterministic Queries in a Relational Grid Information Service

13

Outline• Overview• Model• Implementation• Nondeterministic queries• Performance evaluation• Related work• Conclusions D. Lu and P. Dinda, Synthesizing Realistic

Computational Grids, SC 2003D. Lu, J. Skicewicz, and P. Dinda, Scoped and Approximate Queries in a Relational Grid Information Service, Grid 2003

Page 14: Nondeterministic Queries in a Relational Grid Information Service

14

Motivation

• Queries for compositions of resources easily expressed in SQL:

• But such queries can be very expensive to execute• However, we typically don’t need the entire result set, just

some rows, and not always the same ones• And we need them in a bounded amount of time

“Find 2 hosts with Linux that together have 3 GB of RAM”

select h1.insertid, h2.insertid from hosts h1, hosts h2 where h1.os=‘LINUX’ and h2.os=‘LINUX’ and h1.mem_mb+h2.mem_mb>=3072

Page 15: Nondeterministic Queries in a Relational Grid Information Service

15

Why Not Just Limit?• Oracle rownum, MySQL limit clause• “Return first k rows of result set”

• Problem: Always get the SAME answer• Problem: May STILL take a long time

– Results not discovered until near the end• Problem: Query time related to DATA

as well as k

Page 16: Nondeterministic Queries in a Relational Grid Information Service

16

Query Approaches

All results

Scopedresults

Nondeterministic results (this paper)

Approximateresults

Available inGrid 2003 Paper

Return Random Sample of Result Set

Page 17: Nondeterministic Queries in a Relational Grid Information Service

17

Nondeterministic Version of Query

select nondeterministically h1.insertid, h2.insertid from hosts h1, hosts h2 where h1.os=‘LINUX’ and h2.os=‘LINUX’ and h1.mem_mb+h2.mem_mb>=3072 within 2 seconds

Page 18: Nondeterministic Queries in a Relational Grid Information Service

18

Implementing non-deterministic queriesselect nondeterministically h1.insertid, h2.insertid from hosts h1, hosts h2 where h1.os=‘LINUX’ and h2.os=‘LINUX’ and h1.mem_mb+h2.mem_mb>=3072 within 2 seconds SELECT

H1.INSERTID, H2.INSERTID FROM HOSTS H1 SAMPLE(P), HOSTS H2 SAMPLE(P) WHERE (H1.OS='LINUX' AND H2.OS='LINUX' AND H1.MEM_MB+H2.MEM_MB>=3072)

Query Managerand Rewriter

Random sample ofinput tables with

Selection Probability Pdetermined by time constraint

and server load

Using Oracle-SpecificExtensions

Page 19: Nondeterministic Queries in a Relational Grid Information Service

19

Implementing non-deterministic queriesselect nondeterministically h1.insertid, h2.insertid from hosts h1, hosts h2 where h1.os=‘LINUX’ and h2.os=‘LINUX’ and h1.mem_mb+h2.mem_mb>=3072 within 2 seconds SELECT

H1.INSERTID, H2.INSERTID FROM HOSTS H1, HOSTS H2 , INSERTIDS TEMP_H1 , INSERTIDS TEMP_H2 WHERE (H1.OS='LINUX' AND H2.OS='LINUX' AND H1.MEM_MB+H2.MEM_MB>=3072) AND (H1.INSERTID=TEMP_H1.INSERTID AND TEMP_H1.rand > 982663452.975047 AND TEMP_H1.rand <= 1025613125.93505) AND (H2.INSERTID=TEMP_H2.INSERTID AND TEMP_H2.rand > 1877769069.94039 AND TEMP_H2.rand <= 1920718742.90039)

Query Managerand Rewriter

Random sample ofinput tables with

Selection Probability Pdetermined by time constraint

and server load

Using Our Schema(Not Oracle-Specific)

Rest of Talk

Page 20: Nondeterministic Queries in a Relational Grid Information Service

20

Implementing non-deterministic queries

Host insertid random_number

0 Nx x+y

RandomStarting Point

y=P*N ReshufflingRequirement

Page 21: Nondeterministic Queries in a Relational Grid Information Service

21

Deadlines• Hard-limiting

– Time-limited thread or process forked• Climbing

– Start with low probability p, issue query, if no results, double probability, try again, keep going until no more time or have results

• Estimation– Like climbing, but do polynomial estimation

over previous runs to estimate if next run will exceed deadline

Page 22: Nondeterministic Queries in a Relational Grid Information Service

22

Outline• Overview• Model• Implementation• Nondeterministic queries• Performance evaluation• Related work• Conclusions D. Lu and P. Dinda, Synthesizing Realistic

Computational Grids, SC 2003D. Lu, J. Skicewicz, and P. Dinda, Scoped and Approximate Queries in a Relational Grid Information Service, Grid 2003

Page 23: Nondeterministic Queries in a Relational Grid Information Service

23

GridG: Synthesing Realistic Computational Grids

http://www.cs.northwestern.edu/~urgis/GridG

• Generates a Grid as an annotated layer 3 topology– Hosts, routers, links

• Graph conforms to power laws of Internet topology• Annotations include:

– memory, clock speed, cpu type, number of CPUs, operating system type, link bandwidths, router bandwidths, etc.

– Memory distribution according to Smith study of MDS contents

Page 24: Nondeterministic Queries in a Relational Grid Information Service

24

Test GridsGrid Size (Hosts) Query

50,000 “Find n hosts with 3 GB of memory”

500,000 “Find n hosts with 3 GB of memory”

5,000,000 “Find n hosts with 3 GB of memory”

10,000 “Find 2 close hosts”

50,000 “Find 2 close hosts”

100,000 “Find 2 close hosts”

Page 25: Nondeterministic Queries in a Relational Grid Information Service

25

Nondeterministic query performance

0.1

1

10

100

1

10

100

1000

10000

100000

1000000

0.0001 0.001 0.002Selection Probability

Query Time

Number ofResults

Meaningful tradeoff between query processing time and result set size is possible

Select two hosts that together have >3GB of RAM

Page 26: Nondeterministic Queries in a Relational Grid Information Service

26

Nondeterministic query performance

0.1

1

0

200

400

600

800

1000

1200

1400

1600

1800

2000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Number of Hosts

Query Time

Number ofResults

p=0.0001

p=0.00001

p=0.000005

p=0.00000001

Can use tradeoff to controlquery time independent of query complexity

Select n hosts that together have >3GB of RAM, holding query time constant

Page 27: Nondeterministic Queries in a Relational Grid Information Service

27

Deadlines

Climbing Climbing+Hard Limiting Estimation Estimation+Hard Limiting0

0.5

1

1.5

2

2.5

Mechanism

Target Deadline

Find 2 hosts with collective600 GB RAM (VERY RARE)in 50K host grid

Max

Min

Page 28: Nondeterministic Queries in a Relational Grid Information Service

28

Extending RGIS to Support Grid Computing On Virtual Machines

• Virtuals– Each RGIS object has a unique id– Virtualization table associates unique id of virtual

resources with unique ids of their constituent physical resources

– Virtual nature of resource is hidden unless query explicitly requests it

• Futures– An RGIS object that does not exist yet– Futures table of unique ids– Future nature of resource hidden unless query

explicitly requests it

Page 29: Nondeterministic Queries in a Relational Grid Information Service

29

Related Work• SLP, X.500, LDAP• Condor ClassAds• MDS• R-GMA• Redline• Random sampling from databases

– Olsen, others

Page 30: Nondeterministic Queries in a Relational Grid Information Service

30

Conclusions• GIS system based on relational data model• Powerful queries, but expensive to execute• Nondeterminism to control query time

– Can be implemented without RDMBS support– Automated query translation in RGIS

• Several techniques to implement deadlines for queries

Page 31: Nondeterministic Queries in a Relational Grid Information Service

31

People and Acknowledgements

• Students– Jason Skicewicz, Andrew Weinrich (Web +

Soap), Jack Lange (CDN)• Collaborator

– Relational Grid Resources Project at Indiana• Beth Plale• http://www.cs.indiana.edu/~plale/projects/RGR

• Funder– NSF

Page 32: Nondeterministic Queries in a Relational Grid Information Service

32

For MoreInformation

• URGIS Site– http://www.cs.northwestern.edu/~urgis

• Prescience Lab– http://plab.cs.northwestern.edu

Join The User Comfort Study! http://comfort.cs.northwestern.edu

Special Advertising Section