searching technology for a large number of objects kurt stockinger and john wu lawrence berkeley...

29
Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

Upload: grant-york

Post on 25-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

Searching TechnologyFor a Large Number Of Objects

Kurt Stockinger and John Wu

Lawrence Berkeley National Laboratory

Page 2: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 2

Outline

• Current work

— FastBit: a compressed bitmap indexing package

— Applications:• Grid Collector

• DEX

• TBitmapIndex

• Network Flow Data Analysis

• Future Plans

— Extending the searching technology

— Integrating with other SDM center technologies

Page 3: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

FastBit

A compressed bitmap indexing technology for efficient searching of read-only data

John Wu, Ekow Otoo, Arie Shoshani

Kurt Stockinger, Doron Rotem

http://sdm.lbl.gov/fastbit

Page 4: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 4

FastBit Overview

• FastBit is designed to search multi-dimensional data

— Conceptually in table format• rows objects

• columns attributes

• FastBit uses vertical (column-oriented) organization for the data— Efficient for analysis of read-only data

• FastBit uses compressed bitmap indices to speed up searches— Proven in analysis to be optimal for single-

attribute queries

— Superior to other optimal indices because they are also efficient for multi-attribute queries

rowcolumn

Page 5: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

Grid Collector

Put FastBit and SRM together to improve the efficiency of STAR analysis jobs

John Wu, Junmin Gu, Jerome Lauret, Arthur M. Poskanzer, Arie Shoshani, Alexander Sim,

Wei-Ming Zhang

http://www.star.bnl.gov/

Page 6: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 6

Grid Collector Features

Key features of the Grid Collector:— Providing transparent object access— Selecting objects based on their attribute values— Improving analysis system’s throughput— Enabling interactive distributed data analysis

Page 7: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 7

Grid Collector Speeds up Analyses

• Legend

— Selectivity: fraction of events needed by the analysis

— Speedup = ratio of time to read events without GC and with GC

— Speedup = 1: speed of the existing system (without GC)

• Results

— When searching for rare events, say, selecting one event out of 1000 (selectivity = 0.001), using GC is 20 to 50 times faster

— Even using GC to read 1/2 of events, speedup > 1.5

0

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1

selectivity

sp

ee

du

p

Sample 1

Sample 2

Sample 3

1

10

100

1000

0.00001 0.0001 0.001 0.01 0.1 1

selectivity

sp

ee

du

p

Sample 1

Sample 2

Sample 3

less selective more selective

Page 8: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

DEX: Using Efficient Bitmap Indices to Accelerate Scientific Visualization

Kurt Stockinger, John Shalf, Wes Bethel, John Wu

Computational Research Division

Lawrence Berkeley National Laboratory

Berkeley, California

Page 9: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 9

DEX: Dexterous Data Explorer

DataQuery

Visualization Toolkit(VTK)

3D visualization of aSupernova explosion

Page 10: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 10

Performance Results with Scientific Data

Isosurface Extraction for Combustion Data

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1 2 3 4 5 6 7 8 9 10 11

Isovalue

Tim

e [s

ec]

vtkMarchingCubes vtkContourFilter vtkKitwareContourFilter DEX

One of the simplest tasks DEX performs is to find isosurfaceDEX is on average a factor of three to four faster than

the best isosurface algorithm of VTK.

VTK rendering time: 0.2 – 2 seconds.

Page 11: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 11

Query-Driven Visualization of Combustion Data Set

b) Q: temp < 3

c) Q: CH4 > 0.3 AND temp < 3

d) Q: CH4 > 0.3 AND temp < 4

a) Query: CH4 > 0.3

Page 12: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

TBitmapIndex: An attempt to introduce FastBit to ROOT

Kurt Stockinger1, John Wu1, Rene Brun2, Philippe Canal3

(1) Berkeley Lab, Berkeley, USA

(2) CERN, Geneva, Switzerland

(3) Fermi Lab, Batavia, USA

Page 13: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 13

Current Status

• Built a prototype wrapper on FastBit called TBitmapIndex

— Read one variable at a time into memory to build index

— Each Index is currently stored in a binary file

• Integrated bitmap indices to support:

— TTree::Draw

— TTree::Chain

• Verified the performance advantage of FastBit vs. ROOT’s TTreeFormula

Page 14: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 14

Experiments With BaBar Data

• Software/Hardware:

— Bitmap Index Software is implemented in C++

— Tests carried out on:• Linux CentOS

• 2.8 GHz Intel Pentium 4 with 1 GB RAM

• Hardware RAID with SCSI disk

• Data:

— 7.6 million records with ~100 attributes each

— Babar data set:

• Bitmap Indices (FastBit):

— 10 out of ~100 attributes

— 1000 equality-encoded bins

— 100 range-encoded bins

Page 15: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 15

Size of Compressed Bitmap Indices

Total size of all 10 attributes

0.E+00

1.E+08

2.E+08

3.E+08

4.E+08

5.E+08

6.E+08

Base data EE-BMI RE-BMI

Siz

e [b

ytes

]

EE-BMI: equality-encoded bitmap index

RE-BMI: range-encoded bitmap index

Page 16: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 16

Query Performance - TTreeFormula vs. Bitmap Indices1-Dimensional Queries

0.1

1

10

0.00001 0.0001 0.001 0.01 0.1 1

Query box

Tim

e [s

ec]

Bitmap indices 10X faster than TTreeFormula

5-Dimensional Queries

0.1

1

10

100

0.00001 0.0001 0.001 0.01 0.1 1

Query box

Tim

e [s

ec]

10-Dimensional Queries

1

10

100

0.00001 0.0001 0.001 0.01 0.1 1

Query box

Tim

e [s

ec]

TTreeFormula BMI-EE BMI-RE

Page 17: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

An Application of TBitmapIndex-- Network Flow Data Analysis

Kurt Stockinger, John Wu, Scott Campbell, Stephen Lau, Mike Fisk, Eugene Gavrilov, Alex Kent, Christopher E.

Davis, Rick Olinger, Rob Young, Jim Prewett, Paul Weber, Thomas P. Caudell, E. Wes Bethel, Steve Smith

LBNL, LANL, UNM

Page 18: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 18

Chasing the Track of a Network Scan

• IDS log shows

— Jul 28 17:19:56 AddressScan 221.207.14.164 has scanned 19 hosts (62320/tcp)

— Jul 28 19:19:56 AddressScan 221.207.14.88 has scanned 19 hosts (62320/tcp)

• Using FastBit/ROOT to explore what else might be going on

• Queries prepared by Scott Campbell. More details at http://www.nersc.gov/~scottc/papers/ROOT/rootuse.prod.html

Page 19: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 19

Are There More Scans?

• Query: select ts/(60*60*24)-12843, IPR_C, IPR_D where IPS_A=211 and IPS_B=207

• More scans from the same subnet

Page 20: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 20

Who Is Doing It?

• Query: select IPS_C, IPS_D where IPS_A==211 and IPS_B==207• Picture: the histogram of the IPS_C and IPS_D• Five IP addresses started most of the scans!

Page 21: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

Future Plans

Meet the challenges of searching in data intensive sciences

Page 22: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 22

Types of Searching Problems

• Not practical to work on many terabytes of data simultaneously work on a subset instead

— Analyze the data collected last month

— Analyze the data collected by Joe

• Find the objects of interest

— Find the flame front in combustion simulation

— Find the top-talker in network communication

• Knowledge discovery

— Association rules

— Cliques/connection subgraphs

Page 23: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 23

Searching Problems From SciDAC2 Appendix

B.1 Experimental Combustion Science

Feature identification and tracking

20TB

B.8 Empowering RHIC users with new analysis tools

Analyze subsets ~GB/s

B.10 U.S. LHC Experiments Analyze subsets ~GB/s

B.13 The Solenoid Tracker at RHIC (STAR)

Analyze subsets 1GB/s

B.2 Advanced Computing for LCLS ?, classification 200 MB/s

B.3 An Earth Science Knowledge System

Locating dataset of interest

PB

B.5 Enabling Discovery in Experimental Biological Science

High-dimensional data search, data versioning, semantic graphs (ontology), multiple sources

Page 24: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 24

Searching Problems From SciDAC2 Appendix

B.4 Remote operations of LHC, CMS and ITER

Streaming data

B.9 ARM/ACRF Program Instrument data streams

B.6 Enhancing Material Science Beamline

ND data array, real-time processing

1GB/h ?

B.7 Large-Scale Computation for ITER Data management

B.11 Nanoscience Mining simulation data together with experimental data

B.12 The Spallation Neutron Source Real-time image analysis, data comparison

20MB/s

Page 25: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 25

Features of These Search Problems

• Large: many datasets are petabytes in size, billions records

• Complex data: multi-dimensional arrays, user-defined data types, mixed simulation data with experimental data, regular data with attribute defined with ontologies (semantic networks)

• Complex searching: data versioning, provenance-based search, catalog matching

• Beyond searching: data mining and knowledge discovery• Real-time response: instrument control, interactive

designed of experiments, computational steering• Integrated: searching is only a part of the overall data

analysis, need to improve the overall throughput

Page 26: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 26

Improve Existing Searching Tools

• FastBit is efficient for range queries; need to support other types of queries, e.g., joins

• FastBit is efficient for read-only data; need to support update

• FastBit supports up to 232 (4 billion) records; need to support at least 264 (16 quintillion) records

• FastBit allows the user to choose from many different type of indices; need to automatically decide one for the user

Page 27: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 27

Expand The Repertoire Of Searching Tools

• Support parallel index building and searching

• Support search of semantic networks, combining ontology with structured data

• Support data versioning (time stamps, provenance, …)

• Support robust recovery (a la POSTGRES)

• Support user-defined data types (ROOT)

• Support user-defined functions

• Support commonly used B-trees and R-trees

• Support combined searching of structured and semi-structured data, extend

Page 28: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 28

Extend The Accessibility Of The Tools

• Extend the collaboration with ROOT to make FastBit seamlessly available to users— Implemented a prototype, need a more integrated

way to read and write ROOT files• Read data from other common file formats; write indices

to the same file formats— netCDF, HDF (4/5)

• Extend the advantage of searching to other steps of analysis— Feature tracking; extending it to higher dimension;

more general image analysis• Make FastBit available in other forms

— Web service, an actor in Kepler, …

Page 29: Searching Technology For a Large Number Of Objects Kurt Stockinger and John Wu Lawrence Berkeley National Laboratory

SDM All-hands, October 2005 29

Summary

• FastBit is efficient for range queries on read-only data

• Integration of FastBit with ROOT is getting underway

— TBitmapIndex prototype

• Integration with other systems possible

— Need to develop a short list based on target application area

• Plan to extend FastBit

— Integration with ROOT will bring up a list of requirements

— Intend to target biological applications