microhash:an efficient index structure for flash-based sensor devices demetris zeinalipour [...
TRANSCRIPT
MicroHash:An Efficient Index Structure for Flash-Based Sensor
DevicesDemetris Zeinalipour
[ [email protected] ]School of Pure and Applied Sciences
Open University of Cyprus
http://www.cs.ucy.ac.cy/~dzeina/
IBM Research, Zurich, Switzerland, Dec. 12th, 2008
2
Presentation Goals
• To provide an overview of Wireless Sensor Networks and related Data Acquisition Frameworks.
• To highlight some important storage and retrieval challenges that arise in this context.
3
• This is a joint work with my collaborators at the University of California – Riverside.
• Our results were presented in the following papers:– "MicroHash: An Efficient Index Structure for Flash-
Based Sensor Devices", D. Zeinalipour-Yazti, S. Lin, V. Kalogeraki, D. Gunopulos and W. Najjar, The 4th USENIX Conference on File and Storage Technologies (FAST’05), San Francisco, USA, December, 2005.
– " Efficient Indexing Data Structures for Flash-Based Sensor Devices", S. Lin, D. Zeinalipour-Yazti, V. Kalogeraki, D. Gunopulos, W. Najjar, ACM Transactions on Storage (TOS), ACM Press, Vol.2, No. 4, pp. 468-503, November 2006.
Acknowledgements
4
Presentation Outline
1. Overview of Wireless Sensor Networks
2. Overview of Data Acquisition Frameworks
3. The MicroHash Index Structure.
4. MicroHash Experimental Evaluation
5. Conclusions and Future Work
5
Wireless Sensor Devices• Resource constrained devices utilized for
monitoring and researching the physical world at a high fidelity.
Xbow’s i-mote2
UC-Riverside’s RISE
Xbow’sTelosB
UC-Berkeley’s Mica2dot
Xbow’s Mica
6
Wireless Sensor Device
Radio, used for transmitting the acquired data to some storage site (SINK) (9.6Kbps-250Kbps
Storage
Sensors: Numeric readings in a limited range (e.g., temperature -40F..+250F with one decimal point precision) at a high frequency (2-2000Hz)
Wireless Sensor Network
7
8
Wireless Sensor Networks
• Applications have already emerged in: – Environmental and habitant monitoring– Seismic and Structural monitoring– Understanding Animal Migrations &
Interactions between species– Automation, Tracking, Hazard Monitoring
Scenarios, Urban Monitoring etc
Great Duck Island – Maine (Temperature, Humidity etc).
Golden Gate – SF, Vibration and Displacement
of the bridge structure
Zebranet (Kenya) GPS trajectory
9
Wireless Sensor NetworksThe Great Duck Island Study (Maine, USA)
• Large-Scale deployment by Intel Research, Berkeley in 2002-2003 (Maine USA).
• Focuses on monitoring microclimate in and around the nests of endangered species
which are sensitive to disturbance.• They deployed more than 166 motes
installed in remote locations (such as 1000 feets in the forest)
10
Wireless Sensor Networks
WebServer
11
Wireless Sensor NetworksThe James Reserve Project, CA, USA
Available at: http://dms.jamesreserve.edu/
12
Wireless Sensor NetworksMicrosoft’s SenseWeb/SensorMap Technology
Available at: http://research.microsoft.com/nec/SenseWeb/
SenseWeb: A peer-produced sensor network that consists of sensors deployed by contributors across the globeSensorMap: A mashup of SenseWeb’s data on a map interface
Swiss Experiment (SwissEx)(6 sites on the Swiss Alps)
Chicago (Traffic, CCTV Cameras, Temperature, etc.)
13
Characteristics 1. The Energy Source is limited.
Energy source: AA batteries, Solar Panels
2. Local Processing is cheaper than transmitting over the radio.
Transmitting 1 Byte over the Radio consumes as much energy as ~1200 CPU instructions.
3. Local Storage is cheaper than transmitting over the radio.
Transmitting 512B over a single-hop 9.6Kbps (915MHz) radio requires 82,000μJ, while writing to local flash only 760μJ.
14
Presentation Outline
1. Overview of Wireless Sensor Networks (WSN)
2. Overview of Data Acquisition Frameworks
3. The MicroHash Index Structure
4. MicroHash Experimental Evaluation
5. Conclusions and Future Work
15
Centralized Storage
• A Database that collects readings from many Sensors.
• Centralized: Storage, Indexing, Query Processing, Triggers, etc.
16
Centralized Storage I
Available at: http://www.xbow.com/
Crossbow’s MoteView software• NO in-network Aggregation/Filtering • NO in-Network Storage
17
Centralized Storage II
Available at: http://telegraph.cs.berkeley.edu/tinydb/
TinyDB - A Declarative Interface for Data Acquisition in Sensor Networks.
• In-Network Aggregation/Filtering• Limited In-Network Storage (No Indexes)
v1
v3
v2
v4
v570
6590
70
7090
90
MAX 90
85
75
e.g., SELECT MAX(temp) FROM sensors
Centralized Storage: Conclusions• Frameworks such as TinyDB:
- Are suitable for continuous queries.
- Push aggregation in the network but keep much of the processing at the sink.
• New Challenges: - Many applications DON’T require the continuous
evaluation of user queries (e.g., historic query: Find the average temperature for the last 6 months)
- In many applications there is no sink (e.g., remote deployments and mobile sensor networks)
- Local Storage on sensor devices keeps growing.- RISE supports a 1GB external SD Card- I-Mote-2 support 32MB Flash/32MB SRAM
18
19
Our Model: In-Situ Data Storage
1.Data remains In-situ (at the generating site) in a sliding window fashion.
2.When required, users conduct on-demand queries to retrieve information of interest.
The SinkProgramming board
A Network of Sensor Databases
20
Center for Conservation Biology@ UCR: Research of Soil-Organisms
– A set of sensors monitor the CO2 levels in the soil over a large window of time.– Not a real-time application.– Most acquired values are not of particular interest.
In-Situ Data Storage: Motivation
D. Zeinalipour-Yazti, S. Neema, D. Gunopulos, V. Kalogeraki and W. Najjar, "Data Acquision in Sensor Networks with Large Memories", IEEE Intl. Workshop on Networking Meets Databases NetDB (ICDE'2005), Tokyo, Japan, 2005.
21
Presentation Outline
1. Overview of Wireless Sensor Networks
2. Overview of Data Acquisition Frameworks
3. The MicroHash Index Structure
4. MicroHash Experimental Evaluation
5. Conclusions and Future Work
22
Flash Memory at a Glance• The most prevalent storage medium used for Sensor
Devices is Flash Memory (NAND Flash)• Fastest growing memory market (‘05 $8.7B, ‘06:$11B)
(NAND) Flash Advantages
• Simple Cell Architecture (high capacity in a small surface) => Economical Reproduction
• Fast Random Access (50-80 μs) compared to 10-20ms in Disks
• Shock Resistant
• Power Efficient
Surface mount NAND flash
Removable NAND Devices
23 Asymmetric Read/Write Energy Cost!
Measurements using RISE
Flash Memory at a Glance1. Write-Constrain: Writing can only be performed at a
page granularity (256B~512B) to an empty page (if occupied we need to delete its content).
2. Delete-Constrain: Erasure of a page can only be performed at a block granularity (i.e. 8KB~64KB)
3. Wear-Constrain: Each page can only be written a limited number of times (typically 10,000-100,000)
Flash Media
Block 1
Block 2
Block n
Occupied Page
Empty Page
Energy (Page Size = 512 B)
Read = 24 μJ
Write =763μJ
Block Erase =425μJ
24
MicroHash Index Objectives
General Objectives• Provide efficient access to any record stored
on flash by timestamp or value.• Execute a wide spectrum of queries based on
our index, similarly to generic DB indexes.
Design Objectives (Adhere to Flash Constrains): • Avoid wearing out specific pages.• Minimize random access deletions of pages.• Minimize main memory (SRAM) structures
• SRAM is extremely limited (8KB-64KB).• Small memory-footprint => quick initialization.
25
Main Structures• 4 Page Types: a) Root Page, b) Directory Page, c)
Index Page and d) Data Page
• 4 Phases of Operation: a) Initialization, b) Growing, c) Repartition and d) Garbage Collect.
26
Growing the MicroHash Index• Collect data in an SRAM buffer page Pwrite
• When Pwrite is full flush it out to flash media• Next create index records for each data record
in Pwrite
• If SRAM gets full, Index pages are forced out to flash media by an LRU policy.
(ts, 74F)
Index Pages
BufferPwrite
BufferPwrite
60
80x
70
50
90
40
Directory
Index
Growing the MicroHash Index
Flash Media
A populated Flash Media
idx: next empty page27
28
Garbage Collection in MicroHash• When the media gets full some pages need to
be deleted => delete the oldest pages.
• Oldest Block? The next block following the idx pointer.
Note:• This might create invalid
index records.• This will be handled by
our search algorithm
29
Directory Repartition in MicroHash• MicroHash starts out with a directory that is
segmented into equiwidth buckets– e.g., divide the temperature range [0,100] into c
buckets)
• Not efficient as certain buckets will never be utilized
– Consider the first few or last few buckets below.
30
Directory Repartition in MicroHash• If bucket A links to more than τ index pages, evict the
least used bucket B and segment the full bucket A into A and A’
• We want to avoid bucket reassignments of old records as this would be very expensive
C: #entries since last split S: timestamp of last addition
A:
>> Add(18) <<
__
_
Example: τ=3
31
Searching in MicroHash• Searching by value
“Find the timestamp (s) on which the temperature was 100F”– Simple operation in MicroHash– We simply find the right Directory Bucket, from there the
respective index page and then data record (page-by-page)
• Searching by timestamp“Find the temperature of some sensor on a given timestamp tq”– Problem: Index pages are mixed together with data pages.– Solutions:
1. Binary Search (O(logn), 18 pages for a 128MB media)2. LBSearch (less than 10 pages for a 128MB media)3. ScaleSearch (better than LBSearch, ~4.5 pages for a 128MB
media)
32
LBSearch and ScaleSearchSolutions to the Search-by-timestamp problem:
A)LBSearch: We recursively create a lower bound on the position of tq until the given timestamp is located.
B)ScaleSearch: Quite similar to LBSearch, however in the first step we proceed more aggressively (by exploiting data distribution)
Query
tq=500
tq=300tq=350tq=420
tq=490
tq=500
33
Searching Bottlenecks• Index Pages written on flash might not be fully
occupied• When we access these pages we transfer a lot of
empty bytes (padding) between the flash media and SRAM.
• Proposed Solutions:– Solution 1: Two-Phase Page Reads– Solution 2: ELF-like Chaining of Index Pages
34
Improving Search Performance• Solution 1: Utilize Two-Phase Page Reads.
– Reads the 8B header from the flash media.– Then read the correct payload in the next
phase.
35
Improving Search Performance• Solution 2: Avoid non-full index pages using ELF*.
– ELF: a linked list in which each page, other than the last page, is completely full.
– keeps copying the last non-full page into a newer page, when new records are requested to be added.
*Dai et. al., Efficient Log Structured Flash File System, SenSys 2004
36
Presentation Outline
1. Overview of Wireless Sensor Networks
2. Overview of Data Acquisition Frameworks
3. The MicroHash Index Structure
4. MicroHash Experimental Evaluation
5. Conclusions and Future Work
37
Experimental Evaluation
• Implemented MicroHash in nesC.• We tested it using TinyOS along with a trace-
driven experimental methodology.• Datasets:
– Washington State Climate• 268MB dataset contains readings in 2000-2005.
– Great Duck Island • 97,000 readings between October and November 2002.
• Evaluation Parameters: i) Space Overhead, ii) Energy Overhead, iii) Search Performance
37
38
Space Overhead of Index
Conclusions: Space Overhead is
minimized with:a)A Larger Bufferb)No-Offsetc)Pressure Data
• Measure: IndexPages/(DataPages+IndexPages)
• Two Index page layouts– Offset, an index record has the following form {datapageid,offset}
– NoOffset, in which an index record has the form {datapageid}
• 128 MB flash media (256,000 pages)
39
Space Overhead of Index
Black denotes the index pages
10KB Buffer
2.5KB Buffer
Index OccupancyIndex | Data Pages
Bitmap Representations of the Flash Media
Index OccupancyIndex | Data Pages
Increasing the Buffer Decreases the Index Overhead
40
Search Performance• Measure: # of page reads to find a record by timestamp• 2 Index page layouts (128MB flash, varying SRAM)
– Anchor: Index Pages store the last known timestamp– No Anchor: Timestamp is only stored in Data Pages
Conclusions: Search Performance
is increased with:a)Larger Write Buffer during Indexingb)Anchorsc)ScaleSearch
41
Indexing the Great Duck Island Trace• Used 3KB index buffer and a 4MB flash card to store all
the 97,000 20-byte data readings.– The index never requires more than 28% additional space – Indexing the records has only a small increase in energy
demand: the energy cost of storing the records on flash without an index is 3042mJ
– We were able to find any record by its timestamp with 4.75 page reads on average
42
Presentation Outline
1. Overview of Wireless Sensor Networks
2. Overview of Data Acquisition Frameworks
3. The MicroHash Index Structure
4. MicroHash Experimental Evaluation
5. Conclusions and Future Work
43
Conclusions & Future Work• We proposed the MicroHash index, which is an
efficient external memory hash index for sensor devices that addresses the distinct characteristics of flash memory
• Our experimental evaluation shows that the structure we propose is both efficient and practical
• Future work:– Develop a complete library of indexes and data
structures (stacks, queues, b+trees, etc.)– Buffer optimizations and Online Compression– Support Range Queries
MicroHash:An Efficient Index Structure for Flash-Based Sensor
DevicesDemetris Zeinalipour
Thank you!
Questions?Related Publications•"MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices", D. Zeinalipour,S. Lin, V. Kalogeraki, D. Gunopulos, W. Najjar, In USENIX FAST’05.• " Efficient Indexing Data Structures for Flash-Based Sensor Devices", ACM Transactions on Storage (TOS), November 2006.
Presentation and publications available at:http://www.cs.ucy.ac.cy/~dzeina/
Backup Slides
46
The Anatomy of a Sensor Device• Processor, in various (sleep, idle, active) modes• Power source AA or Coin batteries, Solar
Panels• SRAM used for the program code and for in-
memory buffering.• LEDs used for debugging• Radio, used for transmitting the acquired data to some storage site (SINK) (9.6Kbps-250Kbps)
• Sensors: Numeric readings in a limited range (e.g. temperature -40F..+250F with one decimal
point precision) at a high frequency (2-2000Hz)
Storage
47
Sensor Devices & CapabilitiesSensing Capabilities
• Light• Temperature• Humidity • Pressure• Tone Detection• Wind Speed• Soil Moisture• Location (GPS)• etc…
48
In-Network Storage: Data Centric Storage Outline
1. Data is stored on specific nodes in the network (e.g., humidity on node A and temperature on node B)
2. Locating Data can be performed without flooding (e.g., temperature-related data is stored on node B).
The SinkProgramming board
Temperature Store
Q: SELECT nodeid where temp=100F
The Programming Cycle • The Operating System
TinyOS (UC-Berkeley): Component-based architecture that allows programmers to wire together the minimum required components in order to minimize code size and energy consumption
(The operating system is really a number of libraries that can be statically linked to the sensor binary at compile time)
• The Programming LanguagenesC (Intel Research, Berkeley): an event-based C-variant optimized for programming sensor devices
event result_t Clock.fire() { state = !state; if (state) call Leds.redOn(); else call Leds.redOff(); }
“Hello World”: Blinking the red LED!
The Programming CycleThe Testing Environment• Debugging code directly on a sensor device is a tedious
procedure • nesC allows programmers to compile their code to
• A Binary File that is burnt to the sensor• A Binary File that runs on a PC
• TOSSIM (TinyOS Simulation) is the environment which allows programmers to simulate the PC binary directly on a PC.
• This enables accurate simulations, fine grained energy modeling (with PowerTOSSIM) and visualization (TinyViz)
The Programming CycleThe Pre-deployment Environment• Once you have created and debugged you code you can perform
a deployment in a laboratory environment.• Harvard’s MoteLab uses 190 sensors, powered from wall power
interconnected with an Ethernet connection.• The Ethernet is just for debugging and reprogramming, while the
Radio for actual communication between motes• Motes can be reprogrammed through a web interface.
Available at: http://motelab.eecs.harvard.edu/
Page Types in MicroHash• Root Page
– contains information related to the state of the flash media, e.g. it contains the position of the last write (idx), the current write cycle (cycle) and meta information about the various indexes stored on the flash media
• Directory Page (the hash table) – contains a number of directory records (buckets) each of which
contains the address of the last known index page mapped to this bucket.
• Index Page– contains a fixed number of index records and the 8 byte
timestamp of the last known data record. The latter field, denoted as anchor is exploited by timestamp searches.
• Data Page– contains a fixed number of data records
52
53
Search Performance• We compared MicroHash vs. ELF Index Page
Chaining by searching all values in the range [20,100]• Keeping full index pages in ELF increases search
performance but decreases insertion performance.
Decreasing indexing performance using ELF
(15% more writes)
Increasing search performance using ELF
(10% less reads)