inserts at drive speed ben haley research director netqos
TRANSCRIPT
![Page 1: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/1.jpg)
Inserts At Drive SpeedBen HaleyResearch DirectorNetQoS
![Page 2: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/2.jpg)
Overview Introduction
Our problem
Why use a storage engine?
How to implement a read-only storage engine
Optimization
Goal: Provide a new tool that might help solve your issues.
![Page 3: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/3.jpg)
Who is NetQoS? Commercial software vendor
Network Traffic Analysis
o Who is on the network?
o What applications are they using?
o Where is the traffic going?
o How is the network running?
o Can the users get their work done?
Built on MySQL
![Page 4: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/4.jpg)
Who Are Our Customers?
![Page 5: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/5.jpg)
Problem Domain Collected, analyzed and reported on network data
Each collector received >100k records/second
Data was stored for the top IP addresses, applications, ToS for each interface
Data was stored at 15-minute resolution
Kept data for 6 weeks – 13 months
![Page 6: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/6.jpg)
What Did Customers Want? Greater resolution
New ways to look at data
More detail
Use existing hardware
![Page 7: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/7.jpg)
Key Observations Information was in optional or temporary files
Data is unchanging
Large data volumes (100s of GB/day)
Data collectors scattered over the enterprise
Expensive to pull data to a central analysis box
Most analysis focused on short timeframes
Small subset of the data was interesting
Hierarchical data
Flexible formats
![Page 8: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/8.jpg)
1st Approach – Custom Service C++ service to query data
Create a result set to pull back to reporting console
Advantages
o Fast
o Leveraged existing software
Issues
o Not very flexible
o Only access through the console UI
![Page 9: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/9.jpg)
2nd Approach – Traditional DB Insert data into database
Reporting console queries database
Advantages
o Easy
o Somewhat flexible
o Access from standard DB tools
Issues
o Hard to maintain insert/delete rates
o Database load operations tax CPU and I/O
o Not as flexible as desired
![Page 10: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/10.jpg)
3rd Approach –Storage Engine Manage data outside the database
Create storage engine to retrieve data into MySQL
Advantages
o Fast
o Extremely flexible
o Only pay CPU and I/O overhead in queries
o Access from standard DB tools
Issues
o Learning curve
o Multiple moving parts
![Page 11: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/11.jpg)
What Does This Look Like?
MySQL
MyIS
AM
InnoD
B
Arc
hiv
e
…
Cust
om
Data Files
Queries
Data Collection
and Management
![Page 12: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/12.jpg)
Collector Provides Collect data
Create data files
Age out old data
Indexing
Compression
Collector manages data
![Page 13: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/13.jpg)
MySQL Provides Remote Connectivity
SSL Encryption
SQL Support
o Queries (select)
o Aggregations (group by)
o Sorting (order by)
o Integration with other data (join operations)
o Functions
o UDF Support
MySQL gives us a SQL stack
![Page 14: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/14.jpg)
Storage Engine Provides Map data into MySQL
Provides optimization information on indexes
Efficient data extraction
Flatten data structure
Decompression
Storage engine provides the glue between collector and MySQL
![Page 15: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/15.jpg)
How To Great document for storage engines:
http://forge.mysql.com/wiki/MySQL_Internals_Custom_Engine#Writing_a_Custom_Storage_Engine
I am going to concentrate on divergence
![Page 16: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/16.jpg)
Overview of our approach Singleton data storage
Storage engine maps to the data storage
Table schema is a view into storage
Table name for unique view
Column names map to data elements
Indices may be real or virtual
![Page 17: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/17.jpg)
Storage Management External process creates/removes data
Storage engine indicates the data
During query, storage engine locks data range
![Page 18: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/18.jpg)
Simple Create Table ExampleCREATE TABLE `testtable` (
`Router` int(10) unsigned,
`Timestamp` int(10) unsigned,
`Srcaddr` int(10) unsigned,
`Dstaddr` int(10) unsigned,
`Inpkts` int(10) unsigned,
`Inbytes` int(10) unsigned,
index `routerNDX`(`Router`)
) ENGINE=NFA;
![Page 19: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/19.jpg)
Behind the Scenes MySQL creates a .frm file (defines table)
Storage engine validates the DDL
No data tables are created
No indices are created
Table create/delete is almost free
![Page 20: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/20.jpg)
Validation Example – Static Format Restricted to specific table names
Each table name maps to a subset of data
Fixed set of columns for table name
Fixed index definitions
![Page 21: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/21.jpg)
Validation Example – Dynamic Fmt Table name can be anything
Column names must match known definitions
o Physical Columns
o Virtual Columns
Indices may be real or artificial
o Realized Indices
o Virtual Indices
![Page 22: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/22.jpg)
Virtual Columns Represent alternate ways of representing data or
derived values
Provides a shortcut instead of using functions
Examples:
o Actual columns
• ipAddress – IP address
• ipMask – subnet CIDR mask (0-32)
o Virtual columns
• ipMaskBits – bit pattern described by ipMask
• ipSubnet – ipAddress & ipMaskBits
![Page 23: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/23.jpg)
Optimizing Columns Our storage engine supports many columns
Storage engines have to return the entire row defined in the table
MySQL uses only columns referenced in select statement
Table acts as view, so make view narrower
![Page 24: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/24.jpg)
Index Optimization Options MySQL parser
o Define indices in schema
o Provide guidance to MySQL
Roll your own
o Limits table interoperability
o Best left to the experts
![Page 25: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/25.jpg)
Real Index Expose the internal file format
Example:
o Data organized by timestamp, srcAddr
o Query: select router, count(*)
where srcAddr=inet_aton(10.1.2.3)
and timestamp > ‘2009-04-20’;
o Add index timestamp
o Storage engine walks data by timestamp
![Page 26: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/26.jpg)
Real Index #2 Example:
o Data organized by timestamp, srcAddr
o Query: select router, count(*)
where srcAddr=inet_aton(10.1.2.3)
and timestamp > ‘2009-04-20’;
o Add index timestamp, srcAddr
o Storage engine still walks data by timestamp
Database is unable to leverage the full index!
![Page 27: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/27.jpg)
Virtual Index Not completely supported, but good enough
Example:
o Data organized by timestamp, srcAddr
o Query: select router, count(*)
where srcAddr=inet_aton(10.1.2.3)
and timestamp > ‘2009-04-20’;
o Add index srcAddr, timestamp
o Storage engine still walks data by timestamp, but filters on srcAddr
o Would fail on range scan of srcAddr
![Page 28: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/28.jpg)
Virtual Index #2 Example:
o Data organized by timestamp, srcAddr
o Query: select router, count(*)
where srcAddr=inet_aton(10.1.2.3)
and timestamp > ‘2009-04-20’;
o Add index timestamp
o Add index srcAddr
o Storage engine still walks data by timestamp, but filters on srcAddr
![Page 29: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/29.jpg)
Index Optimization Leverage storage format
Add virtual index support where helpful
Don’t overanalyze
o Be accurate if fast
o Estimates are fine
o Heuristics are often great
o Be careful about mixing approaches
![Page 30: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/30.jpg)
Index Heuristics Example Start with large estimate for number of rows returned
Adjust estimate based on expected value of column constraints
o Time – great – files are organized by time
o srcAddr
• Good for equality
• Terrible for range
o Bytes – poor
![Page 31: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/31.jpg)
Typical Query Pattern Create temp table
o Specify only necessary columns
o Specify optimal indices for where clause and engine
Select …
Drop table
![Page 32: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/32.jpg)
KISS Only support what you have to:
o Do you need multiple datasets?
o Do you need flexible table definitions?
o Do you need insert/delete/alter support?
o How will data be accessed?
It’s OK to limit functionality to just solving your problem
![Page 33: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/33.jpg)
Conclusion Why we used a storage engine
Storage engine pattern
Optimization
o Columns
o Indices
o Virtual indices
![Page 34: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/34.jpg)
Application Log analysis
Transaction records
ETL alternative
Custom database
![Page 35: Inserts At Drive Speed Ben Haley Research Director NetQoS](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56649d925503460f94a797d2/html5/thumbnails/35.jpg)
Questions