big data - in-memory index / sub second query engine - roxie - hpcc systems
DESCRIPTION
Roxie , the best kept Big Data secret for high performance. Leverage the multi-threaded processing of Roxie and use tools like In-memory indexes, In-memory data ,SSD and more to do sub-second querying.TRANSCRIPT
![Page 1: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/1.jpg)
HPCC Systems - Big Data!!
Roxie - In-Memory Data & Index , Sub-Second Query Cluster
By Fujio Turner
@FujioTurner
![Page 2: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/2.jpg)
LexisNexis is a provider of legal, tax, regulatory, news, business information, and analysis to legal, corporate, government,!
accounting and academic markets. !!
LexisNexis has been in business since 1977 with over 30,000 employees worldwide.
What is HPCC Systems?Who is ?
LexisNexis Risk is the division of the LexisNexis which focuses on data, Big Data processing, linking and vertical expertise and supports HPCC Systems as an open source project under Apache 2.0 License.
![Page 3: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/3.jpg)
Comparison
JAVA C++
Petabytes
1-80,000 Jobs/day
Since 2005
Exabytes
Non-Indexed 4X-13X
Since 2000
Indexed: 2K-3K Jobs/sec
? ? ? ? ? ?
Thor Roxie
Block Based File Based
![Page 4: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/4.jpg)
BusinessDevelopmentCustomers1 20
Non-Indexed Full Data Set
http://hpccsystems.com/why-hpcc/benchmarks
![Page 5: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/5.jpg)
How do I Query HPCC Systems?ECL (Enterprise Control Language) is a C++ based query language for use with HPCC Systems Big Data platform. ECLs syntax and format is very simple and easy to learn.!!
Note - ECL is very similar to Hadoop’s pig ,but!more expressive and feature rich.
![Page 6: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/6.jpg)
Map/Reduce
SQL w/ JOINS
GraphDB
Machine Learning
Simple to Complex Queries
ECL (Enterprise Control Language) C++ based query language
![Page 7: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/7.jpg)
“I’m sub-second fast.”
“I can query all or part of your
data.”
Thor RoxieHard Disk
Index(optional)Hard Disk
Index(optional) In-memory Index
SSD
Either/Both
Cluster Architecture
Rapid Online XML Inquiry Engine
![Page 8: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/8.jpg)
Roxie
Date 2004
Query Languages ECL
In-Memory Index w/ Part or All Data
Data Type Normalized
Index Only
Query Methods REST Direct TCP
DeNormalizedor
or
SOAP
or Unstructured
![Page 9: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/9.jpg)
Load Into Thor QueryFile
VS
Example 1
Index PublishLoad Into Roxie QueryFile
![Page 10: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/10.jpg)
HPCC Systems Sample Data for Examples 1
Sample Data
http://hpccsystems.com/download/docs/learning-ecl
![Page 11: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/11.jpg)
Administrator Web GUI!on
Port 8010IP / Url of HPCC install
![Page 12: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/12.jpg)
4.
5.
1. Upload file*!2. Distribute to cluster!3. Name of file in cluster!4. Size of each row!5. Push to cluster
*2GB file size limit through web No limit if uploaded via SOAP
Load Data
![Page 13: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/13.jpg)
In Thor Cluster
Loaded
![Page 14: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/14.jpg)
Query !Example 1
Data
allPeople := DATASET(‘~test::originalperson’,Layout_Person,THOR);
Layout_People := RECORD STRING15 FirstName; STRING25 LastName; STRING15 MiddleName; STRING5 Zip; STRING42 Street; STRING20 City; STRING2 State; END;
smith; //Output
smith := allPeople(LastName= ‘Smith’);Query
Schema
WHERE `LastName` = ‘Smith’
File Location,!“FROM Table”“USE DATABASE;”
“SELECT * ….”
![Page 15: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/15.jpg)
Copy Data !From Thor to Roxie
![Page 16: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/16.jpg)
1.Select Thor File(s) 2.Copy Tab!!
3. Pick Roxie cluster!!
4. Click Copy!!
![Page 17: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/17.jpg)
Indexing!In-Memory
Make Index
File Position Number!pseudo recordID!
“Alter Table”(new column) Index Filename
allPeople := DATASET(‘~test::originalperson_copy’, {Layout_People, UNSIGNED8 RecPtr {virtual(fileposition)}}, FLAT);
rx := INDEX(allPeople,{LastName,RecPtr},’~test::key_person_copy’,PRELOAD);
BUILDINDEX(rx);
Ex. Creating an index by “LastName”
In-Memory
![Page 18: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/18.jpg)
rx := INDEX(allPeople,{LastName,RecPtr},’~test::key_person_copy’,PRELOAD);
Indexing!In-Memory with Luggage
Index Only
rx := INDEX(allPeople,{FirstName,MiddleName},{LastName,RecPtr},’~test::key_person_copy’,PRELOAD);Index + Part or All Data
+Store Data!In-Memory!with Index
![Page 19: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/19.jpg)
Query
filterdata; //Output
w/ IndexData
Queryfilterdata:= FETCH(allPeople,datax(LastName=‘Smith’),RIGHT. RecPtr);
datax:= INDEX(allPeople,{LastName,RecPtr},’~test::key_person_copy’);
WHERE `LastName` = ‘Smith’ from Index
allPeople := DATASET(‘~test::originalperson_copy’, {Layout_People, UNSIGNED8 RecPtr {virtual(fileposition)}},FLAT);
![Page 20: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/20.jpg)
“I’m sub-second fast.” Publish Your Code
What Is publishing your code?
Can I do ad-hoc queries without publishing?
ECL is built from C++. So your ECL needs to be compiled before it runs.When you publish a query, you pre-compile your ECL and send it to the ESP server where it will be stored. ESP, on port 8002 , will listen to any requests and execute the published query.
Yes, but it will not be sub-second fast as it is not pre-compiled.
![Page 21: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/21.jpg)
How Querying ESP Works
![Page 22: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/22.jpg)
How to Publish Your ECL
1.Select your ad-hoc “lastname” query from before
2.Name you query “lastname”
3. Publish
![Page 23: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/23.jpg)
ESP on port 8002Enterprise Service Platform
1.Select “lastname” query2.Select your data output format ”Output JSON” 3.Click Submit button
![Page 24: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/24.jpg)
JSON Format
1.Send Request = Query or hit this Url
![Page 25: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/25.jpg)
JSON Format
Results - in less then a second
![Page 26: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/26.jpg)
2013-06-06 Twitter
2013-06-07 Twitter
2013-06-08 Twitter
2013-06 Twitter
2013-06-06 ……….. -07 ……….. -08
Logical File
Real File
SuperFile!Organizing Your Files
+ Append new real files on the fly
Use this file name to get all the real files
![Page 27: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/27.jpg)
2013-06-06 Twitter
2013-06-07 Twitter
2013-06-08 Twitter
2013 Twitter
SuperKeys No Sub-Super Files or Keys
in RoxieOrganizing Your Indexes
![Page 28: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/28.jpg)
http://www.slideshare.net/FujioTurner/
For More HPCC!“How To’s”!
Go to SlideShare
![Page 29: Big Data - In-Memory Index / Sub Second Query engine - Roxie - HPCC Systems](https://reader035.vdocuments.mx/reader035/viewer/2022081414/54b718a04a7959286f8b468e/html5/thumbnails/29.jpg)
http://www.youtube.com/watch?v=8SV43DCUqJg
Watch how to install HPCC Systems
in 5 Minutes
Download HPCC Systems Open Source
Community Edition
or
Source Codehttps://github.com/hpcc-systems
http://hpccsystems.com/download/