a discuss on “distributed indexing of web scale datasets for the cloud
DESCRIPTION
TRANSCRIPT
![Page 1: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/1.jpg)
A discuss on“Distributed Indexing of Web Scale
Datasets for the Cloud”[1]
Speakers: Vasileios Komianos,
Georgios Tsoumanis,
Eleni Moustaka
Supervisor: Spyridon Sioutas
Ionian University, Dept. of Informatics, Postgraduate
For the course: Advanced Topics in Database Systems
![Page 2: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/2.jpg)
The focus of this presentation is a distributed architecture, from now on called System, for indexing large datasets. Hadoop, MapReduce, HBase and NoSQLDatabases are a few terms used often in this as these are the keystone technologies enabling such tasks.
![Page 3: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/3.jpg)
Why Cloud?
• Cost• Device and Location Independence• Virtualization• Performance• Scalability• Infrastructure as a Service• Platform as a Service• Software as a Service
![Page 4: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/4.jpg)
Why Web Scale?
• Wikipedia
• Amazon
• Internet Archive
![Page 5: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/5.jpg)
Why Distributed?
• Huge volumes of data
• Computational problems
• Failure tolerance
• Scalability
![Page 6: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/6.jpg)
What Hadoop[2] is
It is a open-source java framework capable of distributed processing of large data sets by using a distributed file system called HDFS[3] and MapReduce[4] model.
![Page 7: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/7.jpg)
Hadoop
HDFS MapReduce
NameNode DataNodes JobTracker TaskTrackers
HadoopArchitecture
Usually NameNode is at the same time JobTracker and DataNodesare also TaskTrackers.
![Page 8: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/8.jpg)
What HBase[5] is
An open-source distributed data store belonging to the known category of NoSQLdatabases. HBase is capable of storing large data sets that can be structured, semi-structured and unstructured offering also rapid query execution.
![Page 9: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/9.jpg)
HBaseArchitecture
HBase
HMaster Region Servers
HBase runs on top of Hadoop and it is modelled after Google’s BitTable[6]. ACIDity* is sacrificed to improve performance and scalability.
*ACID: Atomicity, Consistency, Isolation and Durability
![Page 10: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/10.jpg)
HBasecharacteristics
• NoSQL
• Schema free
• Very large tables
• Scalable
• Sharding
• JSON enable
![Page 11: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/11.jpg)
NoSQLParadigmMongoDB[7]
> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({id: "123", name: "Vasileios"})> db.test.insert({Presentation: "NoSQL databases"})> db.test.find(){ "_id" : ObjectId("4fbac827f119ef630e74638d"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac835f119ef630e74638e"), "id" : "123", "name" : "Vasileios" }{ "_id" : ObjectId("4fbac85df119ef630e74638f"), "Presentation" : "NoSQL databases" }>
MongoDB is an easy to use NoSQL Database, it is free and it is supported by a large community. Suitable if there is no previous NoSQL experience.
NoSQL JSON Schema free
![Page 12: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/12.jpg)
System Architecture
DatasetsUploader
MapReducetask
Content table
IndexerMapReduce
task
Index table
Client API
SearchGetConsisting of: 1 master and
11 worker nodes.
Having: 66 Mappers and 22 Reducers.
Dataset is composed of: 23GB of structured data,300GB of semi-structured data and20GB of unstructured data.
![Page 13: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/13.jpg)
The experiment
The purpose was to test the System’s performance in various conditions such as:
• several datasets sizes,
• different datasets types,
• varying number of nodes,
• different index rules.
![Page 14: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/14.jpg)
![Page 15: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/15.jpg)
![Page 16: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/16.jpg)
![Page 17: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/17.jpg)
Index creation time
TXT dataset is the most demanding of processing when indexed.
![Page 18: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/18.jpg)
![Page 19: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/19.jpg)
5GB HTML dataset index creation time for different index rules
0
2
4
6
8
10
12
1 2 3 4
Iteration No1) 7 indexed tags,2) 14,3) 19,4) 27
Tim
e(m
in)
![Page 20: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/20.jpg)
5GB HTML index size for different index rules
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1 2 3 4
Iteration No:1) 7 tags indexed (table, li, p, b, I, u, title), 2) 14 tags, 3) 19, 4) 27
Ind
ex s
ize
(GB
)
![Page 21: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/21.jpg)
System performance under query load
• Client instances were run concurrently on 14 machines sending queries to the system.
• Types of queries: exact specific attribute,exact any attributerange any attribute.
• Range query loads above 140 queries/sec failed.
• Tests were run with load of 14 queries/sec.
Response time per request:Exact specific queries: 20 ms.Exact any queries: 150ms.Range queries any: 27secs.
![Page 22: A discuss on “Distributed Indexing of Web Scale Datasets for the Cloud](https://reader033.vdocuments.mx/reader033/viewer/2022051608/53ff995c8d7f724c088b46c0/html5/thumbnails/22.jpg)
References
[1] Ioannis Konstantinou, Evangelos Angelou, Dimitrios Tsoumakos and Nectarios Koziris: Distributed Indexing of Web Scale Datasets for the Cloud. In MDAC ’10, April 26, 2010 Raleigh, NC, USA.
[2] http://hadoop.apache.org/ [3] HDFS Scalability: The limits to growth KV Shvachko - The USENIX
Magazine. v35 i2, 2010 - usenix.org[4] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data
processing on large clusters. Commun. ACM 51, 1 (January 2008), 107-113.
[5] Ankur Khetrapal, Vinay Ganesh: HBase and Hypertable for large scale distributed storage systems, Dept. of Computer Science, Purdue University
[6] Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2008. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2, Article 4 (June 2008), 26 pages.
[7] http://www.mongodb.org