large-scale data and computation - fudan...

47
Large-Scale Data and Computation Yifu Huang School of Computer Science, Fudan University [email protected] COMP620003 Advanced Computer Networks Report, 2013 Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 1 / 27

Upload: others

Post on 27-Jun-2020

21 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Large-Scale Data and Computation

Yifu Huang

School of Computer Science, Fudan [email protected]

COMP620003 Advanced Computer Networks Report, 2013

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 1 / 27

Page 2: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Outline

1 Motivation

2 GFS

3 MapReduce

4 Bigtable

5 Discussion

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 2 / 27

Page 3: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Motivation

Motivation

Why we choose these papers?

Big dataGoogle

Why we need GFS, MapReduce, Bigtable?

A scalable distributed file system for large distributed data-intensiveapplicationsA programming model for processing and generating large datasets thatis amenable to a broad variety of real-world tasksA distributed storage system for managing structured data that isdesigned to scale to a very large size

What can we learn from these papers?

The design and implementation ideas behind GFS, MapReduce,BigtableGet ready to enjoy open source equivalents HDFS, YARN, HBase

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 3 / 27

Page 4: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Motivation

Motivation

Why we choose these papers?

Big dataGoogle

Why we need GFS, MapReduce, Bigtable?

A scalable distributed file system for large distributed data-intensiveapplicationsA programming model for processing and generating large datasets thatis amenable to a broad variety of real-world tasksA distributed storage system for managing structured data that isdesigned to scale to a very large size

What can we learn from these papers?

The design and implementation ideas behind GFS, MapReduce,BigtableGet ready to enjoy open source equivalents HDFS, YARN, HBase

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 3 / 27

Page 5: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Motivation

Motivation

Why we choose these papers?

Big dataGoogle

Why we need GFS, MapReduce, Bigtable?

A scalable distributed file system for large distributed data-intensiveapplicationsA programming model for processing and generating large datasets thatis amenable to a broad variety of real-world tasksA distributed storage system for managing structured data that isdesigned to scale to a very large size

What can we learn from these papers?

The design and implementation ideas behind GFS, MapReduce,BigtableGet ready to enjoy open source equivalents HDFS, YARN, HBase

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 3 / 27

Page 6: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Motivation

Ecosystem

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 4 / 27

Page 7: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

The Google File System

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Google, Inc.

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 5 / 27

Page 8: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview

The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27

Page 9: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview

The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27

Page 10: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview

The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27

Page 11: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview

The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27

Page 12: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview

The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27

Page 13: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview

The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 6 / 27

Page 14: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview (cont.)

Architecture

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 7 / 27

Page 15: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview (cont.)

Chunk Size: 64 MBMetadata

The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas

Chunk LocationsOperation LogConsistency Model

Guarantees by GFSImplications for Applications

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 8 / 27

Page 16: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview (cont.)

Chunk Size: 64 MBMetadata

The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas

Chunk LocationsOperation LogConsistency Model

Guarantees by GFSImplications for Applications

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 8 / 27

Page 17: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview (cont.)

Chunk Size: 64 MBMetadata

The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas

Chunk LocationsOperation LogConsistency Model

Guarantees by GFSImplications for Applications

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 8 / 27

Page 18: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview (cont.)

Chunk Size: 64 MBMetadata

The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas

Chunk LocationsOperation LogConsistency Model

Guarantees by GFSImplications for Applications

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 8 / 27

Page 19: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Design Overview (cont.)

Chunk Size: 64 MBMetadata

The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas

Chunk LocationsOperation LogConsistency Model

Guarantees by GFSImplications for Applications

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 8 / 27

Page 20: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

System Interactions

Leases and Mutation Order

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 9 / 27

Page 21: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

System Interactions (cont.)

Data Flow

We decouple the flow of data from the flow of control to use thenetwork efficientlyOur goals are to fully utilize each machine’s network bandwidth, avoidnetwork bottlenecks and high-latency links, and minimize the latency topush through all the data

Atomic Record Appends

GFS provides an atomic append operation called record append

Snapshot

The snapshot operation makes a copy of a file or a directory treealmost instantaneously, while minimizing any interruptions of ongoingmutations

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 10 / 27

Page 22: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

System Interactions (cont.)

Data Flow

We decouple the flow of data from the flow of control to use thenetwork efficientlyOur goals are to fully utilize each machine’s network bandwidth, avoidnetwork bottlenecks and high-latency links, and minimize the latency topush through all the data

Atomic Record Appends

GFS provides an atomic append operation called record append

Snapshot

The snapshot operation makes a copy of a file or a directory treealmost instantaneously, while minimizing any interruptions of ongoingmutations

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 10 / 27

Page 23: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

System Interactions (cont.)

Data Flow

We decouple the flow of data from the flow of control to use thenetwork efficientlyOur goals are to fully utilize each machine’s network bandwidth, avoidnetwork bottlenecks and high-latency links, and minimize the latency topush through all the data

Atomic Record Appends

GFS provides an atomic append operation called record append

Snapshot

The snapshot operation makes a copy of a file or a directory treealmost instantaneously, while minimizing any interruptions of ongoingmutations

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 10 / 27

Page 24: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Fault Tolerance And Diagnosis

High Availability

Fast RecoveryChunk ReplicationMaster Replication

Data IntegrityDiagnostic Tools

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 11 / 27

Page 25: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Fault Tolerance And Diagnosis

High Availability

Fast RecoveryChunk ReplicationMaster Replication

Data IntegrityDiagnostic Tools

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 11 / 27

Page 26: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

GFS

Fault Tolerance And Diagnosis

High Availability

Fast RecoveryChunk ReplicationMaster Replication

Data IntegrityDiagnostic Tools

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 11 / 27

Page 27: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

MapReduce

MapReduce: Simplied Data Processing on Large Clusters

MapReduce: Simplied Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

Google, Inc.

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 12 / 27

Page 28: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

MapReduce

Model

Map

Takes an input pair and produces a set of intermediate key/value pairsThe MapReduce library groups together all intermediate valuesassociated with the same intermediate key and passes them to thereduce function

Reduce

Accepts an intermediate key and a set of values for that key andmerges these values together to form a possibly smaller set of values

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 13 / 27

Page 29: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

MapReduce

Model

Map

Takes an input pair and produces a set of intermediate key/value pairsThe MapReduce library groups together all intermediate valuesassociated with the same intermediate key and passes them to thereduce function

Reduce

Accepts an intermediate key and a set of values for that key andmerges these values together to form a possibly smaller set of values

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 13 / 27

Page 30: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

MapReduce

Implementation

User ProgramMasterWorker (Map/Reduce)Fault toleranceLocalityTask granularityBackup tools

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 14 / 27

Page 31: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

MapReduce

Performance

Grep

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 15 / 27

Page 32: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

MapReduce

Performance (cont.)

Sort

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 16 / 27

Page 33: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Bigtable: A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

Google, Inc.

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 17 / 27

Page 34: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Model

Rows

AtomicTablets

Column Families

Family : qualifierAccess control

Timestamps

Decreasing orderGarbage collect

Cell

(row : string, column : string, time : int64) -> string

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 18 / 27

Page 35: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Model

Rows

AtomicTablets

Column Families

Family : qualifierAccess control

Timestamps

Decreasing orderGarbage collect

Cell

(row : string, column : string, time : int64) -> string

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 18 / 27

Page 36: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Model

Rows

AtomicTablets

Column Families

Family : qualifierAccess control

Timestamps

Decreasing orderGarbage collect

Cell

(row : string, column : string, time : int64) -> string

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 18 / 27

Page 37: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Model

Rows

AtomicTablets

Column Families

Family : qualifierAccess control

Timestamps

Decreasing orderGarbage collect

Cell

(row : string, column : string, time : int64) -> string

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 18 / 27

Page 38: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Model (cont.)

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 19 / 27

Page 39: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Storage

Region–Store—memStore—StoreFileFile–BlocksLog–Write ahead log

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 20 / 27

Page 40: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Implementation

Tablet Location

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 21 / 27

Page 41: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Implementation (cont.)

Tablet Serving

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 22 / 27

Page 42: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Evaluation

Benchmarks

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 23 / 27

Page 43: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Evaluation (cont.)

Scaling

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 24 / 27

Page 44: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Bigtable

Evaluation (cont.)

Applications

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 25 / 27

Page 45: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Discussion

Discussion

Contributions of GFS, MapReduce, Bigtable

Supporting large-scale data processing workloads on commodityhardwareEasy to use, hides details of parallelization, fault tolerance, localityoptimization, and load balancingScale well both in terms of data size (from URLs to web pages tosatellite imagery) and latency requirements (from backend bulkprocessing to real-time data serving)

Drawbacks of GFS, MapReduce, Bigtable

Single master may be still a potential bottleneckDisk I/O, time consumingDo not support many database operations (multi-row transaction,secondary index . . . )

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 26 / 27

Page 46: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Discussion

Discussion

Contributions of GFS, MapReduce, Bigtable

Supporting large-scale data processing workloads on commodityhardwareEasy to use, hides details of parallelization, fault tolerance, localityoptimization, and load balancingScale well both in terms of data size (from URLs to web pages tosatellite imagery) and latency requirements (from backend bulkprocessing to real-time data serving)

Drawbacks of GFS, MapReduce, Bigtable

Single master may be still a potential bottleneckDisk I/O, time consumingDo not support many database operations (multi-row transaction,secondary index . . . )

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 26 / 27

Page 47: Large-Scale Data and Computation - Fudan Universityadmis.fudan.edu.cn/~yfhuang/files/LSDC_slide.pdfLarge-Scale Data and Computation YifuHuang School of Computer Science, Fudan University

Appendix

References I

[1] The Google File System. SOSP. 2003.

[2] MapReduce: Simplied Data Processing on Large Clusters. OSDI.2004.

[3] Bigtable: A Distributed Storage System for Structured Data.OSDI. 2006.

[4] The Hadoop Distributed File System. MSST. 2010.

[5] Hadoop: the definitive guide. 2012.

[6] HBase: the definitive guide. 2011.

[7] Data-intensive text processing with MapReduce. 2010.

[8] The Chubby lock service for loosely-coupled distributed systems.OSDI. 2006.

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 27 / 27