large-scale data and computation - fudan...

Large-Scale Data and Computation

Yifu Huang

School of Computer Science, Fudan Universityhuangyifu@fudan.edu.cn

COMP620003 Advanced Computer Networks Report, 2013

Yifu Huang (FDU CS) COMP620003 Report 2013/12/5 1 / 27

Outline

1 Motivation

3 MapReduce

4 Bigtable

5 Discussion

Motivation

Why we choose these papers?

Big dataGoogle

Why we need GFS, MapReduce, Bigtable?

A scalable distributed file system for large distributed data-intensiveapplicationsA programming model for processing and generating large datasets thatis amenable to a broad variety of real-world tasksA distributed storage system for managing structured data that isdesigned to scale to a very large size

What can we learn from these papers?

The design and implementation ideas behind GFS, MapReduce,BigtableGet ready to enjoy open source equivalents HDFS, YARN, HBase

Motivation

Big dataGoogle

Motivation

Big dataGoogle

Motivation

Ecosystem

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Google, Inc.

Design Overview

The system is built from many inexpensive commodity componentsthat often failThe system stores a modest number of large filesThe workloads primarily consist of two kinds of reads: large streamingreads and small random readsThe workloads also have many large, sequential writes that appenddata to filesThe system must efficiently implement well-defined semantics ormultiple clients that concurrently append to the same fileHigh sustained bandwidth is more important than low latency

Design Overview

Design Overview (cont.)

Architecture

Chunk Size: 64 MBMetadata

The file and chunk namespacesThe mapping from files to chunksThe locations of each chunk’s replicas

Chunk LocationsOperation LogConsistency Model

Guarantees by GFSImplications for Applications

System Interactions

Leases and Mutation Order

System Interactions (cont.)

Data Flow

We decouple the flow of data from the flow of control to use thenetwork efficientlyOur goals are to fully utilize each machine’s network bandwidth, avoidnetwork bottlenecks and high-latency links, and minimize the latency topush through all the data

Atomic Record Appends

GFS provides an atomic append operation called record append

Snapshot

The snapshot operation makes a copy of a file or a directory treealmost instantaneously, while minimizing any interruptions of ongoingmutations

Data Flow

Snapshot

Data Flow

Snapshot

Fault Tolerance And Diagnosis

High Availability

Fast RecoveryChunk ReplicationMaster Replication

Data IntegrityDiagnostic Tools

High Availability

MapReduce

MapReduce: Simplied Data Processing on Large Clusters

Jeffrey Dean and Sanjay Ghemawat

Google, Inc.

MapReduce

Takes an input pair and produces a set of intermediate key/value pairsThe MapReduce library groups together all intermediate valuesassociated with the same intermediate key and passes them to thereduce function

Reduce

Accepts an intermediate key and a set of values for that key andmerges these values together to form a possibly smaller set of values

MapReduce

Takes an input pair and produces a set of intermediate key/value pairsThe MapReduce library groups together all intermediate valuesassociated with the same intermediate key and passes them to thereduce function

Reduce

Accepts an intermediate key and a set of values for that key andmerges these values together to form a possibly smaller set of values

MapReduce

Implementation

User ProgramMasterWorker (Map/Reduce)Fault toleranceLocalityTask granularityBackup tools

MapReduce

Performance

MapReduce

Performance (cont.)

Bigtable

Bigtable: A Distributed Storage System for Structured Data

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

Google, Inc.

Bigtable

AtomicTablets

Column Families

Family : qualifierAccess control

Timestamps

Decreasing orderGarbage collect

(row : string, column : string, time : int64) -> string

Bigtable

AtomicTablets

Column Families

Timestamps

Bigtable

AtomicTablets

Column Families

Timestamps

Bigtable

AtomicTablets

Column Families

Timestamps

Bigtable

Model (cont.)

Bigtable

Storage

Region–Store—memStore—StoreFileFile–BlocksLog–Write ahead log

Bigtable

Implementation

Tablet Location

Bigtable

Implementation (cont.)

Tablet Serving

Bigtable

Evaluation

Benchmarks

Bigtable

Evaluation (cont.)

Scaling

Bigtable

Evaluation (cont.)

Applications

Discussion

Contributions of GFS, MapReduce, Bigtable

Supporting large-scale data processing workloads on commodityhardwareEasy to use, hides details of parallelization, fault tolerance, localityoptimization, and load balancingScale well both in terms of data size (from URLs to web pages tosatellite imagery) and latency requirements (from backend bulkprocessing to real-time data serving)

Drawbacks of GFS, MapReduce, Bigtable

Single master may be still a potential bottleneckDisk I/O, time consumingDo not support many database operations (multi-row transaction,secondary index . . . )

Discussion

Contributions of GFS, MapReduce, Bigtable

Supporting large-scale data processing workloads on commodityhardwareEasy to use, hides details of parallelization, fault tolerance, localityoptimization, and load balancingScale well both in terms of data size (from URLs to web pages tosatellite imagery) and latency requirements (from backend bulkprocessing to real-time data serving)

Drawbacks of GFS, MapReduce, Bigtable

Single master may be still a potential bottleneckDisk I/O, time consumingDo not support many database operations (multi-row transaction,secondary index . . . )

Appendix

References I

[1] The Google File System. SOSP. 2003.

[2] MapReduce: Simplied Data Processing on Large Clusters. OSDI.2004.

[3] Bigtable: A Distributed Storage System for Structured Data.OSDI. 2006.

[4] The Hadoop Distributed File System. MSST. 2010.

[5] Hadoop: the definitive guide. 2012.

[6] HBase: the definitive guide. 2011.

[7] Data-intensive text processing with MapReduce. 2010.

[8] The Chubby lock service for loosely-coupled distributed systems.OSDI. 2006.

large-scale data and computation - fudan...

Documents

fudan university september 22, 2019

partners faculty admission overview - cau · 2015-10-13 ·...

fudan annual report 2015

2012 - fudan university · of universitas 21, an...

publications - fudan university

endnote on windows: class notes - fudan university...endnote...

sample test fudan (physics)

fudan-uc dispatch fudan-uc dispatch · 2019-03-04 ·...

advanced microeconomics (ii) - fudan university

xue xiangyang - fudan university

double degree program - fudan university -. double degree...

work @ fudan university

fudan university international summer session ·...

welcome to fudan...buddy program •from year 2010 on, fudan...

research strengths of fudan university dr. chouwen zhu...

fudan case study - gale

chinese chemical letters - fudan university

fudan mba ilab - business finlandoverview of fudan...

as featured in - fudan university

the fudan fulcrum