definition dryadlinq is a simple, powerful, and elegant programming environment for writing...

32
DryadLINQ

Upload: paulina-anderson

Post on 19-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

DryadLINQ

Page 2: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Definition

• DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters.

• The goal of DryadLINQ is to make distributed computing on large compute cluster simple enough for every programmer.

• DryadLINQ combines two important pieces of Microsoft technology: the Dryad distributed execution engine and the .NET Language Integrated Query (LINQ).

Page 3: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

The Evolution of DryadLINQ

• Dryad had its roots in an idea developed in October 2004 by Isard

• Yu recognized the potential of LINQ to serve as the front-end programming tool for Dryad, and started the DryadLINQ project in September 2006

• By early 2008, the Dryad/DryadLINQ combination was made available within Microsoft

• The DryadLINQ research paper won a best-paper award in 2008 during the eighth USENIX Symposium on Operating Systems Design and Implementation

Page 4: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Current Status

• Works with any LINQ enabled language– C#, VB, F#, IronPython, …

• Works with multiple storage systems– NTFS, SQL, Windows Azure, Cosmos DFS

• Released internally within Microsoft– Used on a variety of applications

• External academic release announced at PDC– DryadLINQ in source, Dryad in binary– UW, UCSD, Indiana, ETH, Cambridge, …

Page 5: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Dryad Definition

• Dryad is an infrastructure which allows a programmer to use the resources of a computer cluster or a data center for running data-parallel programs. A Dryad programmer can use thousands of machines, each of them with multiple processors or cores, without knowing anything about concurrent programming.

Page 6: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

The Structure of Dryad Jobs

Page 7: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Dryad services

•An API to create distributed applications (jobs), by specifying which processes have to be executed and communication channels linking them.•Scheduling of the processes on the cluster machines.•Fault-tolerance through re-execution of processes after transient failures.•Monitoring of the computation and statistics collection.•Job visualization.•An API for run-time resource management policies.•Support for efficient bulk data transfer between processes.

Page 8: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

8

ImageProcessing

Cosmos DFSSQL Servers

Software Stack

Windows Server

Cluster Services

Azure Platform

Dryad

DryadLINQ

Windows Server

Windows Server

Windows Server

Other Languages

CIFS/NTFS

MachineLearning

GraphAnalysis

DataMining

Applications

…Other Applications

Page 9: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

9

Dryad System Architecture

Files, TCP, FIFO, Networkjob schedule

data plane

control plane

NS PD PDPD

V V V

Job manager cluster

Page 10: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

LINQ Framework

PLINQ

Local machine

.Netprogram(C#, VB, F#, etc)

Execution engines

Query

Objects

LINQ-to-SQL

DryadLINQ

LINQ-to-XML

LIN

Q p

rovi

der i

nter

face

Scalability

Single-core

Multi-core

Cluster

• Extremely open and extensible

Page 11: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

DryadLINQ Operators

• Operators present in LINQ which are implemented by DryadLINQ. • Adaptations of operators present in LINQ which return scalar values

(i.e., not IQueryable), but which are modified to return an IQueryable instead. For example, Count returns an integer, while CountAsQueryable returns an IQueryable whose actual contents will be a single integer. The AsQueryable variants can be chained together to produce complex queries, while using the scalar variants would require breaking queries into small sub-queries, which could decrease efficiency

• New operators, which exist only in DryadLINQ. We have added new operators which cannot be synthesized efficiently from compositions of primitive LINQ operators, and which can substantially improve the performance of queries in the context of a distributed execution environment like Dryad.

Page 12: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

12

Combining with LINQ-to-SQL

DryadLINQ

Subquery Subquery Subquery Subquery Subquery

Query

LINQ-to-SQL LINQ-to-SQL

Page 13: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

DryadLINQ and LINQ

• C# and LINQ data objects become distributed partitioned files.

• LINQ queries become distributed Dryad jobs.• C# methods become code running on the

vertices of a Dryad job.

Page 14: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

DryadLINQ representation

Page 15: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

DryadLInq features• Declarative programming: computations are expressed in a high-level language similar to SQL• Automatic parallelization: from sequential declarative code the DryadLINQ compiler generates highly parallel query plans

spanning large computer clusters. For exploiting multi-core parallelism on each machine DryadLINQ relies on thePLINQ parallelization framework.

• Integration with Visual Studio: programmers in DryadLINQ take advantage of the comprehensive VS set of tools: Intellisense, code refactoring, integrated debugging, build, source code management.

• Integration with .Net: all .Net libraries, including Visual Basic, and dynamic languages are available.• Type safety: distributed computations are statically type-checked.• Automatic serialization: data transport mechanisms automatically handle all .Net object types.• Job graph optimizations

– static: a rich set of term-rewriting query optimization rules is applied to the query plan, optimizing locality and improving performance.

– dynamic: run-time query plan optimizations automatically adapt the plan taking into account the statistics of the data set processed.• Conciseness: the following line of code is a complete implementation of the Map-Reduce computation framework in

DryadLINQ:• public static IQueryable<R>

MapReduce<S,M,K,R>(this IQueryable<S> source, Expression<Func<S,IEnumerable<M>>> mapper, Expression<Func<M,K>> keySelector, Expression<Func<K,IEnumerable<M>,R>> reducer){ return source.SelectMany(mapper).GroupBy(keySelector, reducer);}

Page 16: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

DryadLINQ System Architecture

16

DryadLINQClient machine

(11)

Distributedquery plan

.NET program

Query Expr

Data center

Output TablesResults

Input TablesInvoke Query

Output DryadTable

Dryad Execution

.Net Objects

JM

ToTable

foreach

Vertexcode

Page 17: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

A Query provider translates IQueryable objects to a suitable format and ships them to a remote execution engine. It also

transforms the remote data into C# objects. DryadLINQ is just an instance of such a provider which interfaces

with the Dryad remote execution framework.

Query Provider

Execute

Local program

(5)

(11)

Transform

C#

C#

(1)

(12)

Query obj

C# Objects

(3)

Remote execution

Data (8)Results

Data

Query Invoke Query Query(2)

Transform

(7)

(9)

(10)

(4) (6)

Page 18: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Execution stages of a Dryad Job

Page 19: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Partitioned File Structure

Page 20: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Reductions (Aggregations)

• var result = input.Aggregate((x,y) => x+y);

[Associative]int Add(int x, int y); var sum = input.Aggregate((x,y)=>Add(x,y));

Page 21: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

ApplyThe Select delegate receives each element

individually, while the one of Apply receives the whole stream.

Page 22: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

22

MapReduce in DryadLINQ

MapReduce(source, // sequence of Ts mapper, // T -> Ms keySelector, // M -> K reducer) // (K, Ms) -> Rs{ var map = source.SelectMany(mapper); var group = map.GroupBy(keySelector); var result = group.SelectMany(reducer); return result; // sequence of Rs}

Page 23: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Map-Reduce Plan(When reduce is combiner-enabled)

M

Q

G1

C

D

MS

G2

R

M

Q

G1

C

D

MS

G2

R

M

Q

G1

C

D

MS

G2

R

MS

G2

R

map

sort

groupby

combine

distribute

mergesort

groupby

reduce

mergesort

groupby

reducem

apD

ynam

ic a

ggre

gatio

nre

duce

Page 24: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC
Page 25: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

An Example: PageRankRanks web pages by propagating scores along hyperlink structure

Each iteration as an SQL query:

1. Join edges with ranks2. Distribute ranks on edges3. GroupBy edge destination4. Aggregate into ranks5. Repeat

Page 26: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Multi-Iteration PageRankpages ranks

Iteration 1

Iteration 2

Iteration 3

Memory FIFO

Page 27: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Dryad Enters the Market

• A big step is coming, as Dryad and DryadLINQ become fully productized as part of the Microsoft HPC Server suite.

• It will be integrated with Microsoft SQL Server and Windows Azure to give customers from academia to the business community a new, powerful computing tool.

• Offering an easy-to-use but powerful, data-intensive computing tool

• It benefits a whole new set of Microsoft customers

Page 28: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Windows Azure and DryadLINQ

• Windows Azure is a platform for building scalable, highly reliable, multi-tiered web service applications. It is hosted on Microsoft’s large data centers in the United States, Europe, and Asia. Windows Azure has both compute and data resources. The compute resources are designed to allow applications to scale to thousands of servers and data resources.

• There is no port of Hadoop or Dryad/LINQ currently available. However, Windows Azure is an excellent platform for experimenting with new variations on large-scale map-reduce algorithms, as these patterns are easily coded as worker role networks.

Page 29: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

“We’re convinced that we will delight our customers, both with the pure capability of the system, as well as its ease of use. What I really like about Dryad is that is not just about handling a problem in a better way, it is also about new possibilities in computing that you couldn’t imagine before.”

Page 30: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Resources• http://research.microsoft.com/en-us/projects/dryadlinq/• http://research.microsoft.com/en-us/projects/dryad/• DryadLINQ: A System for General-Purpose Distributed Data-Parallel

Computing Using a High-Level Language - Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey

• Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations - Yuan Yu, Pradeep Kumar Gunda, Michael Isard

• Distributed Data-Parallel Computing Using a High-Level Programming Language - Michael Isard, Yuan Yu

• Some sample programs written in DryadLINQ - Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey, Frank McSherry, and Kannan Achan

• http://blogs.msdn.com/b/dryad/archive/2009/11/24/what-is-dryad.aspx

• http://research.microsoft.com/en-us/projects/azure/faq.aspx• http://research.microsoft.com/en-us/news/features/dryad-012611.a

spx

Page 31: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Questions?

Page 32: Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC

Thank you for your attention!