parallel worlds of cruby's gc

Post on 15-Jan-2015

21.054 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

I talked this presentation at rubyconf 2011. yay!

TRANSCRIPT

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Parallel worlds of CRuby's GC

nari/Narihiro Nakamura/@nari_en

Network Applied Communication Laboratory Ltd.

I'm very happy now.

Today is my first presentation in English.

My English is not good.

But, I'll do my best.Please bear with me :)

Self introduction

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Ice-cream factory

I worked in an assembly line✓

For example, I made many cardboard boxes.

I was a professional cardboard box maker :)

8/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Ice-cream factory

I made 150 boxes per hour(ZOMG)

9/207

http://www.flickr.com/photos/kevincollins123/5887984753/http://www.flickr.com/photos/kevincollins123/5887984753/

I was like a machine!!

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Working with Java

I worked in a big company.✓

This work was similar to assembly line work..

I made a part of a product. I didn't understand whole product.

13/207

http://www.flickr.com/photos/kevincollins123/5887984753/http://www.flickr.com/photos/kevincollins123/5887984753/

I was still like a machine!!

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

My current work

Currently, I work at NaCl.✓

matz and shyouhei and takaokouji are my co-workers.

shugo is my boss.They are CRuby committers.✓

17/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

When I started Ruby programming

I felt free.✓

This work wasn't similar to assembly line work.

I could make the whole product.✓

18/207

http://www.flickr.com/photos/danzden/121379782/http://www.flickr.com/photos/danzden/121379782/

I was no longera machine!!

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Garbage Collection for me

GC technology is very interesting for me.

GC is a garbage collecting machine.

I've been creating it since then. It's very fun!!

21/207

I'm making a machine!!

My relationship to GC

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

I'm a CRuby Committer

I work on GC.✓

24/207

And, I wrote abook about GC.

But, it's only in Japanese :(

And, I've been creating GC with RDD.

What is RDD?

RDD = RubyKaigi Driven Development

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

My RDD history

LazySweepGC - RubyKaigi2008✓

LonglifeGC - 2009✓

LazySweepGC - 2010✓

ParallelMarkingGC - 2011✓

30/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

My RDD history

LazySweepGC - RubyKaigi2008✓

LonglifeGC - 2009✓

LazySweepGC - 2010✓

ParallelMarkingGC - 2011✓

31/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

LonglifeGC

It treats long-life objects as a special case.

similar to Generational GC.✓

LonglifeGC was rejected in CRuby 1.9.2 by some reason.

:'(✓

32/207

http://www.flickr.com/photos/conifer/2389654222/http://www.flickr.com/photos/conifer/2389654222/

But, LonglifeGC has been

used in Kiji :-)

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Kiji

Kiji is an optimized version of REE by Twitter developers.

The twitter team substantially extended LonglifeGC.

It's cool!!✓

34/207

But, Kiji will be rejected also... :'(

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

My RDD history

LazySweepGC - RubyKaigi2008✓

LonglifeGC - 2009✓

LazySweepGC - 2010✓

ParallelMarkingGC - 2011✓

36/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

LazySweepGC

Traditional M&S GC executes mark and sweep atomically.

Ruby application stops during GC (stop-the-world).

In Lazy sweeping, sweeping is lazy.

37/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

LazySweepGC

Each invocation of the object allocation sweeps Ruby's heap

until it finds an appropriate free object.✓

38/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Improvements

This improves the response time of GC

I.e. the worst case time of GC decreases.

39/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

LazySweepGC

You can use LazySweepGC since Ruby 1.9.3

40/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

My RDD history

LazySweepGC - RubyKaigi2008✓

LonglifeGC - 2009✓

LazySweepGC - 2010✓

ParallelMarkingGC - 2011✓

41/207

Today's topics

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Today's topics

Why do we need Parallel Marking?

What to consider?✓

How to implement?✓

How much did performance improve?

43/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Today's topics

Why do we need Parallel Marking?

What to consider?✓

How to implement?✓

How much did performance improve?

44/207

Why do we need Parallel Marking?

This is CRuby'scurrent GC.

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Current CRuby's GC

GC operates on only 1 core.✓

In multi-core environment, other cores don't help GC.

47/207

http://www.flickr.com/photos/hortont/2698261070/http://www.flickr.com/photos/hortont/2698261070/

GC:"I'm alone, it's so hard."

http://www.flickr.com/photos/knallaerbse/2863161933/http://www.flickr.com/photos/knallaerbse/2863161933/

We should run GC in parallel!!

First, Let me explain a few GC related concepts.

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

What is GC?

GC collects all dead objects.✓

51/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

What is a dead object?

A dead object is an object that is never referenced by the program.

In GC terms, we say a that dead object is unreachable from Roots.

52/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

What is Roots?

Roots is a set of pointers that directly reference objects in the program.

e.g. Ruby's local variables, etc..✓

53/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

For example

54/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Please remember that

GC collects objects that are unreachable from Roots.

55/207

Next, Let me explain the current CRuby GC

algorithm.

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

CRuby's GC algorithm summary

CRuby adopts the Mark & Sweep algorithm

Collector works in separate Mark and Sweep phases.

57/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

In the Mark phase

collector marks live objects that are reachable from Roots.

58/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

For example

59/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Mark phase with GC.start

60/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Ruby Heap after marking

61/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

In the Sweep phase

collector sweeps "dead" objects"dead" means unmarked✓

"dead" means unreachable from Roots✓

62/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Sweep phase

63/207

Characteristics of CRuby's GC

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Characteristics

The stop-the-world algorithm✓

Single thread execution✓

65/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Recently, PC has multi-core processors. But,

GC executes on a single thread.✓

Other cores don't work during GC.✓

What a waste!!✓

66/207

How can we fix this?

UseParallel Marking,Luke

What is Parallel Marking?

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

What is Parallel Marking?

Collector run several marking processes in parallel

by using native threads.✓

We will be happy on multi-core machine.

70/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Flow diagram for Parallel Marking

71/207

BTW:Why not perform

sweeping in parallel?

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Why not perform sweeping in parallel

The sweeping is much faster than the marking.

You can see ko1's research✓

<URL:http://www.atdot.net/~ko1/diary/201011.html#d4>

73/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Why not perform sweeping in parallel

So, Mark phase improvement = GC improvement

And, we already have the lazy sweeping.

74/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Today's topics

Why do we need Parallel Marking?

What to consider?✓

How to implement?✓

How much did performance improve?

75/207

What to consider when implementing Parallel

Marking?

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

We should consider two problems

Workload balancing✓

Wait-free algorithm✓

77/207

Workload balancing

How can we divide the marking task into sub-

tasks?

I tried think about a simple approach.

1 branch of Roots is marked by 1 thread.

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

This means..

Tasks are distributed to multiple threads.

The task of marking the entire heap is divided into several tasks, each marking a single branch.

84/207

This seems to be no problem.

But actually, this solution suffers from the workload

problem.

Each thread doesn't know what the other threads are doing.

For instance, if A and B finishes work early,

then, they will stop doing anything :(

I think "machines should work forever" :D

So, I think A and B should ...

http://www.flickr.com/photos/ryanr/157458385/http://www.flickr.com/photos/ryanr/157458385/

Parallel Marking with Task Stealing.

If A and B finishes work early,

This is called"Task Stealing"

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

We should consider two problems

Workload balancing✓

Wait-free algorithm✓

97/207

Wait-free algorithm

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

What does "wait-free" mean?

A wait-free program does non-blocking execution.

It guarantees per-thread progress.✓

99/207

Why is wait-free important?

Amdahl's law

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Amdahl's law

is used to find the maximum expected improvement to an overall system when only part of the system is improved.

[cited from `Amdahl's law - Wikipedia']

102/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Amdahl's law is used in parallel computing

If parallel portion of the system is X%

And number of processors is Y,✓

How much speedup can we expect?

103/207

It's worse than expected, right?

The conclusion so far

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

The conclusion so far

We should consider how we can efficiently balance workloads.

So, we use Task Stealing.✓

We should eliminate non-parallel parts

by using wait-free algorithm.✓

109/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Today's topics

Why do we need Parallel Marking?

What to consider?✓

How to implement?✓

How much did performance improve

110/207

How to implement Parallel Marking?

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Task Stealing

In Task Stealing, threads steal tasks from each other

Task Stealing is achieved with Arora's Deque

112/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Arora's Deque

Deque stands for the Double-Ended Queue.

In Arora's Deque, the deque contains tasks as elements.

It's a wait-free data structure.✓

113/207

Arora's Deque has only three operations.

Each mark worker has a single deque.

Only the owner can call pop() and push().

Worker can call shift() to steal other workers' deque.

"Hey wait a minute, doesn't shift() have

contention problems?"

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

In what ways could shift() cause contention problems?

e.g...

Multi-thread (workers) may call shift() of same deque at the same time.

122/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

In what ways could shift() cause contention problems?

e.g...

shift() and pop() could be called at the same time

when deque has only one element.✓

123/207

But, Arora's Deque avoids these contention problems.

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Serialization

shift() is serialized by using CAS.CAS = Compare And Swap✓

And, this serialization doesn't use a lock.

It's wait-free!!✓

125/207

I omit details of the implementation of the

serialization.

For the sake of this presentation, let's assume that Arora's Deque avoids

contention problems.

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Summary for Arora's Deque

A simple data structure for Task Stealing.

Each worker has a single deque.✓

Stealing (shift operation) is wait-free!

128/207

How to use Arora's Deque in Parallel Marking?

First try: A task is an object.

Let's say that worker A has a branch that is composed of 4 objects.

We start by marking A and pushing it to the deque.

pop A, mark B and C, push B and C.

pop C, mark D, push D

pop D, pop B

This is a branch marking.

How do you steal?

Suppose that worker1 has task B and C. Worker2 has no task.

Worker2 steals task B on Worker1 by using shift().

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Summary

Marker uses Arora's Deque as a marking stack.

A "task" means an object.The granularity of the task is very fine.✓

This is a naive implementation.✓

140/207

I implemented this approach.

But..

It's slower than original GC.

http://www.flickr.com/photos/emariephotos/4958245676/http://www.flickr.com/photos/emariephotos/4958245676/

OMG...

I fell intothe Pitfalls ofParallel Processing(PPP!!!)

Why slow?

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Why slow?

pop(),push(),shift() are called frequently.

Because deque has fine-grained tasks.✓

Their overhead is too big.✓

147/207

How to fix this?

We can make the tasks less fine-grained.

A task is a branch

All branches in Roots are divided roughly among the deques.

Each Worker marks a branch in its deque.

When the deque is empty, the worker steals a branch from another worker.

like this!!

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Good point & Bad point

Number of calls to Deque's operations was reduced.

Marking speed of the worker is improved.

However, Coarse-grained tasks decrease parallelism.

155/207

Why do coarse-grained tasks decrease parallelism?

Tasks may involve a large branch.

If an object in B's branch has many child objects..

.. then A can't steal it while B is marking the large branch.

So, the worker needs to treat large branches as

special cases.

Almost all large branches hold large Array objects

and/or large Hash objects.

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Treatment for large Array objects and Hash objects

Each marker has a special deque to manage them.

A marker divides them into fixed size tasks.

e.g. 0-9 elements of Array, 10-19 elements of Array...

162/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Treatment for Large Array and Hash

By doing this, other workers can steal divided tasks.

This improves parallelism.✓

163/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Summary

The naive implementation was slow.

Grain of the task was too fine.✓

A "task" means a branch in RootsGrain of the task is coarse.✓

It's faster!!✓164/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Today's topics

Why do we need Parallel Marking?

What to consider?✓

How to implement?✓

How much did performance improve?

165/207

How much did performance improve?

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

These are my machine specs

My machine has only 2 cores✓

Memory: 8GB✓

OS: Linux✓

167/207

Parallel marking uses 4 marking threads.

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

First benchmark program is

make benchmarkThis is the benchmark which used in CRuby development

169/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Why does this seem so slow?

I think it's affected by Parallel Marking's preparation.

e.g. creating marking threads, allocation of deques.

171/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Why does this seem so slow?

In most of the benchmarks, the mark target objects are few.

In this case, Parallel Marking cost is expensive.

172/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Next benchmark program is

make rdocmake rdoc generates the Ruby documentation.

This benchmark measures execution time and the GC execution time of make rdoc.

173/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

make rdoc

It takes about 80 seconds on my machine.

In fact, 30% of that time is spent on GC!!

How much did performance improve?

174/207

All GC time is improved by 40%!

So fast!!

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

In many core environment

I expect we get a large improvement.

e.g. 8 core, 16 core...✓

But, my machine has just 2 cores.I can't see it :(✓

178/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Best case for Parallel GC

If the objects are many.In this case, mark targets is also many.✓

If the objects are long-lived.Server-side application?✓

179/207

Demo

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Demonstration

I want to show the performance improvement with Parallel GC.

This demonstration is video game style.

181/207

Let me explain about this game.

And, Character has HP.

When GC runs,

the character loses HP while waiting for the GC to finish.

We must reach the goal before HP run out.

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Other characteristics of SUPER NARIO GC

GC is running in fixed intervals.✓

A lot of objects are generated to increase GC's burden.

Burden = Game Level✓

187/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Try to compare Original GC and Parallel GC

Original GC pause time is long.This game will be difficult.✓

Parallel GC pause time is short.This game will be easy.✓

188/207

OK, Let's try!

DEMOOriginal GC version

Oops.. so difficult!!!

DEMOParallel GC version

Wow!! Easy!!!!

Let's compare average times GC

Fast!!

Remaining Problems

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Windows OS is not supported

Mark Worker uses pthread as native thread.

And, uses some gcc built-in functions.

But, I'll support for Windows eventually.

198/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Increased memory usage.

Size of 1 Deque is roughly 32KB.✓

But generally multi-core machine have plenty of memory.

So, I think it's OK :P✓

199/207

Conclusion

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Conclusion

I implemented Parallel Marking GC

GC was improved!I'll report to ruby-core soon.✓

201/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Conclusion

But, Parallel Marking has some problems.

I'll fix these.✓

202/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

source code

Parallel Marking GC<URL:https://github.com/authorNari/ruby/tree/pmark_div_root2>

SUPER NARIO GC<URL:https://github.com/authorNari/nario/>

203/207

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Acknowledgments

Following people helped me make this presentation!!

Tor-san!!✓

matz, shugo, yhara, sada, takaokouji, other co-workers!!

204/207

Thank you!!!

Do you have any questions?

Please short and simple questions :)

Parallel worlds of CRuby's GC Powered by Rabbit 0.9.3

Sorry

It's too difficult for me to understand/answer the question.

Could be send the question on twitter(@nari_en)?

207/207

top related