© 2010 autodesk massive parallelism in ai throughput versus realtime pierre pontevia 10 th march...

© 2010 Autodesk

Massive Parallelism in AIThroughput versus Realtime

Pierre Pontevia10th March 2010

© 2010 Autodesk

Agenda

Where are we today

The pathfinding challenge : from throughput to realtime

MASAI : the premises of an AI massive parallel solution

© 2010 Autodesk

WHERE ARE WE TODAY?

© 2010 Autodesk

Where are we today?

Parallel programming has becoming a reality for game developers since the arrival of ”next gen” consoles (2005-2006)

Since then, a lot of new languages and programming models have been suggested to better tackle parallelism,

And new hardware is being announced, shaping the future of consoles…

So this is a good moment to see how parallelism could be revisited for the games of tomorrow… with a special focus on pathfinding

© 2010 Autodesk

As a start, the 13 dwarves should help us to find the right parallel pattern

The 13 dwarves is an initiative from Berkeley University to help achieve high parallelism

A dwarf is an algorithmic method that captures a pattern of computation and communication

The 1st exercise is to identify which dwarves match the problems involved in pathfinding

© 2010 Autodesk

As a start, the 13 dwarves should help us to find the right parallel pattern (cont’d)

Dwarf Description

1. Dense Linear Algebra Data are dense matrices or vectors

2. Sparse Linear AlgebraData sets include many zero values. Data is usually stored in compressed matrices to reduce the storage and bandwidth requirements to access all of the nonzero values

3. Spectral Methods Data are in the frequency domain, as opposed to time or spatial domains

4. N-Body Methods Depends on interactions between many discrete points. Variations include particle-particle methods

5. Structured Grids Represented by a regular grid; points on grid are conceptually updated together. It has high spatial locality

6. Unstructured Grids An irregular grid where data locations are selected, usually by underlying characteristics of the application

7. Monte Carlo Calculations depend on statistical results of repeated random trials

© 2010 Autodesk

As a start, the 13 dwarves should help us to find the right parallel pattern (cont’d)

Dwarf Description

8. Combinational Logic Functions that are implemented with logical functions and stored state

9. Graph traversalVisits many nodes in a graph by following successive edges. These applications typically involve many levels of indirection, and a relatively small amount of computation

10. Dynamic ProgrammingComputes a solution by solving simpler overlapping sub problems. Particularly useful in optimization problems with a large set of feasible solutions

11. Backtrack and Branch + Bound

Finds an optimal solution by recursively dividing the feasible region into sub domains, and then pruning sub problems that are suboptimal

12. Construct Graphical Models

Constructs graphs that represent random variables as nodes and conditional dependencies as edges. Examples include Bayesian networks and Hidden Markov Models

13. Finite State Machine A system whose behavior is defined by states, transitions defined by inputs and the current state, and events associated with transitions or states

© 2010 Autodesk

Recent languages and programming models provide guidance for parallel implementation

Data Parallelism for homogenous architectures

• OpenMP• TBB• Ct

Data Parallelism for heterogeneous architectures

• CUDA, • OpenCL, • DirectCompute• SPURS• RapidMind

PC clusters• MPI • Map Reduce

Concurrent Programming• PPL, Asynchronous Agents• Grand Central Station

© 2010 Autodesk

However, there are specific constraints in the video games impacting on parallel design…

Memory Resources Constraints How much scratch memory required by solver

Concurrent Memory access Computations are done on data which can change significantly from frame to

frame

Data lifetime / persistence Things are volatile by nature

Reactivity / Time delay / Frequency constraints When do you really need the result of your computation

Interruptibility The system can change its mind – 80% of the path goals are never reached

© 2010 Autodesk

…and even more constraints when you develop middleware

Multiple cohabitant models Several middleware with several threading models Not blocking is not enough -> fine tuning issues Spurs everywhere?

Multiple HW targets PC is different from Xbox 360 console which is different from a

PlayStation® 3 (PS3) console Multiple exclusive programming languages

© 2010 Autodesk

A gap analysis on existing solutions shows that no one solution fits the video game context perfectly

No model really takes care of memory as a limitating resource in the design of parallel solutions

No model takes into account time as a dimension of the problem

All the approches are very throughput oriented

© 2010 Autodesk

THE PATHFINDING CHALLENGE : FROM THROUGHPUT TO REALTIME

© 2010 Autodesk

Pathfinding in a nutshell

Path PlanningPath

SmoothingDA(*) &

Steering

LOW FREQUENCY (0,1 Hz)• Input :

- Topology- current position- destination

• Output : - Valid Path

MEDIUM FREQUENCY (2 Hz)Input :

- current position- destination

• Output : - Target point

HIGH FREQUENCY (10 Hz)• Input :

- current position- Target point

• Output : - New Target point

(*): DA - Dynamic Avoidance

A

B

© 2010 Autodesk

Pathfinding is made of different solvers with different characteristics

3 categories of solvers: A*, Graph Traversal : low frequency/large input-work memory

Trajectory Smoothing : medium frequency/optional

DA / Steering : high frequency/critical

Frequency

Wor

k M

emor

y re

quire

men

ts

• A*• Graph Traversals

• Smoothing• DA• Steering

1030.2

> 500 K

< 5 K

© 2010 Autodesk

There are 2 natures of data parallelism in pathfinding

Number of characters: all solver jobs increase linearly with the number of characters

Size of graph : Graph Traversal related solvers can use a Dwarf 9 pattern solving approach

© 2010 Autodesk

A first approach could be a single frame batch paradigm (throughput) compatible with most programming models

Pathfinder – Entity 1

Path RequestQueue Target Request

Queue

DA RequestQueue

Steering RequestQueue

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

SearchPathTask

SelectTargetTask

Compute DA

Task

ComputeSteering

Task

PPM (Parallel Programming Model)

Mid

dle

Wa

reQ

ue

ue

PP

MQ

ue

ue

Fra

me

wo

rk

© 2010 Autodesk

Each task request has a context composed of character data, global data, and potentially customized objects

Searching Path

Start & Destination

Movement Model

Constraint

LPF(*) Shortcut

Pathdata

Potentially all PathObjects

Path

Selecting Target

Current Pos

Current target

Path

Movement Model

Constraint

LPF(*) Shortcut

PathObjects of the path

Pathdata

Target Pos

Computing DA Target

Current Pos

Current Target

Movement Model

Cluster of entities

Pathdata

DA Target Pos

Steering

Current Pos

Current DA Target

Movement Model

Current PathObject

LPF Shortcut

Wanted Speed & Yaw

Character ContextGlobal DataCustomizable

Output

(*): LPF – Obstacle Avoidance

© 2010 Autodesk

ComputePath

ComputeTargetPoint

ComputeDA TgtPoint

ComputeSteering

However, as the number of solvers can be limited by memory…

Thread 1

Thread 2

© 2010 Autodesk

…throughput maximization approach in parallelization can be capped by Amdahl’ law

Thread 1

Thread 1

Thread 2

Thread 1

Thread 2

Parallel - No memory limitation

Parallel - Memory constrained environment

Serial - No memory limitation

© 2010 Autodesk

To avoid that, the Pathfinding solution needs to find more task parallelism on time dimension

Moving from

“How to solve all the work within a frame”

To

“How to distribute work across several frames”

© 2010 Autodesk

A good illustration is describing Pathfinding as a statechart with 4 orthogonal states

StoppedPath Not Found Has Arrived

Active

Target Selection

No Target

SelectingTarget

Target Found

Path Updated

Target Found

Has arrived

DA Target

No DA Target

ComputingDA Target

DA Target Computed

Target Updated

DA Target Found

Has arrived

Steering

No Steering

Computing Steering

SteeringComputed

DA Target Updated

Steering Computed

Has arrived

Path Planning

No Path

SearchingPath

PathFound

New Destination

Path Found

Has arrived

New

Des

tinatio

n

New

Po

sP

ath U

pd

ated

New

Po

sTarg

et Up

dated

New

Po

sD

A Targ

et Up

dated

New Destination Pos updated

© 2010 Autodesk

It is still compatible with the precedent approach, but multiframe (no more capped by Amdahl’s law)

Path RequestQueue Target Request

Queue

DA RequestQueue

Steering RequestQueue

SearchPathTask

SelectTargetTask

Compute DA

Task

ComputeSteering

Task

Mid

dle

Wa

reQ

ue

ue

Fra

me

wo

rk

Active

Target Selection

No Target

SelectingTarget

Target Found

Path Updated

Target Found

Has

arriv

ed

DA Target

No DA Target

ComputingDA Target

DA Target Computed

Target Updated

DA Target Found

Has

arriv

ed

Steering

No Steering

Computing Steering

SteeringComputed

DA Target Updated

Steering ComputedH

as a

rrive

d

Path Planning

No Path

SearchingPath

PathFound

New Destination

Path Found

Has

arriv

ed

New

De

stin

ation

New

Po

sP

ath U

pd

ated

New

Po

sTa

rget U

pd

ate

d

New

Po

sD

A Ta

rge

t Up

date

d

Active

Target Selection

No Target

SelectingTarget

Target Found

Path Updated

Target Found

Has

arriv

ed

DA Target

No DA Target

ComputingDA Target

DA Target Computed

Target Updated

DA Target Found

Has

arriv

ed

Steering

No Steering

Computing Steering

SteeringComputed

DA Target Updated

Steering ComputedH

as a

rrive

d

Path Planning

No Path

SearchingPath

PathFound

New Destination

Path Found

Has

arriv

ed

New

De

stin

ation

New

Po

sP

ath U

pd

ated

New

Po

sTa

rget U

pd

ate

d

New

Po

sD

A Ta

rge

t Up

date

d

Active

Target Selection

No Target

SelectingTarget

Target Found

Path Updated

Target Found

Has

arriv

ed

DA Target

No DA Target

ComputingDA Target

DA Target Computed

Target Updated

DA Target Found

Has

arriv

ed

Steering

No Steering

Computing Steering

SteeringComputed

DA Target Updated

Steering Computed

Has

arriv

ed

Path Planning

No Path

SearchingPath

PathFound

New Destination

Path Found

Has

arriv

ed

New

De

stin

ation

New

Po

sP

ath U

pd

ated

New

Po

sTa

rget U

pd

ate

d

New

Po

sD

A Ta

rge

t Up

date

d

© 2010 Autodesk

But now we have 3 new problems

Problem 1 : How to guarantee that high frequency steering solvers return value on time?

Problem 2 : How to deal with multiframe volatility and dynamicity of data?

Problem 3 : What computation triggering logic do we want?

© 2010 Autodesk

Problem 1 is a scheduling problem for realtime systems

Problem 1 can be reworded as follows:“How to guarantee a deadline for each pathfinding solver request

compatible with the frequency of the solver”

This is very close the definition of a realtime software as found on Wikipedia:

“In computer science, real-time computing (RTC), or "reactive computing", is the study of hardware and software systems that are subject to a "real-time constraint"—i.e., operational deadlines from

event to system response”

The good news is that there is a good literature on realtime scheduling!

© 2010 Autodesk

To answer problem 1 we restate pathfinding solvers in a realtime formalism…

Realtime formalism: a task x is defined by 4 parameters X.s : starting time X.d : deadline X.e : execution requirement X.p : execution period

Adapting to pathfinding solvers: Need to assume all tasks are periodic:

Easy for smoothing, steering or DA solvers More tricky for A* and other Graph traversals solvers

Need to have an estimate of each core solver job duration: Again quite simple for smoothing, steering or DA solvers Much less easy for A* and other Graph traversals solvers -> need to decompose graph

traversal tasks into subtasks of constant duration

© 2010 Autodesk

…and select a scheduling algorithm

P-fairness scheduling scheme (S.K. Baruah, N.K. Cohen, C.G. Plaxton, D.A. Varvel): Defines a notion of proportionate progress called P-fairness Uses it to define an efficient algorithm solving the periodic scheduling problem

Cache-aware P-fair based scheduling scheme (J.H. Anderson, J.M. Calendrino, U.M. Devi) Extends P-fairness approach to avoid scheduling of co-existent threads that

would worsen performance of shared caches

Task-grouping P-fair based scheduling scheme (J.H. Anderson, J.M. Calendrino) Extends P-fairness approach to encourage grouping of tasks that share

common working set

© 2010 Autodesk

Answering problem 2 (volatile data) requires a better description of memory models

Programming models differ in the way they manage memory space

Homogenous models: unified memory Heterogeneous models: Host / Device space

Today only homogenous models offer a transparent memory management

For heterogeneous models, the developer still has to do a lot of work

© 2010 Autodesk

Programming models differ in the way they manage memory space

Framework

RequestQueue

ComputeKernel

ComputeKernel

ComputeKernel

ComputeKernel

Task

OpenCL Queue

Host Memory Space

Device Memory Space

© 2010 Autodesk

There is a need for locking mechanism between the framework and the kernel

FrameworkRequest

TaskRequest

KernelRequest

KernelExecution

TaskUpdate

FrameworkUpdate

InsertingData

OK OKLOCK

(if Kernel uses data)

OK OK OK

Data Ready OK OK OK OK OK OK

Data Locked OK OK OKLOCK

(if Kernel accesses host

memory)

OK OK

RemovingData

OK OKLOCK

(if Kerneluses data)

OK OK OK

© 2010 Autodesk

It requires also a better description of user data

There are 3 types of user data:

Read Only Memory (e.g. navmesh in a static world) Needs to be aware of when user data is available and when it is garbage

Read / Write Memory (e.g.. navmesh in a dynamic world) Same as Read Only approach, with extension to secure data modification

stages

Work Memory (e.g. open & closed sets for a A* solver) Located where the solver is really called

© 2010 Autodesk

Data Lifecycle States

Data Life cycle States are introduced to handle R/O and R/W data volatility and dynamicity

Data Ready

Notifying Data To be Inserted

Data in Insertion

Data in Removal

Notifying Data Removed

LOAD Notification

Ready for insertion

Data Inserted

Data Removed

UNLOAD Notification

Data Locked

On Dependency Insertion / RemovalDependency Inserted / Removed

End

CRITICAL when data are not owned by middleware

© 2010 Autodesk

Problem 3 (triggering logic) requires choosing between Pull or Push Triggering mechanism

To limit computations over time, it is important to decide whether we want a pull or push triggering model In a push model, the system polls over all the characters to get new steering

policy In a pull model, the system gets update requirements from the game engine

and only performs computations on related characters

The pull model better controls the amount of computations – not really compatible with a Realtime approach

The push model offers the capabilities of optimizing from a Cache and Task Grouping point of view

© 2010 Autodesk

Guidelines for a new parallel programming model for realtime AI

• Extends to the full AI the rational described in previous slides

• Data / Message Flow based system• Realtime P-fair Scheduling algorithm• Compatible with heterogeneous programming models• Push Triggering Mechanism

© 2010 Autodesk

Introducing the concept of Working Unit

A WU receives requests to process A WU communicates with another WU ONLY through strongly typed requests Requests are explicitly exposed in the WU interface A request can be synchronous or asynchronous (2 different implementations of the

request) A WU is responsible for the serialization Host<->Device of its context

Working Unit

Host Code

Device Code

Owner / Children

Event Handler

Incoming Requests Queues

Context

ContextSerializer

RequestsInterface

ContextAccessors

© 2010 Autodesk

The system works on a mixture of events and requests

Entity 1 Entity 2 Entity …

Brain1 Brain 2 Brain …

PF 1 PF 2 PF …

Entity Update WU

Entity UpdateQueue

Brain Update WU

Brain UpdateQueue

Pathfinding WU

Pathfinding Update Queue

Pathdata Mgr

CanGo WU

CanGoQueue

World Update WU

World UpdateQueue

RequestEvent

Game Engine

World1 World…

Geometry Mgr

IsVisible WU

IsVisibleQueue

© 2010 Autodesk

The underlying architecture would rely on a event broadcaster and communicating components

Global Events Broadcaster

Local Events Broadcaster

SearchPath CC

SelectTargetCC

ComputeDACC

SteeringCC

Local Events Broadcaster

SearchPath CC

SelectTargetCC

ComputeDACC

SteeringCC

Communicating Component = Working Unit for parallelism

© 2010 Autodesk

Open challenges

Customized Objects vs. Data / Services model

Interruptability

Multi-platform

Scheduling algorithm performance

And many more…

© 2010 Autodesk

Multiplatform

Too many programming languages! C++ C for OpenCL C for CUDA C99 for Spurs HLSL 5 for DirectX …

Which standards will emerge?

Which standards will be chosen in future consoles?

© 2010 autodesk massive parallelism in ai throughput versus realtime pierre pontevia 10 th march...

Documents