inside nvidia's ai infrastructure for self-driving...

1

INSIDE NVIDIA'S AI INFRASTRUCTURE

FOR SELF-DRIVING CARSCLEMENT FARABET | NVIDIA | GTC Europe

2

DRIVE AR

DRIVE IX

DRIVE AV

DRIVE OS

Lidar

Localization

Surround Perception

RADAR LIDAR

Egomotion

LIDAR Localization Path Perception

Path PlanningCamera LocalizationLanes Signs Lights

Trunk Opening Eye Gaze Distracted Driver Drowsy Driver Cyclist Alert

CGTrackDetect

NVIDIA DRIVE: SOFTWARE-DEFINED CARPowerful and Efficient AI, CV, AR, HPC | Rich Software Development Platform

Functional Safety | Open Platform | 370+ partners developing on DRIVE

DRIVE AGX XAVIERDRIVE AGX PEGASUS

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

BUILDING AI FOR SDC IS HARD

Every neural net in our DRIVE

Software stack needs to

handle 1000s of conditions

and geolocations

HazardsAnimalsBicyclesPedestrians

Backlit

Snow

Vehicles

Day

Clear FogRainCloudy

Street

LampsNightTwilight

— Target robustness per model(miles)

–– Test dataset size required(miles)

— NVIDIA’s ongoing data collection(miles)

DATA AND INFERENCE TO GET THERE?

30PB

60PB

120PB

180PB

Real-time test runs in 24h

on 400 Nodes*

24h test

on 1,600 Nodes*

24h test

on 3,200 Nodes*

* DRIVE PEGASUS Nodes

WHAT TESTING SCALE ARE WE TALKING ABOUT?We’re on our way to 100s PB of real test data = millions of real miles

+ 1,000s DRIVE Constellation nodes for offline testing alone

& billions of simulated miles

15PB

SDC SCALE TODAY AT NVIDIA

1PB collected/week

12-camera+Radar+Lidar

RIG mounted on 30 cars

4,000 GPUs in cluster

= 500 PFLOPs1,500 labelers

20M objects labeled/mo100 DRIVE

Pegasus in cluster

(Constellations)

15PB active training+test

dataset

20 unique models

50 labeling tasks

1PB of in-rack object

cache per 72 GPUs,

30PB provisioned

Developing AI for industrial applications is

bottlenecked by our ability to

“develop” the right datasets,

run the right experiments, instrument,

and ultimately automate.

If we solve these problems, then adding

more GPUs can mean more rapid progress.

EnterProject MagLev

Dataset Lifecycle

Management

ML Pipeline

Automation

Model Analytics

& Insights

DRIVE Labeling[HumanLoop: data

labeling + curation

services, dataset

export API]

MagLev

Datasets[immutable,

cached]

MagLev ML

Workflows[scheduling,

automation]

MagLev

Experiment

Service[experiments,

models, metrics]

DRIVE DL SDK [DL/Data SDK]

MagLev

Hyperopt

DRIVE Data[Hyperion data

ingestion, indexing,

transcoding]

DRIVE Metrics[DRIVE DNN metrics

and data coverage]

MagLev DRIVE MagLev Core DC Mgmt Layer

DATASET LIFECYCLE MANAGEMENTProbably our most challenging problem

DriveWorks

Parser

Data

Warehouse

Augmentation

Load/Decode

Preprocessing

Ground-truth

rasterization

...

Batching

Pipelines

Train

Build and compose PB datasets DL App dataset consumption

HumanLoop

Data+DNN

Metrics

Export Service

Notebooks

Web-based

SQL UI

Clients

Presto

Spark

Hive

Query EnginesCloud

Storage

Meta-

Storage

SSD

Upload

Store

Pure Python-programmable clientACTIVE

LEARNING

DriveWorks

ML PIPELINE AUTOMATIONHybrid cluster deployment100% Kubernetes-basedMulti-node training via MPI over k8s20+ microservicesUnderlying k8s arch shared with NGC

Maglev

CLI/UI

Workflow

controller

Maglev

Operator

MPI Operator

Dataset

Service

Workflow

Service

Experiment

Service1. Submit workflow

Compute Cluster

(DGX @ NVIDIA)

Service cluster (Cloud)

3. Render and

launch workflows

4. Launch Distributed

training jobs

K8s Controller

K8s Informer

4. Launch

jobs

5. Listen for updates

on workflows, jobs

and mpi-jobs

Postgres /

ElasticSearch

2. Handle dataset

preparation and

mounting

Sync workflow

status updates

Update experiment

details like artifacts,

model version, hyperopt

results, etc.

Access

Management

Service

MAGLEV MODEL ANALYTICS & INSIGHTSRapidly retrieve experiment results Analyze them in a notebookFind best models

Code

Repository

App #1

App #2

App #N

Git+CI based

Workflow

Launcher

Traced Asset

Repository

Models

Datasets

Metrics

Code Version

NVIDIA DRIVE Car

4000-GPU Cluster

MAGLEV: AUTOMATION & TRACEABILITY

ML

Developer

Production

Engineer

Empower Prod engineers to run or schedule

complete workflows & version everything

Optimize

app perfDeploy prod

applicationsPublish

Develop

Applications

Run/Debug

Applications

Manual

Workflow

Launcher

Analyze

Experiments/results

Cloud

Kubernetes over 4000 GPU Cluster (= 480 PFLOPs)

Data

LakeSelected

Datasets

Data selection

Job #1

Data selection

Job #N

…

Labeled

Datasets

Metrics

& Logs

HYBRID CLUSTER”Collect ⇨ Select ⇨ Label ⇨ Train ⇨ Test”as programmatic workflows

Ingest

1PB per

week

15PB

Today

Labeling

UI

Data selection

Job #2

Trained

Models

Training

Job #1

Training

Job #N

Training

Job #2

Testing

Job #1

Testing

Job #N

…

Testing

Job #1

…

ML/Metrics

UI

Run

Multi-Step

Workflow

(workflow =

sequence of

map jobs)

1,500

Labelers

Large

AI Dev team

20M

objects

labeled

per month

20 models

actively

developed

Each rack:

9 DGX-1 = 72 TESLA V100 GPUs = 9 PFLOPs

12 CPU nodes for services & data management

1.2PB per rack of cache can front object storage

MAGLEV DATA CENTER ARCHITECTURE

Kubernetes

Cloud Provider

Object Storage

On Premise

Object Storage

35kW Rack

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

CPU Node

CPU Node

CPU Node

MagLev Platform

35kW Rack

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

CPU Node

CPU Node

CPU Node

35kW Rack

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

CPU Node

CPU Node

CPU Node

35kW Rack

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

CPU Node

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

DGX-1

CPU Node

CPU Node

CPU Node1PB per

week

15PB

Today

15

THANK YOUtwitter.com/nvidiaAItwitter.com/clmt

inside nvidia's ai infrastructure for self-driving...

Documents