inside nvidia's ai infrastructure for self-driving...
TRANSCRIPT
1
INSIDE NVIDIA'S AI INFRASTRUCTURE
FOR SELF-DRIVING CARSCLEMENT FARABET | NVIDIA | GTC Europe
2
DRIVE AR
DRIVE IX
DRIVE AV
DRIVE OS
Lidar
Localization
Surround Perception
RADAR LIDAR
Egomotion
LIDAR Localization Path Perception
Path PlanningCamera LocalizationLanes Signs Lights
Trunk Opening Eye Gaze Distracted Driver Drowsy Driver Cyclist Alert
CGTrackDetect
NVIDIA DRIVE: SOFTWARE-DEFINED CARPowerful and Efficient AI, CV, AR, HPC | Rich Software Development Platform
Functional Safety | Open Platform | 370+ partners developing on DRIVE
DRIVE AGX XAVIERDRIVE AGX PEGASUS
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
BUILDING AI FOR SDC IS HARD
Every neural net in our DRIVE
Software stack needs to
handle 1000s of conditions
and geolocations
HazardsAnimalsBicyclesPedestrians
Backlit
Snow
Vehicles
Day
Clear FogRainCloudy
Street
LampsNightTwilight
— Target robustness per model(miles)
–– Test dataset size required(miles)
— NVIDIA’s ongoing data collection(miles)
DATA AND INFERENCE TO GET THERE?
30PB
60PB
120PB
180PB
Real-time test runs in 24h
on 400 Nodes*
24h test
on 1,600 Nodes*
24h test
on 3,200 Nodes*
* DRIVE PEGASUS Nodes
WHAT TESTING SCALE ARE WE TALKING ABOUT?We’re on our way to 100s PB of real test data = millions of real miles
+ 1,000s DRIVE Constellation nodes for offline testing alone
& billions of simulated miles
15PB
SDC SCALE TODAY AT NVIDIA
1PB collected/week
12-camera+Radar+Lidar
RIG mounted on 30 cars
4,000 GPUs in cluster
= 500 PFLOPs1,500 labelers
20M objects labeled/mo100 DRIVE
Pegasus in cluster
(Constellations)
15PB active training+test
dataset
20 unique models
50 labeling tasks
1PB of in-rack object
cache per 72 GPUs,
30PB provisioned
Developing AI for industrial applications is
bottlenecked by our ability to
“develop” the right datasets,
run the right experiments, instrument,
and ultimately automate.
If we solve these problems, then adding
more GPUs can mean more rapid progress.
EnterProject MagLev
Dataset Lifecycle
Management
ML Pipeline
Automation
Model Analytics
& Insights
DRIVE Labeling[HumanLoop: data
labeling + curation
services, dataset
export API]
MagLev
Datasets[immutable,
cached]
MagLev ML
Workflows[scheduling,
automation]
MagLev
Experiment
Service[experiments,
models, metrics]
DRIVE DL SDK [DL/Data SDK]
MagLev
Hyperopt
DRIVE Data[Hyperion data
ingestion, indexing,
transcoding]
DRIVE Metrics[DRIVE DNN metrics
and data coverage]
MagLev DRIVE MagLev Core DC Mgmt Layer
DATASET LIFECYCLE MANAGEMENTProbably our most challenging problem
DriveWorks
Parser
Data
Warehouse
Augmentation
Load/Decode
Preprocessing
Ground-truth
rasterization
...
Batching
Pipelines
Train
Build and compose PB datasets DL App dataset consumption
HumanLoop
Data+DNN
Metrics
Export Service
Notebooks
Web-based
SQL UI
Clients
Presto
Spark
Hive
Query EnginesCloud
Storage
Meta-
Storage
SSD
Upload
Store
Pure Python-programmable clientACTIVE
LEARNING
DriveWorks
ML PIPELINE AUTOMATIONHybrid cluster deployment100% Kubernetes-basedMulti-node training via MPI over k8s20+ microservicesUnderlying k8s arch shared with NGC
Maglev
CLI/UI
Workflow
controller
Maglev
Operator
MPI Operator
Dataset
Service
Workflow
Service
Experiment
Service1. Submit workflow
Compute Cluster
(DGX @ NVIDIA)
Service cluster (Cloud)
3. Render and
launch workflows
4. Launch Distributed
training jobs
K8s Controller
K8s Informer
4. Launch
jobs
5. Listen for updates
on workflows, jobs
and mpi-jobs
Postgres /
ElasticSearch
2. Handle dataset
preparation and
mounting
Sync workflow
status updates
Update experiment
details like artifacts,
model version, hyperopt
results, etc.
Access
Management
Service
MAGLEV MODEL ANALYTICS & INSIGHTSRapidly retrieve experiment results Analyze them in a notebookFind best models
Code
Repository
App #1
App #2
App #N
Git+CI based
Workflow
Launcher
Traced Asset
Repository
Models
Datasets
Metrics
Code Version
NVIDIA DRIVE Car
4000-GPU Cluster
MAGLEV: AUTOMATION & TRACEABILITY
ML
Developer
Production
Engineer
Empower Prod engineers to run or schedule
complete workflows & version everything
Optimize
app perfDeploy prod
applicationsPublish
Develop
Applications
Run/Debug
Applications
Manual
Workflow
Launcher
Analyze
Experiments/results
Cloud
Kubernetes over 4000 GPU Cluster (= 480 PFLOPs)
Data
LakeSelected
Datasets
Data selection
Job #1
Data selection
Job #N
…
Labeled
Datasets
Metrics
& Logs
HYBRID CLUSTER”Collect ⇨ Select ⇨ Label ⇨ Train ⇨ Test”as programmatic workflows
Ingest
1PB per
week
15PB
Today
Labeling
UI
Data selection
Job #2
Trained
Models
Training
Job #1
Training
Job #N
Training
Job #2
Testing
Job #1
Testing
Job #N
…
Testing
Job #1
…
ML/Metrics
UI
Run
Multi-Step
Workflow
(workflow =
sequence of
map jobs)
1,500
Labelers
Large
AI Dev team
20M
objects
labeled
per month
20 models
actively
developed
Each rack:
9 DGX-1 = 72 TESLA V100 GPUs = 9 PFLOPs
12 CPU nodes for services & data management
1.2PB per rack of cache can front object storage
MAGLEV DATA CENTER ARCHITECTURE
Kubernetes
Cloud Provider
Object Storage
On Premise
Object Storage
35kW Rack
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
CPU Node
CPU Node
CPU Node
MagLev Platform
35kW Rack
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
CPU Node
CPU Node
CPU Node
35kW Rack
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
CPU Node
CPU Node
CPU Node
35kW Rack
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
CPU Node
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
DGX-1
CPU Node
CPU Node
CPU Node1PB per
week
15PB
Today
15
THANK YOUtwitter.com/nvidiaAItwitter.com/clmt