missing the forest for the trees hpca2020 - sigarch · 1 missing the forest for the trees:...

1

Missing the Forest for the Trees: End-to-End AI Application Performance in Edge Data Centers

Daniel Richins1,2, Dharmisha Doshi2, Matthew Blackmore2, Aswathy Thulaseedharan Nair2, Neha Pathapati2, Ankit Patel2, Brainard Daguman2, Daniel Dobrĳalowski2, Ramesh Illikkal2, Kevin Long2, David Zimmerman2, Vĳay Janapa Reddi1,3

1The University of Texas at Austin 2Intel 3Harvard University

International Symposium on High Performance Computer Architecture 25 February 2020

Missing the Forest for the Trees The AI Tax

2

AI


2

AI

The forest is the AI tax


3

The AI tax includes all the compute and infrastructure in an AI application that is necessary to enable the AI to execute but that isn’t AI itself.


3


Artificial IntelligenceTime

Excit

emen

t


3


Artificial IntelligenceTime

Excit

emen

t

Pre Post

1. AI Tax - A Case Study a. Definition b. Video Analytics c. Analysis

2. AI Acceleration - Anticipating Future Bottlenecks a. Emulation Technique b. Results c. What's Breaking?

3. Optimization - Better Performance at Lower TCO a. Fixing the Bottleneck b. Edge Data Centers c. Two Designs

4. Conclusion

Outline

4

1. AI Tax a. Definition b. Video Analytics c. Analysis

2. AI Acceleration a. Emulation Technique b. Results c. What's Breaking?

3. Optimization a. Fixing the Bottleneck b. Edge Data Centers c. Two Designs

4. Conclusion

AI Tax

5




4. Conclusion

AI Tax - Definition

6

Missing the Forest for the Trees

7


7

AI Tax


7

AI TaxSupporting compute, storage, network, software infrastructure, etc. together constitute the AI Tax.




4. Conclusion

AI Tax - Video Analytics

8

Face Recognition Algorithm

9


9

Face Recognition is Google’s FaceNet as a data center application.


9


Video Stream


9


Video Stream Ingestion


9


FrameVideo

Stream Ingestion


9


FrameFace

DetectionVideo

Stream Ingestion


9


Face Thumbnail

FrameFace

DetectionVideo

Stream Ingestion


9


Face Thumbnail

FrameFace

Detection

Feature Extraction



9


Vector

Face Thumbnail

FrameFace

Detection

Feature Extraction



9


Vector

Face Thumbnail

FrameFace

Detection

Feature ExtractionClassification



9


Vector

Face Thumbnail

Frame

Identity

Face Detection




9

User Application


Vector

Face Thumbnail

Frame

Identity

Face Detection




9

User Application


Vector

Face Thumbnail

Frame

Identity

Face Detection

Face Detection

Feature ExtractionFeature

ExtractionClassificationClassification


AI Compute

Face Recognition Data Center Deployment

10

Classification

Ingestion Face Detection

Video Stream

Identity Feature Extraction

User Application

AI Compute


10

Classification

Ingestion Face Detection

Feature Extraction

User Application

AI Compute

Ingest/Detect


10

User

Identification

Application

AI Compute

Ingest/Detect


10

User

Identification

Application

Producers

ConsumersAI Compute

IdentificationIdentification

Ingest/DetectIngest/DetectIngest/Detect


10

User

Identification

Application

Producers

ConsumersAI Compute

IdentificationIdentification

Ingest/DetectIngest/DetectIngest/Detect


10

User

Identification

Application

Brokers

Producers

ConsumersAI Compute

Experimental Setup Hardware

11


11

2x Intel Xeon Platinum 8176 2x 28 cores, 2.10 GHz, 2x 38.5 MB LLC

384 GB DDR4 SDRAM

1x Intel SSD P4510 2.85 GB/s read 1.10 GB/s write

100 Gbps Ethernet


11

Experimental Setup Face Recognition

12

We allocate one core per container. Hence, a server runs 56 containers.

Identification


12


Ingest/Detect x56

x56


12


Ingest/Detect Identification

840 total producers 1680 total consumers


12


Brokers get their own server. This grants them full network and storage bandwidth.




12


Broker





12





Broker

3 brokers




4. Conclusion

AI Tax - Analysis

13

Application Progress Event Logging

14

while True: frame = queue.get()

producer.send(faces) faces = detect_faces(frame)


14

while True: frame = queue.get() start = time.time()

end = time.time() producer.send(faces)

faces = detect_faces(frame)


14


end = time.time()

size = sys.getsizeof(faces) log = { 'start': start, 'end': end, 'size': size } logger.info(log)

producer.send(faces)



14


end = time.time()

size = sys.getsizeof(faces) log = { 'start': start, 'end': end, 'size': size } logger.info(log)



Logging is designed to raise the level of abstraction. We view application progress from the data center perspective.

Face Detection Latency

15


15

Latency Breakdown

Ingestion DetectionBrokers Identification


15

Latency Breakdown


5.4%

AI Tax


15

Latency Breakdown


21.3%

5.4%

AI Tax

AI Compute


15

Latency Breakdown


35.9%

21.3%

5.4%

AI Tax

AI Tax

AI Compute


15

Latency Breakdown


37.4%

35.9%

21.3%

5.4%

AI Tax

AI Tax

AI Compute

AI Compute

Process Breakdowns

16

Ingestion

100%

AI AI Tax

Process Breakdowns

16

Face Detection

58%42%

AI AI Tax

Ingestion

100%

AI AI Tax

Process Breakdowns

16

Face Detection

58%42%

AI AI Tax

Identification

12%

88%

AI AI Tax

Ingestion

100%

AI AI Tax

Process Breakdowns

16

Face Detection

58%42%

AI AI Tax

Identification

12%

88%

AI AI Tax

Ingestion

100%

AI AI Tax

Process Breakdowns

16

Pre- and post-processing are heavily utilized within stages.

AI Tax

17

Time

Excit

emen

t

Pre PostAI

AI Tax

18

Time

Excit

emen

t

Pre PostAIAIAI

AI Tax

19

Time

Excit

emen

t

Pre PostAIAIAI




4. Conclusion

AI Acceleration

20




4. Conclusion

AI Acceleration - Emulation Technique

21

Acceleration Emulation

22

Classification

Brokers

Ingest/Detect

Identification


22

Classification

Brokers

Dial an Accelerator Speed


23

while True:

faces = detect_faces(frame) start = time.time()

end = time.time()

sys.getsizeof(faces)

frame = queue.get()


log = { 'start': start, 'end': end,

} logger.info(log)

'size': 'size': size

size =


23

while True:

start = time.time()

end = time.time()

sys.getsizeof(faces)

frame = queue.get()

producer.send(

time.sleep(avg_time)

faces)


} logger.info(log)


size =


23

while True:

start = time.time()

end = time.time()

frame = queue.get()

producer.send(


os.urandom(avg_size))


} logger.info(log)


avg_size size =


24

while True:

start = time.time()

end = time.time()

frame = queue.get()

producer.send(


} logger.info(log)

'size': 'size':



size = avg_size

size


24

while True:

start = time.time()

end = time.time()

frame = queue.get()

producer.send(


} logger.info(log)

'size': 'size':


time.sleep(avg_time/speedup)

size = avg_size

size


24

while True:

start = time.time()

end = time.time()

frame = queue.get()

producer.send(


} logger.info(log)

'size': 'size':


time.sleep(avg_time/speedup)

size = avg_size

size

With faster processing, we feed frames into the system faster to maximize throughput




4. Conclusion

AI Acceleration - Results

25

Accelerated AI: Reduced Latency and Increased Throughput

26


26

Late

ncy

(ms)

0

100

200

300

400

500

600

700

1x 2x 4x 6x 8x

Ingest/Detect Broker Identify

Fram

es p

er S

econ

d (x

1000

)

0

10

20

30

40

50

60

70Throughput


26

Late

ncy

(ms)

0

100

200

300

400

500

600

700

1x 2x 4x 6x 8x

Ingest/Detect Broker Identify

Fram

es p

er S

econ

d (x

1000

)

0

10

20

30

40

50

60

70Throughput

At 8x speedup, the average latency goes to infinity. The longer the experiment runs, the greater the latency.




4. Conclusion

AI Acceleration - What’s Breaking?

27

Three Big Systems

28

Three Big Systems

28

Compute

Three Big Systems

28

Compute Network

Three Big Systems

28

Compute Network Storage

Three Big Systems

28


?

Three Big Systems

28


? ?

Explaining the Bottleneck

29

Network Utilization

0%

1%

2%

3%

4%

5%

6%

7%

1x 2x 4x 6x 8x

Broker Read Broker Write


29

Storage Utilization

0%

10%

20%

30%

40%

50%

60%

70%

1x 2x 4x 6x 8x

Network Utilization

0%

1%

2%

3%

4%

5%

6%

7%

1x 2x 4x 6x 8x



29

Storage Utilization

0%

10%

20%

30%

40%

50%

60%

70%

1x 2x 4x 6x 8x

Network Utilization

0%

1%

2%

3%

4%

5%

6%

7%

1x 2x 4x 6x 8x



29

As storage utilization approaches the limits of the devices, it becomes the limiting factor to performance.




4. Conclusion

Optimization

30




4. Conclusion

Optimization - Fixing the Bottleneck

31

Fixing the Bottleneck

32


32

Additional Drives

Late

ncy

(ms)

0

50

100

150

200

1 Driv

e

2 Driv

es

3 Driv

es

4 Driv

es

8x 12x 16x 24x 32x


32

Additional Drives

Late

ncy

(ms)

0

50

100

150

200

1 Driv

e

2 Driv

es

3 Driv

es

4 Driv

es

8x 12x 16x 24x 32x

Additional Brokers

Late

ncy

(ms)

0

50

100

150

200

3 Brok

ers

4 Brok

ers

6 Brok

ers

8 Brok

ers




4. Conclusion

Optimization - Edge Data Centers

33

Advantages of an Edge Data Center

34


34

Smaller corporations are finding edge data centers more economical than the cloud.


34


Edge data centers offer lower latency by serving local users.


34


Edge data centers offer lower latency by serving local users.

Edge data centers can be built to target a specific application domain.

https://www.networkworld.com/article/2926448/7-key-criteria-for-defining-edge-data-centers.html

http://blog.cushwake.com/americas/life-on-the-edge-the-new-normal-for-data-centers.html

https://www.vxchnge.com/blog/what-is-an-edge-data-center

Sources




4. Conclusion

Optimization - Two Designs

35

Edge Data Center Node Allocation

36


36

We need to allocate enough brokers to handle 32x speedup.


36


Consumers30

Producers15

Brokers8


36

Consumers578

Producers289

Brokers157


Consumers30

Producers15

Brokers8

Targeted Data Center Design Optimizing for Total Cost of Ownership

37


37

Homogeneous Heterogeneous


37


56 core 1 Drive 100 GbE


37



Compute Broker


37



Compute Broker

160 switches


37



Compute Broker

160 switches 28+14 switches


37



Compute Broker

TCO

160 switches 28+14 switches

Heterogeneous Edge Data Center

38


38

Compute Node


38

Compute Node Broker Node


38


10 GbE


38


10 GbE50 GbE

Heterogeneous Edge Data Center Networking

39


39

100 Gbps Switch 100 Gbps Switch

100 Gbps 100 Gbps 100 Gbps 100 Gbps


39



Broker Node

50 Gbps


39



40 Gbps 40 Gbps

Broker Node

50 Gbps


39



40 Gbps 40 Gbps

Broker NodeCompute Node

50 Gbps

10 Gbps


39



40 Gbps 40 Gbps 40 Gbps40 Gbps

Broker NodeCompute Node

50 Gbps

10 Gbps

Comparing Total Cost of Ownership

40

We assume a three-year amortization of costs.


40


0%

20%

40%

60%

80%

100%

Compute Networking Power Overall



40


0%

20%

40%

60%

80%

100%



89%100%


40


0%

20%

40%

60%

80%

100%



23%

89%100%100%


40


0%

20%

40%

60%

80%

100%



100%

23%

89%100%100%100%


40


0%

20%

40%

60%

80%

100%



84%100%

23%

89%100%100%100%100%


40


0%

20%

40%

60%

80%

100%



84%100%

23%

89%100%100%100%100%

$12.9 million $10.8 million


40


0%

20%

40%

60%

80%

100%



84%100%

23%

89%100%100%100%100%


The targeted, heterogeneous data center incurs 16% lower total cost of ownership


40


0%

20%

40%

60%

80%

100%



84%100%

23%

89%100%100%100%100%


The targeted, heterogeneous data center incurs 16% lower total cost of ownership

Designing the data center to match the needs of the application, we created a better data center at lower cost




4. Conclusion

Conclusion

41

Calls to Action

42

Calls to Action

• To fully understand AI applications, we must consider the overhead of the AI tax in end-to-end performance.

42

Calls to Action


• As we accelerate AI, we must consider new bottlenecks that manifest as AI tax.

42

Calls to Action


• As we accelerate AI, we must consider new bottlenecks that manifest as AI tax.

• We cannot limit our view of AI to microarchitectural considerations. We need data center-level optimizations to address data center-level bottlenecks.

42

43

Thank You

missing the forest for the trees hpca2020 - sigarch · 1 missing the forest for the trees:...

Documents