ai tax in mobile socs: end-to-end performance analysis of

AI Tax in Mobile SoCs: End-to-end PerformanceAnalysis of Machine Learning in Smartphones

Michael Buch1, Zahra Azad2, Ajay Joshi2, and Vijay Janapa Reddi1

1Harvard University2Boston University

{mbuch, vjreddi}@g.harvard.edu, {zazad, joshi}@bu.edu

Abstract—Mobile software is becoming increasingly featurerich, commonly being accessorized with the powerful decisionmaking capabilities of machine learning (ML). To keep upwith the consequently higher power and performance demands,system and hardware architects add specialized hardware unitsonto their system-on-chips (SoCs) coupled with frameworks todelegate compute optimally. While these SoC innovations arerapidly improving ML model performance and power efficiency,auxiliary data processing and supporting infrastructure to enableML model execution can substantially alter the performanceprofile of a system. This work posits the existence of an AI tax,the time spent on non-model execution tasks. We characterize theexecution pipeline of open source ML benchmarks and Androidapplications in terms of AI tax and discuss where performancebottlenecks may unexpectedly arise.

Index Terms—mobile systems, Android, system-on-chip, hard-ware acceleration, machine learning, workload characterization.

I. INTRODUCTION

With the advances in the compute and storage capacity ofsystems, there has been a surge of artificial intelligence (AI)in mobile applications. Many of today’s mobile applicationsuse AI-driven components like super resolution [1], facerecognition [2], image segmentation [3], virtual assistants [4],speech transcription [5], sentiment analysis [6], gesture recog-nition [7], etc. To make an application accessible to thebroader audience, AI needs to be routinely deployed on mobilesystems, which presents the following three key challenges:

First, AI processing on general-purpose mobile processorsis inefficient in terms of energy and power, and so we needto use custom hardware accelerators. Commonly used AIhardware accelerators include traditional GPUs and DSPsas well as custom AI processing engines such as Google’sCoral [8], MediaTek’s Dimensity [9], Samsung’s Neural Pro-cessing Unit [10], and Qualcomm’s AIP [11]. While theseAI engines offer hardware acceleration, depending on howthey are integrated into the wider system, they can potentiallyintroduce offloading overheads that are often not accountedfor or studied in AI system performance analyses.

Second, to manage the underlying AI hardware, SoC ven-dors rely on AI software frameworks. Examples includeGoogle’s Android Neural Network API (NNAPI) [12], Qual-comm’s Snapdragon Neural Processing Engine (SNPE) [13],and MediaTek’s NeuroPilot [14]. These frameworks manage

AI Tax

HardwareFrameworksAlgorithms

Pre-processing

End to End (E2E)Performance

AI Model

Post-processing

Data Capture Drivers Offload

Scheduling Multitenancy

Run-to-runVariability

Fig. 1: Taxonomy of overheads for AI processing, which inthis paper we expand upon to highlight common pitfalls whenassessing AI performance on mobile chipsets.

ML model execution on heterogeneous hardware and decideon the application run-time scheduling plan. While this reducesdevelopment effort, abstracting model preparation and offloadcan make performance predictability difficult to reason aboutas the framework’s run-time behavior varies depending onsystem configuration, hardware support, model details, and theend-to-end application pipeline.

Third, to enable the execution of a machine learning (ML)model, an application typically requires a host of additionalalgorithmic processing. Examples of such algorithmic process-ing include retrieval of data from sensors, pre-processing of theinputs, initialization of ML frameworks, and post-processingof results. While every ML application and its model(s) mustgo through these necessary processing stages, such overheadsare often discounted from performance analysis.

In this paper, we explain the individual stages of executionof open-source ML benchmarks and Android applicationson state-of-the-art mobile SoC chipsets [15]–[18] to quantifythe performance penalties paid in each stage. We call thecombined end-to-end latency of the ML execution pipelinethe “AI tax”. Figure 1 shows the high-level breakdown

of an application’s end-to-end performance and how end-to-end performance can be broken down into AI tax andAI model execution time. The AI tax component has threecategories that introduce overheads, which include algorithms,frameworks and hardware. Each of these categories has itsown sources of overhead. For instance, algorithmic processinginvolves the runtime overhead associated with capturing thedata, pre-processing it for the neural network, and runningpost-processing code that often generates the end result.Frameworks have driver code that coordinates scheduling andoptimizations. Finally, the CPU, GPU or accelerator itself cansuffer from offloading costs, run-to-run variability and lag dueto multi-tenancy (or running concurrent models together).

We use AI tax to point out the potential algorithmicbottlenecks and pitfalls of delegation frameworks that typi-cal inference-only workload analysis on hardware would notconvey. For example, a crucial shortcoming of several exist-ing works, including industry-standard benchmarks such asMLPerf [19] and AI Benchmark [20], is that they overempha-size ML inference performance (i.e., AI Model in Figure 1).In doing so, they ignore the rest of the processing pipelineoverheads (i.e., AI tax portion in Figure 1) — thereby “missingthe forest for the trees”. ML inference-only workload perfor-mance analysis commonly does not capture the non-modelexecution overheads because they are either interested in pureinference performance or the ad-hoc nature of these overheadsmakes it unclear how to model them in a standardized fashion.However, since the high-level application execution pipelinevaries marginally between ML models and frameworks, manyML mobile applications follow a similar structure, and sopatterns of where bottlenecks form could thus be capturedsystematically and that could help us mitigate inefficiencies.

To that end, we characterize the importance of end-to-end analysis of the execution pipeline of real ML workloadsto find the contribution of AI tax in the machine learningapplication time. We examine this from the perspective ofalgorithmic processing, framework inefficiencies and over-heads, and hardware capabilities. Given that data capture andprocessing times can be a significant portion of end-to-endAI application performance, these overheads are essential toquantify when assessing mobile AI performance. Moreover, itis necessary to consider jointly accelerating these seeminglymundane yet important data processing tasks along with MLexecution. We also present an in-depth study of ML offloadframeworks and demonstrate how they can be effectivelycoupled with a measure of AI tax to aid with efficient hardwaredelegation decisions. Finally, application performance can varysubstantially based on the underlying hardware support, butthis is not always transparent to the end user. Hence, this isyet another reason that end-to-end AI application performance(i.e., including the software framework) is crucial instead ofjust measuring hardware performance in isolation.

In summary, our contributions are as follows:1) It is important to consider end-to-end performance

analysis when quantifying AI performance in mobilechipsets. Specifically, the current state-of-the-art bench-

marks overly focus on model execution and miss thesystem-level implications crucial for AI usability.

2) The data capture and the AI pre- and post-processingstages are just as crucial for AI performance as the coreML computational kernels because they can consume asmuch as 50% of the actual execution time.

3) When conducting AI performance analysis, it is essentialto consider how well the ML software framework isoptimized for the underlying chipset. Not all frameworkssupport all chipsets and models well.

4) AI hardware performance is susceptible to cold startpenalties, run-to-run variation, and multi-tenancy over-heads, much of which are all overlooked today.

We believe that this breakdown of end-to-end AI perfor-mance is of both interest and significance for users acrossthe stack. End-users (i.e., application developers) care becausethey may want to invest in additional complexity to lowerpre-processing overheads (instead of focusing on inference).Framework authors may wish to report more details about pre-and post-processing or data collection to its users. Frameworkwriters may also want to think about scheduling pre- and post-processing operations on hardware. AI accelerator designersmay want to consider dropping an expensive tensor acceleratorin favor of a cheaper DSP that can also do pre-processing.

The remainder of the paper is organized as follows. SectionII describes the steps involved in using an ML model fromwithin a mobile application, from data acquisition to interpre-tation of the results. In Section III, we detail our experimentalsetup, including the frameworks, models, applications, andhardware that we tested. We define AI tax and quantify itin Section IV. We discuss prior art in Section V and concludethe paper in Section VI.

II. THE MACHINE LEARNING PIPELINE

This section presents the typical flow of events wheninvoking an ML model on an SoC, which we refer to as theML pipeline. This sets the stage for how we conduct our end-to-end AI performance analysis. Figure 2 shows each of thestages, and the section elaborates on their individual functions.

A. Data Capture

Acquiring data from sensors can seem trivial on the surface,but can easily complicate an application’s architecture andinfluence system design decisions. For example, capturing rawimages faster than what the application can handle can putstrains on system memory and I/O or an incorrect choice ofimage resolution can cause non-linear performance drops ifimage processing algorithms in later parts of the ML pipelinedo not scale with image size. Some systems collect datafrom more than a single sensor, in which case additionaldata processing (such as fusing multiple sources of data intoa single metric) or sanitization may be required. This extraprocessing often runs concurrently with the rest of the appli-cation across the same set of cores adding synchronizationoverheads and interference into the picture. Alternatively itmay also be offloaded to co-processors, such as the DSP

2

Data Capture Frameworks

NNAPI

Vendor

...

Pre-processing

CropScale

...

Execution

... ...

Op

OpOp

Op

OpOp

Op

Op

Op

DSPCPU

GPUNPU

Post-processing

{ people, forest, person,lamps,... }

topK

Fig. 2: Example of the steps taken in a typical end-to-end flow for an image classification machine learning task. A pre-processing step transforms data captured by the camera into the shape that the model in question expects. The ML frameworkloads the model and plans any subsequent stages; the choice of framework can be platform agnostic (e.g., NNAPI) or providedby a chipset vendor (e.g., Qualcomm’s SNPE) and its initialization is typically performed once. Then the framework schedulesmodel execution on hardware at “operation” granularity. Finally, a post-processing step makes sense of the model output, e.g.,by selecting the likeliest classes to which the image belongs (i.e., topK).

through frameworks such as Qualcomm’s FastCV [21], whichbring additional performance considerations of their own. Inthe applications we studied, the supporting code around datacapture contributed to a large share of overall applicationlatency.

B. Pre-processing

The first step in using sensor data is to shape it into aformat that the model in question expects. Although MLframeworks expose some standard pre-processing algorithms,the input requirements of a model depend on the task, dataset,network architecture, and application details, such as how datais acquired and stored. Hence, a system’s architecture canaffect the choice of pre-processing algorithms and the timespent in that stage. In this paper, we deal with models thatoperate on image data, and so we direct our attention in theremainder of this section to the common algorithms used forimage pre-processing in the applications we benchmarked:

Bitmap formatting: When dealing with image data that weobtain from a camera module, an application needs to convertthe raw frames into some standard image representation. Acommon pattern in Tensorflow-based Android applications isto retrieve a camera frame in the YUV NV21 format usingthe Android Camera API [22] and convert it to ARGB8888format.

Scale/Crop: A model is trained on images of fixed di-mensions, and the input dimensions determine a network’sarchitecture (i.e., the sizes of individual layers). Hence, some-where in the pre-processing stage, the image data needs toget resized to the size the first layer of the network canhandle. A widespread technique for scaling images (e.g., usedas Tensorflow’s default resizing algorithm) is Bilinear Interpo-lation. Although this algorithm’s run-time scales quadraticallywith the output image size, vectorized implementations cansignificantly improve its performance. Some models such asInception-v3 (center-)crop an image prior to scaling it [23].This step removes some number of pixels along each dimen-sion of an image. This operation’s cost is thus the cost ofcomputing the bounding box around the cropped region and

reshaping a tensor. In the majority of models we measured,cropping and scaling was a significant contributor to overallexecution time.

Normalize: Almost all networks require normalized inputs,i.e., with zero mean and unit variance. A typical implementa-tion will thus iterate over each pixel, subtract the image meanfrom it and scale it by the standard deviation. The run-time ofa normalization operation thus scales linearly with the numberof pixels in the input image.

Rotate: Depending on the orientation of the image providedby the sensors, an application might have to perform anadditional rotation operation on the input. If a network wastrained on augmented data or if the sensor orientation isconsistent with the image orientation expected by the model,this step might not be necessary. However, models such asPoseNet [24], which we evaluate in this paper, make extensiveuse of this operation, and it may be essential to keep in mindthat image rotation scales quadratically with the image size.

Type conversion. Quantized and reduced precision models(such as 16-bit floating-point) are increasingly popular, es-pecially in mobile systems because less memory is requiredto store weights and activations while accuracy remains inacceptable bounds. However, the datatype and bit-width of amodel also affects the pre-processing stage because framesfrom sensors are captured as raw bytes and thus need to beconverted to the appropriate types and possibly be quantized.

C. Frameworks

Most of the ML pipeline is determined by the framework(s).Pre- and post-processing algorithms are often provided aslibrary calls, such as TensorFlow operations in TFLite. Inputdata formats, model execution, and computation offload is allmanaged transparently (and often configurable manually) bya run-time, e.g., Android’s NNAPI. Device dependent frame-works are offered by most SoC vendors, e.g., Qualcomm’sSNPE or MediaTek’s ANN, but there is ongoing effort toprovide interoperability between them and NNAPI. They canexpand the list of accelerators on which we can schedulecomputation and a model’s supported operations. Given that

3

Task Model Resolution Pre-processing Task Post-processing Task NNAPI-fp32 NNAPI-int8 CPU-fp32 CPU-int8

Classification MobileNet 1.0 v1 224x224 scale, crop, normalize topK, dequantization* Y Y Y Y

Classification NasNet Mobile 331x331 scale, crop, normalize topK, dequantization* Y N Y N

Classification SqueezeNet 227x227 scale, crop, normalize topK, dequantization* Y N Y N

Classification EfficientNet-Lite0 224x224 scale, crop, normalize topK, dequantization* Y Y Y Y

Classification AlexNet 256×256 scale, crop, normalize topK, dequantization* N N Y Y

Face Recognition Inception v4 299 x 299 scale, crop, normalize topK, dequantization* Y Y Y Y

Face Recognition Inception v3 299x299 scale, crop, normalize topK, dequantization* Y Y Y Y

Segmentation Deeplab-v3 Mobilenet-v2 513x513 scale, normalize mask flattening Y N Y N

Object Detection SSD MobileNet v2 300x300 scale, crop, normalize topK, dequantization* Y Y Y Y

Pose Estimation PoseNet 224x224 scale, crop, normalize, rotate calculate keypoints Y N Y N

Language Processing Mobile BERT - tokenization topK, compute logits Y N Y N

TABLE I: Comprehensive list of benchmarks. Each entry shows all possible pre- and post-processing tasks that we observedacross our experiments. Tasks marked with an “*” are only performed with quantized models.

with cross-platform frameworks such as NNAPI, each vendoris responsible for implementing the corresponding drivers, anapplication’s performance profile can vary significantly acrossdevices.

D. Model Execution

This stage encompasses the invocation of a model andretrieval of the results back to the calling application. Inframeworks such as TFLite, these steps are handled by acombination of calls to OS APIs such as Android’s NNAPI anddevice specific driver implementations. These drivers can beopen-source, such as the TFLite Hexagon Delegate, or bundledinto vendor-specific SDKs, such as Samsung’s Edge delegate.

In NNAPI, the model execution details are determinedduring a step called model compilation, which usually needsto be performed only once when loading a new model into anapplication. Based on the application’s execution preference(low-power consumption, maximum throughput, fast execu-tion, etc.), the framework will determine (and remember forsuccessive executions) on which processors and co-processorsto run a model. Given the abundance of accelerators on modernSoCs, NNAPI tries to distribute a model’s execution acrossseveral devices. A step referred to as model partitioning.

Offloading is the process of performing a task (ideally moreefficiently) on an attached accelerator instead of the CPU.There are two common models for integrating an acceleratorinto an SoC. In the tightly coupled model, an accelerator isintegrated with the CPU core and its cache hierarchy. In theloosely coupled model, the accelerator is a separate hardwareblock with its own memory subsystem and communicates withthe CPU over memory-mapped I/O [25]. The accelerators inthe mobile devices that we consider (namely the SnapdragonSD6xx chipsets) are loosely coupled to the CPU. Thus anycommunication with the DSP requires a round-trip throughthe kernel device driver interface.

E. Post-processing

Post-processing refers to the remaining computations on themodel’s outputs before presenting them to the user. As withpre-processing algorithms, the details are task-dependent. Forimage classification tasks that we considered, all that is left

to do on the output tensor is pick the most fitting predictions(referred to as topK). The outputs of a model are sorted by thelikelihood of labels, and so choosing topK elements is simplyan array slice operation. Other tasks like pose estimation orimage segmentation require more intensive data processing onthe model output. E.g., an application using PoseNet must mapthe detected key points to the image.

III. EXPERIMENTAL SETUP

In this section, we summarize our experiments’ infrastruc-ture, configuration, and measurement setup and provide thenecessary details to reproduce our results.

A. Machine Learning Models

We measure individual stages of the ML pipeline across arepresentative set of tasks, input data sizes, and models. TableI summarizes the models we used during our experiments.All models are hosted by TFLite [26]. Implementations forthe pre- and post-processing tasks are TensorFlow Lite An-droid libraries. They are well optimized for production usecases [27].

We picked a set of commonly used models on mobilesystems that cover a range of input image resolutions (224x224to 513x513) and perform pre-/post-processing tasks of varyingcomplexity. We also picked models that are designed formobile devices (e.g., MobileNet) and more general-purposemodels (e.g., Inception). We experiment on two widespreadnumerical formats for mobile ML: (1) 8-bit quantized integers(INT8) (2) 32-bit floating-point (FP32). Performance andaccuracy are often tightly correlated. For our work, accuracydoes not matter because we do not compare FP32 againstINT8 models. Any comparisons we make are between thesame model and bit-width. Hence, we do not discuss modelaccuracy in the paper.

B. Software Frameworks

We use the open-source TFLite command-line benchmarkutility (and it’s Android wrapper) to measure inference pre-processing, post-processing, and inference latency. The formertwo required additions of timing code around the correspond-ing parts of the source. By default, the utility generates random

4

System SoC Accelerators

Open-Q 835 µSOM Snapdragon 835 Adreno 540 GPU, Hexagon 682 DSP

Google Pixel 3 Snapdragon 845 Adreno 630 GPU, Hexagon 685 DSP

Snapdragon 855 HDK Snapdragon 855 Adreno 640 GPU, Hexagon 690 DSP

Snapdragon 865 HDK Snapdragon 865 Adreno 650 GPU, Hexagon 698 DSP

TABLE II: Platforms we used to conduct our study. Weuse a userdebug build of AOSP in order to get the rightinstrumentation hooks for our purposes while mimicking thefinal product build as closely as possible.

tensors as input data. Each model invocation is performed 500times. Depending on the exact experiment, we schedule themodel on the open-source GPU Delegate or Hexagon Delegate,via automatic device assignment by NNAPI, the CPU (acrossfour threads) or a combination of all the above. By default, thebenchmarks use NNAPI’s default execution preference settingFAST_SINGLE_ANSWER.

To evaluate the effect of embedding ML models into mobileapplications, we profile a set of open-source applications. Forimage classification, segmentation, and pose estimation tasks,we utilize TFLite’s open-source example Android applica-tions [28]. We have seen similar implementations of the datacapture, pre-processing, post-processing, and inference stagesacross other Android apps.

C. Hardware Platforms

Table II describes the hardware platforms we study in ourcharacterization evaluation. We focus on the Qualcomm Snap-dragon series ranging from SD835 to SD865 as these chipsetsare widely deployed in the ecosystem. We only present resultson the Google Pixel 3 (SD845), although our experimentalresults indicate that the trends are representative across theother, older and newer, chipsets. The other chipsets have thesame set of hardware accelerators. The chipsets contain aCPU, GPU, and DSP for AI processing tasks. Moreover, thesechipsets were used by Qualcomm to generate the MLPerfInference [19], [29] and recent MLPerf Mobile results [30].

We ran a similar set of experiments through a non-NNAPIframework called Qualcomm’s SNPE using open-source ap-plications found on Qualcomm’s Developer Network [31] andusing the SNPE benchmark tool [32] instead of TFLite’s. Wecame to similar conclusions about end-to-end latency trends,but due to a lack of model variety and available Android appstargeting these systems, we decided to focus our discussionsolely on NNAPI. We expect to see similar results across non-Qualcomm SoCs because many of the performance trends wediscuss are consequences of how frameworks and applicationsuse hardware instead of the underlying SoC hardware itself.

D. Probe Effect

In our measurements we keep the performance degradationdue to our instrumentation infrastructure to a minimum. Todecrease variance in our results, we report the arithmetic meanof 500 runs. Since mobile SoCs are particularly susceptible to

Fig. 3: Comparison of inference latency between the TFLitecommand-line benchmark utility, TFLite Android benchmarkapp and example Android applications.

thermal throttling, we make sure to run benchmarks once theCPU is cooled to its idle temperature of around 33 °C.

When enabling our driver instrumentation, we observe a4-7% increase in inference time with hardware accelerationenabled (via SNPE or NNAPI) and has no effect on pre-processing or inference performed on the CPU. These in-creased numbers are not the ones we use in our evaluationsince the instrumentation only serves to reveal code pathsand measure time spent in drivers. The only results we showthat had instrumentation enabled that influenced latency onmilliseconds’ order is in the offload section (See Section IV-C).

IV. AI TAX FOR MOBILE

AI tax is the time a system spends on tasks that enable theexecution of a machine learning model; this is the combinedlatency of all non-inference ML pipeline stages (defined insection II). We claim that viewing benchmark results andmodel performance in the context of AI tax can steer themobile systems community towards fruitful research areasand narrow in on the parts of a system that are sources ofperformance bottlenecks and need optimization.

To understand and quantify the real-world effects of theend-to-end AI processing, we first compare the performanceof ML models when they are run as (1) pure benchmarksfrom the command line; (2) packaged into benchmark appswith a user interface (TFLite Android benchmark utility [33]);and (3) executed as part of a real application. An example ofhow benchmarks or proxy applications can deviate from real-world application performance is shown in Figure 3 wherewe measured end-to-end latency of various models runningon the CPU. Since TFLite’s benchmark utility focuses ontesting inference performance, we observe that negligiblepre-processing is required to run a model. Both benchmarkutilities (including the one that aims to be more representativeof Android applications) masks the end-to-end performancepenalties that arise from data capture and pre-processing. For

5

(a) (b)

Fig. 4: Comparison of the time spent on pre-processing and data capture compared to inference in the TFLite benchmark utilityas opposed to Android applications. In (a) we report absolute numbers for all three components, while in (b) we report the datacapture and pre-procesing latency relative to the inference latency. Compared to benchmarks, the same model encapsulatedinside a real application spends a significant amount of time waiting for data capture and pre-processing before it is invokedfor inference.

example, in Inception V3-fp32, the app latency is nearly350 ms while the benchmark latency is 100 ms less at 250 ms.The trend holds true across all the machine learning models.

Consequently, this section aims to convey the importanceand demonstrate the identification of AI tax in mobile appli-cations. AI tax manifests itself through three broad categories:

1) Algorithms: how an application pre-/post-processes dataand the complexity and input shape of a model.

2) Frameworks: the choice of a machine learning inferenceruntime to drive the execution of a model.

3) Hardware: availability of hardware accelerators and themechanisms by which the CPU offloads compute (includ-ing any OS-level abstractions).

The following sections present our findings across thecategories. We find that the time spent in OS and vendordrivers quickly amortize across hardware delegation frame-works. Together with post-processing, they constitute only asmall fraction of the overall execution time. However, thetime spent on capturing and then pre-processing data forML models’ consumption can combine to a majority of theexecution time.

A. AI Tax: Algorithms

This section presents our pre-/post-processing and inferencelatency measurements and shows how the performance pat-terns differ between benchmarks and Android applications.We measure the time spent on acquiring and pre-processinginput data to highlight the differing performance profilesbetween benchmarks and applications. Although most of ourresults suggest that post-processing latency is negligible (sub-millisecond per inference), more complex tasks such as imagesegmentation and object detection show that applications re-

quire significant additional work on the model output (e.g.,mask flattening and bounding box tracking, respectively).

Data Capture & Pre-Processing. Figure 4 contrasts thedata capture and pre-processing latency between those mea-sured in inference benchmarks versus Android applicationsusing the same models via NNAPI as the underlying softwareframework. Figure 4a shows the data capture, pre-processing,and inference latencies, while Figure 4b shows the processingoverheads relative to the inference latency to narrow in ondetails. These show that a significant portion of a model’s timeis spent on data capture and pre-processing instead of actualinference. That said, the amount of time depends on both themodel in question and whether we consider benchmarks orapplications.

Models such as quantized MobileNet v1 and SSD Mo-bileNet v1 spent up to two times as much time acquiring andprocessing data than performing inference in both benchmarksand applications. When embedding floating-point models suchas PoseNet, EfficientNet, and DeepLab v3 into applications,we similarly see the majority of time spent outside of infer-ence. On the other hand, in benchmarks, inference latencydominates. Generally, for floating-point data, the “data cap-ture” (which in the case of the inference benchmarks is randomdata generation) is negligible. In quantized models, it seemsto approximate real applications to some extent. However, thisis one of the fallacies of this approach to data acquisition.The standard C++ library that this benchmark happened tobe compiled against (libc++) generates real numbers signifi-cantly faster than integers. Using a different standard library(libstdc++), we observed the exact opposite behavior. The onlymodel where inference latency dominates is for quantized andfloating-point Inception-v3. Not only do the Inception models

6

have significantly more number of parameters and operationsthan other more mobile-friendly models we test, they are onlypartially able to be offloaded by NNAPI and runs aroundhalf of its inference on the CPU. For image classification andobject detection, pre-processing and data acquisition contributeapproximately equal amounts to overall run-time, while forPoseNet and Deeplab-v3 pre-processing contributes to about10% and 1%, respectively.

Post-Processing. Many machine learning tasks, particularlyimage classification, require relatively little post-processing.However, applications in domains such as object detectioncommonly employ CPU-intensive output transformations afterevery inference to make the output data user-friendly. Forexample, a typical application using PoseNet maps the de-tected key points to the image. Also, Dashcams, for instance,compute and visualize bounding boxes from a model’s output.Similar to pre-processing overheads, a system designer shouldkeep in mind that post-processing can quickly become anapplication’s main bottleneck depending on the task; thesepatterns again will likely be absent from a benchmark report.

Takeaway from AI Tax: Algorithms

When assessing or benchmarking ML system perfor-mance, it is crucial to quantify the end-to-end MLsystem performance that includes the overheads fromcapturing data along with pre- and post-processingas these can be as much as 50% of the total ex-ecution time. Therefore, obsessing about ML-onlyperformance can lead us to miss the forest for the trees.

B. AI Tax: Software Frameworks

The next set of contributors to ML application overhead arespecific to the choice of ML framework. Prominent examplesinclude open-source projects such as TFLite or vendor-specificdevelopment kits such as Qualcomm’s SNPE or MediaTek’sNeuroPilot [14]. These frameworks abstract away most ofthe execution pipeline of Figure 2, including management ofmodel lifetime, communication with accelerators, and pre- andpost-processing primitives’ implementation. In this section,we use NNAPI to illustrate the pitfalls in evaluating AIperformance if we do not benchmark end-to-end performance.

Drivers & Scheduling. Since NNAPI is in large part aninterface that relies on mobile vendors to implement, perfor-mance bottlenecks can commonly arise due to inefficienciesin the driver implementations, e.g., how models are executedor how specific NNAPI operations are implemented. Figure 5shows a situation in which this effect is pronounced. We ranthe quantized EfficientNet-Lite0 model on four different devicetargets through TFLite: (1) the open-source Hexagon backend,(2) 4 CPU threads, (3) a single CPU thread, and (4) NNAPI.

Relying on NNAPI’s automatic device assignment degradesperformance by 7× compared to scheduling the model ona single-threaded CPU. Interestingly this does not occur inthe floating-point model, pointing to several subtle traps. This

Fig. 5: Performance degradation of TFLite’s quantizedEfficientNet-Lite0 when using NNAPI (with CPU fallback).

is because of NNAPI driver support, which is lagging forthe INT8 operators the model implementation used. Futureiterations may likely fix this performance “bug.” However, thetakeaway is that not all frameworks are created and treatedequal.

To root-cause the reason for the overhead, we analyze theexecution profile of executing EfficientNet-Lite0 through animage processing app on our Pixel 3. The measured profile isshown in Figure 6. Execution on four CPU threads behaves asexpected; cores 4-7 are at 100% utilization for the benchmark(see annotation 1 ). Similarly, execution through Hexagonshows 100% utilization of the cDSP and increased AXI traffic2 . However, NNAPI’s automatic device selection does not

decide on efficient execution. Initially, we see an attempt tocommunicate via the DSP through a spike in the ComputeDSP (CDSP) utilization at the start of the benchmark (5throw from the bottom). Then, the rest of the execution uses asingle thread (indicated by the sporadic CPU 100% utilizationacross cores 4-7 3 ). Moreover, we witness frequent CPUmigrations (characterized by more frequent context switchesand the core utilization pattern 4 ). This failure of offloadingto a co-processor and reliance on the OS CPU scheduler ruinsinference latency.

While Figure 6 shows a specific case study of a single neuralnetwork, similar behavior is observed on other networks.Though the DSP as an accelerator is always supposed to runfast, there is no guarantee that the model will always run faster.

We extended our analysis (not shown) to include the per-formance of various models running on the CPU (using thenative TFLite delegate) versus the DSP (using the NNAPIdriver backend). In all cases except for Inception V4, we hadobserved that the NNAPI-DSP code path is slower. Whenintrospecting the execution using the execution profile, wesee similar behavior as in Figure 6—falling back on the CPUresults in poor end-to-end system performance.

When we switch the framework to the vendor-optimizedQualcomm SNPE, the DSP’s performance is significantly

7

Fig. 6: Output from the Snapdragon Profiler, which shows measurements while running the EfficientNet-Lite0 model on theCPU, TFLite Hexagon delegate, and NNAPI. The benchmark was run using TFLite’s command-line benchmark utility.

better. The models’ performance on the DSP outperforms theCPU (as one would expect). This tells us that being awareof the software framework’s choice can greatly impact theperformance of the chipset. The SoC vendor-specific softwareis highly tuned for the SoC and provides optimized supportfor the neural network operators.

The unfortunate side-effect of vendor-optimized softwarebackends is that it creates silos that can make it difficult forapplication developers to know which is the right frameworkfor wide-scale deployment. Ahead of time, it is unclear whichframework(s) provides the best (and the most flexible) supportuntil the developers (1) download the framework, (2) select aset of ML models to run, and (3) profile those models on thechosen frameworks for their target SoCs.

Takeaway from AI Tax: Software Frameworks

Not all frameworks are created equal. Frameworks thatpoorly support models (at least partially) fallback onthe CPU, resulting in worse performance than usingthe CPU from the start instead of attempting to lever-age the promised hardware accelerator performance.Hence, there is a need for greater transparency inframeworks being used during performance analysis.

C. AI Tax: Hardware Performance

Modern mobile SoCs can include several accelerators fordifferent problem domains, e.g., audio, sensing, image process-

ing, and broadband [34]. The behavior of these acceleratorsis strongly tied to how the software counterparts invoke them.We assess how invocation frequency affects AI performance.

Cold start. We use the Qualcomm Snapdragon 8xx chipsetsthat include a Hexagon DSP that has been optimized formachine learning applications over the years and is commonlyreferred to as one of Qualcomm’s Neural Processing Units(NPU). It is reminiscent of a VLIW vector processing engine.

Qualcomm developed the Fast Remote Procedure Call(FastRPC) framework to enable fast communication betweenCPU and DSP. We measure the FastRPC calls themselves,i.e., the overhead of offloading machine learning inferenceto a DSP. The framework requires two trips through the OSkernel with the FastRPC drivers signaling the other side uponreceipt/transmission, as shown in Figure 7. We were interestedin how much this contributes to the hardware AI tax. Thefigure shows the various system-level call boundaries that haveto be crossed when offloading to a DSP. Note the cache flushthat has to occur to maintain coherency.

Figure 8 shows the measured inference and offloading timefor MobileNet v1 when using NNAPI Hexagon delegate forexecuting the inference task. As we can see, for a smallnumber of inferences, the offloading cost is dominant inthe inference latency. However, by increasing the inferenceoperations, the proportion of offloading time to inference time(pre inference) decreases. The DSP initial setup (mapping theDSP to the application process) is done once, and we canperform multiple inferences using the same setup.

Based on the results, the reason for concern is that current

8

Fig. 7: FastRPC call flow for the Qualcomm DSP.

benchmarks and performance analysis often allow for warm-up time that is not necessarily representative of a real-worldapplication. End-user experience, especially in smartphonedeployments, involves a cold start penalty that should alsobe measured in ML benchmarks and workload performanceanalysis. The TFlite benchmark tool breaks down modelinitialization time, which is good to measure if an applicationswitches between models or frequently reloads them.

Multi-tenancy. An emerging use-case in real-world appli-cations is the growing need to support multiple models runningconcurrently. Example application use-cases are hand-tracking,depth-tracking, gesture recognition, etc., in AR/VR. Yet, mosthardware today supports the execution of one model at a time.To this end, we study AI hardware performance as more tasksare scheduled to the CPU and AI processors.

Figure 9 shows the latency breakdown of our image pro-cessing Android application when we schedule increasinglymany inferences through the NNAPI Hexagon Delegate in thebackground (using the TFlite benchmark utility). We observea linear increase in the latency per inference because theinference is stalled on resource availability. The main inferenceis running on the DSP but there is only one DSP available forML model acceleration on this particular SoC. Meanwhile,pre-processing and data capture is approximately constantregardless because activity on the CPU is not affected by theconcurrent inferences.

In contrast, Figure 10 breaks down the application latencywith the background inferences all on the CPU (while theimage processing app still offloads to the DSP). As expected,the time spent on pre-processing and data-capture increasesbecause more tasks are running on the CPU concurrently.Inference latency is now constant because multiple processesare not contending for the DSP.

This simple experiment demonstrates a potential pitfall.Suppose we as system designers or application writers wereto focus only on one part of the execution pipeline (e.g.,pre-processing in Figure 9 or inference latency in Figure 10)in isolation. In that case, we could wrongly conclude fromthe results that the device selection and schedule are optimal.However, this need not be the case since we may have other

Fig. 8: Overhead amortization over consecutive inferences.

similar applications running concurrently that contend for thesame hardware resources and may need to rethink resourceallocation, perhaps running one model on the CPU whileanother uses the DSP or offloading other execution stages incertain circumstances.

While NN accelerators can speed up execution, benchmarksthat solely focus on inference latency could mislead SoCdesigners to reserve silicon area for hardware that speeds upa specific model while eliminating potentially useful othercomponents. For example, DSPs are frequently used for every-day image processing tasks (mainly through computer visionframeworks such as FastCV), while a specially tailored NPUtypically cannot support such activities. In such cases, it ispotentially more feasible to trade-off a more powerful NPU fora smaller one paired with a DSP for pre-processing, freeingup the CPU to work on other parts of the application.

Run-to-run Variability. Performance variability is a majorconcern for mobile developers because it affects end-userexperience. But it is often not accounted for when assessingdevice performance. This is specifically an area that is over-looked by today’s mobile AI benchmarks. Performance on areal device while running applications varies from run to rundue to background interference, affecting user experience.

Figure 11 shows performance variation to quantify the gapbetween benchmarks and mobile AI performance when thesame AI models are packaged into a real application formfactor. The results show that repeated benchmark executionshave a very tight distribution in performance. The deviationfrom the mean latency is minimal. However, for the modelsrunning inside an app, we see a very pronounced distributionin mean latency. The latency can vary by as much as 30%from the median for the benchmarks.

Such large performance variability stems from the non-MLportion of the application code. Data capture, pre-processing,and post-processing all involve multiple hardware units andprocessing subsystems. To this end, the Android operatingsystem’s scheduling decisions, delays in the interrupt handlingfrom sensor input streams, etc., lead to such fluctuations.

Due to such pronounced variability, workload performanceanalysis needs to report statistical distributions in performance.Instead, today’s standard practice is to report a single ML

9

Fig. 9: Latency breakdown of image classification app whenscheduling an increasing number of inference benchmarksthrough NNAPI in the background (using the same model).

Fig. 10: Same experimental setup as in figure 9 except back-ground inferences are scheduled on the CPU.

performance number, even in well-established standard bench-marks like MLPerf [29], [35], [36]. In prior work, Facebookreported a similar observation in their analysis of mobileinference systems in the wild [37]. There is a greater needfor us to model performance variability in workload analysis,which was also recommended in the aforementioned prior art.

Takeaway from AI Tax: Hardware Performance

Cold start penalty is common in mobile devices andneeds to be considered, yet it is often ignored inML benchmarks. Moreover, today’s benchmarks focusalmost exclusively on a single benchmark running inpristine conditions that are not representative of theend-user’s operating conditions, which can lead to mis-guided conclusions about overall system optimization.Finally, run-to-run variability is important to faithfullyassess mobile AI performance.

V. RELATED WORK

AI/ML Mobile Benchmarks. There have been many effortsin recent years to benchmark mobile AI performance. Theseinclude Xiaomi’s Mobile AI Benchmark [38], AI Bench-mark [39], AI Mark [40], AITUTU [41], Procyon AI InferenceBenchmark [42] and Neural Scope [43]. Nearly all of thesebenchmarks provide a configurable means to select the soft-ware backend, allowing users to use multiple ML hardwaredelegation frameworks. Many of these benchmarks run a setof pre-selected models of various bit widths (INT8, FP16,or FP32) on the CPU and open-source or vendor-proprietaryTFLite delegates. Some of them even enable heterogeneousIP support. Regardless of all these functionalities, they missthe crucial issue of measuring end-to-end performance, whichis the ultimate focus of our work. Our objective is to raiseawareness of AI tax so that future benchmarks can be extendedto be more representative of real-world AI use cases.

End-to-end AI Performance Analysis. Kang et al. [44]have investigated the end-to-end performance pipeline of

DNNs in server-scale systems and come to a similar conclu-sion: pre-processing being the main bottleneck in hardware-accelerated models. The authors propose optimizations tocommon pre-processing steps, including re-ordering, fusion,and datatype changes within the network compilation graph.It features an in-depth analysis of ResNet and proposes amodel that is aware of pre-processing overheads to reduceend-to-end latency. We study a more comprehensive range ofmodels and applications on mobile devices. Also, we analyzethe results from a systems perspective and provide analyses ofML algorithms, frameworks and hardware.

Similarly, Richins et al. [45] do a study that is specificto datacenter workloads and architectures. It identifies net-work and storage as the major bottlenecks in the end-to-enddatacenter AI-based applications, which commonly includemultiple inference stages (e.g., video analytics applications).The authors show that even though their application underconsideration is built around state-of-the-art AI and ML al-gorithms, it relies heavily on pre- and post-processing code,which consumes as much as 40% of the total time of ageneral-purpose CPU. Our work focuses on mobile SoCs andidentifies the non-AI sources of overhead typical for handheldAI devices in the consumer market, which differs from PCand datacenter hardware in the type and number of availableaccelerators on a given SoC and as of now targets a verydifferent software infrastructure.

VI. CONCLUSION

There is a need for end-to-end performance analysis ofAI systems. On mobile systems, due to the heterogeneoushardware and software ecosystem, it is essential to havemore transparency into the actual end-to-end execution. TheAI tax overheads we explicate across algorithms, softwareframeworks, and hardware highlight new opportunities forexisting ML performance assessment efforts to improve theirmeasurement methodologies. Our work motivates character-ization studies across more ML frameworks and SoCs witha balanced focus on the entire system and the end-to-endpipeline instead of focusing on ML application kernels only

10

Fig. 11: Latency distribution for image classification us-ingMobileNet v1 on the CPU. We contrast the distributioninapplications to the TFLite benchmark utility.

in isolation. Such studies will help us develop a more com-prehensive performance model that can aid programmers andSoC architects in optimizing bottlenecks in ML applicationsaccelerate the different stages of the ML pipeline.

ACKNOWLEDGEMENTS

Supported by the Semiconductor Research Corporation.

REFERENCES

[1] W. Yang, X. Zhang, Y. Tian, W. Wang, J. Xue, and Q. Liao, “Deeplearning for single image super-resolution: A brief review,” IEEE Trans-actions on Multimedia, vol. 21, no. 12, pp. 3106–3121, 2019.

[2] G. Guo, S. Z. Li, and K. Chan, “Face recognition by support vector ma-chines,” in Proceedings 4th IEEE international conference on automaticface and gesture recognition (cat. no. PR00580). IEEE, 2000.

[3] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con-volutional encoder-decoder architecture for image segmentation,” IEEETransactions on Pattern Analysis and Machine Intelligence, vol. 39,no. 12, pp. 2481–2495, 2017.

[4] G. Campagna, R. Ramesh, S. Xu, M. Fischer, and M. S. Lam,“Almond: The architecture of an open, crowdsourced, privacy-preserving, programmable virtual assistant,” in Proceedings of the26th International Conference on World Wide Web, ser. WWW ’17.Republic and Canton of Geneva, CHE: International World WideWeb Conferences Steering Committee, 2017, p. 341–350. [Online].Available: https://doi.org/10.1145/3038912.3052562

[5] F. Seide, G. Li, and D. Yu, “Conversational speech transcription usingcontext-dependent deep neural networks,” in Twelfth annual conferenceof the international speech communication association, 2011.

[6] L. Zhang, S. Wang, and B. Liu, “Deep learning for sentiment analysis: Asurvey,” Wiley Interdisciplinary Reviews: Data Mining and KnowledgeDiscovery, vol. 8, no. 4, p. e1253, 2018.

[7] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recogni-tion with 3d convolutional neural networks,” in Proceedings of the IEEEconference on computer vision and pattern recognition workshops, 2015.

[8] Google coral. [Online]. Available: https://coral.ai/[9] Mediatek dimensity. [Online]. Available:

https://i.mediatek.com/mediatek-5g[10] Samsung neural processing unit. [Online]. Avail-

able: https://news.samsung.com/global/samsung-electronics-introduces-a-high-speed-low-power-npu-solution-for-ai-deep-learning

[11] Qualcomm neural processing engine. [Online]. Available:https://developer.qualcomm.com/docs/snpe/overview.html

[12] Nnapi. [Online]. Available:https://developer.android.com/ndk/guides/neuralnetworks

[13] Snapdragon neural processing engine sdk. [Online]. Available:https://developer.qualcomm.com/docs/snpe/overview.html

[14] Mediatek neuropilot. [Online]. Available:https://www.mediatek.com/innovations/artificial-intelligence

[15] Snapdragon 835 mobile platform. [Online]. Available:https://www.qualcomm.com/products/snapdragon-835-mobile-platform



[18] Snapdragon 865+ 5g mobile platform. [Online]. Avail-able: https://www.qualcomm.com/products/snapdragon-865-plus-5g-mobile-platform

[19] P. Mattson, C. Cheng, G. Diamos, C. Coleman, P. Micikevicius, D. Pat-terson, H. Tang, G.-Y. Wei, P. Bailis, V. Bittorf et al., “Mlperf trainingbenchmark,” Proceedings of Machine Learning and Systems, vol. 2, pp.336–349, 2020.

[20] A. Ignatov, R. Timofte, W. Chou, K. Wang, M. Wu, T. Hartley, andL. Van Gool, “Ai benchmark: Running deep neural networks on androidsmartphones,” in Proceedings of the European Conference on ComputerVision (ECCV) Workshops, September 2018.

[21] Qualcomm fastcv. [Online]. Available:https://developer.qualcomm.com/software/fastcv-sdk

[22] [Online]. Available: https://developer.android.com/guide/topics/media/camera[23] [Online]. Available: https://cloud.google.com/tpu/docs/inception-v3-

advanced[24] A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional

network for real-time 6-dof camera relocalization,” in Proceedings ofthe IEEE International Conference on Computer Vision (ICCV), 2015.

[25] Y. S. Shao and D. Brooks, “Research infrastructures for hardwareaccelerators,” Synthesis Lectures on Computer Architecture, vol. 10,no. 4, pp. 1–99, 2015.

[26] [Online]. Available: https://www.tensorflow.org/lite/guide/hosted-models[27] [Online]. Available: https://github.com/tensorflow/tflite-support/[28] [Online]. Available: https://github.com/tensorflow/examples/[29] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-

J. Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou et al.,“Mlperf inference benchmark,” in ACM/IEEE 47th Annual InternationalSymposium on Computer Architecture (ISCA). IEEE, 2020.

[30] Mlperf mobile. [Online]. Available:https://github.com/mlperf/mobile app

[31] [Online]. Available: https://developer.qualcomm.com/[32] [Online]. Available: https://developer.qualcomm.com/docs/snpe/benchmarking.html[33] Tensorflow lite benchmark tool. [Online]. Available:

https://www.tensorflow.org/lite[34] Qualcomm Technologies, Inc., Qualcomm© HexagonTM DSPUser

Guide, Qualcomm Technologies, Inc.[35] V. J. Reddi, C. Cheng, D. Kanter, P. Mattson, G. Schmuelling, C.-J.

Wu, B. Anderson, M. Breughe, M. Charlebois, W. Chou et al., “Mlperfinference benchmark,” arXiv preprint arXiv:1911.02549, 2019.

[36] V. J. Reddi, D. Kanter, P. Mattson, J. Duke, T. Nguyen, R. Chukka,K. Shiring, K.-S. Tan, M. Charlebois, W. Chou et al., “Mlperf mobileinference benchmark: Why mobile ai benchmarking is hard and what todo about it,” arXiv preprint arXiv:2012.02328, 2020.

[37] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan,K. Hazelwood, E. Isaac, Y. Jia, B. Jia et al., “Machine learning atfacebook: Understanding inference at the edge,” in 2019 IEEE Interna-tional Symposium on High Performance Computer Architecture (HPCA).IEEE, 2019, pp. 331–344.

[38] Xiaomi. Mobile ai bench. https://github.com/XiaoMi/mobile-ai-bench.[39] A. Ignatov, R. Timofte, A. Kulik, S. Yang, K. Wang, F. Baum, M. Wu,

L. Xu, and L. V. Gool. (2019) Ai benchmark: All about deep learningon smartphones in 2019.

[40] AIMark. [Online]. Available:https://play.google.com/store/apps/details?id=com.ludashi.aibenchhl=en US

[41] ANTUTU AI Benchamrk. https://www.antutu.com/en/index.htm.[42] UL Procyon AI Inference Benchmark.

https://benchmarks.ul.com/procyon/ai-inference-benchmark.[43] NeuralScope Mobile AI Benchmark Suite.

https://play.google.com/store/apps/details?id=org.aibench.neuralscope.[44] D. Kang, A. Mathur, T. Veeramacheneni, P. Bailis, and M. Zaharia,

“Jointly optimizing preprocessing and inference for dnn-based visualanalytics,” arXiv preprint arXiv:2007.13005, 2020.

[45] D. Richins, D. Doshi, M. Blackmore, A. T. Nair, N. Pathapati, A. Patel,B. Daguman, D. Dobrijalowski, R. Illikkal, K. Long et al., “Missing theforest for the trees: End-to-end ai application performance in edge datacenters,” in 2020 IEEE International Symposium on High PerformanceComputer Architecture (HPCA). IEEE, 2020, pp. 515–528.

11

ai tax in mobile socs: end-to-end performance analysis of

Documents