accerelate framwork

165
These are confidential sessions—please refrain from streaming, blogging, or taking pictures Fast and energy efficient computation Session 713 Accelerate Framework Geoff Belter Engineer, Vector and Numerics Group

Upload: arsalan-malik

Post on 11-Dec-2015

25 views

Category:

Documents


4 download

DESCRIPTION

Accerelate Framwork

TRANSCRIPT

Page 1: Accerelate Framwork

These are confidential sessions—please refrain from streaming, blogging, or taking pictures

Fast and energy efficient computation

Session 713

Accelerate Framework

Geoff BelterEngineer, Vector and Numerics Group

Page 2: Accerelate Framwork

What is it?Accelerate Framework

• Easy access to a lot of functionality•Accurate• Fast with low energy usage•Works on both OS X and iOS•Optimized for all of generations of hardware

Page 3: Accerelate Framwork

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 4: Accerelate Framwork

Session goalsAccelerate Framework

•How Accelerate helps you•Areas of your code likely to benefit from Accelerate•How you use Accelerate

Page 5: Accerelate Framwork

Why is it fast?Accelerate Framework

Page 6: Accerelate Framwork

Why is it fast?Accelerate Framework

• SIMD instructions■ SSE, AVX and NEON

Page 7: Accerelate Framwork

Why is it fast?Accelerate Framework

• SIMD instructions■ SSE, AVX and NEON

•Match the micro-architecture■ Instruction selection and scheduling■ Software pipelining■ Loop unrolling

Page 8: Accerelate Framwork

Why is it fast?Accelerate Framework

• SIMD instructions■ SSE, AVX and NEON

•Match the micro-architecture■ Instruction selection and scheduling■ Software pipelining■ Loop unrolling

•Multi-threaded using GCD

Page 9: Accerelate Framwork

Tips for Successful Use of Accelerate

Page 10: Accerelate Framwork

Tips for Successful Use of Accelerate

• Prepare your data■ Contiguous■ 16-byte aligned

Page 11: Accerelate Framwork

Tips for Successful Use of Accelerate

• Prepare your data■ Contiguous■ 16-byte aligned

•Understand problem size

Page 12: Accerelate Framwork

Tips for Successful Use of Accelerate

• Prepare your data■ Contiguous■ 16-byte aligned

•Understand problem size•Do setup once/destroy at the end

Page 13: Accerelate Framwork

Using Accelerate FrameworkXcode

Page 14: Accerelate Framwork

Using Accelerate FrameworkXcode

Page 15: Accerelate Framwork

Using Accelerate FrameworkXcode

Page 16: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 17: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 18: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 19: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 20: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 21: Accerelate Framwork

XcodeUsing Accelerate Framework

Page 22: Accerelate Framwork

Command L=lineUsing Accelerate Framework

Page 23: Accerelate Framwork

cc -framework Accelerate main.c

Command L=lineUsing Accelerate Framework

Page 24: Accerelate Framwork

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 25: Accerelate Framwork

Vectorized image processing libraryvImage

Page 26: Accerelate Framwork

vImageWhat’s available?

Page 27: Accerelate Framwork

vImageWhat’s available?

Page 28: Accerelate Framwork

Additions and Improvements

Page 29: Accerelate Framwork

Additions and Improvements

• Improved conversion support

Page 30: Accerelate Framwork

Additions and Improvements

• Improved conversion support• vImage Buffer creation utilities

Page 31: Accerelate Framwork

Additions and Improvements

• Improved conversion support• vImage Buffer creation utilities• Resampling of 16-bit images

Page 32: Accerelate Framwork

Additions and Improvements

• Improved conversion support• vImage Buffer creation utilities• Resampling of 16-bit images• Streamlined Core Graphics interoperability

Page 33: Accerelate Framwork

Core Graphics Interoperability

Page 34: Accerelate Framwork

Core Graphics Interoperability

•How do I use vImage with my CGImageRef?

Page 35: Accerelate Framwork

Core Graphics Interoperability

•How do I use vImage with my CGImageRef?•New utility functions• vImageBuffer_InitWithCGImage• vImageCreateCGImageFromBuffer

Page 36: Accerelate Framwork

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 37: Accerelate Framwork

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 38: Accerelate Framwork

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 39: Accerelate Framwork

Core Graphics InteroperabilityFrom CGImageRef to vImageBuffer

#include <Accelerate/Accelerate.h>

// Create and prepare CGImageRefCGImageRef inImg;

// Specify FormatvImage_CGImageFormat format = { .bitsPerComponent = 8, .bitsPerPixel = 32, .colorSpace = NULL, .bitmapInfo = kCGImageAlphaFirst, .version = 0, .decode = NULL, .renderingIntent = kCGRenderingIntentDefault,};

// Create vImageBuffervImage_Buffer inBuffer;vImageBuffer_InitWithCGImage(&inBuffer, &format, NULL, inImg, kvImageNoFlags);

Page 40: Accerelate Framwork

Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef

#include <Accelerate/Accelerate.h>

// The output buffervImage_Buffer outBuffer;

// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);

Page 41: Accerelate Framwork

Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef

#include <Accelerate/Accelerate.h>

// The output buffervImage_Buffer outBuffer;

// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);

Page 42: Accerelate Framwork

Core Graphics InteroperabilityFrom vImageBuffer to CGImageRef

#include <Accelerate/Accelerate.h>

// The output buffervImage_Buffer outBuffer;

// Create CGImageRefvImage_Error error;CGImageRef outImg = vImageCreateCGImageFromBuffer(&outBuffer, &format, NULL, !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! NULL, kvImageNoFlags, &error);

Page 43: Accerelate Framwork

Core Graphics Interoperability

• Convert between vImage_CGImageFormat types• Tips for use

■ Create converter once■ Use many times

vImageConvert_AnyToAny

Page 44: Accerelate Framwork

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 45: Accerelate Framwork

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 46: Accerelate Framwork

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 47: Accerelate Framwork

0

10

20

30

ARGB

ARGB

pre

mul

BGRA

pre

mul

RGBA

RGBA

pre

mul

RGBA

(flo

at)

RGBA

(flo

at) p

rem

ul

RGBA

(uin

t16)

RGBA

(uin

t16)

pre

mul

MPi

xel/s

Software JPEG Encode Performance

Bigger is Better

iPhone 5

Old methodAnyToAny

Page 48: Accerelate Framwork

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 49: Accerelate Framwork

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 50: Accerelate Framwork

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 51: Accerelate Framwork

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 52: Accerelate Framwork

Scaling with a premultiplied PlanarF imageConversion Example

#include <Accelerate/Accelerate.h>

vImage_Buffer src, dst, alpha;...// Premultiplied data -> Non-premultiplied data, works in-placevImageUnpremultiplyData_PlanarF(&src, &alpha, &src, kvImageNoFlags);

// Resize the imagevImageScale_PlanarF(&src, &dst, NULL, kvImageNoFlags);

// Non-premultiplied data -> Premultiplied data, works in-placevImagePremultiplyData_PlanarF(&dst, &alpha, &dst, kvImageNoFlags);

Page 53: Accerelate Framwork

Conversions in ContextScaling on a premultiplied PlanarF image

0 25 50 75 100

Unpremultiply

Scale

Premultiply

Percentage of time taken

Page 54: Accerelate Framwork

Conversions in ContextScaling on a premultiplied PlanarF image

0 25 50 75 100

Unpremultiply

Scale

Premultiply 0.64%

98.26%

1.10%

Percentage of time taken

Page 55: Accerelate Framwork
Page 56: Accerelate Framwork

vImage vs. OpenCV

Page 57: Accerelate Framwork

The competitionOpenCV

•Open source computer vision library• Image processing module

Page 58: Accerelate Framwork

Points of Comparison

• Execution time• Energy consumed

Page 59: Accerelate Framwork

vImage Speedup over OpenCViPhone 5

0 5 10 15 20 25

Histogram

Max

Box Convolve

Affine Warp

Speedup over OpenCV

Above one is better

(lanczos)

Page 60: Accerelate Framwork

vImage Speedup over OpenCViPhone 5

0 5 10 15 20 25

Histogram

Max

Box Convolve

Affine Warp 6.41

7.88

23.19

1.62

Speedup over OpenCV

Above one is better

(lanczos)

Page 61: Accerelate Framwork

Energy Consumption and Battery Life

• Fast code tends to■ Decrease energy consumption■ Increase battery life

Page 62: Accerelate Framwork

Typical Energy Consumption Profile

Pow

er

Idle power

Instantaneous power

Energy

t1t0

p2

p1

p0

Time

Page 63: Accerelate Framwork

Typical Energy Consumption Profile

Pow

er

Idle power

Instantaneous power

Time

Unoptimized

Opt

imiz

ed

Page 64: Accerelate Framwork

vImage Energy Savings over OpenCViPhone 5

0 2 4 6 8

Histogram

Max

Box Convolve

Affine Warp

Times less energy than OpenCV

(lanczos)

Above one is better

Page 65: Accerelate Framwork

vImage Energy Savings over OpenCViPhone 5

0 2 4 6 8

Histogram

Max

Box Convolve

Affine Warp 6.96

4.05

6.71

0.75

Times less energy than OpenCV

(lanczos)

Above one is better

Page 66: Accerelate Framwork
Page 67: Accerelate Framwork

“Using vImage from the Accelerate framework to dynamically pre-render my sprites. It’s the only way to make it fast. ;-)”

–Twitter User

Page 68: Accerelate Framwork

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 69: Accerelate Framwork

Vectorized digital signal processing libraryvDSP

Page 70: Accerelate Framwork

• Basic operations on arrays■ Add, subtract, multiply, conversion, accumulation, etc.

•Discrete Fourier Transform• Convolution and correlation

vDSPWhat’s available?

Page 71: Accerelate Framwork

New and Improved in vDSP

Page 72: Accerelate Framwork

New and Improved in vDSP

•Multi-channel IIR filter

Page 73: Accerelate Framwork

New and Improved in vDSP

•Multi-channel IIR filter• Improved power of 2 support

■ Discrete Fourier Transform (DFT)■ Discrete Cosine Transform (DCT)

Page 74: Accerelate Framwork

Discrete Fourier Transform (DFT)

• Same operation, two entries based on number of points

Page 75: Accerelate Framwork

Discrete Fourier Transform (DFT)

• Same operation, two entries based on number of points

256

FFT

384

DFT

Before

Page 76: Accerelate Framwork

Discrete Fourier Transform (DFT)

• Same operation, two entries based on number of points

256

FFT

384

DFT

256 384

DFT

Before OS X 10.9, iOS 7

Page 77: Accerelate Framwork

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 78: Accerelate Framwork

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 79: Accerelate Framwork

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 80: Accerelate Framwork

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 81: Accerelate Framwork

#include <Accelerate/Accelerate.h>

// Create and prepare data:float *Ir,*Ii,*Or,*Oi;

// Once at start:vDSP_DFT_Setup setup = vDSP_DFT_zop_CreateSetup(0, 1024, vDSP_DFT_FORWARD);... vDSP_DFT_Execute(setup, Ir, Ii, Or, Oi);...// Once at end:vDSP_DFT_DestroySetup(setup);

DFT Example

Page 82: Accerelate Framwork
Page 83: Accerelate Framwork

vDSP vs. FFTW

Page 84: Accerelate Framwork

Fastest Fourier Transform in the westFFTW

•One and multi-dimensional transforms• Real and complex data• Parallel

Page 85: Accerelate Framwork

vDSP Speedup over FFTWDFT on iPhone 5

Page 86: Accerelate Framwork

vDSP Speedup over FFTW

240 256 320 384 480 512 640 768 9600

1

2

3

Spee

dup

Number of Points

DFT on iPhone 5

Above one is better

Page 87: Accerelate Framwork

vDSP Speedup over FFTW

240 256 320 384 480 512 640 768 9600

1

2

3

Spee

dup

Number of Points

DFT on iPhone 5

Above one is better

Page 88: Accerelate Framwork

•Used in FaceTime•DFT one of many DSP routines• Percentage of time spent in DFT

DFT In UseAAC Enhanced Low Delay

Page 89: Accerelate Framwork

DFT In UseAAC Enhanced Low Delay

Page 90: Accerelate Framwork

DFT In UseAAC Enhanced Low Delay

47%54%

DFTEverything Else

FFTW

Percent time spent in DFT

Page 91: Accerelate Framwork

DFT In UseAAC Enhanced Low Delay

47%54%

DFTEverything Else

FFTW

70%

30%

vDSP

Percent time spent in DFT

Page 92: Accerelate Framwork

Data Types in vDSP

• Single and double precision• Real and complex• Support for strided data access

Page 93: Accerelate Framwork

“Wanna do FFT on iOS?Use the Accelerate.framework.

Highly recommended. #ios”

–Twitter User

Page 94: Accerelate Framwork

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 95: Accerelate Framwork

Fast Math

Luke ChangEngineer, Vector and Numerics Group

Page 96: Accerelate Framwork

Math for Every Data Length

• Libm for scalar data

float

Page 97: Accerelate Framwork

Math for Every Data Length

• Libm for scalar data• vMathLib for SIMD vectors

float vFloat

Page 98: Accerelate Framwork

Math for Every Data Length

• Libm for scalar data• vMathLib for SIMD vectors• vForce for array data

float vFloat

...float []

Page 99: Accerelate Framwork

Standard C math libraryLibm

Page 100: Accerelate Framwork

Libm

• Standard math library in C• Collection of transcendental functions•Operates on scalar data

■ exp[f ]■ log[f ]■ sin[f ]■ cos[f ]■ pow[f ]■ Etc…

Page 101: Accerelate Framwork

What’s New in Libm?

• Extensions to C11, prefixed with “__”•Added in both iOS 7 and OS X 10.9• __exp10[f ]• __sinpi[f ], __cospi[f ], __tanpi[f ]• __sincos[f ], __sincospi[f ]

Page 102: Accerelate Framwork

__exp10[f]Power of 10

• Commonly used for decibel calculations• Faster than pow(10.0, x)•More accurate than exp(log(10) * x)

■ exp(log(10) * 5.0) =100000.0000000002

Page 103: Accerelate Framwork

__sinpi[f ], __cospi[f ], __tanpi[f ]Trigonometry in Terms of PI

• cospi(x) means cos( *x)• Faster because argument reduction is simpler•More accurate when working with degrees

■ cos(M_PI*0.5) returns 6.123233995736766e-17■ __cospi(0.5) returns exactly 0.0

Page 104: Accerelate Framwork

__sincos[f ], __sincospi[f ]Sine-Cosine Pairs

• Compute sine and cosine simultaneously• Faster because argument reduction is only done once• Clang will call __sincos[f ] when possible

Page 105: Accerelate Framwork

C11 Features

• Some complex values can’t be specified as literals■ (0.0 + INFINITY * I)

• C11 adds CMPLX macro for this purpose■ CMPLX(0.0, INFINITY)

• CMPLXF and CMPLXL are also available

Page 106: Accerelate Framwork

SIMD vector math libraryvMathLib

Page 107: Accerelate Framwork

vMathLib

• Collection of transcendental functions for SIMD vectors•Operates on SIMD vectors

■ vexp[f ]■ vlog[f ]■ vsin[f ]■ vcos[f ]■ vpow[f ]■ Etc…

Page 108: Accerelate Framwork

Writing your own vector algorithmWhen to Use vMathLib?

•Need transcendental functions in your vector code

Page 109: Accerelate Framwork

Taking sine of a vectorvMathLib Example

•Using Libm

#include <math.h>

vFloat vx = { 1.f, 2.f, 3.f, 4.f };vFloat vy;...float *px = (float *)&vx, *py = (float *)&vy;for( i = 0; i < sizeof(vx)/sizeof(px[0]); ++i ) {

py[i] = sinf(px[i]);}...

Page 110: Accerelate Framwork

Taking sine of a vectorvMathLib Example

•Using vMathLib

#include <Accelerate/Accelerate.h>

vFloat vx = { 1.f, 2.f, 3.f, 4.f };vFloat vy;...vy = vsinf(vx);...

Page 111: Accerelate Framwork

Vectorized math libraryvForce

Page 112: Accerelate Framwork

vForce

• Collection of transcendental functions for arrays•Operates on array data

■ vvexp[f ]■ vvlog[f ]■ vvsin[f ]■ vvcos[f ]■ vvpow[f ]■ Etc…

Page 113: Accerelate Framwork

vForce Example

• Filling a buffer with sine wave using a for loop

#include <math.h>

float buffer[length];float indices[length];

...

for (int i = 0; i < length; i++){ buffer[i] = sinf(indices[i]);}

Page 114: Accerelate Framwork

vForce Example

• Filling a buffer with sine wave using vForce

#include <Accelerate/Accelerate.h>

float buffer[length];float indices[length];

...

vvsinf(buffer, indices, &length);

Page 115: Accerelate Framwork

Better PerformanceMeasured on iPhone 5

Page 116: Accerelate Framwork

vForce

For loop

0 1 2 3 4 5 6

Sines Computed per µs

0 10 20 30 40 50 60

Better PerformanceMeasured on iPhone 5

Bigger is better

56

24

Page 117: Accerelate Framwork

vForce

For loop

0 1 2 3 4 5 6

nJ consumed per sine

0 4 8 12 16 20 24

Less EnergyMeasured on iPhone 5

Smaller is better

Page 118: Accerelate Framwork

vForce

For loop

0 1 2 3 4 5 6

nJ consumed per sine

0 4 8 12 16 20 24

Less EnergyMeasured on iPhone 5

Smaller is better

14.16

23.37

Page 119: Accerelate Framwork

Measured on iPhone 5vForce Performance

0 40 80 120 160 200

truncf

logf

expf

powf

sinf

sincosf

Results computed per µs vForceFor loop

Page 120: Accerelate Framwork

Measured on iPhone 5vForce Performance

0 40 80 120 160 200

truncf

logf

expf

powf

sinf

sincosf 10.6

24

9.2

29

22

161

28.8

56

26.6

120

116

0

Results computed per µs vForceFor loop

826

Page 121: Accerelate Framwork

vForce in Detail

• Supports both float and double•Handles edge cases correctly• Requires minimal data alignment• Supports in-place operation• Improves performance even with small arrays

■ Consider using vForce when more than 16 elements

Page 122: Accerelate Framwork

What operations are available?Accelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 123: Accerelate Framwork

Linear Algebra PACKage andBasic Linear Algebra Subprograms

LAPACK and BLAS

Geoff BelterEngineer, Vector and Numerics Group

Page 124: Accerelate Framwork

LAPACK Operations

•High-level linear algebra• Solve linear systems•Matrix factorizations• Eigenvalues and eigenvectors

Page 125: Accerelate Framwork

LINPACK

•How fast can you solve a system of linear equations?

•1000 x 1000 matrices

Page 126: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A”

LINPACK benchmark performance in Mflops

Bigger is better

Mflops

Page 127: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A” 40

LINPACK benchmark performance in Mflops

Bigger is better

Mflops

Page 128: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A”

LINPACK benchmark performance in Mflops

Bigger is better

Mflops

Page 129: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A”

LINPACK benchmark performance in Mflops

Mflops

Bigger is better

Page 130: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A” 788

LINPACK benchmark performance in Mflops

Mflops

Bigger is better

Page 131: Accerelate Framwork

0 500 1000 1500

LAPACK

Accelerate

“Brand A”

Bigger is better

788

LINPACK benchmark performance in Mflops

Mflops

Page 132: Accerelate Framwork

0 500 1000 1500

LAPACK

Accelerate

“Brand A”

Bigger is better

788

1202

LINPACK benchmark performance in Mflops

Mflops

Page 133: Accerelate Framwork

0 500 1000 1500

LAPACK

Accelerate on iPhone 4S

“Brand A”

Bigger is better

788

1202

LINPACK benchmark performance in Mflops

Mflops

Page 134: Accerelate Framwork

0 500 1000 1500

LAPACK

“Brand A”

Bigger is better

788

1202

LINPACK benchmark performance in Mflops

Mflops

Page 135: Accerelate Framwork

0 500 1000 1500 2000

LAPACK

“Brand A”

Bigger is better

Accelerate on iPhone 5

788

LINPACK benchmark performance in Mflops

Mflops

Page 136: Accerelate Framwork

0

Accelerate on iPhone 4S

LAPACK

“Brand A”

500

Bigger is better

1000 1500 2000 2500 3000 3500 4000

788

LINPACK benchmark performance in Mflops

Mflops

Accelerate on iPhone 5

Page 137: Accerelate Framwork

0

Accelerate on iPhone 4S

LAPACK

“Brand A”

3446

500

Bigger is better

1000 1500 2000 2500 3000 3500 4000

788

LINPACK benchmark performance in Mflops

Mflops

Accelerate on iPhone 5

Page 138: Accerelate Framwork
Page 139: Accerelate Framwork

VS.iPadwith Retina Display G5

Power Mac

Page 140: Accerelate Framwork

iPad with Retina display vs. Power Mac G5LAPACK

• Triumphant return•After 10 years•All fans blazing

Page 141: Accerelate Framwork

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

Bigger is better

LINPACK benchmark performance in Mflops

Mflops

Page 142: Accerelate Framwork

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

Bigger is better

3643

LINPACK benchmark performance in Mflops

Mflops

Page 143: Accerelate Framwork

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

Bigger is better

3643

LINPACK benchmark performance in Mflops

Mflops

Page 144: Accerelate Framwork

LAPACK

0

Accelerate on iPhone 4S

PowerMac G5

iPad with Retina Display

1000 2000 3000 4000

3686

Bigger is better

3643

LINPACK benchmark performance in Mflops

Mflops

Page 145: Accerelate Framwork

LAPACK ExampleSolve linear system

#include <Accelerate/Accelerate.h>

// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;

// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);

Page 146: Accerelate Framwork

LAPACK ExampleSolve linear system

#include <Accelerate/Accelerate.h>

// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;

// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);

Page 147: Accerelate Framwork

LAPACK ExampleSolve linear system

#include <Accelerate/Accelerate.h>

// Create and prepare input and output datadouble *A, *B;__CLPK_integer *ipiv;

// solvedgesv_(&n, &nrhs, A, &n, ipiv, B, &n, &info);

Page 148: Accerelate Framwork

BLAS Operations

• Low-level linear algebra• Vector

■ Dot product, scalar product, vector sum

•Matrix-vector■ Matrix-vector product, outer product

•Matrix-matrix■ Matrix multiply

Page 149: Accerelate Framwork

BLAS ExampleMatrix Multiply

#include <Accelerate/Accelerate.h>

// Create and prepare datadouble *A, *B, *C;

// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);

Page 150: Accerelate Framwork

BLAS ExampleMatrix Multiply

#include <Accelerate/Accelerate.h>

// Create and prepare datadouble *A, *B, *C;

// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);

Page 151: Accerelate Framwork

BLAS ExampleMatrix Multiply

#include <Accelerate/Accelerate.h>

// Create and prepare datadouble *A, *B, *C;

// C <-- A * Bcblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 100, 100, 100, 1.0, A, 100, B, 100, 0.0, C, 100);

Page 152: Accerelate Framwork

Data Types

• Single and double precision• Real and complex•Multiple data layouts

■ Dense, banded, triangular, etc.■ Transpose, conjugate transpose■ Row and column major

Page 153: Accerelate Framwork

“Playing with the Accelerate.framework today. Having a BLASt.”

–Twitter User

Page 154: Accerelate Framwork

Summary

Page 155: Accerelate Framwork

Lots of functionalityAccelerate Framework

• Image processing (vImage)•Digital signal processing (vDSP)• Transcendental math functions (vForce, vMathLib)• Linear algebra (LAPACK, BLAS)

Page 156: Accerelate Framwork

Accelerate Framework

• Easy access to a lot of functionality•Accurate• Fast with low energy usage•Works on both OS X and iOS•Optimized for all of generations of hardware

Features and benefits

Page 157: Accerelate Framwork

• Prepare your data■ Contiguous■ 16-byte aligned

•Understand problem size•Do setup once/destroy at the end

Accelerate FrameworkTo be successful

Page 158: Accerelate Framwork
Page 159: Accerelate Framwork

If You Need a Feature

Page 160: Accerelate Framwork

If You Need a FeatureRequest It

Page 161: Accerelate Framwork

“Discrete Cosine Transform was my feature request that made it into the

Accelerate Framework. I feel so special!”

–Twitter User

Page 162: Accerelate Framwork

“Thanks Apple for making the Accelerate Framework.”

–Twitter User

Page 163: Accerelate Framwork

Paul DanboldCore OS Technologies [email protected]

George WarnerDTS Sr. Support [email protected]

DocumentationvImage Programming Guidehttp://developer.apple.com/library/mac/#documentation/Performance/Conceptual/vImage/Introduction/Introduction.html

vDSP Programming Guidehttp://developer.apple.com/library/mac/#documentation/Performance/Conceptual/vDSP_Programming_Guide/Introduction/Introduction.html

Apple Developer Forumshttp://devforums.apple.com

More Information

Page 164: Accerelate Framwork

Labs

Accelerate Lab Core OS Lab AThursday 4:30PM

Page 165: Accerelate Framwork