hc-4020, enhancing opencl performance in aftershot pro with hsa, by michael wootton

28
ENHANCING OPENCL PERFORMANCE IN COREL AFTERSHOTPRO WITH HSA

Upload: amd-developer-central

Post on 13-May-2015

1.513 views

Category:

Technology


2 download

DESCRIPTION

Presentation Hc-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton at the AMD Developer Summit (APU13) November 11-13, 2013.

TRANSCRIPT

Page 1: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

ENHANCING OPENCL PERFORMANCE IN COREL AFTERSHOT™ PRO WITH HSA

Page 2: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

2 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

COREL AFTERSHOT™ PRO

What is Corel AfterShot™ Pro?

Corel AfterShot™ Pro is photo workflow software

Non-destructive photo editing of JPEG, TIFF, and Raw formats from hundreds of cameras

Photo Management

Batch Processing of modified files

Page 3: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

AfterShot Pro

Basics

Page 4: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

4 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

INSIDE AFTERSHOT

Architectural Features: ‒ Task Scheduling

‒ Tile Processing

Page 5: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

5 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

AFTERSHOT TASK MANAGEMENT

Work is broken down into Tasks. Tasks typically: ‒ Contain execution logic (code)

‒ May store resultant data

‒ Track whether they are complete

The Task Scheduler: ‒ Allocates a worker thread per CPU core

‒ Runs Tasks based on priority

‒ Allows Tasks to block on each other

File Reader Photo

Thumbnail

JPEG Decoder

Disk

Data

Task Dependency

A Simple Task Dependency Graph

Page 6: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

6 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

PROCESSING WITH TILES

The standard simpler approach is to use large monolithic images

Images are broken down into tiles for processing

Tiling provides faster screen updates. Only compute the visible parts of the image

Tiling allows more effective memory management

Page 7: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

7 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

PROCESSING WITH TILES CONTINUED

The Image Processing Pipeline is made up of several discrete steps [or filters]

To process a single tile: ‒ Load the input data (e.g. raw or jpeg data)

‒ Apply each Filter step in turn

Generally, we only need the output of the last step, the top Tile in the Stack

Raw Data Final Image

Page 8: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

8 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

ADVANCED TILE PROCESSING

Some Image Filters require a radius of pixels as input

Partially processed neighbor Tiles must complete before the main Tile can continue

Intermediate Tiles must be stored in memory so they do not rerun

Example Filters: ‒ Sharpening

‒ Lens Correction

‒ Noise Reduction

‒ Cropping Requires multiple source tiles

Page 9: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

OpenCL™ in AfterShot Pro

Page 10: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

10 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

ACCELERATING AFTERSHOT WITH OPENCL™

Goals for the AfterShot Pro OpenCL port

Offload image processing from Tiles

Work within the existing System ‒ Contain changes to a few critical modules

‒ Maintain full CPU utilization

‒ Integrate OpenCL Events into the Task System

Page 11: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

11 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

GETTING WORK TO OPENCL

Identify the longest running image Filter functions and replace them with OpenCL kernels

Do not block CPU threads, use OpenCL event callbacks.

Processing becomes Asynchronous

Limit total work in flight to conserve memory

Marshall data automatically

Page 12: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

12 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

CAVEATS OF ASYNCHRONOUS OPENCL PROCESSING

High Buffer Usage ‒ Each kernel that runs needs input, output, and possibly scratch buffers.

‒ Buffers must “stick around” until the kernels complete

‒ Multiple chains of kernels a needed to keep the GPU busy

Kernel 4

Kernel 5

Buffer

Kernel 3

Kernel 2

Kernel 1

Buffer Buffer Buffer Buffer Buffer

Buffer

Processing one 512 x 512 image requires multiple 3 MB buffers resident in device memory (VRAM)

Page 13: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

13 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

CAVEATS OF ASYNCHRONOUS OPENCL PROCESSING – CONTINUED

Dependencies Must Be Resolved in Advance ‒ For best performance all kernels in a chain should be enqueued together

‒ The state of all dependencies must be known before the first kernel is queued

‒ Difficult to track

‒ Compromise: only use OpenCL for Filters with simple linear dependencies

Kernel chaining and asynchronous execution provides excellent GPU utilization.

Page 14: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

OpenCL Challenges

Page 15: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

15 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

LARGE RADIUS IMAGE FILTERS

Several image processing operations require neighbor pixels. In AfterShot image Filters are broken down into one of two categories:

Normal Only requires the local Tile

Large Radius Requires multiple Tiles

Page 16: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

16 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

LARGE RADIUS IMAGE FILTERS ARE DIFFICULT

Large Radius AfterShot Filters are particularly difficult to implement in OpenCL

Large Radius filters will “break” kernel chaining

A extra layer of Intermediate Tiles must be resident, which will: ‒ Exhaust Device Memory, or

‒ Cause excessive bus transfers, hurting performance

And the solution is…

Page 17: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

17 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

LARGE RADIUS FILTERS - NO

Don’t do it.

Large Radius filters are possible but at great development cost

Performance would ultimately depend on tricky optimizations

Large radius filters were left to run on the CPU

Page 18: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

18 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

AFTERSHOT OPENCL RESULTS

Approximately 70% of image processing work was moved off of the CPU cores*

Batch processing speed improved by 3.5x*

Maintains 100% utilization on 8 CPU cores*

Only a mid-level GPU is required

Supported on Windows, Linux, and OS X

AfterShot Pro with OpenCL was a success

*measured on developer’s system

Page 19: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

OpenCL 2.0 SVM

Page 20: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

20 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

OPENCL 2.0 SHARED VIRTUAL MEMORY

OpenCL 2.0 introduces Shared Virtual Memory (SVM)

Basic [Coarse Grain] SVM ‒ Host and kernels can share pointers

Advanced [Fine Grain] SVM is available on some hardware ‒ Host and kernels can operate concurrently on the same memory

Fine Grain System SVM ‒ Kernels can access the entire host process’ address space. Kernels can read or write malloc

buffers

‒ System SVM can greatly simplify buffer management in an OpenCL application

Page 21: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

AfterShot Redux

Page 22: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

22 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

RECONSIDERING LARGE RADIUS FILTERS

Large Radius OpenCL filters were dropped as an AfterShot feature. The reasons were both technical and resource related

Can System SVM make Large Radius AfterShot filters feasible? Signs point to yes ‒ No Device Memory required for Intermediate buffers

‒ Input streams from SVM, no buffer transfers

‒ Behavior more in-line with Software [non-OpenCL] filters

‒ Dependencies could be resolved just as they would for a Software filter

Page 23: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

23 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

LOCAL CONTRAST – A LARGE RADIUS AFTERSHOT FILTER

The next version of AfterShot Pro will contain a new Local Contrast filter.

‒ GPU accelerated on systems with OpenCL and SVM.

‒ Increases image contrast in detailed areas while leaving large constant areas unchanged

‒ The effect is achieved through a large radius Unsharp Mask (10-20% of the overall image width)

Page 24: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

24 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

SETTING UP A KERNEL TO USE SVM MEMORY

Page 25: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

25 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

LOADING SVM MEMORY FROM INSIDE THE KERNEL

Page 26: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

26 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

LOCAL CONTRAST RESULTS

System SVM simplified Local Contrast ‒ No complicated buffer management

‒ No clever optimizations were required to hide Device memory transfers

‒ Additional memory pressure is similar to a software filter

Performance is good. The OpenCL code runs in ¼ the time of the optimized software filter*

*measured on developer’s system

Page 27: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

27 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

THANK YOU

Questions

Page 28: HC-4020, Enhancing OpenCL performance in AfterShot Pro with HSA, by Michael Wootton

28 | Enhancing OpenCL Performance in Corel AfterShot™ Pro with HSA | NOVEMBER 19, 2013

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. Other names are for informational purposes only and may be trademarks of their respective owners.