how to leverage multicore architecture for compute...

How to leverage Multicore Architecture for Compute Intensive Applications – FTF-SDS-F0598

Huang Yun

Wind River Confidential – NDA Disclosure.

Agenda

Freescale hardware

– QorIQ

– i.MX6

– SMP/AMP

Multi-core Software Archtectures

– MCAPI/MRAPI

– OpenMP

– OpenCL

– Cilk/Cilk++

– Proprietary

QorIQ T4240 CPU architecture

12 CPU Cores – e6500 64 bit

– 1.8 GHz at 1V

– Dual-threaded to 24 threads

– Hardware Virtualization

L1 Cache between two threads

L2 Cache shared by a cluster of 4 cores

i.MX6 Quad

4 CPU Cores – ARM Cortex A9 - 32 bit

– 1.2 GHz

L1 32 K I & 32K D Cache per core

L2 1 MB shared cache

Hardware Graphics Accelerator

– OpenGL & OpenCL capable

Maximizing Multi-core Benefits

The potential of multi-core platforms to deliver increased performance with less power consumption is not a guaranteed outcome.

Successfully mapping your single-core application to multi-core architectures is a journey challenged by what you don’t know…

All operating environments are not created equal when it comes to configuration options yielding maximum performance for your specific applications on your chosen multi-core platform

Single to Multi-Core

Single-Core

Multi-Core

App 2 App 8

Multi-Core Architecture: SMP

Symmetric multiprocessing (SMP)

• Many computing resources for OS and applications to share

• Single RTOS and scheduler

• Priority based assumptions might cause timing issues

App 2 App 8

Best suited for

• Heavy processing tasks such as data manipulation and image processing

Not as suitable for

• Hard real-time response requirements

Multi-Core Architecture: uAMP

Unsupervised asymmetric multiprocessing (uAMP)

• Same or different copies of an RTOS are running on all cores in an unsupervised AMP environment

• OS and applications do not share computing resources

App 2 App 8

AMP AMP

Best suited for

• Small independent deterministic tasks

Not as suitable for

• Heavy processing tasks

Multi-Core Architecture: Mixed

SMP and uAMP

• An SMP operating system controls the first couple of cores, while the rest of the cores run unsupervised AMP images

• AMP OS instances do not have to be the same

App 2 App 8

AMP AMP

Best suited for

• Consolidation that brings mix of tasks into one platform

Provisioning the system

System Resources

• Which CPUs belong to which OS domains

• Where to Map Memory both RAM and Flash

• Interrupts – Which interrupts handled by what cores

• Devices – Which devices are providing connectivity to each OS

Take Full Advantage of Multicore with Multi-OS

Memory 0

Interrupts

Devices

Memory 1

Interrupts

Devices

Memory 2

Interrupts

Devices

Inter-Process Communication

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Receive

Command

• Proprietary

• Roll your own

• Use MCAPI / MRAPI

System Resources

• Interrupt

• Shared Memory

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Receive

Command

MCAPI (Multicore Communications API)

• Node: CPU, OS or Process/Thread instance

• Endpoint: Connected / Connectionless

• Channel: Scalar or Datagram

MCAPI MCAPI

MCAPI / MRAPI

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Receive

Command

• Channel: Scalar or Datagram )

Sequence of events

1. Define Topology

- Nodes

- Endpoints

2. Create channels

- Connected

- Connectionless

3. Send/Receive Data

MCAPI MCAPI MRAPI (Multicore

Resource API)

• Shared Memory

• Shared Semaphores

• Interrupts

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Receive

Command

Sequence of events

1. Define Topology

- Nodes

- Endpoints

2. Create channels

- Connected

- Connectionless

MCAPI MCAPI

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Receive

Command

Sequence of events

1. Define Topology

- Nodes

- Endpoints

2. Create channels

- Connected

- Connectionless

MCAPI MCAPI

OpenMP

• Shared Memory between compute nodes

• Can use Pthreads underneath

PI formula in C – Single Threaded Hello World for Parallel Programming

static long num_steps = 100000;

double step;

int main()

int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

for (i=0;i<num_steps;i++)

x = (i+0.5)*step;

sum = sum + 4.0/(1.0+x*x);

pi = step * sum;

printf (" PI is %f\n", pi);

PI formula in C - OpenMP #include <stdio.h>

#include <omp.h>

static long num_steps = 100000;

double step;

#define PAD 8

static int num_threads = 4;

static long thrd_step;

int main()

double pi = 0.0;

double sum = 0.0;

double my_sum[num_threads][PAD];

int j;

double start_time;

double end_time;

step = 1.0/(double) num_steps;

thrd_step = num_steps / num_threads;

omp_set_num_threads(num_threads);

start_time = omp_get_wtime();

#pragma omp parallel

int ID = omp_get_thread_num();

int i;

double x;

int startat = ID * thread_step;

for (i = startat; i < startat+thrd_step; i++)

x = (i+0.5)*step;

my_sum[ID][0] += 4.0/(1.0+x*x);

} // end of pragma

for (j = 0; j < num_threads; j++)

sum += my_sum[j][0];

pi = step * sum;

OpenCL

OpenCL – Open compute language

• Khronos Standard – Started with Apple in

• Used with Symmetric cores

• Used with GPGPU (General Purpose GPU)

• Need OpenCL drivers for GPU

PI formula in OpenCL - C int main(void)

char *kernelsource = getKernelSource("../pi_ocl.cl"); // Kernel source

cl_int err;

cl_device_id device_id; // compute device id

cl_context context; // compute context

cl_command_queue commands; // compute command queue

cl_program program; // compute program

cl_kernel kernel_pi; // compute kernel

// Set up OpenCL context. queue, kernel, etc.

cl_uint numPlatforms; // Find number of platforms

err = clGetPlatformIDs(0, NULL, &numPlatforms);

// Get all platforms

cl_platform_id Platform[numPlatforms];

err = clGetPlatformIDs(numPlatforms, Platform, NULL);

https://raw.githubusercontent.com/HandsOnOpenCL/Exercises-Solutions/master/Solutions/Exercise09/C/pi_ocl.c

PI formula in OpenCL - C // Secure a device

for (int i = 0; i < numPlatforms; i++)

err = clGetDeviceIDs(Platform[i], DEVICE, 1, &device_id, NULL);

// Output information

err = output_device_info(device_id); // Create a compute context

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

// Create a command queue

commands = clCreateCommandQueue(context, device_id, 0, &err);

// Create the compute program from the source buffer

program = clCreateProgramWithSource(context, 1, (const char **)

& kernelsource, NULL, &err);

// Build the program

err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

PI formula in OpenCL - C // Create the compute kernel from the program

kernel_pi = clCreateKernel(program, "pi", &err);

// Find kernel work-group size

err = clGetKernelWorkGroupInfo (kernel_pi, device_id,

CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &work_group_size, NULL);

// Now that we know the size of the work-groups, we can set the number of

// work-groups, the actual number of steps, and the step size

nwork_groups = in_nsteps/(work_group_size*niters);

if (nwork_groups < 1)

{ err = clGetDeviceInfo(device_id, CL_DEVICE_MAX_COMPUTE_UNITS,

sizeof(size_t), &nwork_groups, NULL);

work_group_size = in_nsteps / (nwork_groups * niters);

nsteps = work_group_size * niters * nwork_groups;

d_partial_sums = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) *

nwork_groups, NULL, &err);

PI formula in OpenCL - C // Set kernel arguments

err = clSetKernelArg(kernel_pi, 0, sizeof(int), &niters);

err |= clSetKernelArg(kernel_pi, 1, sizeof(float), &step_size);

err |= clSetKernelArg(kernel_pi, 2, sizeof(float) * work_group_size, NULL);

// Execute the kernel over the entire range of our 1D input data set

// using the maximum number of work items for this device

size_t global = nwork_groups * work_group_size;

size_t local = work_group_size;

double rtime = wtime();

err = clEnqueueNDRangeKernel( commands, kernel_pi, 1, NULL,

&global, &local, 0, NULL, NULL); if (err != CL_SUCCESS)

err = clEnqueueReadBuffer( commands, d_partial_sums, CL_TRUE,

0, sizeof(float) * nwork_groups, h_psum, 0, NULL, NULL);

PI formula in OpenCL - kernel __kernel void pi( const int niters, const float step_size,

__local float* local_sums, __global float* partial_sums)

int num_wrk_items = get_local_size(0);

int local_id = get_local_id(0);

int group_id = get_group_id(0);

float x, accum = 0.0f;

int i,istart,iend;

istart = (group_id * num_wrk_items + local_id) * niters;

iend = istart+niters;

for(i= istart; i<iend; i++)

x = (i+0.5f)*step_size;

accum += 4.0f/(1.0f+x*x);

local_sums[local_id] = accum;

barrier(CLK_LOCAL_MEM_FENCE);

reduce(local_sums, partial_sums);

OpenCL Communication

RTOS GPU

Memory Pool

Shared Memory

Memory Memory

Receive

Data is presented in shared memory

Kernel of activity is loaded up into the GPU

Kernel activity is loaded

by CPU.

Has local memory

Has Global shared

memory.

High Performance Computer

https://community.freescale.com/docs/DOC-94464

Mini-HPC

• System

• 4 i.MX6 Quad 1.2 Ghz

• Uses the CPU + GPU

• Hardware:

• 4 1.2 GHz Cortex-A9

• 1 Vivante GC2000 GPU

• 1 GB RAM

• 8 GB SD

• 100 Mbit Ethernet via USB

• Software

• Unbuntu 11.10 Linaro Linux

• OpenCL driver: Vivante GC2000

• GCC 4.6.1

• MPI Parallel Compute

• Results

• 100GFLOPS

• 15 Watts

Cilk/Cilk++

• Shared Memory between compute nodes

• Needs compiler support

Conclusion

Compute Intensive Applications are here

• SMP / AMP are very different approaches

• Hybrid may help to optimize system performance

• MCAPI/MRAPI – Good for AMP between OS

instances

• Proprietary – Similar to MCAPI but dependent on

provider.

• OpenMP – easiest to implement. Good or SMP

• OpenCL – High performance Needs tuning

• Cilk/Cilk++ Early for PowerPC/ARM. Stay tuned.

Contact Us

To learn more, visit Wind River at ：http://www.windriver.com

Email: inquiries-ap-china@windriver.com

Wind River Sina Weibo，

@Wind River http://weibo.com/windriverchina

Beijing Office Tel：010-84777100

Shanghai Office Tel：021-63585586/87/89/90

Shenzhen Office Tel：0755-25333408/3418/4508/4518

Xi’an Office Tel：029-87607208

Chengdu Office Tel：028-65318000

how to leverage multicore architecture for compute...

Documents

statuts de la ftf

ftf explorer - astoria, or

ftf katalog-preview

multicore system design with xum: the extensible utah...

general rf update - nxp...

eric white ftf presentation

ftf aut f0234

estrategia_estatal_innovación accenture ftf...

ftf manifest

ftf oct2012 madrid

013 ftf - 2008

emerging trends in - nxp...

jens ogniewski information coding...

rhino gold ftf

ftf workbook

multicore processsors

high performance compute platform based on multi-core...

motor control toolbox overview - nxp...

multicore and multicore programming with openmp (syst emes

ftf official brochure 2014