it4innovations, vŠb – technical university of ostrava ... · engine with support for interactive...

44
IT4Innovations, VŠB – Technical University of Ostrava, Ostrava, Czech Republic Milan Jaroš, Petr Strakoš, Lubomír Říha

Upload: others

Post on 25-Jan-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

  • IT4Innovations, VŠB – Technical University of Ostrava, Ostrava, Czech Republic

    Milan Jaroš, Petr Strakoš, Lubomír Říha

  • 2

    Outline

    Part I.

    Blender Cycles introduction

    Acceleration Structures for Path-tracing

    CyclesPhi

    Part II.

    Intel® AVX-512

    Embree integration

    Benchmarks

  • 3

    Blender is an open source 3D creation suite. It has five render engines in the new version 2.8: Blender Internal, Blender Game, Clay, Eevee and Cycles.

    Cycles is a path-tracing based render engine with support for interactive rendering, shading node system, and texture workflow.

    The Daily Dweebs

    Blender & Cycles Render

  • 4

    Cycles is internal plugin (Extending Python w. C++)

    More at: http://intermediate-and-advanced-software-carpentry.readthedocs.io/en/latest/c++-wrapping.html

    //intern/cycles/blender/addon/engine.pydef render(engine) :

    import _cyclesif hasattr(engine, "session") :_cycles.render(engine.session)

    //source/blender/python/intern/bpy_interface.cstatic struct _inittab bpy_internal_modules[] = {{ "mathutils", PyInit_mathutils },//...{ "_cycles", CCL_initPython },//...{ NULL, NULL }};

    Blender start – register C++ modules into Python

    Blender GUI

    Starting the rendering plugin

  • 5

    Cycles is internal plugin (Extending Python w. C++)

    //blender/intern/cycles/blender/blender_session.cppvoid BlenderSession::render() {//...BL::RenderSettings r = b_scene.render();//...

    More at: http://intermediate-and-advanced-software-carpentry.readthedocs.io/en/latest/c++-wrapping.html

    //intern/cycles/blender/addon/engine.pydef render(engine) :

    import _cyclesif hasattr(engine, "session") :_cycles.render(engine.session)

    //source/blender/python/intern/bpy_interface.cstatic struct _inittab bpy_internal_modules[] = {{ "mathutils", PyInit_mathutils },//...{ "_cycles", CCL_initPython },//...{ NULL, NULL }};

    Blender start – register C++ modules into Python

    Blender GUI

    1.) Write a C++ function for renderingStarting the rendering plugin

  • //blender/intern/cycles/blender/blender_python.cppstatic PyObject *render_func(PyObject * /*self*/, PyObject*value) {BlenderSession *session =

    (BlenderSession*)PyLong_AsVoidPtr(value);//...session->render();//...

    6

    Cycles is internal plugin (Extending Python w. C++)

    //blender/intern/cycles/blender/blender_session.cppvoid BlenderSession::render() {//...BL::RenderSettings r = b_scene.render();//...

    More at: http://intermediate-and-advanced-software-carpentry.readthedocs.io/en/latest/c++-wrapping.html

    //intern/cycles/blender/addon/engine.pydef render(engine) :

    import _cyclesif hasattr(engine, "session") :_cycles.render(engine.session)

    //source/blender/python/intern/bpy_interface.cstatic struct _inittab bpy_internal_modules[] = {{ "mathutils", PyInit_mathutils },//...{ "_cycles", CCL_initPython },//...{ NULL, NULL }};

    Blender start – register C++ modules into Python

    Blender GUI

    2.) Write a Python-callable function

    1.) Write a C++ function for renderingStarting the rendering plugin

  • //blender/intern/cycles/blender/blender_python.cppstatic PyObject *render_func(PyObject * /*self*/, PyObject*value) {BlenderSession *session =

    (BlenderSession*)PyLong_AsVoidPtr(value);//...session->render();//...

    //intern/cycles/blender/blender_python.cppstatic struct PyModuleDef module = {PyModuleDef_HEAD_INIT,"_cycles","Blender cycles render integration", -1,methods,NULL, NULL, NULL, NULL

    };

    7

    Cycles is internal plugin (Extending Python w. C++)

    //intern/cycles/blender/blender_python.cppstatic PyMethodDef methods[] = {

    //...{ "render", render_func, METH_O, "" },{ "bake", bake_func, METH_VARARGS, "" },{ "draw", draw_func, METH_VARARGS, "" },//...{ NULL, NULL, 0, NULL },

    };

    //blender/intern/cycles/blender/blender_session.cppvoid BlenderSession::render() {//...BL::RenderSettings r = b_scene.render();//...

    More at: http://intermediate-and-advanced-software-carpentry.readthedocs.io/en/latest/c++-wrapping.html

    //intern/cycles/blender/addon/engine.pydef render(engine) :

    import _cyclesif hasattr(engine, "session") :_cycles.render(engine.session)

    //source/blender/python/intern/bpy_interface.cstatic struct _inittab bpy_internal_modules[] = {{ "mathutils", PyInit_mathutils },//...{ "_cycles", CCL_initPython },//...{ NULL, NULL }};

    Blender start – register C++ modules into Python

    Blender GUI

    2.) Write a Python-callable function

    3.) Register this function within a module’s symbol table

    1.) Write a C++ function for renderingStarting the rendering plugin

  • //blender/intern/cycles/blender/blender_python.cppstatic PyObject *render_func(PyObject * /*self*/, PyObject*value) {BlenderSession *session =

    (BlenderSession*)PyLong_AsVoidPtr(value);//...session->render();//...

    //intern/cycles/blender/blender_python.cppstatic struct PyModuleDef module = {PyModuleDef_HEAD_INIT,"_cycles","Blender cycles render integration", -1,methods,NULL, NULL, NULL, NULL

    };

    8

    Cycles is internal plugin (Extending Python w. C++)

    //intern/cycles/blender/blender_python.cppstatic PyMethodDef methods[] = {

    //...{ "render", render_func, METH_O, "" },{ "bake", bake_func, METH_VARARGS, "" },{ "draw", draw_func, METH_VARARGS, "" },//...{ NULL, NULL, 0, NULL },

    };

    //blender/intern/cycles/blender/blender_session.cppvoid BlenderSession::render() {//...BL::RenderSettings r = b_scene.render();//...

    More at: http://intermediate-and-advanced-software-carpentry.readthedocs.io/en/latest/c++-wrapping.html

    //intern/cycles/blender/addon/engine.pydef render(engine) :

    import _cyclesif hasattr(engine, "session") :_cycles.render(engine.session)

    //source/blender/python/intern/bpy_interface.cstatic struct _inittab bpy_internal_modules[] = {{ "mathutils", PyInit_mathutils },//...{ "_cycles", CCL_initPython },//...{ NULL, NULL }};

    Blender start – register C++ modules into Python

    Blender GUI//intern/cycles/blender/blender_python.cppvoid *CCL_initPython() {PyObject *mod =

    PyModule_Create(&ccl::module);//...return (void*)mod;

    2.) Write a Python-callable function

    3.) Register this function within a module’s symbol table

    1.) Write a C++ function for renderingStarting the rendering plugin

    4.) Write an init function for the module

  • //blender/intern/cycles/blender/blender_python.cppstatic PyObject *render_func(PyObject * /*self*/, PyObject*value) {BlenderSession *session =

    (BlenderSession*)PyLong_AsVoidPtr(value);//...session->render();//...

    //intern/cycles/blender/blender_python.cppstatic struct PyModuleDef module = {PyModuleDef_HEAD_INIT,"_cycles","Blender cycles render integration", -1,methods,NULL, NULL, NULL, NULL

    };

    9

    Cycles is internal plugin (Extending Python w. C++)

    //intern/cycles/blender/blender_python.cppstatic PyMethodDef methods[] = {

    //...{ "render", render_func, METH_O, "" },{ "bake", bake_func, METH_VARARGS, "" },{ "draw", draw_func, METH_VARARGS, "" },//...{ NULL, NULL, 0, NULL },

    };

    //blender/intern/cycles/blender/blender_session.cppvoid BlenderSession::render() {//...BL::RenderSettings r = b_scene.render();//...

    More at: http://intermediate-and-advanced-software-carpentry.readthedocs.io/en/latest/c++-wrapping.html

    //intern/cycles/blender/addon/engine.pydef render(engine) :

    import _cyclesif hasattr(engine, "session") :_cycles.render(engine.session)

    //source/blender/python/intern/bpy_interface.cstatic struct _inittab bpy_internal_modules[] = {{ "mathutils", PyInit_mathutils },//...{ "_cycles", CCL_initPython },//...{ NULL, NULL }};

    Blender start – register C++ modules into Python

    Blender GUI//intern/cycles/blender/blender_python.cppvoid *CCL_initPython() {PyObject *mod =

    PyModule_Create(&ccl::module);//...return (void*)mod;

    2.) Write a Python-callable function

    3.) Register this function within a module’s symbol table

    1.) Write a C++ function for renderingStarting the rendering plugin

    4.) Write an init function for the module

  • //blender/intern/cycles/blender/blender_python.cppstatic PyObject *render_func(PyObject * /*self*/, PyObject*value) {BlenderSession *session =

    (BlenderSession*)PyLong_AsVoidPtr(value);//...session->render();//...

    //intern/cycles/blender/blender_python.cppstatic struct PyModuleDef module = {PyModuleDef_HEAD_INIT,"_cycles","Blender cycles render integration", -1,methods,NULL, NULL, NULL, NULL

    };

    10

    Cycles is internal plugin (Extending Python w. C++)

    //intern/cycles/blender/blender_python.cppstatic PyMethodDef methods[] = {

    //...{ "render", render_func, METH_O, "" },{ "bake", bake_func, METH_VARARGS, "" },{ "draw", draw_func, METH_VARARGS, "" },//...{ NULL, NULL, 0, NULL },

    };

    //blender/intern/cycles/blender/blender_session.cppvoid BlenderSession::render() {//...BL::RenderSettings r = b_scene.render();//...

    More at: http://intermediate-and-advanced-software-carpentry.readthedocs.io/en/latest/c++-wrapping.html

    //intern/cycles/blender/addon/engine.pydef render(engine) :

    import _cyclesif hasattr(engine, "session") :_cycles.render(engine.session)

    //source/blender/python/intern/bpy_interface.cstatic struct _inittab bpy_internal_modules[] = {{ "mathutils", PyInit_mathutils },//...{ "_cycles", CCL_initPython },//...{ NULL, NULL }};

    Blender start – register C++ modules into Python

    Blender GUI//intern/cycles/blender/blender_python.cppvoid *CCL_initPython() {PyObject *mod =

    PyModule_Create(&ccl::module);//...return (void*)mod;

    2.) Write a Python-callable function

    3.) Register this function within a module’s symbol table

    1.) Write a C++ function for renderingStarting the rendering plugin

    4.) Write an init function for the module

  • Path-tracing (unbiased vs. biased)Acceleration methods: Adaptive rendering (unbiased->biased) Filtering (unbiased->biased) Acceleration structures

    (BVH, kd-tree, …) Distributed rendering

    11

    0

    1

    light

    object

    Samples

    1.1

    0.76

    0.51

    0.16

    0.84

    0.620.665

    Pixel ResultSamples

    The Cycles Encyclopedia

  • 12

    Acceleration Structures

    Space partitioning octree kd-tree (binary tree)

    Bounding Volume Hierarchy (BVH) BVH2 BVH4 BVH8

    https://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-tree-traversal-gpu/

  • 13

    Bounding Volume Hierarchy (BVH)The raytracing acceleration structure used is a bounding volume hierarchy. The original code is based on an implementation from Nvidia (BVH2) and Embree(Hair BVH builder - oriented SAH, BVH4 traversal, BVH nodes intersection, Triangles intersection). BVH2 ( wo, SSE ) BVH4 ( SSE2 ) BVH8 ( AVX(2) ) BVH16 (AVX-512) – in progress EMBREE – BVH4 ( SSE ) EMBREE – BVH8 ( AVX(2), AVX-512 )

  • 14

    Dynamic BVH (interactive) allows you to move objects around in the scene with quick viewport updates in rendered mode. But the image will take longer to clear up.

    Static BVH (offline) will create a completely new BVH tree whenever an object is moved around but the noise will clear up faster.

    Bounding Volume Hierarchy (BVH)

    The Daily Dweebs

    The Daily Dweebs

  • 15

    IT4Innovations

    Salomon 1008 nodes 2x Intel Xeon E5-2680v3, 2.5 GHz, 12 cores

    (Hasswell) 128GB InfiniBand FDR56 / 7D Enhanced hypercube 576 nodes w/o accelerator 432 nodes with 2x Intel Xeon Phi 7120P, 61

    cores, 16 GB RAM

    Anselm 209 nodes 2x Intel Xeon E5-2665, 2.4 GHz, 8 cores

    (Sandy Bridge) 64GB InfiniBand FDR56 / 7D Enhanced hypercube 180 nodes w/o accelerator 4 nodes with MIC accelerator (1x Intel Xeon

    Phi 5110P) 23 nodes with NVIDIA Kepler K20

  • 16

    CyclesPhi

    We have modified the kernel of the Blender Cycles rendering engine and then extended its capabilities to support the HPC environment. We call this version the CyclesPhi and it supports following technologies:

    OpenMP

    MPI

    Sockets

    Intel® Xeon Phi™ with Symmetric mode (KNC, KNL)

    Intel® Xeon and Intel® Xeon Phi™ with Intel® AVX-512

  • 17

    MPI, Hybrid and Native mode

    Blender

    MPI

    MPI mode (KNC, KNL, Skylake)

    CPU

    CPU

    Blender client

    OpenMP

    MIC

    Blender client

    OpenMP

    Native mode (KNL, Skylake)

    BlenderBlender

    MPI

    Hybrid mode (KNC, KNL, Skylake)

    CPU

    MIC/

    CPU

    Blender client

    MICBlender

    client

    OpenMP

    Rank0

    Rank1 Rank2

    CPUBlender

    client

    OpenMPRank0

    Rank1

    Rank2

    Sockets (TurboJPEG)

    Server

    Remote Client

    Cluster Cluster

    Distributed rendering Single node rendering

    Easier debugging

    Easier profiling

    Sockets over SSH

  • 18

    KernelDatacam, background,

    integrator (emission,bounces, sampler),

    ...

    const_copy_to

    buffer, rng_state

    mem_allocmem_freemem_copy_to

    tex_alloctex_free

    KernelTexturesbvh, objects, triangles,

    lights, particles,sobol_directions, texture_images,

    ...

    7D Enhanced hypercube Infiniband network

    Sync Tag Update

    Device Update

    Blender Thread

    Session Thread Device Threads

    BlenderScene Scene

    Socket Device

    BcastGathervScatterv

    Blender

    CommunicationLoop (1 thread)

    NODE

    Render loop

    CommunicationLoop (1 thread)

    MPI rank1

    OpenMP

    1 23

    4

    7

    8NODE

    Communication

    Socket serverMPI rank0

    Socket client

    Send KernelData (ex. camera properties) to Socket server and from there to all nodes with Bcast.Send KernelTextures (ex. triangles) to Socket server and from there to all nodes with Bcast.Send the information about the size of buffer (rendered pixels) and size of rng_state (random number generator state) to Socket server and from there to all nodes with Bcast.Send the initial values of rng_state to Socket server and from there to all nodes with Bcast.Start rendering in Socket client and distribute the command via Socket server by Bcast message.

    1

    2

    3

    4

    5

    6

    7

    8

    Read the current results from node and send the buffer to root with Gatherv. Compress the results by JPEG and send it to Socket client.Send new jobs to Socket server and from there to clients with Scatterv.View results in Blender.

    NODE

    Render loop

    CommunicationLoop (1 thread)

    MPI rank2

    OpenMP

    Sockets

    6 6

    5123

    4

    6

    7

    5 6

    Rendering using Hybrid mode

  • 19

    Outline

    Part I.

    Blender Cycles introduction

    Acceleration Structures for Path-tracing

    CyclesPhi

    Part II.

    Intel® AVX-512

    Embree integration

    Benchmarks

  • 20

    AVX-512 Vectorization and Intrinsics Instructions

    AVX-512 = Intel® Advanced Vector Extensions 512

    KNL and Skylake support AVX-512

    KNC supports many instructions from AVX-512

    https://software.intel.com/sites/landingpage/IntrinsicsGuide/#

  • 21

    AVX-512 Vectorization and Intrinsics Instructions

    AVX-512 = Intel® Advanced Vector Extensions 512

    KNL and Skylake support AVX-512

    KNC supports many instructions from AVX-512

    https://software.intel.com/sites/landingpage/IntrinsicsGuide/#

    float dst[16], a[16], b[16];// The intialization of a, b// ...for (int j = 0; j < 16; j++) {

    dst[j] = a[j] * b[j];}

  • 22

    AVX-512 Vectorization and Intrinsics Instructions

    AVX-512 = Intel® Advanced Vector Extensions 512

    KNL and Skylake support AVX-512

    KNC supports many instructions from AVX-512

    https://software.intel.com/sites/landingpage/IntrinsicsGuide/#

    float dst[16], a[16], b[16];// The intialization of a, b// ...for (int j = 0; j < 16; j++) {

    dst[j] = a[j] * b[j];}

    __m512 dst, a, b;// The intialization of a, b// ...dst = _mm512_mul_ps(a, b);

  • 23

    AVX-512 Vectorization and Intrinsics Instructions

    AVX-512 = Intel® Advanced Vector Extensions 512

    KNL and Skylake support AVX-512

    KNC supports many instructions from AVX-512

    https://software.intel.com/sites/landingpage/IntrinsicsGuide/#

    float dst[16], a[16], b[16];// The intialization of a, b// ...for (int j = 0; j < 16; j++) {

    dst[j] = a[j] * b[j];}

    __m512 dst, a, b;// The intialization of a, b// ...dst = _mm512_mul_ps(a, b);

    struct avx512f {union {

    __m512 v;float f[16];

    };__forceinline avx512f operator*(

    const avx512f& a, const avx512f& b) {return _mm512_mul_ps(a, b);

    }}

    avx512f dst, a, b;// The intialization of a, b// ...dst = a * b;

  • 24

    API from Embree: avx512f, avx512i, avx512b

    /* 16-wide AVX-512 bool type */struct avx512b{enum { size = 16 }; // number of SIMD elements__mmask16 v; // data

    }

    /* 16-wide AVX-512 float type */struct avx512f{enum { size = 16 }; // number of SIMD elementsunion { // data__m512 v;float f[16];int i[16];

    }; }

    /* 16-wide AVX-512 integer type */struct avx512i{enum { size = 16 }; // number of SIMD elementsunion { // data__m512i v;int i[16];

    }; }

    • Constructors, Assignment & Cast Operators• Loads and Stores• Constants• Array Access• Unary Operators• Binary Operators• Ternary Operators• Operators with rounding• Assignment Operators• Comparison Operators + Select• Rounding Functions• Movement/Shifting/Shuffling Functions• Reductions• Memory load and store operations

  • 25

    Example of AVX-512 Vectorization: Ray-Triangle Intersection/* Calculate vertices relative to ray origin. */const float3 v0 = tri_c - ray_P;const float3 v1 = tri_a - ray_P;const float3 v2 = tri_b - ray_P;

    /* Calculate triangle edges. */const float3 e0 = v2 - v0;const float3 e1 = v0 - v1;const float3 e2 = v1 - v2;

    /* Perform edge tests. */const float U = dot(cross(v2 + v0, e0), ray_dir);const float V = dot(cross(v0 + v1, e1), ray_dir);const float W = dot(cross(v1 + v2, e2), ray_dir);

    const float minUVW = min(U, min(V, W));const float maxUVW = max(U, max(V, W));

    if(minUVW < 0.0f && maxUVW > 0.0f) return false;

    /* ... */*isect_u = U * inv_den;*isect_v = V * inv_den;*isect_t = T * inv_den;

    avx512f P_512(ray_P.x, ray_P.y, ray_P.z, ray_P.w);avx512f dir_512(ray_dir.x, ray_dir.y, ray_dir.z, ray_dir.w);

    avx512f tri_abc_512 = loadu(float_verts);avx512f tri_abc_512_shuff = shuffle4(tri_abc_512);

    /* Calculate vertices relative to ray origin. */avx512f v012_512 = tri_abc_512_shuff - P_512;avx512f v012_512_shuff = shuffle4(v012_512);

    /* Calculate triangle edges. */avx512f e012_512_min = v012_512_shuff - v012_512;avx512f e012_512_plus = v012_512_shuff + v012_512;

    /* Perform edge tests. */avx512f cross_512 = lcross_xyz(e012_512_plus, e012_512_min);const avx512f UVW = ldot3_xyz(cross_512, dir_512);const float minUVW = reduce_min(UVW);const float maxUVW = reduce_max(UVW);

    if (minUVW < 0.0f && maxUVW > 0.0f) return false;

    /* ... */

  • 26

    Blender & Embree – Data Preparation

    Our modifications is based on code from Stefan Werner: https://github.com/tangent-animation/blender278/tree/master_cycles-embree

    // bvh_embree.hclass BVHEmbree : public BVH{

    RTCScene scene;RTCDevice device;//...BVHEmbree(/*...*/) : BVH(/*..*/) { device = rtcNewDevice("verbose=1"); }virtual ~BVHEmbree() { rtcDeleteScene(scene); rtcDeleteDevice(device); }

    } };

    void BVHEmbree::add_triangles(Mesh *mesh, int i, int &geom_id, int &geom_id_device) {//...geom_id = rtcNewTriangleMesh2(scene, RTC_GEOMETRY_DEFORMABLE,

    num_triangles, num_verts, num_motion_steps, i*2);

    void* raw_buffer = rtcMapBuffer(scene, geom_id, RTC_INDEX_BUFFER);unsigned *rtc_indices = (unsigned*) raw_buffer;for (size_t j = 0; j < num_triangles; j++) {

    Mesh::Triangle t = mesh->get_triangle(j);rtc_indices[j * 3] = t.v[0];//...

    }rtcUnmapBuffer(scene, geom_id, RTC_INDEX_BUFFER);update_tri_vertex_buffer(geom_id, geom_id_device, mesh);//...

    }

    // bvh_embree.cppvoid BVHEmbree::build(/*...*/) {

    // ...RTCSceneFlags flags = RTC_SCENE_HIGH_QUALITY | /*...*/ ;scene = rtcDeviceNewScene(device, flags, RTC_INTERSECT1);foreach(Object *ob, objects) { add_object(ob, i); ++i; // ... }rtcCommit(scene);

    }

    // bvh_embree.cppvoid BVHEmbree::add_object(Object *ob, int i) {

    Mesh *mesh = ob->mesh;int geom_id = 0;int geom_id_device = 0;//...

    add_triangles(mesh, i, geom_id, geom_id_device);rtcSetUserData(scene, geom_id, (void*) prim_offset);rtcSetOcclusionFilterFunction(scene, geom_id, rtc_filter_func);rtcSetMask(scene, geom_id, ob->visibility);

    }

    void BVHEmbree::add_instance(Object *ob, int i) {BVHEmbree *instance_bvh = (BVHEmbree*) (ob->mesh->bvh);int geom_id = rtcNewInstance3(scene, instance_bvh->scene,

    num_motion_steps, i*2);int geom_id_device = 0;//...rtcSetTransform2(scene, geom_id, …);rtcSetUserData(scene, geom_id, (void*) instance_bvh->scene);rtcSetMask(scene, geom_id, ob->visibility);

    }

  • 27

    Blender & Embree & MPI – Data Preparation// bvh_embree.hclass BVHEmbree : public BVH{

    RTCScene scene;RTCDevice device;//...BVHEmbree(/*...*/) : BVH(/*..*/) { device = rtcNewDevice("verbose=1"); }virtual ~BVHEmbree() { rtcDeleteScene(scene); rtcDeleteDevice(device); }

    } };

    void BVHEmbree::add_triangles(Mesh *mesh, int i, int &geom_id, int &geom_id_device) {//...geom_id = rtcNewTriangleMesh2(scene, RTC_GEOMETRY_DEFORMABLE,

    num_triangles, num_verts, num_motion_steps, i*2);

    void* raw_buffer = rtcMapBuffer(scene, geom_id, RTC_INDEX_BUFFER);unsigned *rtc_indices = (unsigned*) raw_buffer;for (size_t j = 0; j < num_triangles; j++) {

    Mesh::Triangle t = mesh->get_triangle(j);rtc_indices[j * 3] = t.v[0];//... }

    rtcUnmapBuffer(scene, geom_id, RTC_INDEX_BUFFER);update_tri_vertex_buffer(geom_id, geom_id_device, mesh);//...

    }

    // bvh_embree.cppvoid BVHEmbree::build(/*...*/) {

    // ...RTCSceneFlags flags = RTC_SCENE_HIGH_QUALITY | /*...*/ ;scene = rtcDeviceNewScene(device, flags, RTC_INTERSECT1);foreach(Object *ob, objects) { add_object(ob, i); ++i; // ... }rtcCommit(scene);

    }

    // bvh_embree.cppvoid BVHEmbree::add_object(Object *ob, int i) {

    Mesh *mesh = ob->mesh;int geom_id = 0;int geom_id_device = 0;//...

    add_triangles(mesh, i, geom_id, geom_id_device);rtcSetUserData(scene, geom_id, (void*) prim_offset);rtcSetOcclusionFilterFunction(scene, geom_id, rtc_filter_func);rtcSetMask(scene, geom_id, ob->visibility);

    }

    void BVHEmbree::add_instance(Object *ob, int i) {BVHEmbree *instance_bvh = (BVHEmbree*) (ob->mesh->bvh);int geom_id = rtcNewInstance3(scene, instance_bvh->scene,

    num_motion_steps, i*2);int geom_id_device = 0;//...rtcSetTransform2(scene, geom_id, …);rtcSetUserData(scene, geom_id, (void*) instance_bvh->scene);rtcSetMask(scene, geom_id, ob->visibility);

    }

    MPIMPI MPI

    MPI

    MPI

    MPI

    MPI

    MPI

    MPI

    Our modifications is based on code from Stefan Werner: https://github.com/tangent-animation/blender278/tree/master_cycles-embree

    Each architecture can have a different type of BVH

  • 28

    Blender & Embree - Rendering// bvh.hbool scene_intersect(KernelGlobals *kg, const Ray ray, const uint visibility, Intersection *isect, uint *lcg_state, float difl, float extmax) {

    if(kernel_data.bvh.scene) {isect->t = ray.t;CCLRay rtc_ray(ray, kg, visibility, CCLRay::RAY_REGULAR);rtcIntersect(kernel_data.bvh.scene, rtc_ray);if(rtc_ray.geomID != RTC_INVALID_GEOMETRY_ID && rtc_ray.primID !=

    RTC_INVALID_GEOMETRY_ID) {rtc_ray.isect_to_ccl(isect);return true;

    }return false; } // … }

    // bvh_embree_traversal.hCCLRay::CCLRay(const ccl::Ray& ray, ccl::KernelGlobals *kg_, const ccl::uint visibility, RayType type_) {

    org[0] = ray.P.x; org[1] = ray.P.y; org[2] = ray.P.z;dir[0] = ray.D.x; dir[1] = ray.D.y; dir[2] = ray.D.z;tnear = 0.0f; tfar = ray.t; time = ray.time; mask = visibility;geomID = primID = instID = RTC_INVALID_GEOMETRY_ID;// …

    }

    void CCLRay::isect_to_ccl(ccl::Intersection *isect) {const bool is_hair = geomID & 1;isect->u = 1.0f - v - u;isect->v = u;isect->t = tfar;if(instID != RTC_INVALID_GEOMETRY_ID) {

    RTCScene inst_scene = (RTCScene) rtcGetUserData(kernel_data.bvh.scene, instID);isect->prim = primID + (intptr_t) rtcGetUserData(inst_scene, geomID) +

    kernel_tex_fetch(__object_node, instID/2);isect->object = instID/2;

    } else {isect->prim = primID + (intptr_t) rtcGetUserData(kernel_data.bvh.scene, geomID);isect->object = OBJECT_NONE;

    }isect->type = kernel_tex_fetch(__prim_type, isect->prim);

    } };

    // bvh_embree_traversal.hstruct CCLRay : public RTCRay {

    typedef enum {RAY_REGULAR = 0, RAY_SHADOW_ALL = 1, RAY_SSS = 2,RAY_VOLUME_ALL = 3,

    } RayType;// cycles extensions:ccl::KernelGlobals *kg;RayType type;// for shadow raysccl::Intersection *isect_s;int max_hits;int num_hits;

    // … }

    Our modifications is based on code from Stefan Werner: https://github.com/tangent-animation/blender278/tree/master_cycles-embree

  • 29

    Embree & KNC Embree versions: Embree v3.0.0 (beta), Embree v2.17.2 (release), Embree v2.9.0 (KNC)

    SSE emulator

  • 30

    Embree & KNC Embree versions: Embree v3.0.0 (beta), Embree v2.17.2 (release), Embree v2.9.0 (KNC)

    SSE emulator

  • 31

    Embree & KNC Embree versions: Embree v3.0.0 (beta), Embree v2.17.2 (release), Embree v2.9.0 (KNC)

    SSE emulator #define _MM_mul_ps _simd_mul_ps//Multiply packed single-precision (32-bit) …INTRINSICS_EMUL_INLINE__simd128 _simd_mul_ps (__simd128 a, __simd128 b){

    INTRINSICS_EMUL_ASSUME_ALIGNED(a);INTRINSICS_EMUL_ASSUME_ALIGNED(b);

    __simd128 dst;for (int j = 0; j < 4; j++) {

    dst.f[j] = a.f[j] * b.f[j];}

    #ifdef SIMD_DEBUG__m128 __a;memcpy(&__a, a, 16);__m128 __b;memcpy(&__b, b, 16);__m128 __dst = _mm_mul_ps(__a, __b);if (memcmp(&dst, &__dst, 16) != 0)

    printf("_mm_mul_ps:%d\n",(memcmp(&dst, &__dst, 16) != 0) ? 0 : 1);

    #endifreturn dst;

    }

  • 32

    Embree & KNC Embree versions: Embree v3.0.0 (beta), Embree v2.17.2 (release), Embree v2.9.0 (KNC)

    SSE emulator #define _MM_mul_ps _simd_mul_ps//Multiply packed single-precision (32-bit) …INTRINSICS_EMUL_INLINE__simd128 _simd_mul_ps (__simd128 a, __simd128 b){

    INTRINSICS_EMUL_ASSUME_ALIGNED(a);INTRINSICS_EMUL_ASSUME_ALIGNED(b);

    __simd128 dst;for (int j = 0; j < 4; j++) {

    dst.f[j] = a.f[j] * b.f[j];}

    #ifdef SIMD_DEBUG__m128 __a;memcpy(&__a, a, 16);__m128 __b;memcpy(&__b, b, 16);__m128 __dst = _mm_mul_ps(__a, __b);if (memcmp(&dst, &__dst, 16) != 0)

    printf("_mm_mul_ps:%d\n",(memcmp(&dst, &__dst, 16) != 0) ? 0 : 1);

    #endifreturn dst;

    }

    #ifdef SIMD_AVX512return _mm512_mul_ps(a.m, b.m);

    #endif

    struct __simd128 {union {

    __m512 m;int8_t c[64];int16_t s[32];float f[16];int32_t i[16];double d[8];int64_t l[8];

    };__simd128() {};__simd128(const __m512 &m_):

    m(m_) {}} __attribute__((aligned(64)));

  • Benchmarks

    33

    Salomon IT4Innovations Intel® Xeon™ E5-2680v3 (Haswell) Intel® Xeon Phi™ 7120P (KNC) NVIDIA® GeForce® GTX TITAN X

    Marconi CINECA Intel® Xeon Phi™ 7250 (KNL) Intel® Xeon 8160 (SKL)

    Classroom by Christophe Seux

    Fishy Cat by Manu Jarvinen

    Pabellon Barcelona by Claudio AndresThe Daily Dweebsby Blender Foundation

    Crown by Martin Lubich

  • Performance comparison - Rendering

    34

    453

    798 769

    0

    1127

    688

    1138

    2784

    639

    322

    645

    1240

    322141

    329

    581

    340160

    330

    660

    0

    500

    1000

    1500

    2000

    2500

    3000

    Classroom Fishy Cat Pabellon Barcelona Dweebs MB

    Tim

    e in

    sec

    onds TITAN X (CUDA)

    KNC

    Haswell

    KNL (Marconi)

    SKL (Marconi)

    OpenMP threads: KNC 244x, Haswell 12x, KNL 272x, SKX 24x. KNC uses BVH8. EMBREE-BVH8 is used for Haswell, KNL and Skylake.

  • Performance comparison of BVH - KNC

    35

    390

    952

    1586

    4022

    206

    745

    1242

    3151

    200

    721

    1236

    2886

    185

    689

    1138

    2784

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    Imperial Crown Fishy Cat Pabellon Barcelona Dweebs MB

    Tim

    e in

    sec

    onds

    BVH2

    BVH2 ( AVX-512 )

    BVH4 ( AVX-512 )

    BVH8 ( AVX-512 )

    BVH2, BVH4 – ssef was replaced by avx512f. BVH8 – avxf was replaced by avx512f. BVH2, BVH4, BVH8 contain Ray Triangle Intersection with avx512f.

  • Performance comparison of BVH – KNL (Marconi)

    36

    161

    313

    606

    1352

    84

    214

    493

    1116

    93175

    440

    931

    87173

    423

    885

    77160

    425

    697

    66141

    329

    581

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    Imperial Crown Fishy Cat Pabellon Barcelona Dweebs MB

    Tim

    e in

    sec

    onds BVH2

    BVH2 ( SSE2 )

    BVH4 ( SSE2 )

    BVH8 ( AVX2 )

    EMBREE-BVH4

    EMBREE-BVH8

  • Performance comparison of BVH – SKL (Marconi)

    37

    150

    308

    630

    1534

    76

    221

    450

    1096

    65176

    381

    907

    70183

    399

    927

    72158

    326

    660

    76160

    330

    662

    0

    200

    400

    600

    800

    1000

    1200

    1400

    1600

    1800

    Imperial Crown Fishy Cat Pabellon Barcelona Dweebs MB

    Tim

    e in

    sec

    onds BVH2

    BVH2 ( SSE2 )

    BVH4 ( SSE2 )

    BVH8 ( AVX2 )

    EMBREE-BVH4

    EMBREE-BVH8

  • Benchmark Worm: Strong Scalability MPI Test (offline)

    38

    The benchmark was run on 64 computing nodes of the Salomon supercomputer equipped with two Intel® Xeon™ E5-2680v3 CPUs and two Intel® Xeon Phi™ 7120P.

    Worm scene has 13.2 million triangles.

    Resolution: 4096x2048, Samples: 1024

    Cosmos Laundromat - First Cycle

  • Benchmark Worm: Strong Scalability MPI Test (offline)

    39

    162258436

    43392210

    1115562

    285

    85944297

    21491074

    537269

    134

    1

    10

    100

    1000

    10000

    100000

    1 2 4 8 16 32 64

    Tim

    e in

    sec

    onds

    Number of nodes

    OMP24 Offload Symmetric linear

  • Benchmark Tatra T87: Strong Scalability MPI Test (interactive)

    40

    The benchmark was run on 64 computing nodes of the Salomon supercomputer equipped with two Intel® Xeon™ E5-2680v3 CPUs

    Tatra T87 has 1.2 million triangles and uses the HDRI lighting.

    Resolution: 1920x1080, Samples: 1

    Tatra T87 by David Cloete

  • Benchmark Tatra T87: Strong Scalability MPI Test (interactive)

    41

    520~

    1.9 fps280~

    3.6 fps 160~

    6.3 fps 88~

    11.3 fps 50~20 fps

    950510

    310

    15080

    51 ms2 samples

    52 ms4 samples19.2 fps

    8080 ms

    2 samples80 ms

    4 samples 12,5 fps

    1

    10

    100

    1000

    1 2 4 8 16 32 64

    Tim

    e in

    mill

    isec

    onds

    Number of nodes

    Offload OMP24Offload - weak scaling OMP24 - weak scalinglinear optimal weak

    real-time rendering achievedwe can increase samples per

    transfer

  • The Daily Dweebs: Performance comparison of w/o Motion Blur on 2x Intel® Xeon™ E5-2680v3 (Haswell)

    42

    0

    10

    20

    30

    40

    50

    60

    70

    1 101 201 301 401 501 601 701 801 901 1001 1101

    Tim

    e in

    min

    utes

    Frames

    BVH4EMBREEBHV4 with motion blurEMBREE with motion blur

  • 43https://cloud.blender.org/blog/the-dweebs-a-mini-open-movie-project

  • 44

    References

    Jaros, M., Riha, L., Strakos, P. , et al.: Rendering in blender cycles using MPI and intel® xeon phi™. Paper presented at the ACM International Conference Proceeding Series, Part F130523 doi:10.1145/3110224.3110236, 2017

    Jaros, M., Riha, L., Strakos, P. , et al.: Acceleration of Blender Cycles Path-Tracing Engine Using Intel Many Integrated Core Architecture, CISIM 2015, Warsaw, Poland, p. 86-97, 2015

    http://blender.it4i.cz

    Acknowledgement

    This work was supported by The Ministry of Education, Youth and Sports from the Large Infrastructures for Research, Experimental Development and Innovations project „IT4Innovations National Supercomputing Center – LM2015070“.

    http://blender.it4i.cz/

    Rendering in Blender Cycles Using AVX-512 VectorizationOutlineBlender & Cycles RenderCycles is internal plugin (Extending Python w. C++)Cycles is internal plugin (Extending Python w. C++)Cycles is internal plugin (Extending Python w. C++)Cycles is internal plugin (Extending Python w. C++)Cycles is internal plugin (Extending Python w. C++)Cycles is internal plugin (Extending Python w. C++)Cycles is internal plugin (Extending Python w. C++)Path-tracing (unbiased vs. biased)Acceleration Structures�Bounding Volume Hierarchy (BVH)Bounding Volume Hierarchy (BVH)IT4InnovationsCyclesPhiMPI, Hybrid and Native modeSnímek číslo 18OutlineAVX-512 Vectorization and Intrinsics InstructionsAVX-512 Vectorization and Intrinsics InstructionsAVX-512 Vectorization and Intrinsics InstructionsAVX-512 Vectorization and Intrinsics InstructionsAPI from Embree: avx512f, avx512i, avx512b�Example of AVX-512 Vectorization: Ray-Triangle IntersectionBlender & Embree – Data PreparationBlender & Embree & MPI – Data PreparationBlender & Embree - RenderingEmbree & KNCEmbree & KNCEmbree & KNCEmbree & KNCBenchmarksPerformance comparison - RenderingPerformance comparison of BVH - KNCPerformance comparison of BVH – KNL (Marconi)Performance comparison of BVH – SKL (Marconi)Benchmark Worm: Strong Scalability MPI Test (offline)Benchmark Worm: Strong Scalability MPI Test (offline)Benchmark Tatra T87: Strong Scalability MPI Test (interactive)Benchmark Tatra T87: Strong Scalability MPI Test (interactive)The Daily Dweebs: Performance comparison of w/o Motion Blur on 2x Intel® Xeon™ E5-2680v3 (Haswell)�Snímek číslo 43References