Quantcast
Channel: Intel Developer Zone Articles
Viewing all 1201 articles
Browse latest View live

Getting Started with Parallel STL

$
0
0

Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies, as specified in the working draft N4659 for the next version of the C++ standard, commonly called C++17. The implementation also supports the unsequenced execution policy specified in the ISO* C++ working group paper P0076R3.

Parallel STL offers efficient support for both parallel and vectorized execution of algorithms for Intel® processors. For sequential execution, it relies on an available implementation of the C++ standard library.

Parallel STL is available as a part of Intel® Parallel Studio XE and Intel® System Studio.

 

Prerequisites

To use Parallel STL, you must have the following software installed:

  • C++ compiler with:
    • Support for C++11
    • Support for OpenMP* 4.0 SIMD constructs
  • Intel® Threading Building Blocks (Intel® TBB) 2018

The latest version of the Intel® C++ Compiler is recommended for better performance of Parallel STL algorithms, comparing to previous compiler versions.

To build an application that uses Parallel STL on the command line, you need to set the environment variables for compilation and linkage. You can do this by calling suite-level environment scripts such as compilervars.{sh|csh|bat}, or you can set just the Parallel STL environment variables by running pstlvars.{sh|csh|bat} in <install_dir>/{linux|mac|windows}/pstl/bin.

<install_dir> is the installation directory, by default, it is:

For Linux* and macOS*:

  • For super-users:      /opt/intel/compilers_and_libraries_<version>
  • For ordinary users:  $HOME/intel/compilers_and_libraries_<version>

For Windows*:

  • <Program Files>\IntelSWTools\compilers_and_libraries_<version>

 

Using Parallel STL

Follow these steps to add Parallel STL to your application:

  1. Add the <install_dir>/pstl/include folder to the compiler include paths. You can do this by calling the pstlvars script.

  2. Add #include "pstl/execution" to your code. Then add a subset of the following set of lines, depending on the algorithms you intend to use:

    • #include "pstl/algorithm"
    • #include "pstl/numeric"
    • #include "pstl/memory"
  3. When using algorithms and execution policies, specify the namespaces std::execution in case of there is no vendor implementation of C++17 standard library or pstl::execution otherwise. See the 'Examples' section below.
  4. For any of the implemented algorithms, pass one of the values seq, unseq, par or par_unseq as the first parameter in a call to the algorithm to specify the desired execution policy. The policies have the following meaning:

     

    Execution policy

    Meaning

    seq

    Sequential execution.

    unseq

    Try to use SIMD. This policy requires that all functions provided are SIMD-safe.

    par

    Use multithreading.

    par_unseq

    Combined effect of unseq and par.

     

  5. Compile the code as C++11 (or later) and using compiler options for vectorization:

    • For the Intel® C++ Compiler:
      • For Linux* and macOS*: -qopenmp-simd or -qopenmp
      • For Windows*: /Qopenmp-simd or /Qopenmp
    • For other compilers, find a switch that enables OpenMP* 4.0 SIMD constructs.

    To get good performance, specify the target platform. For the Intel C++ Compiler, some of the relevant options are:

    • For Linux* and macOS*: -xHOST, -xSSE4.1, -xCORE-AVX2, -xMIC-AVX512.
    • For Windows*: /QxHOST, /QxSSE4.1, /QxCORE-AVX2, /QxMIC-AVX512.
    If using a different compiler, see its documentation.

     

  6. Link with the Intel TBB dynamic library for parallelism. For the Intel C++ Compiler, use the options:

    • For Linux* and macOS*: -tbb
    • For Windows*: /Qtbb (optional, this should be handled by #pragma comment(lib, <libname>))

Version Macros

Macros related to versioning, as described below. You should not redefine these macros.

PSTL_VERSION

Current Parallel STL version. The value is a decimal numeral of the form xyy where x is the major version number and yy is the minor version number.

PSTL_VERSION_MAJOR

PSTL_VERSION/100; that is, the major version number.

PSTL_VERSION_MINOR

PSTL_VERSION - PSTL_VERSION_MAJOR * 100; that is, the minor version number.

Macros

PSTL_USE_PARALLEL_POLICIES

This macro controls the use of parallel policies.

When set to 0, it disables the par and par_unseq policies, making their use a compilation error. It's recommended for code that only uses vectorization with unseq policy, to avoid dependency on Intel® TBB runtime library.

When the macro is not defined (default) or evaluates to a non-zero value all execution policies are enabled.

PSTL_USE_NONTEMPORAL_STORES

This macro enables the use of #pragma vector nontemporal in the algorithms std::copy, std::copy_n, std::fill, std::fill_n, std::generate, std::generate_n with the unseq policy. For further details about the pragma, see the User and Reference Guide for the Intel® C++ Compiler at https://software.intel.com/en-us/node/524559.

If the macro evaluates to a non-zero value, the use of #pragma vector nontemporal is enabled.

When the macro is not defined (default) or set to 0, the macro does nothing.

 

Examples

Example 1

The following code calls vectorized copy:

#include "pstl/execution"
#include "pstl/algorithm"
void foo(float* a, float* b, int n) {
    std::copy(pstl::execution::unseq, a, a+n, b);
}

Example 2

This example calls the parallelized version of fill_n:

#include <vector>
#include "pstl/execution"
#include "pstl/algorithm"

int main()
{
    std::vector<int> data(10000000);
    std::fill_n(pstl::execution::par_unseq, data.begin(), data.size(), -1);  // Fill the vector with -1

    return 0;
}

Implemented Algorithms

Parallel STL supports all of the aforementioned execution policies only for the algorithms listed in the following table. Adding a policy argument to any of the rest of the C++ standard library algorithms will result in sequential execution.

 

Algorithm

Algorithm page at cppreference.com

adjacent_find

http://en.cppreference.com/w/cpp/algorithm/adjacent_find

all_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

any_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

copy

http://en.cppreference.com/w/cpp/algorithm/copy

copy_if

http://en.cppreference.com/w/cpp/algorithm/copy

copy_n

http://en.cppreference.com/w/cpp/algorithm/copy_n

count

http://en.cppreference.com/w/cpp/algorithm/count

count_if

http://en.cppreference.com/w/cpp/algorithm/count

destroy

http://en.cppreference.com/w/cpp/memory/destroy

destroy_n

http://en.cppreference.com/w/cpp/memory/destroy_n

equal

http://en.cppreference.com/w/cpp/algorithm/equal

exclusive_scan

http://en.cppreference.com/w/cpp/algorithm/exclusive_scan

fill

http://en.cppreference.com/w/cpp/algorithm/fill

fill_n

http://en.cppreference.com/w/cpp/algorithm/fill_n

find

http://en.cppreference.com/w/cpp/algorithm/find

find_end

http://en.cppreference.com/w/cpp/algorithm/find_end

find_first_of

http://en.cppreference.com/w/cpp/algorithm/find_first_of

find_if

http://en.cppreference.com/w/cpp/algorithm/find

find_if_not

http://en.cppreference.com/w/cpp/algorithm/find

for_each

http://en.cppreference.com/w/cpp/algorithm/for_each

for_each_n

http://en.cppreference.com/w/cpp/algorithm/for_each_n

generate

http://en.cppreference.com/w/cpp/algorithm/generate

generate_n

http://en.cppreference.com/w/cpp/algorithm/generate_n

inclusive_scan

http://en.cppreference.com/w/cpp/algorithm/inclusive_scan

is_heap

http://en.cppreference.com/w/cpp/algorithm/is_heap

is_heap_until

http://en.cppreference.com/w/cpp/algorithm/is_heap_until

is_partitioned

http://en.cppreference.com/w/cpp/algorithm/is_partitioned

is_sorted

http://en.cppreference.com/w/cpp/algorithm/is_sorted

is_sorted_until

http://en.cppreference.com/w/cpp/algorithm/is_sorted_until

lexicographical_compare

http://en.cppreference.com/w/cpp/algorithm/lexicographical_compare

max_element

http://en.cppreference.com/w/cpp/algorithm/max_element

merge

http://en.cppreference.com/w/cpp/algorithm/merge

min_element

http://en.cppreference.com/w/cpp/algorithm/min_element

minmax_element

http://en.cppreference.com/w/cpp/algorithm/minmax_element

mismatch

http://en.cppreference.com/w/cpp/algorithm/mismatch

move

http://en.cppreference.com/w/cpp/algorithm/move

none_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

partial_sort

http://en.cppreference.com/w/cpp/algorithm/partial_sort

partition_copy

http://en.cppreference.com/w/cpp/algorithm/partition_copy

reduce

http://en.cppreference.com/w/cpp/algorithm/reduce

remove_copy

http://en.cppreference.com/w/cpp/algorithm/remove_copy

remove_copy_if

http://en.cppreference.com/w/cpp/algorithm/remove_copy

replace

http://en.cppreference.com/w/cpp/algorithm/replace

replace_copy

http://en.cppreference.com/w/cpp/algorithm/replace_copy

replace_copy_if

http://en.cppreference.com/w/cpp/algorithm/replace_copy

replace_if

http://en.cppreference.com/w/cpp/algorithm/replace

search

http://en.cppreference.com/w/cpp/algorithm/search

search_n

http://en.cppreference.com/w/cpp/algorithm/search_n

sort

http://en.cppreference.com/w/cpp/algorithm/sort

stable_sort

http://en.cppreference.com/w/cpp/algorithm/stable_sort

swap_ranges

http://en.cppreference.com/w/cpp/algorithm/swap_ranges

transform

http://en.cppreference.com/w/cpp/algorithm/transform

transform_exclusive_scan

http://en.cppreference.com/w/cpp/algorithm/transform_exclusive_scan

transform_inclusive_scan

http://en.cppreference.com/w/cpp/algorithm/transform_inclusive_scan

transform_reduce

http://en.cppreference.com/w/cpp/algorithm/transform_reduce

uninitialized_copy

http://en.cppreference.com/w/cpp/memory/uninitialized_copy

uninitialized_copy_n

http://en.cppreference.com/w/cpp/memory/uninitialized_copy_n

uninitialized_default_construct

http://en.cppreference.com/w/cpp/memory/uninitialized_default_construct

uninitialized_default_construct_n

http://en.cppreference.com/w/cpp/memory/uninitialized_default_construct_n

uninitialized_fill

http://en.cppreference.com/w/cpp/memory/uninitialized_fill

uninitialized_fill_n

http://en.cppreference.com/w/cpp/memory/uninitialized_fill_n

uninitialized_move

http://en.cppreference.com/w/cpp/memory/uninitialized_move

uninitialized_move_n

http://en.cppreference.com/w/cpp/memory/uninitialized_move_n

uninitialized_value_construct

http://en.cppreference.com/w/cpp/memory/uninitialized_value_construct

uninitialized_value_construct_n

http://en.cppreference.com/w/cpp/memory/uninitialized_value_construct_n

unique_copy

http://en.cppreference.com/w/cpp/algorithm/unique_copy

Known limitations

Parallel and vector execution is only supported for a subset of aforementioned algorithms if random access iterators are provided, while for the rest execution will remain serial.

Legal Information

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
© Intel Corporation


Code Sample: Custom Audio Editor Tool with Unreal Engine* for Sound Spatialization in VR

$
0
0

File(s):

Download
License:Intel Sample Source Code License Agreement
Optimized for... 
OS:Microsoft Windows® 10 (64 bit)
Hardware:GPU required, HTC Vive*
Software:
(Programming Language, tool, IDE, Framework)
Microsoft Visual Studio* 2017, C#; Unreal Engine* 4.18.1 or greater
Prerequisites:

Familiarity with Visual Studio, Unreal Engine API, 3D graphics, parallel processing.

Summary

This Code Sample show you step-by-step on building a useful tool for VR devs using Unreal Engine that leverages the power of intel CPU’s.  Unreal Engine has a powerful virtual reality editor option, but something they did not include is the ability to edit and place sounds while inside VR. It can be troublesome constantly having to restart the editor after adjusting a sound to test what it sounds like in VR. So we decided to create a sound editor that allows game devs and sound designers alike to quickly place, edit, and test spatialized sound inside VR! This will prevent the user from having to constantly enter and exit the editor.

What You Will Learn

  • Motion controller interaction
  • How to create a custom C++ class
  • VR UI
  • Saving editor changes
  • Sound spatialization parameters

Below, we will walk you through step-by-step to demonstrate how we made this custom audio editor tool for Unreal Engine from start to finish.

Instruction

Before we begin, you need to do a couple of things. Download and unzip the project folder. You also need to make sure you have at least version 4.18.1 of Unreal Engine* installed.

When you have downloaded and unzipped the folder, right-click on Intel_VR_Audio_Tools.uproject and select "Generate Visual Studio project files." After that completes, open the project. A popup that says "Missing Intel_VR_Audio_Tools Modules" will appear. Click "Yes" to start the rebuild; this should take less than 20 seconds. This is needed because of how you are dynamically finding .wav files that have been added to the project, which will be explained in the Custom C++ Class section.

Follow the tutorial for the step-by-step to build the custom tool.

Updated Log

Created April 25, 2018

Characterizing DPDK-Enabled Open vSwitch* Using TRex on a Dual-Socket System

$
0
0

Introduction

Measuring the performance of an Open vSwitch* (OvS) or any vSwitch can be difficult without resorting to expensive commercial traffic generators, or at least some extra network nodes on which to run a software traffic generator.

However, on dual-socket or multi-socket systems there is another option. The software traffic generator can be run on one socket while OvS as the System Under Test (SUT) can reside on the other socket. Being on separate sockets, the two will work well without interfering with each other. Each will use memory from its own socket’s RAM and will use its own PCIe bus. The only extra hardware requirement is a second network interface card (NIC) to drive the generated traffic without interfering with the NIC and PCI-bus of the SUT.

This article shows you how to set up this configuration using the TRex Realistic Traffic Generator.

The overall system looks like this:

Overall system configuration
Figure 1. Overall system configuration.

Setting up the Hardware

It is important to choose the correct PCIe* slots for your NICs. For instance, to be able to handle the incoming traffic from even one 10 Gbps NIC you need a PCIe Gen3 NIC on an 8-lane bus. See Understanding Performance of PCI Express Systems.

For this article, I used a system based on an Intel® Server Board S2600WT.

sudo dmidecode -t system
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 2.7 present.

Handle 0x0001, DMI type 1, 27 bytes
System Information
        Manufacturer: Intel Corporation
        Product Name: S2600WT2
…

I can use this information to look up the technical specification for the board and see exactly which PCIe slots will give sufficient bandwidth while placing the two NICs on different sockets.

The system will be functional even if the PCIe placement is wrong; however, the direct memory access (DMA) will have to cross the socket interconnect, which will become a bottleneck and the performance observed by TRex will not be accurate.

Host Configuration

You should also be aware of the implications of other host configuration options that affect Data Plane Development Kit (DPDK) applications: hyper-threading and core ID to socket mapping, core isolation, and so on. These are covered in Open vSwitch with DPDK (Ovs-DPDK).

Care has to be taken to ensure that the cores assigned to TRex and the cores assigned to OvS-DPDK are on the correct sockets (i.e., the same socket as their assigned NICs). Be aware if hyper-threading is enabled then the core (strictly speaking, the “logical CPU”) to socket mapping is more complex. If you are using a program such as htop as a way to verify which cores are running poll-mode drivers (PMDs) then also be aware that TRex uses virtually no CPU resources to receive packets (as they actually terminate on the NIC) and only uses a noticeable amount of the CPU to transmit packets. So, if TRex is not actively transmitting, htop will not report its CPUs as being under load.

Setting up TRex

The TRex manual has a good sanity-check tutorial to ensure TRex is installed and working correctly in the “First time Running” section. There is also a list of other TRex documentation that should be scanned to get an idea of its other functionality.

Once TRex is running successfully in loop-back mode, connect the OvS and TRex NICs directly.

This article assumes you are already comfortable setting up Open vSwitch with DPDK (Ovs-DPDK). If not, the process is documented within OvS at Open vSwitch with DPDK. Only changes to the standard OvS-DPDK configuration are described below.

Hugepages

TRex uses DPDK, which requires hugepages. If hugepages are not configured on your system TRex will do this configuration for you auto-magically. Unfortunately, it does not account for other users of hugepages such as OvS-DPDK, so we need to set up and mount the hugepages manually beforehand in order to prevent TRex from reconfiguring everything when it starts up.

Create 2 GB of 2 MB hugepages (i.e., 2048 x 2 MB hugepages) on both sockets:

root# cat 2048 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
root# cat 2048 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages

If not already mounted, mount the hugepages:

root# mkdir /dev/hugepages
root# sudo mount -t hugetlbfs nodev $HUGE_DIR

As both TRex and OvS-DPDK create hugepages via DPDK, they need to change from the default hugepage-backed filenames.

For OvS, when ovsdb is running but before vswitchd is started:

ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-extra=”--file-prefix=ovs”

For TRex, edit trex_cfg.yaml:

- Port_limit    : 4 
  Prefix:       : trex

Without this step you will get errors such as "EAL: Can only reserve 1857 pages from 4096 requested. Current CONFIG_RTE_MAX_MEMSEG=256 is not enough” depending on whether you start OvS or TRex first.

You will be able to see the different hugepage-backed memory-mapped files that DPDK created for each application in /dev/hugepages as ovsmap_NNN and trexmap_NNN.

Other TRex Configurations

Although marked as optional in the TRex documentation, when TRex is used with OvS-DPDK you will need to limit its memory use.

limit_memory : 1024 #MB.

When using more than one core per interface to generate traffic I found:

  • Configuration item “c” must be configured.
  • The master and latency thread IDs must be set to ensure they run on the “TRex” socket and not the “OvS-DPDK” socket. I have used socket#1 as the TRex socket.
  • The length of the “threads” lists must be equal to the value of c.
  • Multiple cores are assigned to an interface by using multiple threads lists. The configuration below assigns cores 16 and 20 to the first interface, 17 and 21 to the second, and so on.

Therefore, you will need a configuration somewhat like:

c : 4 
  …
  platform :
         master_thread_id : 14
         latency_thread_id : 15
         dual_if :
         - socket : 1
           threads : [16, 17, 18, 19]
         - socket : 1
           threads : [20, 21, 22, 23]

For the wiring and OvS forwarding configuration listed above I also used the port_bandwidth_gb and explicitly set the src and dest_macs of the ports so that the outgoing src mac address for each port matched the incoming dest mac. These items may or may not be required:

port_bandwidth_gb : 10

  port_info       :  
          - dest_mac   : 00:03:47:00:01:02
            src_mac    : 00:03:47:00:01:01
          - dest_mac   : 00:03:47:00:01:01
            src_mac    : 00:03:47:00:01:02
          - dest_mac   : 00:03:47:00:00:02
            src_mac    : 00:03:47:00:00:01
          - dest_mac   : 00:03:47:00:00:01
            src_mac    : 00:03:47:00:00:02

Running TRex

TRex can now be run in the usual ways, such as:

sudo ./t-rex-64 -f cap2/dns.yaml -m 250kpps

Or the TRex Python* bindings can be used to exert much more fine-grained control and create your own application-specific traffic generator.

Creating Your Own Traffic Generator

By using the Python bindings, it becomes very simple to write your own traffic generator command-line interface (CLI) that is tailored to your own use case. This can make tasks that are slow via a traditional GUI easily repeatable and modifiable. For instance, when I needed to change the total load offered to OvS but at the same time maintain a ratio of offered load across several ports it was straightforward to write a CLI on top of TRex that looked like this:

$ ./mytest.py
(Cmd) dist 1 2 3 4         <<< per port load ratio 1:2:3:4
Dist ratio is [1, 2, 3, 4]
     
(Cmd) start 1000               <<< total load 1000 kpps
-f stl/traffic.py -m 100kpps --port 0 --force
-f stl/traffic.py -m 200kpps --port 1 --force
-f stl/traffic.py -m 300kpps --port 2 --force
-f stl/traffic.py -m 400kpps --port 3 –force

(Cmd) stats               <<< check loss & latency
Stats from last 4.5s
0->1 offered 98 dropped 0 rxd 98 (kpps) => 0% loss
1->0 offered 196 dropped 0 rxd 196 (kpps) => 0% loss
2->3 offered 295 dropped 0 rxd 295 (kpps) => 0% loss
3->2 offered 393 dropped 0 rxd 393 (kpps) => 0% loss
0->1 average 8 jitter 0 total_max 13 (us)
1->0 average 13 jitter 1 total_max 20 (us)
2->3 average 15 jitter 3 total_max 46 (us)
3->2 average 12 jitter 3 total_max 88 (us)
18 3 25 29 7 48 931 4 984 948 5 985   <<< excel pastable format!

(Cmd) start 10000  <<< increase total load while keeping distribution.
-f stl/traffic.py -m 1000kpps --port 0 --force
-f stl/traffic.py -m 2000kpps --port 1 --force
-f stl/traffic.py -m 3000kpps --port 2 --force
-f stl/traffic.py -m 4000kpps --port 3 –force

(Cmd) stats
Stats from last 2.3s
0->1 offered 987 dropped 0 rxd 988 (kpps) => 0% loss
1->0 offered 1964 dropped 0 rxd 1964 (kpps) => 0% loss
2->3 offered 2927 dropped 886 rxd 2040 (kpps) => 30% loss
3->2 offered 3879 dropped 1847 rxd 2031 (kpps) => 47% loss
0->1 average 18 jitter 3 total_max 25 (us)
1->0 average 29 jitter 7 total_max 48 (us)
2->3 average 931 jitter 4 total_max 984 (us)
3->2 average 948 jitter 5 total_max 985 (us)

Something like this happened to be very slow and error-prone to do on a commercial traffic generator, but simple using TRex and a small amount of Python.

An elided version of the script with additional comments explaining the use of the TRex API and the Python cmd module follows:

from trex_stl_lib.api import *
import cmd  <<< See https://docs.python.org/2/library/cmd.html

def main():
    # connect to the server
    c = STLClient(server = '127.0.0.1')
    c.connect()
    # enter the cli loop
    MyCmd(c).cmdloop()
    ...

class MyCmd(cmd.Cmd):
    def __init__(self, client):
        # standard cmd boilerplate
        cmd.Cmd.__init__(self)

    def do_start(self, line):     <<< cmd invokes this when you type 'start ...' on cli
        """start <total offered rate kpps> 
        Start traffic. Total kpps across all ports. e.g 'start 10'<<< This doc string is also the help string for the 'start' command!
        """
        (argc, argv) = self._parse(line)
        if (argc != 1):
            self.do_help('start')
            return 0              <<< 1 halt cmd loop; 0 get another command
            
        ...   <<< port to rate mapping set up here
        
        for port, rate in enumerate(rates):
            start_line = "-f stl/traffic.py \  <<< traffic.py is based on the TRex sample stl/cap.py
                -m %skpps --port %d --force" % (rate, port)
            rc = self.client.start_line(start_line)   <<< tell TRex server to start stream on port at rate
            
            
    def do_dist(self, line):  <<< cmd invokes this when you type 'dist ...' on cli
        """Set traffic dist across 4 ports e.g. 'dist 2 1 1 1'"""<<< help string for command

        ... <<< rate distribution is parsed and stored here

    def do_stop(self, line):
        """Stop all traffic"""
        rc = self.client.stop(rx_delay_ms=100) #returns None

    def do_stats(self, line):
        """Display pertinent stats"""

        pp = pprint.PrettyPrinter(indent=4)
        stats = self.client.get_stats()
        pp.pprint(stats)   <<< using pretty print is a fast 
                           <<< way to see stats' complicated internals!
        
        ... <<< Skip the gory details of parsing stats
        

    def do_quit(self, line):
        return 1    <<< return 1 tells cmd base class to exit

Summary

A software-based traffic generator such as TRex can be used to test vSwitches with very little extra hardware – just a DPDK-compatible NIC. Also, by using the TRex Python bindings it is simple to write a traffic generator CLI that is tailored to your own use case. This can make test scenarios that are slow and error-prone to do via a traffic generator GUI easily repeatable and modifiable, enabling fast exploratory performance testing.

About the Author

Billy O’Mahony is a network software engineer with Intel. He has worked on the Open Platform for Network Functions Virtualization (OPNFV) project and accelerated software switching solutions in the user space running on Intel® architecture. His contributions to Open vSwitch with DPDK include Ingress Scheduling and RXQ/PMD Assignment.

References

Understanding Performance of PCI Express Systems

Intel® Server Board S2600WT - Technical Product Specification

Open vSwitch with DPDK

TRex Manual

TRex Documentation

Getting Started with the New Unity* Entity Component, C# Job System, and Burst Compiler

$
0
0
By Cristiano Ferreira and Mike Geig

Figure from a video game

Low, medium, and high. Standard fare for GPU settings, but why not CPU settings, too? Today the potential power of the CPU on your end users' machines can vary wildly. Typically, developers will define a CPU min-spec, implement the simulation and gameplay systems using that performance target, and then call it a day. This leaves the many potentially available cores and features built into modern mainstream CPUs sitting idle on the sideline. The new C# job system and entity component system from Unity* don't just allow you to easily leverage previously unused CPU resources, they will also help run all your game code more efficiently in general. Then you can use those extra CPU resources to add more scene dynamism and immersion. In this article, you'll see how to quickly get started learning these new features.

Unity is attacking two important performance problems for computing in game engines. The first problem under assault is inefficient data layout. Unity's Entity Component System (ECS) improves management of data storage for high-performance operations on those structures. The second problem is the lack of a high-performance job language and SIMD vectorization that can operate on that well-organized data. Unity's new C# job system, entity component system and Burst compiler technology leave those shortcomings in the dust.

The Unity entity component system and C# job system are two different things, but they go hand-in-hand. To get to know them, let's look at the current Unity workflow for creating an object in your scene, and then differentiate from there.

In the current Unity workflow, you:

  • Create a GameObject.
  • Add components to your game object that give your object desired properties:
    • Rendering
    • Collision
    • Rigidbody physics
  • Create and add MonoBehaviour scripts to your object to command and alter the states of these components at runtime.

Let's call this the Classic Unity workflow. There are some inherent drawbacks and performance considerations for this way of doing things. For one, data and processing are tightly coupled. This means that code reuse can happen less frequently as processing is tied to a very specific set of data. On top of this, the classic system is very dependent on reference types.

In the Classic GameObject and Components example shown below, the Bullet GameObject is dependent on the Transform, Renderer, Rigidbody, and Collider references. Objects being referenced in these performance-critical scripts exist scattered in heap memory. As a result of this, data is not transformed into a form that can be operated on by the faster SIMD vector units.

Classic gameobject and components lists
Figure 1. Classic gameobject and components lists.

Gaining Speed with Cache Prefetching

Accessing data from system memory is far slower than pulling data from a nearby cache. That is where prefetching comes in. Cache prefetching is when computer hardware predicts what data will be accessed next, and then preemptively pulls it from the original, slower memory into faster memory so that it is warmed and ready when it's needed. Using this, hardware gets a nice performance boost on predictive computations. If you are iterating over an array, the hardware prefetch unit can learn to pull swaths of data from system memory into the cache. When it comes time for the processor to operate on the next part of the array, the necessary data is sitting close by in the cache and ready to go. For tightly packed contiguous data, like you'd have in an array, it's easy for the hardware prefetcher to predict and get the right objects. When many different game objects are sparsely allocated in heap memory, it becomes impossible for the prefetcher to do its thing, forcing it to fetch useless data.

Scattered memory references between gameobjects
Figure 2. Scattered memory references between gameobjects, their behaviors, and their components.

The illustration above shows the random sporadic nature of this data storage method. With the scenario shown above, every single reference (arrows)—even if cached as a member variable—could potentially pull all the way from system memory. The Classic Unity GameObject scenario can get your game prototyped and running in a very short timeline, but it's hardly ideal for performance-critical simulations and games. To deepen the issue, each of those reference types contain a lot of extra data that might not need to be accessed. These unused members also take up valuable space in processor caches. If only a select few member variables of an existing component are needed, the rest can be considered wasted space, as shown in the Wasted Space illustration below:

The few items used for the movement operation
Figure 3. The items in bold indicate the members that are actually used for the movement operation; the rest is wasted space.

To move your GameObject, the script needs to access the position and rotation data members from the Transform component. When your hardware is fetching data from memory, the cache line is filled with much potentially useless data. Wouldn't it be nice if you could simply have an array of only position and rotation members for all of the GameObjects that are supposed to move? This will enable you to perform the generic operation in a fraction of the time.

Enter the Entity Component System

Unity's new entity component system helps eliminate inefficient object referencing. Instead of GameObjects with their own collection of components, let's consider an entity that only contains the data it needs to exist.

In the Entity Component System with Jobs Diagram below, notice that the Bullet entity has no Transform or Rigidbody component attached to it. The Bullet entity is just the raw data needed explicitly for your update routine to operate on. With this new system, you can decouple the processing completely from individual object types.

Entity component system with jobs diagram
Figure 4. Entity component system with jobs diagram.

Of course, it's not just movement systems that benefit from this. Another common component in many games are more complex health systems set up across a wide variety of enemies and allies. These systems typically have little to no variation between object types, so they are another great candidate to leverage the new system. An entity is a handle used to index a collection of different data types that represent it (archetypes for ComponentDataGroups). Systems can filter and operate on all components with the required data without any help from the programmer; more on this later. The data is all efficiently organized in tightly packed contiguous arrays and filtered behind the scenes without the need to explicitly couple systems with entity types. The benefits of this system are immense. Not only does it improve access times with cache efficiency; it also allows advanced technologies (auto-vectorization / SIMD) available in modern CPUs that require this kind of data alignment to be used. This gives you performance by default with your games. You can do much more every frame or do the same thing in a much shorter amount of time. You'll also get a huge performance gain from the upcoming Burst compiler feature for free.

Wasted space generated by the classic system
Figure 5. Note the fragmentation in cache line storage and wasted space generated by the classic system. See image below for data comparison.

Comparison between Transform and DataNeededToMove
Figure 6. Compare the memory footprint associated with a single move operation with both accomplishing the same goal.

The Burst Compiler

The Burst compiler is the behind-the-scenes performance gain that results from the entity component system having organized your data more efficiently. Essentially, the burst compiler will optimize operations on code depending on the processor capabilities on your player's machine. For instance, instead of doing just 1 float operation at a time, maybe you can do 16, 32, or 64 by filling unused registers. The new compiler technology is employed on Unity's new math namespace and code within the C# job system (described below), relying on the fact that the system knows data has been set up the proper way with the entity component system. The current version for Intel CPUs supports Intel® Streaming SIMD Extensions 4 (Intel® SSE4), Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Advanced Vector Extensions 512 (Intel® AVX-512) for float and integer. The system also supports different accuracy per method, applied transitively. For example, if you are using a cosine function inside your top-level method with a low accuracy, the whole method will use a low accuracy version of cosine as well. The system also provides for AOT (Ahead-of-Time) compilation with dynamic selection of proper optimized function based on the feature support of the processor currently running the game. Another benefit to this method of compilation is the future-proofing of your game. If a brand-new processor line comes out to market with some amazing new features to be leveraged, Unity can do all of the hard work for you behind the scenes. All it takes is an upgrade to the compiler to reap the benefits. The compiler is package-based and can be upgraded without requiring a Unity editor update. Since the Burst package will be updated at its own cadence, you will be able to take advantage of the latest hardware architectural improvements and features without having to wait for the code to be rolled into the next editor release.

The C# Job System

Most people who have worked with multi-threaded code and generic tasking systems know that writing thread-safe code is difficult. Race conditions can rear their ugly heads in extremely rare cases. If the programmer hasn't thought of them, the result can be potentially critical bugs. On top of that, context-switching is expensive, so learning how to balance workloads to function as efficiently as possible across cores is difficult. Finally, writing SIMD optimized code or SIMD intrinsics is an esoteric skill, sometimes best left to a compiler. The new Unity C# job system takes care of all of these hard problems for you so that you can use all of the available cores and SIMD vectorization in modern CPUs without the headache.

C# job system diagram
Figure 7. C# job system diagram.

Let's look at a simple bullet movement system, for example. Most game programmers have written a manager for some type of GameObject as shown above in the Bullet Manager. Typically, these managers pool a list of GameObjects and update the positions of all active bullets in the scene every frame. This is a good use for the C# job system. Because movement can be treated in isolation, it is well suited to be parallelized. With the C# job system, you can easily pull this functionality out and operate on different chunks of data on different cores in parallel. As the developer, you don't have to worry about managing this work distribution; you only need to focus entirely on your game-specific code. You'll see how to easily do this in a bit.

Combining These Two New Systems

Combing the entity component system and the C# job system gives you a force more powerful than the sum of its parts. Since the entity component system sets up your data in an efficient, tightly packed manor, the job system can split up the data arrays so that they can be efficiently operated on in parallel. Also, you get some major performance benefits from cache locality and coherency. The thin as-needed allocation and arrangement of data increases the chance that the data your job will need will be in shared memory before it's needed. The layout and job system combination beget predictable access patterns that give the hardware cues to make smart decisions behind the scene, giving you great performance.

"OK!" You are saying, "This is absolutely amazing, but how do I use this new system?"

To help get your feet wet, let's compare and contrast the code involved in a very simple game that uses the following programming systems:

  1. Classic System
  2. Classic System Using Jobs
  3. Entity Component System Using Jobs

Here's how the game works:

  • The player hits the space bar and spawns a certain amount of ships in that frame.
  • Each generated ship is set to a random X coordinate within the bounds of the screen.
  • Each generated ship has a movement function that sends it toward the bottom of the screen.
  • Each generated ship resets its position once the bottom bound is crossed.

Test Configuration:

  • In this article, we will reference the Unity Profiler, a very powerful tool for isolating bottlenecks and viewing work distribution. See the Unity docs to learn more!
    • Screen captures and data were taken using the Intel® Core™ i7-8700K processor and an NVIDIA GeForce* GTX 1080 graphics card.

1. Classic System

The Classic system checks each frame for spacebar input and triggers the AddShips() method. This method finds a random X/Z position between the left and right sides of the screen, sets the rotation of the ship to point downward, and spawns a ship prefab at that location.

void Update()
{
    if (Input.GetKeyDown("space"))
        AddShips(enemyShipIncremement);
}

void AddShips(int amount)
{
    for (int i = 0; i < amount; i++)
    {
        float xVal = Random.Range(leftBound, rightBound);
        float zVal = Random.Range(0f, 10f);

        Vector3 pos = new Vector3(xVal, 0f, zVal + topBound);
        Quaternion rot = Quaternion.Euler(0f, 180f, 0f);

        var obj = Instantiate(enemyShipPrefab, pos, rot) as GameObject;
    }
}

<Classic/ClassicSpawning_Update_AddShips.cs>

Code sample showing how to add ships using the classic system

Classic battleship
Figure 8. Classic ship prefab. (source: Unity.com Asset store battleships package).

The ship object spawned, along with each of its components, are created in heap memory. The movement script attached accesses the transform component every frame and updates the position, making sure to stay between the bottom and top bounds of the screen. Super simple!

using UnityEngine;

namespace Shooter.Classic
{
    public class Movement : MonoBehaviour
    {
        void Update()
        {
            Vector3 pos = transform.position;
            pos += transform.forward * GameManager.GM.enemySpeed * Time.deltaTime;

            if (pos.z < GameManager.GM.bottomBound)
                pos.z = GameManager.GM.topBound;

            transform.position = pos;
        }
    }
}

<Classic/Classic_Movement.cs>

Code sample showing move behavior

The graphic below shows the profiler tracking 16,500 objects on the screen at once. Not bad, but we can do better! Keep on reading.

The profiler tracking 16,500 objects on 30 FPS screen
Figure 9. After some initializations, the profiler is already tracking 16,500 objects on the screen at 30 FPS.

Classic performance visualization
Figure 10. Classic performance visualization.

Looking at the BehaviorUpdate() method, you can see that it takes 8.67 milliseconds to complete the behavior update for all ships. Also note that this is all happening on the main thread.

In the C# job system, that work is split among all available cores.

2. Classic System Using Jobs

using Unity.Jobs;
using UnityEngine;
using UnityEngine.Jobs;

namespace Shooter.JobSystem
{
    [ComputeJobOptimization]
    public struct MovementJob : IJobParallelForTransform
    {
        public float moveSpeed;
        public float topBound;
        public float bottomBound;
        public float deltaTime;

        public void Execute(int index, TransformAccess transform)
        {
            Vector3 pos = transform.position;
            pos += moveSpeed * deltaTime * (transform.rotation * new Vector3(0f, 0f, 1f));
            
            if (pos.z < bottomBound)
                pos.z = topBound;
                
            transform.position = pos;
        }
    }
}

<Jobs/Jobs_Movement_Job.cs>

Sample code showing job movement implementation using the C# Job System

Our new MovementJob script is a struct that implements one of the IJob interface variants. This self-contained structure defines a task, or "job", and the data needed to complete that task. It is this structure that we will schedule with the Job System. For each ship's movement and bounds-checking calculations, you know you need the movement speed, the top bound, bottom bound, and the delta time values. The job has no concept of delta time, so that data must be provided explicitly. The calculation logic itself for the new position is the same as the classic system, although assigning that data back to the original transform must be updated via the TransformAccess parameter since reference types (such as Transform) don't work here. The basic requirements to create a job involve implementing one of the IJob interface variants, such as IJobParallelForTransform in the above example, and implementing the Execute method specific to your job. Once created, this job struct can simply be passed into the Job Scheduler. From there, all of the execution and resulting processing will be completed for you.

To learn more about how this job is structured, let's break down the interface it is using: IJob | ParallelFor | Transform. IJob is the basic interface that all IJob variants inherit from. A Parallel For Loop is a parallel pattern that essentially takes a typical single threaded for loop and splits the body of work into chunks based on index ranges to be operated on within different cores. Last but not least, the Transform keyword indicates that the Execute function to implement will contain the TransformAccess parameter to supply movement data to external Transform references. To conceptualize all of these, think of an array of 800 elements that you iterate over in a regular for loop. What if you had an 8-core system and each core could do the work for 100 entities automagically? A-ha! That's exactly what the system will do.

Using Jobs speeds up the iteration task
Figure 11. Using Jobs speeds up the iteration task significantly.

The Transform keyword on the end of the interface name simply gives us the TransformAccess parameter for our Execute method. For now, just know each ship's individual transform data is passed in for each Execute invocation. Now let's look at the AddShips() and Update() method in our game manager to see how this data is set every frame.

using UnityEngine;
using UnityEngine.Jobs;

namespace Shooter.JobSystem
{
    public class GameManager : MonoBehaviour
    {

        // ...
        // GameManager classic members
        // ...

        TransformAccessArray transforms;
        MovementJob moveJob;
        JobHandle moveHandle;

        
        // ...
        // GameManager code
        // ...
    }
}

<Job/Job_GameManagerVars.cs>

Code sample showing required variables to set up and track jobs

Right away, you notice that you have some new variables that you need to keep track of:

  • TransformAccessArray is the data container that will hold a modified reference to each ship's Transform (job-ready TransformAccess). The normal Transform data type isn't thread-safe, so this is a convenient helper type to set movement related data for your GameObjects.
  • MovementJob is an instance of the job struct we just created. This is what we will be using to configure our job in the job system.
  • JobHandle is the unique identifier for your job that you use to reference your job for various operations, such as verifying completion. You'll receive a handle to your job when you schedule it.
void Update()
{
    moveHandle.Complete();

    if (Input.GetKeyDown("space"))
        AddShips(enemyShipIncremement);

    moveJob = new MovementJob()
    {
        moveSpeed = enemySpeed,
        topBound = topBound,
        bottomBound = bottomBound,
        deltaTime = Time.deltaTime
    };

    moveHandle = moveJob.Schedule(transforms);

    JobHandle.ScheduleBatchedJobs();
}

void AddShips(int amount)
{
    moveHandle.Complete();

    transforms.capacity = transforms.length + amount;

    for (int i = 0; i < amount; i++)
    {
        float xVal = Random.Range(leftBound, rightBound);
        float zVal = Random.Range(0f, 10f);

        Vector3 pos = new Vector3(xVal, 0f, zVal + topBound);
        Quaternion rot = Quaternion.Euler(0f, 180f, 0f);

        var obj = Instantiate(enemyShipPrefab, pos, rot) as GameObject;

        transforms.Add(obj.transform);
    }
}

<Jobs/Jobs_GameManager_Update_addShips.cs>

Code sample showing C# Job System + Classic Update() and AddShips() implementations

Now you need to keep track of our job and make sure that it completes and reschedules with fresh data each frame. The moveHandle.Complete() line above guarantees that the main thread doesn't continue execution until the scheduled job is complete. Using this job handle, the job can be prepared and dispatched again. Once moveHandle.Complete() returns, you can proceed to update our MovementJob with fresh data for the current frame and then schedule the job to run again. While this is a blocking operation, it prevents a job from being scheduled while the old one is still being performed. Also, it prevents us from adding new ships while the ships collection is still being iterated on. In a system with many jobs we may not want to use the Complete() method for that reason.

When you schedule MovementJob at the end of Update(), you also pass it the list of all the transforms to be updated from ships, accessed through the TransformAccessArray. When all jobs have completed setup and schedule, you can dispatch all jobs using the JobHandle.ScheduleBatchedJobs() method.

The AddShips() method is similar to the previous implementation with a few small exceptions. It double-checks that the job has completed in the event the method is called from somewhere else. That shouldn't happen, but better safe than sorry! Also, it saves off a reference to the newly spawned transforms in the TransformAccessArray member. Let's see how the work distribution and performance look.

With C# the number of figures is double
Figure 12. Using the C# Job System, we can nearly double the number of objects on the screen from the classic system in the same frame time (~33 ms).

C# job system + classic Profiler View
Figure 13. C# job system + classic Profiler View.

Now you can see that the Movement and UpdateBoundingVolumes jobs are taking about 4 ms per frame. Much better! Also note that there are nearly double the number of ships on the screen as the classic system!

We can still do better, however. This current method is still limited by a few things:

  • GameObject instantiation is a lengthy process involving system calls for memory allocation.
  • Transforms are still allocated in a random location in the heap.
  • Transforms still contain unused data, polluting cache lines and making memory access less efficient.

3. Entity Component System Using Jobs

This is where things get just a little bit more complex, but once you understand it you'll know it forever. Let's tackle this by looking at our new enemy ship prefab first:

Entity Component System ship prefab.
Figure 14. C# job system + Entity Component System ship prefab.

You'll probably notice a few new things. One, there are no built-in Unity components attached, aside from the Transform component (which isn't used). This prefab now represents a template that we will use to generate entities rather than a GameObject with components. The idea of a prefab doesn't exactly apply to the new system in the same way you are used to. You can look at it as a convenient container of data for your entity. This could all be done purely in script as well. You also now have a GameObjectEntity.cs script attached to the prefab. This required component signifies that this GameObject will be treated like an entity and use the new entity component system. You see that the object now also contains a RotationComponent, a PositionComponent, and a MoveSpeedComponent. Standard components such as position and rotation are built-in and don't need to be explicitly created, but MoveSpeed does. On top of that, we have a MeshInstanceRendererComponent, which exposes a public member a material reference that supports GPU instancing, which is required for the new entity component system. Let's see how these tie into the new system.

using System;
using Unity.Entities;

namespace Shooter.ECS
{
    [Serializable]
    public struct MoveSpeed : IComponentData
    {
        public float Value;
    }

    public class MoveSpeedComponent : ComponentDataWrapper<MoveSpeed> { }
}

<ECS/ECS_MoveSpeed.cs>

Code sample showing how to set up MoveSpeed data (IComponentData) for the Entity Component System

When you open one of these data scripts, you see that each structure inherits from IComponentData. This flags the data as a type to be used and tracked by the entity component system and allows the data to be allocated and packed in a smart way behind the scenes while you get to focus purely on your gameplay code. The ComponentDataWrapper class allows you to expose this data to the inspector window of the prefab it's attached to. You can see that the data you've associated with this prefab represents only the parts of the Transform component required for basic movement (position and rotation) and the movement speed. This is a clue that you won't be using Transform components in this new workflow.

Let's now look at the new version of the GameplayManager script:

using Unity.Collections;
using Unity.Entities;
using Unity.Mathematics;
using Unity.Transforms;
using UnityEngine;

namespace Shooter.ECS
{
    public class GameManager : MonoBehaviour
    {
        EntityManager manager;
        

        void Start()
        {
            manager = World.Active.GetOrCreateManager<EntityManager>();
            AddShips(enemyShipCount);
        }

        void Update()
        {
            if (Input.GetKeyDown("space"))
                AddShips(enemyShipIncremement);
        }

        void AddShips(int amount)
        {
            NativeArray<Entity> entities = new NativeArray<Entity>(amount, Allocator.Temp);
            manager.Instantiate(enemyShipPrefab, entities);

            for (int i = 0; i < amount; i++)
            {
                float xVal = Random.Range(leftBound, rightBound);
                float zVal = Random.Range(0f, 10f);
                manager.SetComponentData(entities[i], new Position { Value = new float3(xVal, 0f, topBound + zVal) });
                manager.SetComponentData(entities[i], new Rotation { Value = new quaternion(0, 1, 0, 0) });
                manager.SetComponentData(entities[i], new MoveSpeed { Value = enemySpeed });
            }
            entities.Dispose();
        }
    }
}

<ECS/ECS_GameManager.cs>

Code sample showing C# Job System + Entity Component System Update() and AddShips() implementations

We've made a few changes to enable the entity component system to use the script. Notice you now have an EntityManager variable. You can think of this as your conduit for creating, updating, or destroying entities. You'll also notice the NativeArray<Entity> type constructed with the amount of ships to spawn. The manager's instantiate method takes a GameObject parameter and the NativeArray<Entity> setup that specifies how many entities to instantiate. The GameObject passed in must contain the previously mentioned GameObjectEntity script along with any needed component data. The EntityManager creates entities based off of the data components on the prefab while never actually creating or using any GameObjects.

After you create entities, iterate through all of them and set each new instance's starting data. This example sets the starting position, rotation, and movement speed. Once that's done, the new data containers, while secure and powerful, must be freed to prevent memory leaks. The movement system can now take over the show.

using Unity.Collections;
using Unity.Entities;
using Unity.Jobs;
using Unity.Mathematics;
using Unity.Transforms;
using UnityEngine;

namespace Shooter.ECS
{
    public class MovementSystem : JobComponentSystem 
	{
        [ComputeJobOptimization]
        struct MovementJob : IJobProcessComponentData<Position, Rotation, MoveSpeed>
        {
            public float topBound;
            public float bottomBound;
            public float deltaTime;

            public void Execute(ref Position position, [ReadOnly] ref Rotation rotation, [ReadOnly] ref MoveSpeed speed)
            {
                float3 value = position.Value;

                value += deltaTime * speed.Value * math.forward(rotation.Value);

                if (value.z < bottomBound)
                    value.z = topBound;

                position.Value = value;
            }
        }

        protected override JobHandle OnUpdate(JobHandle inputDeps)
        {
            MovementJob moveJob = new MovementJob
            {
                topBound = GameManager.GM.topBound,
                bottomBound = GameManager.GM.bottomBound,
                deltaTime = Time.deltaTime
            };

            JobHandle moveHandle = moveJob.Schedule(this, 64, inputDeps);

            return moveHandle;
        }
    }
}

<ECS/ECS_MovementSystem.cs>

Code sample showing C# Job System + Entity Component MovementSystem implementation

Here's the meat and potatoes of the demo. Once entities are set up, you can isolate all relevant movement work to your new MovementSystem. Let's cover each new concept from the top of the sample code to the bottom. 

The MovementSystem class inherits from JobComponentSystem. This base class gives you the callbacks you need to implement, such as OnUpdate(), to keep all of the system-related code self-contained. Instead of having an uber-GameplayManager.cs, you can perform system-specific updates in this neat package. The idea of JobComponentSystem is to keep all data and lifecycle management contained in one place.

<ECS/ECS_MovementJobStruct.cs>

The MovementJob structure encapsulates all information needed for your job, including the per-instance data, fed in via parameters in the Execute function, and per-job data via member variables that are refreshed via OnUpdate(). Notice that all per-instance data is marked with the [ReadOnly] attribute except the position parameter. That is because in this example we are only updating the position each frame. The rotation and movement speed of each ship entity is fixed for its lifetime. The actual Execute function contains the code that operates on all of the required data.

You may be wondering how all of the position, rotation, and movement speed data is being fed into the Execute function invocations. This happens automatically for you behind the scene. The entity component system is smart enough to automatically filter and inject data for all entities that contain the IComponentData types specified as template parameters to IJobProcessComponentData.

using Unity.Collections;
using Unity.Entities;
using Unity.Jobs;
using Unity.Mathematics;
using Unity.Transforms;
using UnityEngine;

namespace Shooter.ECS
{
    public class MovementSystem : JobComponentSystem 
	{

        // ...
        // Movement Job
        // ...

        protected override JobHandle OnUpdate(JobHandle inputDeps)
        {
            MovementJob moveJob = new MovementJob
            {
                topBound = GameManager.GM.topBound,
                bottomBound = GameManager.GM.bottomBound,
                deltaTime = Time.deltaTime
            };

            JobHandle moveHandle = moveJob.Schedule(this, 64, inputDeps);

            return moveHandle;
        }
    }
}

<ECS/ECS_MovementOnUpdate.cs>

 Code sample showing C# Job System OnUpdate() method implementation

The OnUpdate() method below MovementJob is also new. This is a virtual function provided by JobComponentSystem so that you can more easily organize per-frame setup and scheduling within the same script. All it's doing here is:

  • Setting up the MovementJob data to use freshly injected ComponentDataArrays (per-entity-instance data)
  • Setting up per-frame data (time and bounds)
  • Scheduling the job

Voila! Our job is set up and completely self-contained. The OnUpdate() function will not be called until you first Instantiate entities containing this specific group of data components. If you decided to add some asteroids with the same movement behavior, all you would need to do is add those three same Component scripts containing the data types on the representative GameObject that you instantiate. The important thing to know here is that the MovementSystem doesn't care what the entity is that it's operating on. It only cares if the entity contains the types of data it cares about. There are also mechanisms available to help control life cycle.

With FPS ~33 ms can be used 91,000 objects
Figure 15. Running at the same frame time of ~33 ms, we can now have 91,000 objects on screen at once using the entity component system.

The available CPU time tracks more objects
Figure 16. With no dependencies on classic systems, the entity component system can use the available CPU time to track and update more objects.

As you can see in the profiler window above, you've now lost the transform update method that was taking up quite a bit of time on the main thread in the C# job system and Classic combo section shown above. This is because we are completely bypassing the TransformArrayAccess conduit we had previously and directly updating position and rotation information in MovementJob and then explicitly constructing our own matrix for rendering. This means there is no need to write back to a traditional Transform component. Oh yeah, and we've forgotten about one tiny detail the Burst compiler.

Burst Compiler

Now, we'll take exactly the same scene, do absolutely nothing to the code beyond keeping the [ComputeJobOptimization] attribute above our job structure to allow the Burst compiler to pick up the job and we'll get all these benefits. Just make sure that the Use Burst Jobs setting is selected in the Jobs dropdown window shown below.

The dropdown allows the use of Burst Jobs
Figure 17. The dropdown allowing the use of Burst Jobs.

 Burst Jobs allows 150,000 objects at once
Figure 18. By simply allowing Burst Jobs to optimize jobs with the [ComputeJobOptimization] attribute, we go from 91,000 objects on screen at once up to 150,000 with much higher potential.

Time to complete MovementJob from 25 to 2 ms
Figure 19. In this simple example, the total time to complete all MovementJob and UpdateRotTransTransform tasks went from 25 ms down to only 2 ms completion time. We can now see that the bottleneck has shifted from the CPU to the GPU, as the cost of rendering all of these tiny ships on the GPU now outweighs the cost of tracking, updating, and render command generation / dispatch on the CPU side.

As you can see from the screenshot, we've got 59,000 more entities on screen at the same exact frame rate. For FREE. That's because the Burst compiler was able to perform some arcane magic on the code in the Execute() function, leveraging new tightly packed data layout and the latest architecture enhancements available on modern CPUs behind the scenes. As mentioned above, this arcane magic actually takes the form of auto-vectorization, optimized scheduling, and better use of instruction-level parallelism to reduce data dependencies and stalls in the pipeline.

Conclusion

Take a few days to soak in all of these new concepts and they'll pay dividends on subsequent projects. The power saved through the powerful gains reaped in these new systems are a currency that can be spent or saved.

Table 1. Optimizations resulted in significant improvements, such as the number of objects supported on the screen and update costs.

 ClassicC# Job System + ClassicC# Job System + Entity Component System (Burst Off)C# Job System + Entity Component System (Burst On)
Total Frame Time~ 33 ms / frame~ 33 ms / frame~ 33 ms / frame~ 33 ms / frame
# Objects on Screen16,50028,00091,000150,000+
MovementJob Time Cost~ 2.5 ms / frame~ 4 ms / frame~ 4 ms / frame~ < 0.5 ms / frame
CPU Rendering Time Cost To Draw All Ships10 ms / frame18.76 ms / frame18.92 job to calculate rendering matrices + 3 ms Rendering Commands = 21.92 ms / frame~ 4.5 ms job to calculate rendering matrices + 4.27 ms Rendering Commands = 8.77 ms / frame
Time GPU bound~ 0 ms / frame~ 0 ms / frame~ 0 ms / frame~ 15.3 ms / frame

If you're targeting a mobile platform and want to significantly reduce the battery consumption factor in player retention, just take the gains and save them. If you're making a high-end desktop experience catering to the PC master race, use those gains to do something special with cutting edge simulation or destruction tech to make your environments more dynamic, interactable, and immersive. Stand on the shoulders of giants and leverage this revolutionary new tech to do something previously claimed impossible in real time then put it on the asset store so I can use it.

Thanks for reading. Stay tuned for more samples from Unity—watch this space!

Resources

Unity

Unity Entity Component System Documentation

Unity Entity Component System Samples

Unity Entity Component System Forums

Unity Documentation

Learning about efficient memory layout

Ship Asset Used in Demo

Convert SPIR-V to Intel® SPMD Program Compiler (Intel® SPC)

$
0
0

There is a growing trend within the games industry to move compute work to the graphics processing unit (GPU) resulting in engines and/or studios developing large portfolios of GPU compute shaders for many different compute tasks. However, there are times when it would be convenient to run those compute shaders on the CPU without having to re-invest in developing C/C++ variants of them. There are many reasons for doing this, including simple experimentation and debug, utilizing spare CPU cycles and encouraging CPU-based content scaling, CPU-based interaction with other CPU side game assets, for deterministic consistency of results, and so on.

To help address this opportunity while also utilizing the single instruction, multiple data (SIMD) vector units built into modern CPU cores, we have started developing a prototype translator, based on the open source Khronos* SPIRV-Cross project, that will take Standard Portable Intermediate Representation (SPIR-V1) as input and produce Intel® SPMD Program Compiler (Intel® SPC) kernels as output. ISPC takes C-style kernels and generates highly vectorized CPU object files targeting multiple ISAs such as Intel® Streaming SIMD Extensions (Intel® SSE), Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Advanced Vector Extensions 512 (Intel® AVX-512).

The project should be considered as a starting point for conversion to Intel® SPC rather than a fully featured and performant solution. The project currently supports a subset of the standard SPIR-V intrinsics, built-ins and types, but it was designed to utilize a core performance feature of Intel® SPC, which is the notion of uniform (scalar) or varying (vector) variables. This allows optimizations such as avoiding expensive and divergent vector branches if the test condition is scalar.

The code has been tested on a handful of shaders, such as the compute examples in Sascha Willems’ Vulkan* repository and the particle system compute shaders in Microsoft DirectX* 12 MiniEngine sample.

The code can be downloaded from our GitHub* repository and has currently only been tested on Windows* systems. The GitHub Readme contains more detailed documentation about the implementation and supported features.

Usage

glslangValidator.exe -H -V -o test.spv test.comp

spirv-cross.exe --ispc --output test.ispc test.spv

ispc.exe -O2 test.ispc -o test.ispc.obj -h test.ispc.h --target=avx2 --opt=fast-math

Example API Usage

ispc::raytracing_get_workgroup_size(workgroupSize[0], workgroupSize[1], workgroupSize[2]);

int32_t dispatch[3] = { textureComputeTarget.width / workgroupSize[0], textureComputeTarget.height / workgroupSize[1], 1 };
int32_t dispatch_count = dispatch[0] * dispatch[1];

concurrency::parallel_for<uint32_t>(0, dispatch_count, [&](uint32_t dispatchID)
{
    int32_t workgroupID[3] = { dispatchID % dispatch[0], dispatchID / dispatch[1], 0 };
    ispc::raytracing_dispatch_single(workgroupID, dispatch, planeCount, *pPlanes, sphereCount, *pSpheres, *pUBO, resultImage);
});

This project is open sourced under the original SPIRV-Cross Apache 2.0 license and we welcome any comments and contributions.

Further information on using ISPC in games can be found in the article, Use the Intel® SPMD Program Compiler for CPU Vectorization in Games.

Footnote

1. SPIR-V is the default shader language for Vulkan and can be generated from OpenGL* Shading Language (GLSL) and High-Level Shading Language (HLSL) shaders by tools such as the glslangValidator and shaderc compilers.

 

Code Sample: New Unity* Entity Component, C# Job System, and Burst Compiler

$
0
0

File(s):

Download
License:Intel Sample Source Code License Agreement
Optimized for... 
OS:Windows® 10 (64 bit)
Hardware:N/A
Software:
(Programming Language, tool, IDE, Framework)
C#, Microsoft Visual Studio* 2015, Unity
Prerequisites:Familiarity with Microsoft Visual Studio, 3D graphics, parallel processing.
Tutorial:Game DevGetting Started with the New Unity* Entity Component, C# Job System, and Burst Compiler

Introduction

Low, medium, and high. Standard fare for GPU settings, but why not CPU settings, too? Today the potential power of the CPU on your end users’ machines can vary wildly. Typically, developers will define a CPU min-spec, implement the simulation and gameplay systems using that performance target, and then call it a day. This leaves the many potentially available cores and features built into modern mainstream CPUs sitting idle on the sideline. The new C# job system and entity component system from Unity* don’t just allow you to easily leverage previously unused CPU resources, they will also help run all your game code more efficiently in general. Then you can use those extra CPU resources to add more scene dynamism and immersion. In this article, you’ll see how to quickly get started learning these new features.

Unity is attacking two important performance problems for computing in game engines. The first problem under assault is inefficient data layout. Unity’s Entity Component System (ECS) improves management of data storage for high-performance operations on those structures. The second problem is the lack of a high-performance job language and SIMD vectorization that can operate on that well-organized data. Unity’s new C# job system, entity component system and Burst compiler technology leave those shortcomings in the dust.

The Unity entity component system and C# job system are two different things, but they go hand-in-hand. To get to know them, let’s look at the current Unity workflow for creating an object in your scene, and then differentiate from there.

Get Started

Refer the tutorial link above.

References

Unity
Unity Entity Component System Documentation
Unity Entity Component System Samples
Unity Entity Component System Forums
Unity Documentation
Learning about efficient memory layout
Ship Asset Used in Demo

Updated Log

Created May 17, 2018

Tips to Improve Performance for Popular Deep Learning Frameworks on CPUs

$
0
0

Introduction

The purpose of this document is to help developers speed up the execution of the programs that use popular deep learning frameworks in the background. There are situations where we have observed that the deep learning code, with default settings, does not take advantage of the full compute capability of the underlying machine on which it runs. This is often the case, especially when the code runs on Intel® Xeon® processors.

Optimization

The primary goal of the performance optimization tips given in this section is to make use of all the cores available in the machine. Intel® DevCloud consists of Intel® Xeon® Gold 6128 processors.

Assume that the number of cores per socket in the machine is denoted as NUM_PARALLEL_EXEC_UNITS. On the Intel DevCloud, assign NUM_PARALLEL_EXEC_UNITS to 6.

TensorFlow

To get the best performance from a machine, change the parallelism threads and OpenMP* settings as below:

import tensorflow as tf

config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS})

session = tf.Session(config=config)

os.environ["OMP_NUM_THREADS"] = "NUM_PARALLEL_EXEC_UNITS"

os.environ["KMP_BLOCKTIME"] = "30"

os.environ["KMP_SETTINGS"] = "1"

os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

Keras with TensorFlow Backend

To get the best performance from a machine, change the parallelism threads and OpenMP settings as below:

from keras import backend as K

import tensorflow as tf

config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS })

session = tf.Session(config=config)

K.set_session(session)

os.environ["OMP_NUM_THREADS"] = "NUM_PARALLEL_EXEC_UNITS"

os.environ["KMP_BLOCKTIME"] = "30"

os.environ["KMP_SETTINGS"] = "1"

os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

Caffe

To get the best performance from the underlying machine, change the OpenMP settings as below:

export OMP_NUM_THREADS= NUM_PARALLEL_EXEC_UNITS

export KMP_AFFINITY= granularity=fine,verbose,compact,1,0

In general:

export OMP_NUM_THREADS= <number of threads to use>

export KMP_AFFINITY= <your affinity settings of choice>

For example:

KMP_AFFINITY=granularity=fine,balanced

KMP_AFFINITY=granularity=fine,compact

Conclusion

Even though we have observed a speed up in most cases, please note that the performance is largely code-dependent and there can be multiple other reasons that affect the code performance. A good code profiling tool like the Intel® VTune™ Amplifier can help you dig deeper and analyze performance problems.

Author

Anju Paul is a Technical Solutions Engineer working on behalf of the Intel AI® Academia Program

 

 

The 5G Network Transformation

$
0
0

Overview

Internet traffic has undergone tremendous growth over the years and shows no signs of slowing down. For example, in Staying Connected in 2017: Our Predictions, AT&T* reports that the traffic on their network has grown 250,000 percent since 2007. People are adding new devices to their homes, and new data-hungry applications are being developed for work, connectivity, entertainment, gaming and more. In addition to the amount of data required, many applications are also latency-sensitive. This means networks have to handle large volumes of data faster than ever, and without added cost to the end user or subscriber.

Enter 5G

With 5G, a user will be able to download a high-definition video in under a second (a task that could take up to 10 minutes on 4G LTE). 5G networks will boost the development of other new technologies, such as autonomous vehicles, virtual reality, smart agriculture, remote emergency and medical services, and the Internet of Things (IoT).

conceptual representation of 5g and data economy interaction
Figure 1. 5G is critical to a new data economy.

In addition to being a dramatically better mobile broadband system, 5G is an innovation platform for services, applications, and connected devices.

Connecting the World

According to Introducing OpenCellular: An open source wireless access platform, at the end of 2015 approximately half the world's population did not have internet access. The OpenCellular Project, founded by Facebook*, is designed to support a range of communication options, from a network in a box to an access point supporting everything from 2G to LTE. Facebook plans to open-source the hardware design, along with necessary firmware and control software, to enable telecom operators, entrepreneurs, OEMs, and researchers to locally build, implement, deploy, and operate wireless infrastructure based on this platform.

This project empowers the developer community to contribute to the goal of getting to 100 percent connectivity in 5G. To be successful, new 5G technology must be designed to connect with legacy 2G, 3G, 4G LTE, Wi-Fi* and wired networks. Implementing the new generation networks in this way also means operational efficiency for the whole network, and will benefit the operator bring down the cost for even existing users allowing them to remain competitive and hence grow.

The Pillars of 5G Wireless Technology

Everything You Need to Know About 5G, by Amy Nordrum, Kristen Clark and IEEE Spectrum* Staff.

The five pillars below are the foundation of 5G technology.

Millimeter wave

Current networks use the 3 kHZ to 6 GHz spectrum, which is getting crowded due to the explosion of data from smart phones and other connected devices. Next-generation technologies will use the 30-300 GHz spectrum, known as millimeter wave or mmWave, available for mobile broadband communications for the first time. The associated leap in performance can deliver fiber-like speeds, without the wires.

Small cells

Millimeter waves cannot travel through buildings and they can be absorbed by plants and rain. This is why the current technology of big base stations broadcasting their signals over long distances will not work in 5G. To solve this problem, 5G will use thousands of low-powered mini base stations.

Massive MIMO

Advanced multiple input multiple output (MIMO) antenna technology, including adaptive analog beamforming and beam tracking/steering techniques, can increase data rate, coverage, and capacity at base stations and within devices.

Beamforming

It's like a traffic signaling system for cellular signals, allowing data to be sent by the base station to a specific user, instead of broadcasting in all directions, hoping the user will receive it. This precision prevents interference and is much more efficient than the current technology, enabling base stations to handle a larger number of incoming and outgoing data streams simultaneously. The base station uses the direction of the source stream to calculate where the user device is located and determine where to send the stream.

Full duplex

Today's base station transceivers can't simultaneously send and receive signals on the same frequency. 5G transceivers support full duplex transmissions, which enable send and receive of same frequency signals at the same time. An alternate solution is to time division the signal, meaning send the incoming and outgoing data at the same frequency interleaved in a known pattern so the receiving base station can differentiate between the two. Researchers have designed high speed silicon transistor switches to halt the backward roll of these switches so both signals can be sent at the same time, improving the spectral efficiency of the signals

5G and Network Transformation

The previous sections illustrate how 5G is a fundamentally different technology in terms of how the physical layer of the technology works, how it interacts with legacy technologies, and the use cases themselves. Now let us talk about the requirements and challenges of 5G and how they can be addressed by various network elements.

Challenge 1

Network hardware elements (user equipment, modems, antennas, etc.) operating at the physical layer need to work at a much higher speed, support greater bandwidth, and be backward-compatible.

Solution 1

The 5G standard has been defined such that, other than the physical layer, it is backward-compatible. Though the actual bits and bytes travel on different frequency bands and need different modems, the upper-layer protocols do not change much. For this reason, most of the 5G software stack remains unchanged; however, it does need to support higher bandwidth and speeds, a challenge that will be met by a combination of software and hardware architecture designs.

Challenge 2

Network elements must be scalable and agile so that services can be brought up and down fast, as required by user-generated demand.

Solution 2

Software-Defined Networking (SDN) and Network Function Virtualization (NFV) will play a key role, since these technologies will enable network functions to be modularized and to run commodity-based servers. These servers, sitting in service providers' data centers and at the edge, can spin network services up and down on-the-go to meet user demands.

  • Virtual network functions (VNFs) hosted in containers and micro services will help reduce the footprint of network services. Compared to VM-based virtual network functions, they cost less to operate and are faster to bring up and down. Security concerns related to containerization can be addressed by using secure Kata containers as required by the network.
  • Orchestration frameworks will play a critical role, since automation and interworking of network functions across industries and vendors will be the key to scale a 5G network. Various open source projects like OpenStack*, OPNFV*, ONAP*, and Open Baton* are working in this direction.

Challenge 3

Networks need to be flexible without compromising on throughput and bandwidth requirements. Different use cases need to be optimized for different network service level agreements (SLAs). For example, use cases like automated driving and remote medicine are extremely latency-sensitive, while applications like gaming, augmented reality (AR), and virtual reality (VR) demand high bandwidth and low latency. A smart agricultural application has massive bandwidth demand due to scale but is not latency-sensitive. If these different scenarios are to be serviced by the same network, the network must classify the packets as belonging to a particular group, or network slice, and process them according to a set of rules. This requires extensive traffic classification and scheduling to implement all the network nodes through which the traffic passes. This implementation needs to be flexible enough so traffic classes and slices can be defined dynamically and not be tied to what has been preprogrammed in the hardware.

Service assurance: 5G's stringent requirements leave no room for error in terms of how network service classes are handled. If a service class (see Figure 2 for the different service classes) is guaranteed to meet an SLA that requires sub-microsecond latency, the underlying network infrastructure has to reserve resources to make that happen. It is vital to monitor systems for utilization and malfunctions, in order to prevent service disruptions or to facilitate the prompt resumption of normal service. Today, monitoring and management activities throughout the network are supported by discrete systems in fixed service chains with tightly integrated hardware and software products as well as established management frameworks and assurance tools. In a virtualized environment based on NFV, these activities are more challenging as a result of the disaggregation of hardware and software and the ability to deploy services dynamically.

infographic depicts cloud services matched to different needs
Figure 2. Matching cloud services to diverse delivery needs.

Solution 3

Using a well-designed software stack for the core and edge – supported by hardware designed to have the flexibility needed to enable traffic to be classified, sliced, and monitored – is the best way to transform networks and make them ready for 5G. One of the biggest challenges in the networking industry for moving to a more software-based solution has been the fact that these networks have to support legacy devices and be backward-compatible. With 5G, new networks are being deployed, creating the opportunity for a flexible and agile green field deployment.

Intel's 5G Solutions

Intel develops leading network technology and building blocks such as silicon, software, connectivity, memory, and integrated solutions to address the demands of next-generation networks. These solutions provide both the flexibility and scalability needed to build, utilize, and optimize tomorrow's network.

It all starts with Intel® Architecture (IA), which provides the performance and scalability necessary to keep up with today's demanding network requirements. Our roadmap of processors starts with Intel Atom® processors and scales to our leading Intel® Xeon® processor, which is purpose-built for the cloud and offers the most advanced foundation for software-defined infrastructure. With a tool chain that allows seamless migration across the IA roadmap, developers and network administrators can run their software on a single architecture – IA.

SDN/NFV

The network of tomorrow will be deployed using SDN and NFV. Instead of running a separate router, VPN, and firewall on three different pieces of hardware, you can run all of them on the same IA-based infrastructure. SDN provides an intelligent network and orchestration software that enables you to swap out hardware without requiring software reconfiguration. This will provide tremendous value to service providers in terms of driving network scale and agility while reducing capital expenditure (CAPEX) and operating expenditure (OPEX).

Visual cloud

Visual computing is exploding. Studies show that video will account for more than 79 percent of traffic traversing the network by 2020. Use cases include AR and VR, video on demand, live streaming, video surveillance, autonomous driving, medical imaging, 3D modeling, and computer/robotic vision. Intel is democratizing the creation and delivery of these compelling visual experiences by incorporating visual compute IP into our Intel Xeon processors with Intel® Graphics Technology. Intel® Quick Sync Video uses the dedicated media processing capabilities of Intel Graphics Technology to decode and encode quickly, while also enabling the processor to perform other tasks for maximum performance.

Infrastructure

Although industry standards are in the planning stage and actual deployments will not occur until 2019 or 2020, there is a lot of buzz about 5G. Telecom operators and equipment manufacturers are becoming 5G ready now. There will be incremental steps to get there, including the continual expansion of LTE and LTE-A. Intel has an end-to-end story (see Figure 3) for both consumers and businesses from devices to access to the core and cloud.

infographic depicts wide range of use for 5G
Figure 3: 5G end to end solution from smart user devices to core and cloud

Building blocks

For 5G infrastructure, building blocks include FlexRAN, which is a vRAN software reference platform, and Multi-access Edge Computing (MEC) , with products that can be deployed today to provide for lower latency and more connectivity. These components will ultimately provide a best-in-class user experience.

FlexRAN

Wireless base stations, like most network nodes, have traditionally been vertically integrated boxes (see Fig 8 below). FlexRAN (see Figure 3) is a reference architecture developed by Intel to implement software based radio stations which can sit on any part of the wireless networks from edge to core.

colorful design for a software base station
Figure 4. FlexRAN: A reference design for a software base station

Multi Access Edge Computing (MEC)

MEC implements software services close to the user to meet the low latency requirements of newer networks (4G, 4G LTE, 5G) and meets high bandwidth requirements. It opens the door to new types of applications that can use information such as real-time access to radio network information and location awareness. MEC unlocks the network to a new ecosystem of services at the network edge

Intel offers the Network Edge Virtualization (NEV) SDK platform with standard APIs and interfaces for developers and content providers. The NEV SDK is part of open source Akraino project.

infographic depicts Multi Access Edge Computing
Figure 5. Multi Access Edge Computing (MEC)

What's in it for Developers?

Analysts predict a USD 12-trillion dollar opportunity for 5G-related goods and services available globally in 2035. From creating innovative applications and services at the edge, to building a new SDN/NFV infrastructure for the datacenter and cloud, or connecting the world through initiatives like FaceBook's OpenCellular Project, developers will be at the heart of the 5G transformation. We at Intel look forward to making the journey with you.

About the Author

Sujata Tibrewala is an Intel community development manager and technology evangelist who defines programs and training events to ensure that the network developer ecosystem works together toward a common goal: to drive SDN/NFV adoption in the industry using open source ingredients such as DPDK, FD.io, Tungsten Fabric, Open VSwitch, Open Stack, ONAP, and more.   Sujata has worked at several companies, including CISCO, Agere, Ericsson, Avaya, Brocade, leading all phases of diverse software technology projects such as an SDN open flow implementation, TCP/IP/Ethernet/VLAN forwarding software development on CISCO switches, and network processors and cloud deployments using virtualization technologies.She has a Masters from IISc Bangalore and Bachelors from IIT Kharagpur and has completed an Executive Women Leadership Program from Stanford. 

 


How to Build a Custom Audio Editor with Unreal Engine* for Sound Spatialization in VR

$
0
0

woman in a virtual environment

Overview

Unreal Engine* from Epic Games has a powerful virtual reality (VR) editor option, but something they did not include is the ability to edit and place sounds while inside VR. It can be troublesome to have to constantly restart the editor after adjusting a sound to test what it sounds like in VR. So we decided to create a sound editor that allows game developers and sound designers to quickly place, edit, and test spatialized sound inside VR. This will prevent the user from having to constantly enter and exit the editor.

woman in a virtual environment

System requirement

  • Unreal Engine 4.18.1 or greater
  • Visual Studio* 2017
  • HTC Vive*

What you'll learn

  • Motion controller interaction
  • How to create a custom C++ class
  • VR UI
  • Saving editor changes
  • Sound spatialization parameters

Below, we will walk you through step-by-step to demonstrate how we made this custom audio editor tool for Unreal Engine from start to finish:

Project link download

Before we begin, you need to do a couple of things. Download and unzip the project folder. You also need to make sure you have at least version 4.18.1 of Unreal Engine installed.

When you have downloaded and unzipped the folder, right-click on Intel_VR_Audio_Tools.uproject and select "Generate Visual Studio project files." After that completes, open the project. A popup that says "Missing Intel_VR_Audio_Tools Modules" will appear. Click "Yes" to start the rebuild; this should take less than 20 seconds. This is needed because of how you are dynamically finding .wav files that have been added to the project, which will be explained in the Custom C++ Class section.

Setting Up the VR Player

We start with Unreal's VR template and choose the MotionControllerPawn as our pawn, which has motion control set up and allows movement by teleporting.

Motion Controller Interaction

Before the motion controller can interact with 3D widgets, a WidgetInteraction component needs to be added to BP_MotionController, which is located in the VirtualRealityBP folder. We also need to add a scene component for the sound selector widget, called soundScene.

widget, motion controller options screenshot

Press and Release Pointer keys are attached to the event called when the right trigger is pulled. We need to add to the MotionControllerPawn, which is also located in the VirtualRealityBP folder.

screenshot of widget interaction with right controller

Custom C++ Class

The reason for rebuilding the project is because while making this tutorial, the issue of knowing the names and locations of the sounds and dynamically updating a widget to match all those files appeared daunting. Luckily, Unreal Engine has some stuff to help us out.

The IntelSoundComponent is a C++ class that can be added to any blueprint to dynamically locate and load a .wav file into a USoundWave, which is how Unreal loads a sound file.

First, we right-click in the content browser and create a new C++ class named IntelSoundComponent. This action creates an IntelSoundComponent.cpp file and an IntelSoundComponent.h file.

Next, we add some includes which are needed to locate and manage files.

Includes added in IntelSoundComponent.cpp are Paths.h, FileManager.h and Runtime/Engine/Classes/Sound/SoundWave.h (which for some reason need everything before SoundWave.h).

#include "IntelSoundComponent.h"
#include "Paths.h"
#include "FileManager.h"
#include "Runtime/Engine/Classes/Sound/SoundWave.h"

bool exists;
FString dir, soundDir;
TArray<FString> soundFiles;

// Sets default values for this component's properties
UIntelSoundComponent::UIntelSoundComponent()
{
	// Set this component to be initialized when the game starts, and to be ticked every frame.  You can turn these features
	// off to improve performance if you don't need them.
	PrimaryComponentTick.bCanEverTick = true;
	
	//Empty soundFiles TArray. Easiest way if new wave files are added.
	soundFiles.Empty();
	
	//the way Unreal Engine calls the project's root directory
	dir = FPaths::ProjectDir();

	//Combining Root with the folder location for the sounds. 
	//This could probably be an external folder if needed with the help of ( IPlatformFile& PlatformFile = FPlatformFileManager::Get().GetPlatformFile(); )
	soundDir = dir + "Content/Sounds";

	//UE4 returns a bool if the directory exists or not.
	exists = FPaths::DirectoryExists(soundDir);
	
}

Code block. IntelSoundComponent.cpp

Now we create a a bool named exists; 2 FString variables named dir and SoundDir; and a TArray of FStrings named soundFiles. Since soundFiles is a TArray, we are able to call soundFiles.Empty(); which empties the TArray. We believe it's also the fastest way if new wave files are added. Then, we set FString dir to FPaths::ProjectDir(); (which gives the root location of the project). Next, we set FString soundDir to dir + "Content/Sounds" because that is the folder we put our .wav files into. FPaths has another method that can check if a directory exists, so we set our bool to exists = FPaths::DirectoryExists(soundDir);.

// Called when the game starts
void UIntelSoundComponent::BeginPlay()
{
	Super::BeginPlay();
	
	//UE4 way of managing files
	IFileManager &fileManager = IFileManager::Get();
	
	//UE_LOG(LogTemp, Warning, TEXT("%s"), &fileManager);

	
	if (exists == true){

		//Extensions to sound files. Was using .wav, but .uasset seems to work when there is and isn't an editor.
		FString ext = "/*.wav";
		FString ext2 = "/*.uasset";
		
		//path = FPaths::ProjectDir() + Content/Sounds + /*.uasset
		FString path = soundDir + ext2;

		//This finds file in the given array, with the given path 
		//the true bool is saying to look for files while false bool is saying to not look for directories
		fileManager.FindFiles(soundFiles, *path, true, false);

Code block. IntelSoundComponent.cpp

On BeginPlay() we start by instantiating IFileManager by using IFileManager &fileManager = IFileManager::Get();. We do this to debug and test if the .wav files are being found with fileManager.FindFiles, which search for .uassets instead of the .wav files we were using before, because .uassets are more reliable when sharing projects.

//Setting soundFileArray to soundFiles to pass into blueprint.
void UIntelSoundComponent::soundArray(TArray<FString> &soundFileArray) {
	
	soundFileArray = soundFiles;
	
}

//loading a wav file as a USoundWave so Unreal can set the sound chosen with LoadObject<USoundWave> for blueprint
USoundWave* UIntelSoundComponent::setWavToSoundWave(const FString &fileName) {
	
	USoundWave* swRef;
	FString name = fileName;

	swRef = LoadObject<USoundWave>(nullptr, *name);

	return swRef;

}

Code block. IntelSoundComponent.cpp

Lastly, in the .cpp, we create two functions that will be exposed as blueprint nodes. SoundArray (which passes the soundFiles TArray into blueprints) and setWavToSoundWave (which took a while for us to figure out because we had to find a way to dynamically reference a .wav file in a way that Unreal could understand, which is a USoundWave). For this problem we discovered LoadObject. This function loads an object at runtime into any type that we set, if possible. For us, it was LoadObject(nullptr, *name);—*name being the sound that was chosen by VR player.

In the IntelSoundComponent.h we create two UFUNCTIONS as a way to make the two functions in the .cpp blueprint callable.

#include "CoreMinimal.h"
#include "Components/SceneComponent.h"
#include "Runtime/Engine/Classes/Sound/SoundWave.h"
#include "IntelSoundComponent.generated.h"



UCLASS( ClassGroup=(Custom), meta=(BlueprintSpawnableComponent) )
class INTEL_VR_AUDIO_TOOLS_API UIntelSoundComponent : public USceneComponent
{
	GENERATED_BODY()


public:	
	// Sets default values for this component's properties
	UIntelSoundComponent();
	
	//Blueprint function to expose soundFiles into blueprint.
	UFUNCTION(BlueprintCallable, Category = IntelAudio)
		void soundArray(TArray<FString> &soundFileArray);

	//Blueprint function passing a wav converted in USoundwave into blueprint.
	UFUNCTION(Category = IntelAudio, BlueprintCallable)
		USoundWave* setWavToSoundWave(const FString &fileName);

Code block. IntelSoundComponent.h

Blueprint function to expose sound files into blueprint.

screenshot of sound array widget

Blueprint function passing a .wav file converted in USoundWave into blueprint.

screenshot of sound widget

Setting Up the UI

We need to set up three UMG widgets.

screenshot of multiple widget set up

We create the blueprints needed to manage those UMG widgets.

screenshot of multiple blue prints to manage umg widgets

We have a couple of widgets for this project. AudioParamsSliderWidget is the widget that pops up when a sound is selected. soundButtonWidgetBP is just a button widget for the sounds in the Content/Sounds folder. We put a widget called soundSelectorWidgetBP in the level by having an actorBP we create called IntelSoundWidgetBP get the sounds from the SoundArray C++ node and populate soundSelectorWidgetBP with soundButtonWidgetBPs. (We could do this dynamically but then we would have to get a reference to the newly spawned actor every time we began play.) All this happens in the IntelSoundManagerBP, which we also placed in the level from the start.

screenshot of sound widget
IntelSoundManagerBP

In the image above, we get the soundFiles TArray of FStrings and split at the period in the name of the (name of sound).wav. We send that string into an array of strings in IntelSoundWidget to name the buttons being dynamically populated.

screenshot of sound widget
IntelSoundWidgetBP

In the IntelSoundWidgetBP we spawn the soundUI,

screenshot of sound widget

add sounds,

screenshot of sound widget

and if we don't use the Set Widget node the widget would spawn but not be visible in game.

screenshot of sound widget spawned in VR environment

Sound Parameters

Once the player selects a sound from the widget an IntelSoundAudioActorBP actor will spawn. In this actor we see the AudioParamsSliderWidgetBP, and if Spatialize? is clicked, three attenuation settings will be exposed to be changed through the widget.

screenshot of sound widget - volume settings

screenshot of sound widget - volume settings

Sound attenuation is essentially the ability of a sound to lower in volume as the player moves away from it.

graph

 

graph

graph

 

graph

The three settings exposed are Attenuation Function, Attenuation Shape, and the Falloff Distance.

There are plenty more settings that could be exposed with more time. Here are images of the Attenuation Setting struct in Unreal.

Unreal* attenuation settings

We believe the three settings we chose are the most basic and fundamentally needed settings. Showing debug lines when changing settings is something we are working on. We are looking for a way to use the attenuation setting debug lines Unreal uses to show attenuation in the editor in the game, but we have not found that answer. So, we might get the shape extents of the chosen attenuation shape and function and use the Unreal built-in draw debug lines nodes.

Saving On Exit

When we exit the game and have spawned sounds, moved them around, and played with the audio parameters, we save all the variables that are important using IntelSaveGameBP through IntelSoundAudioActorBP.

variables save UI
IntelSaveGameBP

screenshot of sound manager blueprint
IntelSoundManagerBP

screenshot of audio actor blueprint
IntelSoundAudioActorBP

Now if everything worked correctly, we should be able to edit any sounds in the folder inside VR.

Intel® Parallel Computing Center at LAMDA Group, Nanjing University

$
0
0

nanjing university logo

Principal Investigators:

Prof. Zhi-Hua Zhou portraitProf. Zhi-Hua Zhou is currently Professor and Standing Deputy Director of the National Key Laboratory for Novel Software Technology; he is also the Founding Director of the LAMDA group. His research interests are mainly in artificial intelligence, machine learning and data mining. He has authored the books Ensemble Methods: Foundations and Algorithms and Machine Learning (in Chinese), and published more than 150 papers in top-tier international journals or conference proceedings. 

He has received various awards/honors including the National Natural Science Award of China, the PAKDD Distinguished Contribution Award, the IEEE ICDM Outstanding Service Award, the Microsoft Professorship Award, etc. He also holds 22 patents.

He is an Executive Editor-in-Chief of the Frontiers of Computer Science, Associate Editor-in-Chief of the Science China Information Sciences, Action or Associate Editor of Machine Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, ACM Transactions on Knowledge Discovery from Data, etc. He served as Associate Editor-in-Chief for Chinese Science Bulletin (2008-2014), Associate Editor for IEEE Transactions on Knowledge and Data Engineering (2008-2012), IEEE Transactions on Neural Networks and Learning Systems (2014-2017), ACM Transactions on Intelligent Systems and Technology (2009-2017), Neural Networks (2014-2016),  Knowledge and Information Systems (2003-2008), etc.

He founded ACML (Asian Conference on Machine Learning), served as Advisory Committee member for IJCAI (2015-2016), Steering Committee member for ICDM, PAKDD and PRICAI, and Chair of various conferences such as General co-chair of PAKDD 2014 and ICDM 2016, Program co-chair of SDM 2013 and IJCAI 2015 Machine Learning Track, and Area chair of NIPS, ICML, AAAI, IJCAI, KDD, etc. He is/was the Chair of the IEEE CIS Data Mining Technical Committee (2015-2016), the Chair of the CCF-AI(2012- ), and the Chair of the Machine Learning Technical Committee of CAAI (2006-2015). He is a foreign member of the Academy of Europe, and a Fellow of the ACM, AAAI, AAAS, IEEE, IAPR, IET/IEE, CCF, and CAAI.

Description:

The major goal of this Intel® Parallel Computing Center (Intel® PCC) is to implement a deep forest framework as an alternative to neural networks on KNL and all IA architectures. The deep forest model possesses non-differential units (i.e., tree/tree ensembles) instead of neural units to construct multi-layered structure with highly competitive performance compared with current deep models without the need of GPU. Due to the properties of tree ensemble unites, such approaches are born to be suitable for IA architectures rather than GPU structure, and can handle discrete or tabular data better than perceptron based neural networks. There is big potential to be optimized on IA, especially to utilize the many core architecture devices such as Intel® Xeon® and Intel® Xeon Phi™. By doing so, we believe a CPU centered deep learning system can be achieved using decision trees as building blocks instead of neurons. 

In other words, after a performance profiling on the current deep forest code, optimizations and modifications on the current implementation on Intel Xeon devices will be carried out accordingly. Other variations of deep forest models for specific tasks will also be designed and implemented, with the help of Intel® Many Integrated Core Architecture (Intel® MIC Architecture) and the Intel® AI platform.
 
This Intel Parallel Computing Center will also give students hands-on experiences of applying AI technology to solve real world problems with the help of Intel’s AI platforms including hardware and software. Firstly, hardware-oriented AI training. The success of AI applications depends on designing efficient platforms and the knowledge of hardware is a critical step. Students will have access to the latest models for learning and developing proposes. Secondly, software-oriented AI training. Writing efficient implementations of AI programs also requires experiences of using well-maintained IA libraries, like implementing AI System with Intel’s AI tools integration including Intel® Parallel Studio, Intel® Data Analytics Acceleration Library (Intel® DAAL), Intel® Math Kernel Library (Intel® MKL), etc.

Publications:

Zhi-Hua Zhou's Publications

Related Website:

http://lamda.nju.edu.cn

Intel® Parallel Computing Center at First Institute of Oceanography, SOA, China

$
0
0

first institute of oceanography logo

Principal Investigators:

Project Lead, Prof. Fangli Qiao

Professor Fangli QiaoFangli Qiao has been working on the development of new generation ocean and climate models. Working with his research team, he established the surface wave-induced mixing theory which dramatically improves the performance of different ocean circulation models and climate models. He revealed the key role of sea spray in air-sea heat flux and greatly reduced the decades-standing systematic error in the forecast of typhoon/hurricane intensity. He led a team to design a high performance parallel scheme and test with more than 10 million CPU cores. He served as editorial board member of Ocean Modelling and Journal of Marine Systems etc.

Co-Project Lead, Prof. Zhenya Song

Zhenya Song obtained the Ph.D. degree in Physical Oceanography of Ocean University of China in 2011. Currently, Dr. Song is the professor of FIO and has been working on the ocean and climate simulation, HPC, and the effects of the wave effects in the climate systems since 2004. He for the first time incorporated a surface wave model into the global CGCMs and then developed a new generation coupled model named FIO-ESM.

Co-Project Lead, Associate Prof. Xiaomeng Huang

Xiaomeng Huang obtained the Ph.D. degree in Computer Science of Tsinghua University in 2007. Currently, he is an associate professor of the Department of Earth System Science in Tsinghua University. He focuses on the crossing field combined ocean modelling and HPC. Research interests include ocean model, parallel computing and big data.

Description:

Ocean surface wave is crucially important to navigation safety and climate change. High-resolution global wave model plays a key role in accurate surface wave forecasting and simulation. The MASNUM wave model is one of three state-of-the-art wave models in the world, which is developed by FIO and now widely used in several research groups and operational ocean forecasting systems. This work will focus on implementing code on new Intel technologies like AEP/HBW Memory/CLX-AP and new algorithms like Deep Learning, to improve the computing performance and simulation ability of MASNUM wave model. Moreover, it will deliver an open source high resolution and large-scale new generation wave model and development experience to the worldwide ocean community, which will expand both FIO and Intel’s influence on HPC and ocean scientific research.

Related Websites:

http://www.fio.org.cn
http://www.qnlm.ac/ronum/index
http://www.fio.org.cn/team/bodao-detail-1122.htm
http://www.fio.org.cn/team/shuodao-detail-1897.htm

Getting Started with Ubuntu* Core on an UP Squared* Board

$
0
0

Introduction

This article demonstrates to new users how to install Ubuntu* Core on an UP Squared* Grove* IoT Development Kit. The UP Squared board is a low-power and high performance platform ideal for Internet of Things (IoT) applications. The UP Squared board based on either the Intel® Celeron® processor (N3350) or Intel® Pentium® processor (N4200). For more information, visit http://www.up-board.org/upsquared. Ubuntu* Core is a lightweight, transactional version of Ubuntu* designed for IoT and cloud usage. Snaps are universal Linux packages that are available to install on Ubuntu* Core to work on IoT devices and more. For more information on the Ubuntu* Core, go to https://www.ubuntu.com/core.

Hardware Requirements

The hardware components used in this project are listed below:

  • UP Squared board
  • 2 USB 2.0 or 3.0 flash drives with at least 2GB free space available
  • A monitor with an HDMI interface
  • USB keyboard and mouse
  • A VGA or HDMI cable
  • A network connection with Internet access or Wi-Fi kit for UP Squared
  • An existing Linux* system is required to generate the RSA key (see Figure-1 below) and to login with SSH into the Ubuntu Core (Figure-14 below).

Software Requirements

The software requirements used in this project are listed below:

Steps

Download Images

  • Download Ubuntu Core Image 16.04.4

Setup  the Ubuntu SSO Account

  • Create an Ubuntu Account
  • Generate Keyss
  • Import Key

Write the USB Drives

  • Download Ubuntu Core Image 16.04.4

Installation

  • Install Live Flash
  • Install Ubuntu Core

Generate a Host SSH Key

The first step is to create an Ubuntu SSO account from https://login.ubuntu.com
The account is required to create the first user on an Ubuntu Core installation. Click on the Personal details to fill out your information. 
Next, use an existing Linux system to generate the RSA key by running ssh-keygen -t rsa on the Linux shell as follows:
mkdir ~/.ssh
chmod 700 ~/.ssh
ssh-keygen –t rsa

Figure 1: Generate an SSH key on the Linux shell

Your public key is now available as .ssh/id_rsa.pub in your home folder /home/Ubuntu/.ssh/id_rsa.pub.

  • Click on the SSH keys and insert the contents of your public key /home/Ubuntu/.ssh/id_rsa.pub, then click on Import SSH key.

Figure 2: Submitted the SSH keys successfully

Create a Live USB Ubuntu* Flash Drive

Booting from the Live USB Flash Drive

  • Connect the USB hub, keyboard, mouse and the monitor to the Up Squared.

Figure 3: Up Squared board

  • Insert the Live USB Ubuntu Desktop flash drive you created earlier in to the Up Squared board.
  • Select "Try Ubuntu without installing”.

Figure 4: Try Ubuntu without installing

Install Ubuntu* Core Image on the Up Squared

  • Insert the second USB flash drive containing the Ubuntu Core image file into the Up Squared board.
  • Check for directories mounted on the internal eMMC storage. Umount any directory mounted on the internal eMMC storage. Open a terminal and type:
mount | grep mmcblk
umount /media/ubuntu/writable
  • Check for the name of the drives of the Up Squared:
sudo fdisk -l
  • Assume /dev/sda is the second USB flash drive containing the Ubuntu Core image, mount it to /media/usb1.
sudo mkdir /media/usb1
sudo mount /dev/sda /media/usb1
  • Decompress the Ubuntu Core image and flash it into the Up Squared internal memory:
xzcat /media/usb1/ubuntu-core-16-amd64.img.xz | sudo dd of=/dev/mmcblk0 bs=32M status=progress; sync

Figure 5: Flash Ubuntu Core

  • Remove the Live USB Ubuntu Desktop flash drive and reboot the Up Squared board. The Up Squared will reboot from the internal memory where the Ubuntu Core has been flashed.

Configure the UP Squared* Board

After the Up Squared has rebooted, you will see a prompt below.

Figure 6: Ubuntu Core Configuration
  • Select Start to configure your network.

Figure 7: Configure wlan0
  • Select wlan0, then select Configure WIFI settings.

Figure 8: Configure WIFI
  • Enter Network name and Password, then select Done.

Figure 9: Network configuration
  • Highlight Done and press Enter
  • Highlight Done and press Enter again.

Figure 10: Network connections configuration complete

  • Now, DHCPv4 is enabled for wlan0, select Done.
  • Enter the Ubuntu One email address that was set up earlier, select Done then Enter.

Figure 11: Profile setup

  • Once the configuration complete, the Ubuntu SSO username and Up Squared localhost IP address will be displayed on the screen. Use this Up Squared localhost IP address to login from a different Ubuntu machine later in Figure 14.
Figure 12: Configuration complete
  • Select Finish then Enter. Ubuntu Core login will be prompted as follow:
Figure 13: Ubuntu Core login from Up Squared board

First User login on a Different Ubuntu* Machine

  • First, add RSA identities to the authentication agent by running ssh-add on the shell.
  • Next, login with SSH into the Ubuntu Core from a different Ubuntu machine on the same network. The user name is your Ubuntu SSO username and the password is not required.
ssh <your Ubuntu SSO username>@<Ubuntu Core IP address>

Figure 14: ssh into Ubuntu Core from a different Ubuntu machine

  • Set a password in case you want to login from the local console on the Up Squared board. On the different Ubuntu machine console, type:
sudo passwd <your Ubuntu SSO username>

Figure 15: Set localhost password
Now, using your Ubuntu SSO username and password just set in Figure 15 to log in to the Up Squared board from the its local console:
Figure 16:localhost login

Run Hello World Snap on localhost

Now the Up Squared is ready for the snaps. Snaps are self-contained application bundles that contain most of the libraries and runtimes needed. It is a squashFS filesystem containing your app code and a snap.yaml file.
  • Sign in to a Snap store using an Ubuntu SSO account:
Figure 17: Sign in to a snap store from localhost
  • Install the Hello Snap package using the snap name:

Figure 18: Install hello snap

  • Run the Hello Snap:

Figure 19: Run hello snap
  • List all snaps:

Figure 20: List all snaps

Refresh the Hello snap:

Figure 21: Refresh hello snap
Refresh all snaps:

Figure 22: Refresh all snaps
Remove the Hello snap:

Figure 23: Remove Hello snap

Summary

We have described how to install Ubuntu* Core on the Up Squared board and run Hello snap on Ubuntu Core. Visit https://uappexplorer.com/snaps for the list of other available snaps.

Key References

About the Author

Nancy Le is a software engineer at Intel Corporation in the Core & Visual Computing Group working on Intel Atom® processor enabling for Intel® IoT projects.

 

 

Custom Layers Support in Inference Engine

$
0
0

Deep Learning Inference Engine Workflow

The Deep Learning Inference Engine is a part of Intel® Deep Learning Deployment Toolkit (Intel® DL Deployment Toolkit) and OpenVINO™ toolkit. It facilitates deployment of deep learning solutions by delivering a unified, device-agnostic inference API.

For more information, refer to the Inference Engine Developer Guide.

The Deep Learning Inference Engine workflow involves the creation of custom kernels and either custom or existing layers.

A layer is defined as a convolutional neural network (CNN) building block implemented in the training framework (for example, Convolution in Caffe*). A kernel is defined as the corresponding implementation in the Inference Engine. This tutorial is aimed at advanced users of the Inference Engine. It allows users to provide their own kernels for existing or completely new layers.

Networks training is typically done on high-end data centers, using popular training frameworks like Caffe or TensorFlow*. Scoring (or inference), on the other hand, can take place on the embedded, low-power platforms. The limitations of these target platforms make the deployment of the final solution very challenging, both with respect to the data types and API support. Model Optimizer tool enables automatic and seamless transition from the training environment to the deployment environment.

Below is an example Caffe workflow (TensorFlow steps are the same). The Model Optimizer converts the original Caffe proprietary formats to the Intermediate Representation (IR) file that describes the topology accompanied by a binary file with weights. These files are consumed by the Inference Engine and used for scoring.

Example Caffe* Workflow

Note: To work with Caffe, the Model Optimizer requires Caffe recompilation with the special interface wrappers (see the Model Optimizer Developer Guide for details).

The process of conversion from the supported frameworks to the Inference Engine formats is automatic for topologies with the standard layers that are known to the Model Optimizer tool (see Using the Model Optimizer to Convert TensorFlow* Models or Using the Model Optimizer to Convert Caffe* Models).

This tutorial explains the flow and provides examples for the non-standard (or custom) layers.

Inference Engine and the Model Optimizer are provided as parts of the Intel DL Deployment Toolkit and OpenVINO toolkit. The components are the same in both toolkits, but the paths are slightly different:

  • In the Intel DL Deployment Toolkit:
    • <DL_TOOLKIT_ROOT_DIR>/deployment_tools/model_optimizer
    • <DL_TOOLKIT_ROOT_DIR>/deployment_tools/inference_engine
  • In the OpenVINO toolkit:
    • <OPENVINO_ROOT_DIR>/model_optimizer
    • <OPENVINO_ROOT_DIR>/inference_engine

Custom Layers Workflow

The Inference Engine has a notion of plugins (device-specific libraries to perform hardware-assisted inference acceleration). Before creating any custom layer with the Inference Engine, you need to consider the target device. The Inference Engine supports only CPU and GPU custom kernels. It is usually easier to begin with the CPU extension, and debugging with the CPU path, and then switch to the GPU.

For performance implications and estimations, see Performance Implications and Estimating Performance Without Implementing or Debugging a Kernel.

When creating new kernels in the Inference Engine, you must connect custom layers to these kernels as follows:

  1. Register the custom layer for the Model Optimizer tool. This step, which is device agnostic, is required to generate correct Intermediate Representation (IR) file with the custom layers.
  2. Implement the kernel in OpenCL™ (if you target GPU acceleration) or C++ (for general CPU codepath) that can be plugged into the Inference Engine.
  3. Register the kernel in the Inference Engine, so that each time it meets the layer of the specific type in the IR, it inserts a call to the custom kernel into the internal graph of computations. The registration process also defines the connection between the Caffe parameters of the layer and the kernel inputs.

The rest of document explains the steps in details.

Note: The Inference Engine moved to the concept of core primitives implemented by the plugins and extensions that come in the source code. For the CPU device, it allows re-compilation for the target platform with SSE, AVX2, and similar codepaths. The CPU extensions, which you can modify or use as a starting point, are located in the <INSTALL_DIR>/deployment_tools/samples/extension directory.

Registering Custom Layers for the Model Optimizer

The main purpose of registering a custom layer within the Model Optimizer is to define the shape inference (how the output shape size is calculated from the input size). Once the shape inference is defined, the Model Optimizer does not need to call the specific training framework again.

For information on registering custom layers, see Custom Layers in the Model Optimizer.

Note: For Caffe, there is legacy option to use the training framework fallback for shape inference. Custom layers can be registered in the <MODEL_OPTIMIZER_DIR>/bin/CustomLayersMapping.xml, and the tool will call the Caffe directly to get information on the output shapes.

Although the legacy option is much simpler than the primary registration process, it has a limitation related to shapes that dynamically depend on the input data. So we strongly encourage you to use general custom layers registration mechanism via Python* classes for the Model Optimizer.

Performance Considerations for Custom Kernels and Custom Layers

Creating custom kernels and custom layers can create performance issues in certain conditions, so it is important to keep in mind the implications of specific development decisions and to estimate how these development decisions might affect performance.

Performance Implications

  • Overriding Existing Layers.

    Custom kernels are used to quickly implement missing layers for cutting-edge topologies. For that reason, it is not advised to override layers on the critical path (for example, Convolutions). Also, overriding existing layers can disable some existing performance optimizations such as fusing.

  • Post-processing Custom Layers.

    When the custom layers are at the very end of your pipeline, it is easier to implement them as regular post-processing in your application without wrapping as kernels. This is particularly true for kernels that do not fit the GPU well, for example, (output) bounding boxes sorting. In many cases, you can do such post-processing on the CPU.

  • Blocked Layout.

    If the performance of the CPU extension is of concern, consider an implementation that supports the blocking layout (that the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) is using internally), which would eliminate (automatic) Reorders before and after your kernel. For example of the blocked layout support, please refer to the PReLu extension example in the <INSTALL_DIR>/deployment_tools/samples/extension/ext_prelu.cpp.

Estimating Performance without Implementing or Debugging a Kernel

In most cases, before actually implementing a full-blown code for the kernel, you can estimate the performance by creating a stub kernel that does nothing and is “infinitely” fast to let the topology execute end-to-end. The estimation is valid only if the kernel output does not affect the performance (for example, if its output is not driving any branches or loops).

CPU Kernels

Interface

Since the primary vehicle for the performance of the CPU codepath in the Inference Engine is Intel MKL-DNN, new CPU kernels are extending the Inference Engine plugin for the Intel MKL-DNN. Implementing the InferenceEngine::ILayerImplFactory defines a general “CPU-side” extension. So, there are no Intel MKL-DNN specifics in the way you need to implement a kernel.

Let’s consider simple MyCustomLayerFactory class that registers example kernels which make multiplication by two of its input data, but and does not change the dimensions:

  1. Create a constructor, a virtual destructor, and a data member to keep the layer info:
    // my_custom_layer.h
    class MyCustomLayerFactory: public InferenceEngine::ILayerImplFactory {
    public:
    explicit MyCustomLayerFactory(const CNNLayer *layer): cnnLayer(*layer) {}
    private:
    CNNLayer cnnLayer;
    };
    
  2. Overload and implement the abstract methods (getShapes, getImplementations) of the InferenceEngine::ILayerImplFactory class:
    StatusCode MyCustomLayerFactory::getShapes(const std::vector<TensorDesc>& inShapes, std::vector<TensorDesc>& outShapes, ResponseDesc *resp) noexcept override {
        if (cnnLayer == nullptr)
            return GENERAL_ERROR;
        outShapes.clear();
        // the kernel accepts single tensor only
        if (inShapes.size() != 1)
            return GENERAL_ERROR;
        else// the output tensor’s shape is the same (kernel doesn’t change that)
    outShapes.emplace_back(inShapes[0]); 
        return OK;
    }
    StatusCode MyCustomLayerFactory::getImplementations(std::vector<ILayerImpl::Ptr>& impls, ResponseDesc *resp) noexcept override {
        // below the method reports single (CPU) impl of the kernel
        // in theory, here you can report multiple implementations
        // (e.g. depending on the layer parameters, available via the cnnLayer instance
        impls.push_back(ILayerImpl::Ptr(new MyCustomLayerImpl(cnnLayer)));
        return OK;
    }
    
  3. Introduce an actual kernel as MyCustomLayerImpl class, inherited from the abstract InferenceEngine::ILayerExecImpl class:
    // my_custom_layer.h
    class MyCustomLayerImpl: public ILayerExecImpl {
    public:
        explicit MyCustomLayerImpl(const CNNLayer *layer): cnnLayer(*layer) {}
        StatusCode getSupportedConfigurations(std::vector<LayerConfig>& conf, ResponseDesc *resp) noexcept override;
        StatusCode init(LayerConfig& config, ResponseDesc *resp) noexcept override;
        StatusCode execute(std::vector<Blob::Ptr>& inputs, std::vector<Blob::Ptr>& outputs, ResponseDesc *resp) noexcept override;
    private:
        CNNLayer cnnLayer;
    };
    
  4. Implement the virtual methods for your kernel class. First of all, implement the getSupportedConfigurations, which returns all supported format (input/output tensor layouts) for your implementation:
    // my_custom_layer.cpp
    virtual StatusCode MyCustomLayerImpl::getSupportedConfigurations(std::vector< LayerConfig>& conf, ResponseDesc *resp) noexcept {
        try {
            if (cnnLayer == nullptr)
                THROW_IE_EXCEPTION << "Cannot get cnn layer";
            if (cnnLayer->insData.size() != 1 || cnnLayer->outData.empty())
                THROW_IE_EXCEPTION << "Incorrecr number of input/outpput edges!";
        DataPtr dataPtr = cnnLayer->insData[0].lock();
        if (!dataPtr)
            THROW_IE_EXCEPTION << "Cannot get input data!";
        DataConfig dataConfig;
        // this layer can procees data in-place but it is not constant
        dataConfig.inPlace = -1;
        dataConfig.constant = false;
        SizeVector order;
        //order of dimensions is default (unlike some Permute, etc kernels)
        for (size_t i = 0; i < dataPtr->getTensorDesc().getDims().size(); i++) {
            order.push_back(i);
        }
        // combine info into the TensorDesc
        dataConfig.desc = TensorDesc(
       dataPtr->getTensorDesc().getPrecision(),
                     dataPtr->getTensorDesc().getDims(),
                     {dataPtr->getTensorDesc().getDims(), order} /*BlockingDesc*/
        );
              //NCHW is default, so this call can be omitted, but see comment in the end
               dataConfig.desc.SetLayout(MemoryFormat::NCHW);
        LayerConfig config;
        // finally, add the expected input config to the kernel config
        config.inConfs.push_back(dataConfig);
        //pretty much the same for the (single) output that the kernel will	produce	
        DataConfig outConfig;
        outConfig.constant = false;
        outConfig.inPlace = 0;
        order.clear();
        for (size_t i = 0; i < cnnLayer->outData[0]->getTensorDesc().getDims().size(); i++) {
            order.push_back(i);
        }
    // NCHW is default, so we use the TensorDesc constructor that omits the layout
        
       outConfig.desc = TensorDesc(
    cnnLayer->outData[0]->getTensorDesc().getPrecision(),
                                    cnnLayer->outData[0]->getDims(),
                                    {cnnLayer->outData[0]->getDims(), order}
        );
        // add the output config to the layer/kernel config
        config.outConfs.push_back(outConfig);
        // no dynamic batch support
        config.dynBatchSupport = 0;
        // finally, “publish” the single configuration that we are going to support
        conf.push_back(config);
        return OK;
    } catch (InferenceEngine::details::InferenceEngineException& ex) {
        std::string errorMsg = ex.what();
        errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
        return GENERAL_ERROR;
    }
    

    Note: Carefully select the formats to support, as the framework might insert potentially costly reorders - special calls to reshape the data to meet the kernels requirements. Many streaming kernels (for example, that apply some arithmetic to every element of the input, like ReLU) are actually agnostic to the layout, so you should specify InferenceEngine::MKLDNNPlugin::MemoryFormat::any for them. Similarly, kernels that do not follow the traditional tensor semantics (of batches or features), but store the values in tensors can also use MemoryFormat::any.

    Finally, if the performance is of concern, consider an implementation that supports the blocking layout (that the Intel MKL-DNN is using internally), which would eliminate reorders before and after your kernel. For an example of the blocked layout support, please see the PReLu extension example in the following directory: <INSTALL_DIR>/deployment_tools/samples/extension/ext_prelu.cpp.

  5. Implement init method to get a runtime-selected configuration from a vector that populated in the previous step and check the parameters:
    // my_custom_layer.cpp
    virtual StatusCode MyCustomLayerImpl::init(LayerConfig& config, ResponseDesc *resp) noexcept {
        StatusCode rc = OK;
        if (config.dynBatchSupport) {
            config.dynBatchSupport = 0;
            rc = NOT_IMPLEMENTED;
        }
        for (auto& input : config.inConfs) {
            if (input.inPlace >= 0) {
                input.inPlace = -1;
                rc = NOT_IMPLEMENTED;
            }
            for (auto& offset : input.desc.getBlockingDesc().getOffsetPaddingToData()){
                if (offset) // our simplified impl doesn’t support data offsets
                    return GENERAL_ERROR;
            }
            if (input.desc.getBlockingDesc().getOffsetPadding())
                return GENERAL_ERROR; // our simplified impl doesn’t support padding
            
            for (size_t i = 0; i < input.desc.getBlockingDesc().getOrder().size(); i++){
                if (input.desc.getBlockingDesc().getOrder()[i] != i) {
            // our simplified tensors other than 4D, and just regular dims order
                    if (i != 4 || input.desc.getBlockingDesc().getOrder()[i] != 1)
                        return GENERAL_ERROR;  
                }
            }
        }
     
        // pretty much the same checks for output
        for (auto& output : config.outConfs) {
            if (output.inPlace < 0)
                // no in-place support for the output
                return GENERAL_ERROR;
            for (auto& offset : output.desc.getBlockingDesc().getOffsetPaddingToData()) {
                if (offset)
                    return GENERAL_ERROR;
            }
            if (output.desc.getBlockingDesc().getOffsetPadding())
                return GENERAL_ERROR;
            
            for (size_t i = 0; i < output.desc.getBlockingDesc().getOrder().size(); i++) {
                if (output.desc.getBlockingDesc().getOrder()[i] != i) {
                    if (i != 4 || output.desc.getBlockingDesc().getOrder()[i] != 1)
                        return GENERAL_ERROR;
                }
            }
        }
        return rc;
    }
    
  6. Implement the execute method, which accepts and processes the actual tenors as input/output blobs:
    // my_custom_layer.cpp
    virtual StatusCode MyCustomLayerImpl::execute(std::vector<Blob::Ptr>& inputs, std::vector<Blob::Ptr>& outputs, ResponseDesc *resp) noexcept {
        if (inputs.size() != 1 || outputs.empty()) {
            std::string errorMsg = "Incorrect number of input or output edges!";
            errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
            return GENERAL_ERROR;
        }
        const float* src_data = inputs[0]->buffer();
        float* dst_data = outputs[0]->buffer();
        for (size_t o = 0; o < outputs->size(); o++) {
                dst_data[o] = src_data[o]*2; // the kernel just multiplies the input
        }
    }
    

Packing the Kernels into a Shared Library

Packing the kernels into a shared library groups kernels into a shared library. The library should internally implement the InferenceEngine::IExtension, which defines the functions that you need to implement:

// my_custom_extension.h
class MyCustomExtentionLib: public InferenceEngine::IExtension {
private:
    static InferenceEngine::Version ExtensionDescription = {
        {1, 0},             // extension API version
        "1.0",
        "MyCustomExtentionLib"   // extension description message
    };
public:
    // cleanup resources, in our case does nothing
    void Unload() noexcept override {
    }
    //  method called upon destruction, in our case does nothing
    void Release() noexcept override {
        delete this;
    }
    // logging, in our case does nothing
    void SetLogCallback(InferenceEngine::IErrorListener &listener) noexcept override {}
// returns version info
void GetVersion(const InferenceEngine::Version *& versionInfo) const noexcept override {
        versionInfo = &ExtensionDescription;
    }
// retrunes the list of supported kernels/layers
 StatusCode getPrimitiveTypes(char**& types, unsigned int& size, ResponseDesc* resp) noexcept override {
        std::string type_name = "MyCustomLayer";
        types = new char *[1];
        size = 1;
        types[0] = new char[type_name.size() + 1];
        std::copy(type_name.begin(), type_name.end(), types[0]);
        types[0][type_name.size()] = '\0';
        return OK;
    }
// main function!
    StatusCode MyCustomExtentionLib::getFactoryFor(ILayerImplFactory *&factory, const CNNLayer *cnnLayer, ResponseDesc *resp) {
        if (cnnLayer->type != "MyCustomLayer") {
            std::string errorMsg = std::string("Factory for ") + cnnLayer->type + " wasn't found!";
            errorMsg.copy(resp->msg, sizeof(resp->msg) - 1);
            return NOT_FOUND;
        }
        factory = new MyCustomLayerFactory(cnnLayer);
        return OK;
    }
};

Loading the Shared Library

Before loading a network with the plugin, you must load the library with kernels to avoid errors on the unknown layer types:

// Load Intel MKL-DNN plugin, refer to the samples for more examples
InferenceEngine::InferenceEnginePluginPtr plugin_ptr(selectPlugin(…, “CPU”));
InferencePlugin plugin(plugin_ptr);
 // Load CPU (MKL-DNN) extension as a shared library
auto extension_ptr = 
make_so_pointer<InferenceEngine:: IExtension>(“<shared lib path>”);
// Add extension to the plugin list
plugin.AddExtension(extension_ptr);

For code examples, see Inference Engine Samples. for code samples. All Inference Engine samples (except trivial hello_classification) feature a dedicated command-line option to load custom kernels. Use the following command-line code to execute the sample with custom CPU kernels:

$ ./classification_sample -i <path_to_image>/inputImage.bmp -m <path_to_model>/CustomAlexNet.xml -d CPU 
-l <absolute_path_to_library>/libmy_sample_extension.so 

GPU Kernels

General Workflow

Unlike CPU custom kernels, the GPU codepath abstracts many details about OpenCL. You do not need to use host -side OpenCL APIs. You only need to provide a configuration file and one or more kernel source files. See Example Configuration for examples of configuration and kernel files.

There are two options for using custom layer configuration file within the Inference Engine:

  • To include a section with your kernels into global automatically-loaded cldnn_global_custom_kernels/cldnn_global_custom_kernels.xml file (hosted in the <INSTALL_DIR> /deployment_tools/inference_engine/bin/intel64/{Debug/Release} folder)
  • To call the IInferencePlugin::SetConfig() method from the user application with the PluginConfigParams::KEY_CONFIG_FILE key and the configuration file name as the value before loading the network that uses custom layers to the plugin:
    // Load GPU plugin, refer to the samples for more examples
    InferenceEngine::InferenceEnginePluginPtr plugin_ptr(selectPlugin({…, “GPU”));            
    InferencePlugin plugin(plugin_ptr);
    // Load GPU Extensions            
    plugin.SetConfig({{PluginConfigParams::KEY_CONFIG_FILE, ”<path to the xml file>”}});
    …
    

All Inference Engine samples (except trivial hello_classification) feature a dedicated command-line option to load custom kernels with -c option, as follows:

$ ./classification_sample -m ./models/alexnet/bvlc_alexnet_fp16.xml -i ./validation_set/daily/227x227/apron.bmp -d GPU
 -c absolute_path_to_config /custom_layer_example.xml

Configuration File Format

The configuration file is expected to follow the .xml file structure with a node of type CustomLayer for every custom layer provided by the user.

The following definitions will use the notations:

  • (0/1) Can have 0 or 1 instances of this node/attribute
  • (1) Must have 1 instance of this node/attribute
  • (0+) Can have any number of instances of this node/attribute
  • (1+) Can have 1 or more instances of this node/attribute
CustomLayer Node and Sub-node Structure

CustomLayer node contains the entire configuration for a single custom layer.

Attribute Name#Description
name(1)

The name of the layer type to be used. This name should be identical to the type used in the IR.

type(1)Must be SimpleGPU
version(1)Must be 1

Sub-nodes: Kernel (1), Buffers (1), CompilerOptions (0+), WorkSizes (0/1)

Kernel Node and Sub-node Structure

Kernel node contains all kernel source code configuration. No kernel node structure exists.

Sub-nodes: Source (1+), Define (0+)

Source Node and Sub-node Structure

Source node points to a single OpenCL source file.

Attribute Name#Description
filename(1)

Name of the file containing OpenCL source code. Notice that path is relative to your executable.

Multiple source nodes will have their sources concatenated in order.

Sub-nodes: None

Define Node and Sub-node Structure

Define node configures a single #define instruction to be added to the sources during compilation (JIT).

Attribute Name#Description
name(1)

The name of the defined JIT. For static constants, this can include the value as well (taken as a string).

param(0/1)Name of one of the layer parameters in the IR.

This parameter value will be used as the value of this JIT definition.

type(0/1)The parameter type.

Accepted values: int, float, and int[], float[] for arrays

default(0/1)The default value to be used if the specified parameters is missing from the layer in the IR

Sub-nodes: None

The resulting JIT will be of the form: #define [name] [type] [value/default].

Buffers Node and Sub-node Structure

Buffers node configures all input/output buffers for the OpenCL entry function. No buffers node structure exists.

Sub-nodes:Data (0+), Tensor (1+)

Data Node and Sub-node Structure

Data node configures a single input with static data (for example, weight or biases).

Attribute Name#Description
name(1)Name of a blob attached to this layer in the IR
arg-index(1)0-based index in the entry function arguments to be bound to

Sub-nodes: None

Tensor Node and Sub-node Structure

Tensor node configures a single input or output tensor.

Attribute Name#Description
arg-index(1)0-based index in the entry function arguments to be bound to
type(1)input or output
port-index(1)0-based index in the layer’s input/output ports in the IR
format(0/1)Data layout declaration for the tensor

Accepted values: BFYX, BYXF, YXFB, FYXB (also in all lowercase)

Default value: BFYX

CompilerOptions Node and Sub-node Structure

CompilerOptions node configures the compilation flags for the OpenCL sources.

Attribute Name#Description
options(1)Options string to be passed to the OpenCL compiler

Sub-nodes: None

WorkSizes Node and Sub-node Structure

WorkSizes node configures the global/local work sizes to be used when queuing the OpenCL program for execution.

Attribute Name#Description
global(0/1)

An array of up to 3 integers (or formulas) for defining the OpenCL work-sizes to be used during execution.

The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,*,% (all evaluated in integer arithmetic)

Default value: global=”B*F*Y*X” local=””
local(0/1)

Sub-nodes: None

Example Configuration file

The following code sample provides an example configuration file (in .xml format). For information on configuration file structure, see Configuration File Format.

<!-- the config file introduces a custom "ReLU" layer-->
<CustomLayer name="ReLU" type="SimpleGPU" version="1">
  <!-- the corresponding custom kernel is "example_relu_kernel" from the specified .cl file-->
  <Kernel entry="example_relu_kernel">
    <Source filename="custom_layer_kernel.cl"/>
    <!-- the only ReLU specific parameter (for "leaky" one)-->
    <Define name="neg_slope" type="float" param="negative_slope" default="0.0"/>
  </Kernel>
  <!-- inputs and outputs of the kernel-->
  <Buffers>
    <Tensor arg-index="0" type="input" port-index="0" format="BFYX"/>
    <Tensor arg-index="1" type="output" port-index="0" format="BFYX"/>
  </Buffers>
  <!-- OpenCL compiler options-->
  <CompilerOptions options="-cl-mad-enable"/>
  <!-- define the global worksize. The formulas can use the values of the B,F,Y,X dimensions and contain the operators: +,-,/,*,% (all evaluated in integer arithmetic)
Default value: global="B*F*Y*X,1,1"-->
  <WorkSizes global="X,Y,B*F"/>
</CustomLayer>

Built-In Defines for Custom Layers

The following table includes definitions that will be attached before the user sources, where <TENSOR> is the actual input and output, (for example, INPUT0 or OUTPUT0).

For an example, see Example Kernel.

NameValue
NUM_INPUTSNumber of the input tensors bound to this kernel
GLOBAL_WORKSIZEAn array of global work sizes used to execute this kernel
GLOBAL_WORKSIZE_SIZEThe size of the GLOBAL_WORKSIZE array
LOCAL_WORKSIZEAn array of local work sizes used to execute this kernel
LOCAL_WORKSIZE_SIZEThe size of the LOCAL_WORKSIZE array
<TENSOR>_DIMS

An array of the tensor dimension sizes.

Always ordered as BFYX

<TENSOR>_DIMS_SIZEThe size of the <TENSOR>_DIMS array
<TENSOR>_TYPEThe data-type of the tensor: float, half or char
<TENSOR>_FORMAT_The format of the tensor, BFYX, BYXF, YXFB , FYXB or ANY
  • The format will be concatenated to the defined name
  • You can use the tensor format to define codepaths in your code with #ifdef/#endif
<TENSOR>_LOWER_PADDINGAn array of padding elements used for the tensor dimensions before they start.

Always ordered as BFYX.

<TENSOR>_ LOWER_PADDING_SIZEThe size of the <TENSOR>_LOWER_PADDING array
<TENSOR>_UPPER_PADDINGAn array of padding elements used for the tensor dimensions after they end.

Always ordered as BFYX.

<TENSOR>_UPPER_PADDING_SIZEThe size of the <TENSOR>_UPPER_PADDING array
<TENSOR>_PITCHESThe number of elements between adjacent elements in each dimension.

Always ordered as BFYX.

<TENSOR>_PITCHES_SIZEThe size of the <TENSOR>_PITCHES array
<TENSOR>_OFFSETThe number of elements from the start of the tensor to the first valid element (bypassing the lower padding)

All <TENSOR> values will be automatically defined for every tensor bound to this layer (INPUT0, INPUT1, OUTPUT0, and so on), as shown in the following for example:

#define INPUT0_DIMS_SIZE 4
#define INPUT0_DIMS (int []){ 1,96,55,55, } 

Refer to the Appendix A: Resulting OpenCL™ Kernel for more examples.

Example Kernel

#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void example_relu_kernel(
    const __global INPUT0_TYPE*  input0,
          __global OUTPUT0_TYPE* output)
{
    const uint idx  = get_global_id(0);
    const uint idy  = get_global_id(1);
    const uint idbf = get_global_id(2);//batches*features, as OpenCL supports 3D nd-ranges only
    const uint feature = idbf%OUTPUT0_DIMS[1];
    const uint batch   = idbf/OUTPUT0_DIMS[1];
    //notice that pitches are in elements, not in bytes!
    const uint in_id  = batch*INPUT0_PITCHES[0] + feature*INPUT0_PITCHES[1]   + idy*INPUT0_PITCHES[2]  + idx*INPUT0_PITCHES[3]  + INPUT0_OFFSET;
    const uint out_id = batch*OUTPUT0_PITCHES[0] + feature*OUTPUT0_PITCHES[1]  + idy*OUTPUT0_PITCHES[2]  + idx*OUTPUT0_PITCHES[3]  + OUTPUT0_OFFSET;
   
    INPUT0_TYPE value = input0[in_id];
    //neg_slope (which is non-zero for leaky ReLU) is put automatically as #define, refer to the config xml
    output[out_id] = value < 0 ? value * neg_slope : value;
}

Note: As described in the previous section, all the things like INPUT0_TYPE are actually defined as OpenCL (pre-) compiler inputs by the Inference Engine for efficiency reasons. See Debugging Tips for information on debugging the results.

Debugging Tips

Dumping the Resulting Kernels

Compared to the CPU-targeted code, debugging the GPU kernels is less trivial.

First of all, it is recommended to get a dump of the kernel with all of the values set by the Inference Engine (all of the tensors sizes, floating-point, and integer kernel parameters). To get the dump, add a following line to your code that configures the GPU plugin to output the custom kernels:

plugin.SetConfig({{ PluginConfigParams::KEY_DUMP_KERNELS, PluginConfigParams::YES }});

When the Inference Engine compiles the kernels for the specific network, it will also output the resulting code for the custom kernels. In the directory of your executable, you will locate files like clDNN_program0.cl, clDNN_program1.cl. There are as many files as distinct sets of parameters for your custom kernel (different input tensor sizes, and kernel parameters). See Appendix A: Resulting OpenCL™ Kernel for an example of a dumped kernel.

Using printf in the OpenCL™ Kernels

To debug the specific values, you can use printf in your kernels. However, you should be careful: for instance, do not output excessively as it would generate too much data. Since the printf output is typical, your output can be truncated to fit the buffer. Also, because of buffering, you actually get an entire buffer of output when the execution ends.

For more information, refer to printf Function.

Appendix A: Resulting OpenCL™ Kernel

This is an example of the code produced by the Inference Engine that Compute Library for Deep Neural Networks (clDNN) generates internally for the specific layer, when all the params (like neg_slope value for the ReLU) and tensor dimensions are known.

Essentially, this is original user code (see Example Kernel) plus a bunch of define values from the clDNN. Notice that the layer name is also reported (relu1):

// Custom Layer Built-ins
#define NUM_INPUTS 1
#define GLOBAL_WORKSIZE_SIZE 3
#define GLOBAL_WORKSIZE (size_t []){ 55,55,96, } 
#define LOCAL_WORKSIZE_SIZE 0
#define LOCAL_WORKSIZE (size_t []){  } 
#define INPUT0_DIMS_SIZE 4
#define INPUT0_DIMS (int []){ 1,96,55,55, } 
#define INPUT0_TYPE float
#define INPUT0_FORMAT_BFYX 
#define INPUT0_LOWER_PADDING_SIZE 4
#define INPUT0_LOWER_PADDING (int []){ 0,0,0,0, } 
#define INPUT0_UPPER_PADDING_SIZE 4
#define INPUT0_UPPER_PADDING (int []){ 0,0,0,0, } 
#define INPUT0_PITCHES_SIZE 4
#define INPUT0_PITCHES (int []){ 290400,3025,55,1, } 
#define INPUT0_OFFSET 0
#define OUTPUT0_DIMS_SIZE 4
#define OUTPUT0_DIMS (int []){ 1,96,55,55, } 
#define OUTPUT0_TYPE float
#define OUTPUT0_FORMAT_BFYX 
#define OUTPUT0_LOWER_PADDING_SIZE 4
#define OUTPUT0_LOWER_PADDING (int []){ 0,0,0,0, } 
#define OUTPUT0_UPPER_PADDING_SIZE 4
#define OUTPUT0_UPPER_PADDING (int []){ 0,0,0,0, } 
#define OUTPUT0_PITCHES_SIZE 4
#define OUTPUT0_PITCHES (int []){ 290400,3025,55,1, } 
#define OUTPUT0_OFFSET 0

// Layer relu1 using Custom Layer ReLU
// Custom Layer User Defines
#define neg_slope 0.0
// Custom Layer Kernel custom_layer_kernel.cl
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void example_relu_kernel(
    const __global INPUT0_TYPE*  input0,
          __global OUTPUT0_TYPE* output)
{
    const uint idx  = get_global_id(0);
    const uint idy  = get_global_id(1);
    const uint idbf = get_global_id(2);//batches*features, as OpenCL supports 3D nd-ranges only
    const uint feature = idbf%OUTPUT0_DIMS[1];
    const uint batch   = idbf/OUTPUT0_DIMS[1];

    //notice that pitches are in elements, not in bytes!
    const uint in_id  = batch*INPUT0_PITCHES[0]   + feature*INPUT0_PITCHES[1]   + idy*INPUT0_PITCHES[2]  + idx*INPUT0_PITCHES[3]  + INPUT0_OFFSET;
    const uint out_id = batch*OUTPUT0_PITCHES[0]  + feature*OUTPUT0_PITCHES[1]  + idy*OUTPUT0_PITCHES[2]  + idx*OUTPUT0_PITCHES[3]  + OUTPUT0_OFFSET;
   
    INPUT0_TYPE value = input0[in_id];
    //neg_slope (which is non-zero for leaky ReLU) is put automatically as #define by the clDNN, refer to the xml
    output[out_id] = value < 0 ? value * neg_slope : value;
}

The Tools for Production IoT Development

$
0
0

Tools for Production for IoT

There’s no way around it: IoT development requires a broad set of skills and expertise to be successful. You need working knowledge of hardware, software, application development, analytics, use cases, vertical markets and—as much as anything—the right tools. The core set of tools you select can mean the difference between a quick, relatively smooth development processes that leads to successful commercial production versus one that consumes valuable development time because the setup is too challenging. 

 

Optimizing Hardware Selection 

Ideally, tool selection begins with finding the right architecture. That’s the basis for gathering all the necessary components required to successfully prototype and deploy a commercial IoT solution, evolving your solution’s capabilities, and futureproofing your solution with new enhancements.

Consider the importance of your product’s lifespan. Tools must be capable of taking designs and making them extensible—so today’s IoT Product 1.0 can evolve to accommodate next-generation AI, security, connectivity, scalability, and other future attributes. Also, tools must be purpose-built and use-case-driven in order to speed development and minimize tinkering and trial and error. So, for example, if you want to develop an application for traffic management that counts vehicles and analyzes license plates, a 6th Generation Intel® Core™ processor, such as what is found in the iEi* Tank AIoT development kit, is an ideal choice. This kit scales to support complex and parallel video streams through CPU and GPU hardware acceleration. 

Included with the iEi* Tank AIoT development kit is the Intel® OpenVINO™ toolkit, a SDK designed to help developers build high-performance computer vision applications and integrate deep learning inference. For flexibility, developers can go to production and optimize performance using Intel® System Studio tool suite, or prototype with the cloud-based Arduino* Create IDE. This software is included on a pre-installed custom Ubuntu* Desktop OS configured to allow for computer vision development out of the box. Lastly, Intel® Media SDK is also part of the package which exposes the media acceleration capabilities of Intel platforms for decoding, encoding, video and photo processing, and capturing screen content. The Intel® Media SDK’s single API enables hardware acceleration for fast video transcoding, image processing, and media workflows. 

All in all, an Intel® Core™ processor is a great choice for applications requiring parallel workloads. A solid choice for early prototyping is the UP* Squared Grove IoT development kit from Aaeon*, which is ideal for easy setup and concept turnaround, especially when used with the cloud-based Arduino Create* tool designed to support Intel-based platforms. The Arduino Create* tool with integrated libraries and SDKs makes them available at your fingertips.

IoT Developer Solutions

Intel offers the tools and technology that provide a clear path to commercial production of innovative IoT solutions.

Get Started, Develop Efficiently, Then Scale

The business challenges commercial IoT developers face will always vary, which is why the development tools at their disposal must be flexible and wide-ranging. And that’s why Intel offers end-to-end, prototype-to-production solutions with variable options at the SDK level, IDE level, processor level, and cloud connection options to support this need. Again, it’s all about offering the right options to make the right choices with precision. No matter which tools and technology you select from the vast set of Intel tools, you can be assured you will get the essential building blocks you need to gain a clear path to the finished product you envision. 

Enter the Intel Developer Zone for IoT

Intel is committed to helping commercial IoT developers simplify and accelerate the development process every step of the way. That commitment is evident in the breadth of what we have to offer, the size and diversity of our edge-to-cloud ecosystem, and the production IoT solutions that developers are bringing to market using Intel tools and technology. 

With the Intel® Developer Zone for IoT as your source for tools, code samples, training, and ongoing support, you can access all the necessary resources—from seeking IoT solution inspiration to applying tangible developer insights to help you build and optimize your solution for commercialization. And you gain the flexibility to match the tools you choose with your current skills, while scaling up as you see fit. 
Discover all of the advantages of working with Intel at the Intel® Developer Zone for IoT.

Clone of Installing the OpenVINO™ Toolkit for Windows*

$
0
0

NOTEThe OpenVINO™ toolkit was formerly known as the Intel® Computer Vision SDK.
These steps apply to Windows* 10. For Linux* instructions, see the Linux installation guide

Introduction

The OpenVINO™ toolkit quickly deploys applications and solutions that emulate human vision. Based on Convolutional Neural Networks (CNN), the Toolkit extends computer vision (CV) workloads across Intel® hardware, maximizing performance. The OpenVINO™ Toolkit includes the Intel® Deep Learning Deployment Toolkit.For more information, see the OpenVINO Overview information on the Web site.

The OpenVINO™ toolkit for Linux:

  • Enables CNN-based deep learning inference on the edge
  • Supports heterogeneous execution across a CPU, Intel® Integrated Graphics, and Intel® Movidius™ Neural Compute Stick
  • Speeds time-to-market via an easy-to-use library of computer vision functions and pre-optimized kernels
  • Includes optimized calls for computer vision standards including OpenCV*, OpenCL™, and OpenVX*

Included with Installation

ComponentDescription
Deep Learning Model OptimizerModel import tool. Imports trained models and converts to IR format for use by Deep Learning Inference Engine. This is part of the Intel® Deep Learning Deployment Toolkit.
Deep Learning Inference EngineUnified API to integrate the inference with application logic. This is part of the Intel® Deep Learning Deployment Toolkit.
Drivers and runtimes for OpenCL™ version 2.1Enables OpenCL 1.2 on the GPU/CPU for Intel® processors
Intel® Media SDKOffers access to hardware accelerated video codecs and frame processing
OpenCV version 3.4.1OpenCV community version compiled for Intel® hardware. Includes PVL libraries for computer vision.
OpenVX* version 1.1Intel's implementation of OpenVX* 1.1 optimized for running on Intel® hardware (CPU, GPU, IPU).
Sample ApplicationsA set of simple console applications demonstrating how to use Intel's Deep Learning Inference Engine in your applications. Additional information about building and running the samples can be found in the Inference Engine Developer Guide.

System Requirements

This guide includes only information related to Microsoft Windows* 10 64-bit. See the Linux installation guide for Linux information and instructions.

NOTE: Only the CPU and Intel® Integrated Graphics processor options are available. Linux is required for using the FPGA or Intel® Movidius™ Myriad™ 2 VPU options.

Development and Target Platforms

The development and target platforms have the same requirements, but you can select different components during the installation, based on your intended use.

Processor

  • 6th-8th Generation Intel® Core™
  • Intel® Xeon® v5 family, Intel® Xeon® v6 family

Processor Notes:

  • Processor graphics are not included in all processors. See https://ark.intel.com/ for information about your processor.
  • A chipset that supports processor graphics is required for Intel® Xeon® processors.

Operating System

Microsoft Windows* 10 64-bit

Installation

The steps in this guide assume you have already downloaded a copy of OpenVINO™ Toolkit for Linux*. If you do not have a copy of the package you can download the latest version here, then return to this guide to proceed with installation.

Install External Software Dependencies

Install Core Components

  1. Download the OpenVINO toolkit. By default, the file is saved to Downloads as w_openvino_toolkit_p_2018.1.<version>.exe
  2. Go to the Downloads folder.
  3. Double-click w_openvino_toolkit_p_2018.1.<version>.exe. A screen displays with options to choose your installation directory and components:
  4. Click Next.
  5. The next screen warns you about any missing components and the effect the missing component has on installing or using the OpenVINO toolkit:
  6. If you are missing a critical component, click Cancel, resolve the issue, and then restart the installation.
  7. When the installation completes, click Finish to close the wizard and open the Getting Started Guide in a browser.
  8. Make sure the installation directory is populated with sub-folders. The default installation location is C:\Intel\computer_vision_sdk_2018.1.<versions>. 

Next Steps

IMPORTANT: Before using the Model Optimizer to work with your trained model, make sure your Caffe*TensorFlow*, or MXNet* framework is prepared for any custom layers you have in place. The next information will put you on the way to doing this.

Learn About the OpenVINO™ Toolkit

Before using the OpenVINO™ toolkit, read through the product overview information on the Web site to gain a better understanding of how the product works.

Compile the Extensions Library

Some topology-specific layers, like DetectionOutput used in the SSD*, are delivered in source code that assumes the extensions library is compiled and loaded. The extensions are required for pre-trained models inference. While you can build the library manually, the best way to compile the extensions library is to execute the demo scripts.

Run the Demonstration Applications

To verify the installation, run the demo apps in <INSTALL_FOLDER>\deployment_tools\demo. For demo app documentation, see the README.txt in <INSTALL_FOLDER>\deployment_tools\demo.

The demo apps and their functions are:

  • demo_squeezenet_download_convert_run.bat. This demo Illustrates the basic steps used to convert a model and run it. This enables the Intel® Deep Learning Deployment Toolkit to perform a classification task with the SqueezeNet model. This demo:
    • Downloads a public SqueezeNet model.
    • Installs all prerequisites to run the Model Optimizer.
    • Converts the model to an Intermediate Representation.
    • Builds the Inference Engine Image Classification Sample from the <INSTALL_FOLDER>\deployment_tools\inference_engine\samples\classification_sample
    • Runs the sample using cars.png from the demo folder.
    • Shows the label and confidence for the top-10 categories.
  • demo_security_barrier_camera_sample.bat. This demo shows an inference pipeline using three of the pre-trained models included with the OpenVINO. The region found by one model becomes the input to the next. Vehicle regions found by object recognition in the first phase become the input to the vehicle attributes model, which locates the license plate. The region identified in this step becomes the input to a license plate character recognition model. This demo:
    • Builds the Inference Engine Security Barrier Camera Sample from the <INSTALL_FOLDER>\deployment_tools\inference_engine\samples\security_barrier_camera_sample.
    • Runs the sample using car_1.bmp from the demo folder.
    • Displays the resulting frame with detections rendered as bounding boxes and text.

 

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Arria, Core, Movidius, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

*Other names and brands may be claimed as the property of others.

Copyright © 2018, Intel Corporation. All rights reserved.


Computer Vision Glossary of Vocabulary and Concepts

$
0
0

Computer Vision is a rapidly evolving area. This guide is to provide a starting point to understanding some of the terminology used in computer vision and the OpenVINO™ project.

I hope to make this a useful reference for people that are learning to develop, sell, train or otherwise understand the concepts and vocabulary in this fascinating area of research.

This first article is for people who are in sales, marketing, project management or who would otherwise like to be knowledgeable about OpenVINO™, but are not specialists, researchers or engineers. 

If you have words, abbreviations or other concepts that you think should be include in this cheat sheet, then feel free to email them to me at daniel.w.holmlund@intel.com.

Glossary

Introductory Terms for Non-Developers

This section contains foundational key terms that any non-expert should know to speak knowledgeably on the topic of OpenVINO™.

  • Caffe*
  • Computer Vision
  • Convolutional Neural Network (CNN) - Convolutional Neural Networks are Neural Networks that make the explicit assumption that the inputs are 1d, 2d or multi-dimensional arrays. This assumption allows us to simplify the neural network architecture and make it more efficient for applications in computer vision that use images or video.
  • CPU
    • A central processing unit (CPU) is the electronic circuitry within a computer that carries out the instructions of a computer program by performing the basic arithmetic, logical, control and input/output (I/O) operations specified by the instructions.
    • https://en.wikipedia.org/wiki/Central_processing_unit 
  • Deep Learning Inference Engine
    • An inference engine is a component of a system that applies logical rules to a set of inputs to deduce new information. 
    • The Intel® deep learning inference engine is a piece of software that runs trained neural network models. It receives input, runs it through the trained neural network, and delivers the output.
  • FPGA
    • A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by a customer or a designer after manufacturing. It can be specialized for accelerating the highly parallelled processing tasks required in computer vision.
    • https://en.wikipedia.org/wiki/Field-programmable_gate_array 
  • FPGA Inference Accelerator
    • A FPGA that is specialized for running the Intel(R) deep learning inference engine at high speeds.
  • GPU
    • A graphics processing unit (GPU) is a specialized electronic circuit designed to rapidly manipulate data associated with computer vision and computer graphics. 
  • Hardware Heterogeneity
    • Hardware Heterogeneity refers to the idea that computer software should be able to identify and run on a combination of different hardware. For example, if a computer vision program has access to a CPU, GPU and an FPGA then it should be able to use all three in an optimal manner. 
  • Intel® Arria® 10 FPGA GX
  • Intel® Media SDK
  • Intel® Movidius™ brand
    • The trademark name given to products developed by a computer vision company named Movidius™ that was acquired by Intel in September 2016.
  • Intel® Movidius™ Neural Compute Stick
    • The Intel® Movidius™ Neural Compute Stick (NCS) is a tiny, fanless, deep learning device that you can use to learn AI programming at the edge. NCS is powered by the same low power high performance Intel® Movidius™ Vision Processing Unit (VPU) that can be found in millions of smart security cameras, gesture controlled drones, industrial machine vision equipment, and more.
  • Model
    • Model is a trained neural network that specializes in a particular activity. More formally, it is a neural network that has been trained to approximate a particular function. 
  • Model Optimizer
    • The Model Optimizer is a cross-platform command-line tool that takes pre-trained deep learning models from Caffe*, Tensorflow* and MxNet* converts to an intermediate representation for use with the inference engine. It performs static model analysis and adjusts deep learning models for optimal execution on end-point target devices.
  • MxNet
  • Neural Network Topology
    • The total number of neurons and all of their connections and weights  are referred to as the Neural Network Topology. 
  • Neural Network 
  • Object Detection
  • OpenCL™
    • OpenCL™ (Open Computing Language) is a framework for writing programs that execute across heterogeneous  platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors or hardware accelerators. - Wikipedia 
  • OpenCV
    • OpenCV (Open Source Computer Vision Library) is a software library that specializes in real-time computer vision algorithms. Originally, started by Intel, OpenCV is now open source and the most widely used computer vision library in the world. https://opencv.org/  
  • OpenVINO™
    • Open Visual Inference & Neural Network Optimization (OpenVINO™) toolkit provides computer vision libraries and deep neural network and convolutional neural networks (CNN) libraries, the toolkit extends workloads across Intel® hardware and maximizes performance. - https://software.intel.com/en-us/openvino-toolkit 
  • OpenVX*
    • OpenVX* is an open, royalty-free standard for cross platform acceleration of computer vision applications. OpenVX enables performance and power-optimized computer vision processing, especially important in embedded and real-time use cases such as face, body and gesture tracking, smart video surveillance, advanced driver assistance systems (ADAS), object and scene reconstruction, augmented reality, visual inspection, robotics and more. - https://www.khronos.org/openvx/
  • TensorFlow*
    • TensorFlow* is an open source software library for high performance numerical computation. it comes with strong support for machine learning and deep learning and the flexible numerical computation core is used across many other scientific domains.
  • The Edge (of the Network) 
    • Networks located on the periphery of a centralized network. Device’s attached at the edge are often user facing.
  • VPU
    • A Visual Processing Unit is dedicated silicon that is designed for processing computer vision media including images and video. It’s often used in conjunction with Intel® Movidius™ technology.

Smart Video   OpenVINO Toolkit™

Accessing Remote Persistent Memory with Block Semantics Using SPDK and PMDK

$
0
0

Introduction

Persistent memory enables persistence at cache line granularities, compared to block granularity for traditional block storage. But in some cases, legacy software may need to access remote persistent memory using block semantics. This is not a primary use case for persistent memory but may be useful to present a small portion of a larger persistent memory pool over a network fabric. This article describes how the open source Software Performance Development Kit (SPDK) integrates with the Persistent Memory Development Kit (PMDK) to enable low-latency remote access to persistent memory with traditional block semantics using NVM Express* (NVMe*) over Fabrics (NVMe-oF).

Persistent Memory Development Kit Support for Block Storage

A key aspect of block semantics is guaranteeing write atomicity. When a block is written and a power failure occurs, we want to ensure that either all of the data or none of the data is written. This is critical for writing correct storage software such as write-ahead logs. The PMDK libpmemblk library provides such a guarantee for implementing block storage on top of persistent memory. Libpmemblk utilizes a block translation table (BTT), which behaves similarly to a flash translation layer (FTL) found in modern solid-state drives (SSDs). The BTT acts as an indirection table, enabling a separation of copying user data to a block-sized region of persistent memory from the mapping of that region to a logical block address.

Storage Performance Development Kit

Next, how do we present this block storage over a network fabric? NVMe-oF is a popular answer. NVMe-oF is designed for modern multicore architectures, enabling multiple queues for parallel access, and using remote direct memory access (RDMA) protocols to reduce software overhead and minimize latency.

Enter the SPDK. It provides a set of tools, libraries, and applications for creating high performance, scalable, user-mode storage applications. One of SPDK's applications is a poll-mode NVMe-oF target. SPDK provides a block device layer called bdev, which provides a generic interface to a heterogenous set of block devices that are created by bdev modules. Examples of SPDK bdev modules include:

  • NVMe—for accessing either local or remote attached storage using the NVMe protocol
  • Malloc—for accessing DRAM as a RAM disk
  • Ceph RBD—for accessing Ceph* RADOS block devices
  • PMDK—for accessing PMDK libpmemblk pools

SPDK and PMDK Integration

The SPDK PMEM bdev driver uses the pmemblk pool as the target for block input/output (I/O) operations.

Schematic of block device abastraction

Here we see the block device abstraction, which presents the PMDK pool as a block device that can be served as an NVMe-oF namespace over the network fabric. The client system can then access this storage remotely using the NVMe-oF protocol with any NVMe-oF compliant driver.

Configuration

Let's walk through how to configure an SPDK NVMe-oF target with libpmemblk-based namespaces.

First, we need to configure the target system. These instructions assume that you are already familiar with PMDK and have installed PMDK on the target system. If this is not the case, instructions for installing PMDK can be found on the Intel® Developer Zone at Getting Started with Persistent Memory.  We also assume that you are familiar with RDMA and have RDMA interfaces configured on both the target and client systems.

Start by building SPDK on the target system. See the instructions for downloading the source code, installing prerequisite packages, and building SPDK. To enable PMDK with SPDK, you must pass --with-pmdk to the SPDK configure script.

Now we can start the NVMe-oF target:

cd <spdk root directory>
app/nvmf_tgt/nvmf_tgt

The NVMe-oF target should now be running. In a separate terminal window, we will use the SPDK RPC interface to configure a libpmemblk SPDK block device (bdev). Later, we will attach this bdev to an NVMe-oF namespace.

This example creates the backing storage in /tmp, but this can be changed to any directory in a persistent memory-enabled file system. The bdev will be 8192 blocks, with a block size of 4096, for a total of 32 MiB (mebibytes). The name of the bdev will be pmdk0.

cd <spdk root directory>
scripts/rpc.py create_pmem_pool /tmp/spdk_pool 8192 4096
scripts/rpc.py construct_pmem_bdev /tmp/spdk_pool –n pmdk0

Now we can create the NVMe-oF subsystem and attach the pmdk0 bdev as a namespace:

scripts/rpc.py construct_nvmf_subsystem \
nqn.2016-06.io.spdk:cnode1 '''' -a -s SPDK0001
scripts/rpc.py nvmf_subsystem_add_listener \
nqn.2016-06.io.spdk:cnode1 -t RDMA \
-a 192.168.10.1 -s 4420
scripts/rpc.py nvmf_subsystem_add_ns \
nqn.2016-06.io.spdk:cnode1 pmdk0

The client system should now be able to connect:

nvme connect -t rdma -n nqn.2016-06.io.spdk:cnode1 \
-a 143.182.136.99 -s 4420

Summary

Using block protocols over a network fabric can quickly enable legacy applications to take advantage of persistent memory. NVMe-oF is the ideal protocol for this type of persistent memory access. The SPDK NVMe-oF target provides an easy-to-use PMDK plugin to enable persistent memory access over NVMe-oF, and its user space poll mode architecture optimizes access latency, compared to traditional interrupt-driven target applications.

About the Author

Jim Harris is a principal software engineer in Intel's Data Center Group. His primary responsibility is architecture and design of the Storage Performance Development Kit (SPDK). Jim has been at Intel since 2001, serving in a wide variety of storage software related roles. Jim has a B.S. and M.S. in Computer Science from Case Western Reserve University.

Code Sample: Panaconda - A Persistent Memory Version of the Game Snake

$
0
0
File(s):Download
License:3-Clause BSD License
Optimized for... 
OS:Linux* kernel version 4.3 or higher
Hardware:Emulated: See How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM)
Software:
(Programming Language, tool, IDE, Framework)
C++ Compiler and Persistent Memory Developers Kit (PMDK) 
Prerequisites:Familiarity with C++


Introduction

Panaconda Game

Snake is a beloved game from childhood where you use arrow keys to navigate the board, pick up food to grow your snake, and avoid hitting walls or your own tail. Panaconda is a game of Snake designed to demonstrate persistent memory pools, pointers, and transactions. All objects are stored in persistent memory, which means that in the case of a power failure or application crash, the state of the game will be retained and you can continue playing from the point you were at before the failure. In this example, we demonstrate the details of what makes this game persistent, discuss how you can make your applications persistent by using similar methods, and wrap up with how to play the Panaconda game.

This article assumes you have basic knowledge of persistent memory and the concepts used in Persistent Memory Development Kit (PMDK) libraries. In our article, Introduction to Programming with Persistent Memory from Intel we provide a great introduction to what persistent memory is and why it is revolutionary. For setting up your development environment, refer to our Getting Started guide. Pmem.io has a great tutorial series describing use of the libpmemobj library for persistent memory programming. It is highly recommended to at least read part 1, which introduces basic concepts demonstrated in this article. 


Game Design

In Panaconda, everything happens within a while loop in main. Until the snake is stopped for any reason, it will loop taking steps, setting food, and checking for collisions.

while (!snake_game->is_stopped()) {
	…
}

Data structures

data structure flowchart
Figure 1. Data structure for Panaconda.

The game has three main classes: game, game_board, and snake. In the above diagram, you can see the additional classes and how they interact. The game class is the root object. This object is what anchors all the other objects created in the persistent memory pool. Through the game class, all other objects in the pool can be reached. This happens in the init() function of game, as shown:

persistent_ptr<game_state> r = state.get_root();

Game

In addition to being the root object, the game class checks whether the game file specified already exits. In the snippet below, the pool checks LAYOUT_NAME stored in game_state to see if it matches the game file passed in. This code is looking to see if the pool already exists. Whether the pool is being created or it already exists and is being opened, it is being assigned to the Pool Object Pointer (pop) variable.

if (pool<game_state>::check(params->name, LAYOUT_NAME) == 1)	
    pop = pool<game_state>::open(params->name, LAYOUT_NAME);
else
	pop = pool<game_state>::create(params->name, LAYOUT_NAME,
				           PMEMOBJ_MIN_POOL * 10, 0666);

In game::init we see our first transaction. This transaction wraps the maze setup process. If a power failure happens, the data structure does not get corrupted because all changes are rolled back. More details about transactions and how they are implemented can be found in the C++ bindings for libpmemobj (part 6) – transactions on pmem.io.

transaction::exec_tx(state, [&r, &ret, this]() {
	r->init();
	if (params->use_maze)
		ret = parse_conf_create_dynamic_layout();
	else
		ret = r->get_board()->creat_static_layout();

	r->get_board()->create_new_food();
});

use_maze is set to true if the game was started with the "–m" tag. If a custom maze is passed in, the game creates a dynamic layout, else it creates a static, predefined layout.

In this implementation of snake, the game_player class stores score and play_state. The state can be: STATE_NEW, STATE_PLAY, or STATE_GAMEOVER.

Game_board

Game_board creates persistent pointers to food, layout (the map), and snake. This is also where you would change the game board size if you were to create your own map.

game_board::game_board()
{
	persistent_ptr<element_shape> shape =
		make_persistent<element_shape>(FOOD);
	food = make_persistent<board_element>(0, 0, shape,
					      direction::UNDEFINED);
	layout = make_persistent<list<board_element>>();
	anaconda = make_persistent<snake>();
	size_row = 20;
	size_col = 20;
}

In the above snippet, the following objects are returned as persistent object pointers:

food—a persistent pointer to a board element of shape FOOD, with no point nor direction defined

layout—a persistent pointer to a list of board elements

anaconda—a persistent pointer to a snake object, which ultimately is a list of board elements

All of these allocations are part of a transaction, so if the game aborts, the allocations are rolled back, reverting any memory allocation back to its original state. More information about the make_persistent function can be found by reading C++ bindings for libpmemobj (part 5) – make_persistent.

Keep in mind, in the game_board destructor, these pointers are deleted using the following syntax:

game_board::~game_board()
{
	layout->clear();
	delete_persistent>(layout);
	delete_persistent(anaconda);
	delete_persistent(food);
}

Another function of the game_board class is to keep track of collisions. If the snake's head hits food, the snake gets longer and the game gets harder. Collisions happen between the snake and food, between the snake and a wall, and between the snake and its own body.

bool is_snake_collision(point point);
bool is_wall_collision(point point);
bool is_snake_head_food_hit(void);

Snake

In the snake class, snake_segments is a persistent pointer to a list of board_element objects. Initially, the snake is populated with five segments. More segments are added as the snake hits food.

snake_segments = make_persistent<list<board_element>>();

The move function in snake uses persistent pointers and a for loop to iterate through each element of snake_segments. The loop iterates backwards to assign the previous snake_segments point and location to the next segment. This gives the look of the snake moving. When the loop reaches the first element of snake_segments, it calculates the next position and sets the direction based on the direction that was passed into the function.

void snake::move(const direction dir)
{
	int snake_size = snake_segments->size();
	persistent_ptr<point> new_position_point;

	last_seg_position = *(snake_segments->get(snake_size - 1)->get_position().get());
	last_seg_dir = snake_segments->get(snake_size - 1)->get_direction();

	for (int i = (snake_size - 1); i >= 0; --i) {
		if (i == 0) {
			new_position_point =
				snake_segments->get(i)->calc_new_position(dir);
			snake_segments->get(i)->set_direction(dir);
		} else {
			new_position_point =
				snake_segments->get(i)->calc_new_position(
					snake_segments->get(i - 1)->get_direction());
			snake_segments->get(i)->set_direction(
				snake_segments->get(i - 1)->get_direction());
		}
		snake_segments->get(i)->set_position(new_position_point);
	}
}

As you can see in the image below, the snake is basically a moving array of snake_segments. Each element of snake_segments contains the x, y point where it is located and the direction it is going. When snake::move(const direction dir) is called, each element takes the position of the one in front of it. The first element moves based on the direction that was passed into the function.

Panaconda Game
Figure 2. Image of snake_segments before and after move function.


To Play

First, download and build Panaconda. Installation assistance, including dependencies, can be found in the PMDK readme.

Launch the game

$./panaconda /path/game/session/gameFile

The gameFile is where the game session is stored. This is either created the first time you play, or you can open a game file where you previously played. If this is your first time, make up a name for your file and start the game like this:

$./panaconda myFirstGameFile

Additionally, you can create your own game maze or use a friend's. "-m" specifies that you want to use a custom maze.

$./panaconda /path/game/session/gameFile –m /path/myMapCfg

panaconda/conf.cfg contains an example of a predefined maze. The maze is defined using a bitmap, where "1" is a wall and "0" is an open space. Currently, the map size is limited to 80 x 40 bits, but that is configurable in the code. Try creating your own maze and see if you can beat it; then share your maze with a friend!

Controls

Panaconda uses the arrow keys to move. "q" quits the game and "n" creates a new game.

To simulate killing your game, press "ctrl+c", "q", or execute "kill –p `pgrep panaconda`" in another terminal. This returns you to the command line and exits you from your game. To resume, simply launch the game again using the same game file you previously specified. Because of the game's persistent aspects, it resumes exactly where you left off.

Objective

The goal of the game is to stay alive as long as possible while growing your snake longer and longer. The snake grows longer when it runs into a food block. But be careful to avoid hitting any obstacles, walls, or other parts of the snake itself.


Summary

In this code sample, we saw examples of transactions and persistent pointers. We examined game, the root object that anchors all the other objects in the persistent memory pool. This is just one example of how persistent memory can be used. Although simple and fun, this sample demonstrates fundamental persistent memory programming concepts. If you're interested in learning more, I encourage you to dig deeper into Panaconda, or explore our other code samples on software.intel.com or in our GitHub* repo.

The PMDK is available on GitHub and on the Persistent Memory Programming home page.


About the Author

Kelly Lyon is a Developer Advocate at Intel Corporation with three years of previous experience as a Software Engineer. Most recently, Kelly was a maintainer of Snap, Intel’s open source telemetry framework. Kelly is dedicated to advocating for users and looks forward to bringing clarity to complex ideas by researching and providing simple, easy to understand guides and tutorials. Follow her journey on Twitter @a_lyons_tale.



References

Keras Implementation of Siamese-like Networks

$
0
0

Abstract

Deep learning has revolutionized the field of machine learning. Convolutional Neural Networks (CNNs) have become very popular for solving problems related to image recognition, image reconstruction, and various other computer vision problems. Libraries such TensorFlow* and Keras make the programmer’s job easier. But, these libraries do not directly provide support for complex networks and uncommonly used layers. This guide will help you to write complex neural networks such as Siamese networks in Keras. It also explains the procedure to write your own custom layers in Keras.


Introduction

Person re-identification is defined as identifying if the same person exists in a given pair of images. Some of the challenges faced while tackling this problem are caused by pictures being taken from various viewpoints and variations in the light intensity that result in pictures of different people looking similar, thus creating a false positive. The Normalized X-Corr model1 is used to solve the problem of person re-identification. This guide demonstrates a step-by-step implementation of a Normalized X-Corr model using Keras, which is a modification of a Siamese network2.

Normalized X-Corr model
Figure 1. Architectural overview of a Normalized X-Corr model.


Overview of the Normalized X-Corr Model

Arulkumar Subramaniam and his colleagues1 propose a deep neural network to solve the problem of binary classification. Figure 1 gives an overview of the Normalized X-Corr (normxcorr) model. Firstly, the images are passed through conv-pool-conv-pool layers to extract features from the two images. The idea behind using these layers is to extract features from the image, so the weights of conv layers are shared (i.e., both images are passed through the same layers). After extracting the features, establishing a similarity between the features is necessary. This is done by the normalized correlation layer, which is a custom layer that will be discussed later in this guide. This layer basically takes a small 55 patch and then it convolves around in the other feature map and calculates the normalized correlation given by:

normxcorr model

We will denote feature maps as X and Y belonging to the images. Considering the sizes in Figure 1, we take a patch from map of X centered at (x,y) at a given depth and normxcorr is calculated with Y(a,b), where 1 <= a <= 12 and y – 2<= b <= y + 2. Thus, for every X(x,y), 5×12=60, values are generated and stored along the depth of the output feature map. This is done at all depths; therefore, we have output dimensions as 12×37×1500 (i.e., 60×25).

In Figure 2, the size of the image is assumed to be 8×8 for the purpose of demonstration. If we consider the patch centered at the block marked by the red square in image 1 of size 5×5, we calculate Normalized-X-Corr of this patch with patches marked by the green squares in image 2 (i.e., across the entire width of image), and height within [3 - 2, 3 + 5], which is [1,5]. Thus, the total number of values generated by a single patch in image 1 is the width×height allowed (i.e., 8×5=40). These values are stored along the depth of the output feature map. Thus, for one patch, we generate an output of 1×1×40. Considering the entire image, we would have a feature map of size 8×8×40. But, if the input has more than one channel, then the calculated feature maps are stacked one behind the other. Due to this, height and width of the output feature map remain the same, but the depth gets multiplied by the depth of input images. Hence, an input image of 8×8×5 would generate an output feature map of 8×8×(40×5) (i.e., 8×8×200). For the patch centered at the block marked by the blue color, we see that to satisfy the criteria, we need to add padding. Thus, in such cases, the image is padded with zeros.

After the Normalized-X-Corr layer, two conv layers and pooling have been added to concisely incorporate greater context information. On top of it, two fully connected layers are added and a softmax activation function is applied.

More information about the architecture is available in the paper “Deep Neural Networks with Inexact Matching for Person Re-Identification.”

Demonstrating normalization
Figure 2. Demonstrating normalization correlation layers operation.


Diving into the Code

The code below was tested on Intel® AI DevCloud. The following libraries and frameworks were also used: Python* 3 (February 2018 version), Keras (version 2.1.2), Intel® Optimization for TensorFlow* (version 1.3.0), NumPy (version 1.14.0).

import keras 
import sys 
from keras import backend as K 
from keras.layers import Conv2D, MaxPooling2D, Dense,Input, Flatten 
from keras.models import Model, Sequential 
from keras.engine import InputSpec, Layer 
from keras import regularizers 
from keras.optimizers import SGD, Adam 
from keras.utils.conv_utils import conv_output_length 
from keras import activations 
import numpy as np

These are some of the imports from Keras and other libraries we need to implement in this model.

a = Input((160,60,3)) 
b = Input((160,60,3))

These create placeholders for the input images.

model = Sequential() 
model.add(Conv2D(kernel_size = (5,5), filters = 20,input_shape = (160,60,3), activation = 'relu')) 
model.add(MaxPooling2D((2,2))) 
model.add(Conv2D(kernel_size = (5,5), filters = 25, activation = 'relu')) 
model.add(MaxPooling2D((2,2)))

These are the layers that need to be shared between the images. Therefore, we create a model of these layers.

feat_map1 = model(b) 
feat_map2 = model(a)

model(a) passes the input it gets through the model and returns the output layer. This is done for both the layers so that they share the same model and output two feature maps as feat_map1 and feat_map2.

normalized_layer = Normalized_Correlation_Layer(stride = (1,1), patch_size = (5, 5))([feat_map1, feat_map2])

This is the custom layer that establishes a similarity between the feature maps extracted from the images. We pass the feature maps as a list input. Its implementation is mentioned later in this guide.

final_layer = Conv2D(kernel_size=(1,1), filters=25, activation='relu')(normalized_layer) 
final_layer = Conv2D(kernel_size=(3,3), filters=25, activation = None)(final_layer) 
final_layer = MaxPooling2D((2,2))(final_layer) 
final_layer = Dense(500)(final_layer) 
final_layer = Dense(2, activation = "softmax")(final_layer)

These are layers that are added on top of the normalized correlation layer.

x_corr_mod = Model(inputs=[a,b], outputs = final_layer)

Finally, a new model is created with inputs as the images to be passed as a list, which gives a binary output.

The visualizations of layers of this model are available paper “Supplementary Material for the Paper: Deep Neural Networks with Inexact Matching for Person Re-Identification.”


Normalized Correlation Layer

This is not a layer provided by Keras so we have to write it on our own layer with the support provided by the Keras backend.

class Normalized_Correlation_Layer(Layer):

create a class inherited from keras.engine.Layer.

	def __init__(self, patch_size=(5,5), 
          dim_ordering='tf', 
          border_mode='same', 
          stride=(1, 1), 
          activation=None, 
          **kwargs): 

       if border_mode != 'same': 
          raise ValueError('Invalid border mode for Correlation Layer ''(only "same" is supported as of now):', border_mode) 
       self.kernel_size = patch_size 
       self.subsample = stride 
       self.dim_ordering = dim_ordering 
       self.border_mode = border_mode 
       self.activation = activations.get(activation) 
       super(Normalized_Correlation_Layer, self).__init__(**kwargs)

This constructor just sets the values passed as parameters as the class variables and also initializes it’s parent class by calling the constructor.

def compute_output_shape(self, input_shape):
      return(input_shape[0][0], input_shape[0][1], input_shape[0][2], self.kernel_size[0] * input_shape[0][2]*input_shape[0][-1])

This returns the shape of the feature map outputted by this layer as a tuple. The first element is the number of images, the second is the number of rows, the third is the number of columns, and the last one is the depth which is the allowance to move in height×allowance to move in width×depth. In our case its 5×12×25.

def get_config(self): 
   config = {'patch_size': self.kernel_size, 
          'activation': self.activation.__name__, 
          'border_mode': self.border_mode, 
          'stride': self.subsample, 
          'dim_ordering': self.dim_ordering} 
     base_config = super(Correlation_Layer, self).get_config() 
     return dict(list(base_config.items()) + list(config.items()))

This adds the configuration passed as arguments to constructor, appends it to those of the parent class, and returns it. This function is called by Keras to get the configurations.

def call(self, x, mask=None):

This function is called at every iteration. This function takes the input as feature maps as per the model.

input_1, input_2 = x 
     stride_row, stride_col = self.subsample 
     inp_shape = input_1._keras_shape

Separate the inputs from the lists and load some variables to local ones\ to make it easier to refer later on.

output_shape = self.compute_output_shape([inp_shape, inp_shape])

This uses the function written earlier to get the desired output shape and store it in the variable.

   padding_row = (int(self.kernel_size[0] / 2),int(self.kernel_size[0] / 2)) 
    padding_col = (int(self.kernel_size[1] / 2),int(self.kernel_size[1] / 2)) 
    input_1 = K.spatial_2d_padding(input_1, padding =(padding_row,padding_col)) 
    input_2 = K.spatial_2d_padding(input_2, padding = ((padding_row[0]*2, padding_row[1]*2),padding_col)) 

This block of code adds padding to the feature map. This is required as we take patches centered at (0,0) and other edges, too. Therefore, we need to add padding of 2 in our case. But, for the feature map of the second input, we need to take patches with an offset of 2 from the center of the patch of the first feature map. Thus, for the patch at (0, 0) we need to consider patches centered at (0,0), (0,1), (0,2), (0,-1), (0,-2) of the second feature map with same value at X. Thus, we need to add a padding of 4,

output_row = output_shape[1] 
output_col = output_shape[2] 

and store them into the variables.

output = [] 
for k in range(inp_shape[-1]):

Loop for all the depths.

   xc_1 = [] 
   xc_2 = [] 
   for i in range(padding_row[0]): 
      for j in range(output_col): 
         xc_2.append(K.reshape(input_2[:, i:i+self.kernel_size[0], j:j+self.kernel_size[1], k], 
                     (-1, 1,self.kernel_size[0]*self.kernel_size[1])))

This is done for the patches of feature map 2 where we have added the extra padding (i.e., the patches that are not centered on the feature map and which are at the first rows).

for i in range(output_row): 
       slice_row = slice(i, i + self.kernel_size[0]) 
       slice_row2 = slice(i + padding_row[0], i +self.kernel_size[0] + padding_row[0]) 
       for j in range(output_col): 
          slice_col = slice(j, j + self.kernel_size[1]) 
          xc_2.append(K.reshape(input_2[:, slice_row2, slice_col, k], 
                      (-1, 1,self.kernel_size[0]*self.kernel_size[1]))) 
          xc_1.append(K.reshape(input_1[:, slice_row, slice_col, k], 
                        (-1, 1,self.kernel_size[0]*self.kernel_size[1])))

Extract patches of size 5×5 from both feature maps and store them in xc_1 and xc_2, respectively. In this case, these patches are flattened and reshaped in form (-1,1,25).

for i in range(output_row, output_row+padding_row[1]): 
       for j in range(output_col): 
           xc_2.append(K.reshape(input_2[:, i:i+ self.kernel_size[0], j:j+self.kernel_size[1], k], 
                       (-1, 1,self.kernel_size[0]*self.kernel_size[1]))) 

This is to extract patches of feature map 2, but which are centered below the bottom of the feature maps.

xc_1_aggregate = K.concatenate(xc_1, axis=1)

These patches are joined along axis=1 so that they would be of the shape (-1, 60, 25) for any given depth.

xc_1_mean = K.mean(xc_1_aggregate, axis=-1, keepdims=True) 
xc_1_std = K.std(xc_1_aggregate, axis=-1, keepdims=True) 
xc_1_aggregate = (xc_1_aggregate - xc_1_mean) / xc_1_std

This is just the implementation of normalization of the features of the first feature map.

xc_2_aggregate = K.concatenate(xc_2, axis=1)
xc_2_mean = K.mean(xc_2_aggregate, axis=-1, keepdims=True) 
xc_2_std = K.std(xc_2_aggregate, axis=-1, keepdims=True) 
xc_2_aggregate = (xc_2_aggregate - xc_2_mean) / xc_2_std

Similarly, for the feature maps of image 2.

xc_1_aggregate = K.permute_dimensions(xc_1_aggregate, (0, 2, 1)) 
     block = [] 
     len_xc_1= len(xc_1) 
     for i in range(len_xc_1):
		     #This for loop is to compute the product of a given patch of feature map 1 and the feature maps on which it is supposed to
         sl1 = slice(int(i/inp_shape[2])*inp_shape[2], 
              int(i/inp_shape[2])*inp_shape[2]+inp_shape[2]*self.kernel_
size[0]) 
			      #This calculates which are the patches of feature map 2 to be considered for a given patch of first feature map.

         block.append(K.reshape(K.batch_dot(xc_2_aggregate[:,sl1,:], 
                      xc_1_aggregate[:,:,i]),(-1,1,1,inp_shape[2] *self.kernel_size[0]))) 

Calculate the dot product (i.e., the normalized correlation and store it in "block").

block = K.concatenate(block, axis=1) 
     block= K.reshape(block,(-1,output_row,output_col,inp_shape[2] *self.kernel_size[0])) 
     output.append(block)

Join the calculated normalized correlation values, reshape them (they are calculated sequentially so that reshaping would be easier), and append it to “output.”

output = K.concatenate(output, axis=-1)

Join the output feature map calculated at each depth, along the depth of “output.”

output = self.activation(output) 
return output

Apply activation if sent as an argument and return the output generated.


Applications

Such a network can have various applications such as matching a person’s identity in crime scenes. This network can be generalized to find similarity between two images (i.e., to find if the same fruit exists in both images or not).


Further Scope

The code runs sequentially and is devoid of parallelism. The matrix multiplication of the patches can be parallelized across multiple cores using libraries such as multiprocessing. This would help to speed the training time. The accuracy of the model can be increased by finding a more suitable similarity measure between the image patches.


Acknowledgement

I would like to thank the Intel® Student Ambassador Program for AI, which provided me with the necessary training resources on the Intel® AI DevCloud and the technical support that helped me to use DevCloud.


References

  1. Subramaniam, M. Chatterjee, and A. Mittal. “Deep Neural Networks with Inexact Matching for Person Re-Identification.” In NIPS 2016.
  2. Dong Yi, Zhen Lei, Shengcai Liao, Stan Z. Li. “Deep Metric Learning for Person Re-Identification.” In ICPR, volume 2014.
  3. Code on GitHub*

Performance, Methods, and Practices of DirectX* 11 Multithreaded Rendering

$
0
0

Abstract

Rendering is usually the main performance bottleneck of PC games on the CPU; multithreaded rendering is an effective way to eliminate the bottleneck. This article investigates the performance scalability of DirectX* 11 multithreaded rendering, discusses two basic methods for multithreaded rendering, and introduces the case of traditional multithreading deferred shading pipelines in a large-scale online game, Conqueror's Blade*.

Background

Over the past 10 years, CPU chips in the PC market have shown great improvements. According to a software and hardware investigation by Steam*2, 4-core processors (usually 8 logical cores) have become mainstream in the current PC game market. The 6-core processor (usually 12 logical cores) is already on its way to become the mainstream next-generation CPU. For example, the Intel® Core™ i7-8700K processor with 8 or more physical cores has been available since late 2017. We expect this trend to continue. In the next few years, 6-core and 8-core CPUs will become the most popular processors for gamers.

In many PC games, rendering is usually single-threaded and easily becomes the biggest performance bottleneck. This makes it difficult for games to utilize extra idle cores in a multicore processor to improve game performance or enrich game content. Although DirectX 12* has been around a few years, most of the games currently under development—especially the most popular online games — are still using DirectX 11. DirectX 11 is designed to support multithreading from the beginning1. Therefore, investigating the performance scalability of DirectX 11 multithreaded rendering on current mainstream multicore platforms, and studying the methods of making full use of this feature have important reference value for the development and optimization of the majority of games.

DirectX* 11 Multithreaded Rendering Model

First, let's briefly review the DirectX 11 multithreaded rendering model (see Figure 1). DirectX 11 supports two types of rendering— immediate and deferred, based on two Direct3D* 11 device contexts — the immediate context and the deferred context. Immediate rendering calls draw APIs through immediate context, and the generated command is immediately sent to the graphics processing unit (GPU). Deferred rendering calls draw APIs through deferred context, but only records the draw commands in a command list that is submitted to the GPU by the immediate context at another time point. DirectX 11 supports the use of different deferred simultaneous contexts in multithreading. This strategy allows the rendering of complex scenes to be divided into multiple concurrent tasks; that is, multithreaded rendering.

DirectX 11 multithreaded rendering model
Figure 1. DirectX* 11 multithreaded rendering model.

Evaluate DirectX 11 Multithreading Performance Scalability

Based on the hardware and software configuration of Table 1, we evaluate the performance scalability of DirectX 11 multithreaded rendering on multicore CPUs.

Table 1. Hardware and software configurations for performance scalability evaluation.

ConfigurationDescription 
CPUIntel® Core i7-6950X processor @ 3.00GHz (10 Cores)
Memory2 x 16 GB RAM
GPUNVIDIA GeForce* GTX 1080AMD Radeon* RX Vega 64
Driver Version22.21.13.849422.19.677.257
Operating SystemWindows® 10 Professional 64-bit
Test ProgramMicrosoft DirectX* SDK (June 2010) Sample: MultithreadedRendering11.exe

The evaluation uses the Intel Core i7-6950X processor (10 physical cores; that is, 20 logical cores) to simulate CPUs with different numbers of cores. To ensure that the GPU does not become a performance bottleneck for the test program, the test uses two high-performance discrete GPUs: NVIDIA GeForce* GTX 1080 and AMD Radeon* RX Vega 64. The test program uses the MultithreadedRendering11 routine in the Microsoft DirectX SDK*4, which is based mainly on the following considerations. First, the program performance is CPU-bound, and it is developed to demonstrate the DirectX 11 multithreaded rendering feature, which is conducive to maximizing the potential of performance scalability. Second, the main function of the program is rendering (each frame contains more than 4,000 draw calls), and there is no impact of animation, physical load, and so on. It can make scalability a result of DirectX 11 multithreaded rendering as much as possible. In addition, the program's scene complexity and rendering technology are pretty common in games, so that the test results are of representativeness. Last, but not least, the source code of the program is open, making it easy to analyze and understand the DirectX 11 multithreaded rendering methods, and the impact on scalability performance.

Test program
Figure 2. Test program

When running the test program, we chose the MT Def/Chunk mode, because the scalability in this mode is not limited by the number of game rendering passes (or scenes), but only by the number of CPU cores. The workload of each thread is relatively balanced, which can make full use of the computing power of the multicore CPU. During the test, we adjusted the CPU's active core number through the BIOS and tested the program's frame rate at each of these different core numbers. In order to compare the effects of different GPUs on DirectX 11 multithreaded rendering scalability, we divided the multithreaded frame rate on the same GPU by the single-threaded frame rate (immediate mode) under the same configuration, to obtain a normalized relative performance metric. The test results are shown in Figure 3.

Scalability of DirectX 11 multithreaded rendering
Figure 3. Multicore performance scalability of DirectX* 11 multithreaded rendering.

As we can see from Figure 3, with two CPU cores, no matter which GPU we use, multithreaded rendering (MT Def/Chunk mode) performance is lower than single-threaded rendering (immediate mode). What leads to this result? According to the source code of the test program, the number of working threads is the number of CPU physical cores minus one. In other words, on a two-core CPU, in multithreaded rendering mode, only one working thread processes all scene draw calls based on deferred rendering, while the main thread does not assume any scene draw calls. In the single-thread rendering mode, all draw calls are processed by the main thread based on immediate rendering. This means that the overhead of deferred rendering is slightly larger than that of immediate rendering on the basis of handling an equal number of draw calls.

However, when the number of CPU cores is greater than two, the DirectX 11 multithreaded rendering performance is significantly better than that of single-threaded rendering, regardless of which GPU is used, and the performance increases as the number of cores increases. When paired with the NVIDIA GeForce GTX 1080, multicore performance scales very well; performance increase is almost linear from 2 to 6 cores. Even from 6 to 10 cores, the performance increase is significant. When paired with AMD Radeon RX Vega 64, the scalability is worse than that; especially when the number of CPU cores exceeds 4, the performance increase is almost negligible.

Why does the test program have such a large scalability difference for multicore performance on different GPUs? We used Microsoft GPUView* to capture the multithreaded activities of the test program (see Figure 4), and find that the bottleneck of the test program is on the CPU with either the NVIDIA GeForce GTX 1080 or the AMD Radeon RX Vega 64 GPU. However, multithreaded concurrency is better with the NVIDIA GPU, and the main thread blocking working threads is significantly longer with AMD graphics cards.

Rendering parallelism with different G P U s
Figure 4. DirectX* 11 multithreaded rendering parallelism with different GPUs.

From the source code, we know that each working thread has a deferred context, and all draw calls for scene rendering are called by deferred context. The main thread contains an immediate context that is responsible for submitting the commands list generated in the deferred context to the GPU. Using Windows* Performance Analyzer to further analyze the module called by the working thread, we find that, on the NVIDIA GPU, all the working threads call the graphics driver module (see Figure 5), which means that a number of deferred context operations share some of the driver load, and make the immediate context operations bear less driver load, thereby shortening the occurrences of the main thread blocking the working threads. On the AMD GPU, the graphics driver module does not appear in the working thread but is concentrated on the main thread (see Figure 6), which means that a single immediate context bears a large amount of driver load, thus increasing the time of the working threads waiting for the main thread.

Some of the N V I D I A driver load
Figure 5. Working thread (deferred context) represents some of the NVIDIA driver load.

A large amount of the driver load
Figure 6. The main thread (immediate context) represents a large amount of the driver load.

By checking the GPU driver support for DirectX 11 multithreaded rendering features3 (see Figure 7) through the DirectX Caps Viewer, we learn that the NVIDIA GPU driver supports driver command lists, while the AMD GPU driver does not support them. This explains why the driver modules on different GPUs appear in different contexts. When paired with the NVIDIA GPU, working threads can build driver commands in parallel in a deferred context; while when paired with the AMD GPU, the driver commands are all built in serial in the immediate context of the main thread.

 Rendering by different G P U drivers
Figure 7. Support for DirectX* 11 multithreaded rendering by different GPU drivers.

Based on the above tests and analysis, we can draw the following conclusions:

  • Although the indirect load of deferred rendering is larger than that of the immediate rendering, the performance of the DirectX 11 multithreaded rendering can be significantly higher than that of single-threaded rendering, especially on current mainstream 4-core or more-core CPUs when using the appropriate rendering task division method — evenly distributing draw calls to contexts of more than two Direct3D* devices.
  • The performance scalability of DirectX 11 multithreaded rendering is GPU-related. When the GPU driver supports the driver command list, DirectX 11 multithreaded rendering can achieve good performance scalability, whereas performance scalability is easily constrained by the driver bottleneck. Fortunately, the NVIDIA GPU2, with the largest share of the current game market, supports driver command lists.

Multithreaded Rendering Method

The performance scalability evaluation of the above DirectX 11 multithreaded rendering shows that on the current mainstream multicore CPUs and GPUs, multithreaded rendering on DirectX 11 games may achieve significant performance improvement. So, how do you effectively use the performance potential of multithreaded rendering? The MultithreadedRendering11 routine demonstrates two basic methods for dividing a rendering task into multiple threads:

1) Assign each thread a rendering Pass.

2) Assign each thread an equal amount of Chunk.

It should be noted that the multithreaded rendering method described here is not only suitable for DirectX 11 but also for DirectX 12. In fact, we can take the DirectX 11 deferred context as a DirectX 12 command list, and the DirectX 11 immediate context as a combination of the DirectX 12 command list and the command queue.

Figure 8 shows a multithreaded rendering method that divides the rendering task by Pass. Pass is a relatively independent rendering task. The typical Pass includes the generation of pre-Z buffers, shadow maps, reflection maps, G buffers, UI, and the main Pass generating the final frame buffer. With this method, each Pass is assigned with a working thread. A command list of this Pass is built into this working thread. The main thread is responsible for distributing Pass and orderly submitting the command list completed by the working threads. In the MultithreadedRendering11 routine, the main thread will orderly submit after all the working threads complete the command list. Figure 8 shows a better way: When a command list is completed, it should be immediately submitted to the GPU as long as the rendering order permits. Since the submitted command list is usually serial and associated with some overhead, the earlier the submission, the more serial time that can be shielded, which allows GPU processing in advance.

Divide the rendering task by Pass
Figure 8. Divide the rendering task by Pass.

Dividing rendering tasks by Pass is easy to apply to the multiple-pass rendering technology commonly used in modern games. As long as Pass contains a relatively large amount of rendering load (draw call number), using this method in games is usually effective in improving performance. The shortcoming is that the performance scalability is limited by the number of Passes, and it is not easy to achieve load balance between Passes.

Figure 9 shows a multithreaded rendering method that divides rendering tasks by Chunk. Chunk is a granularity rendering task that is smaller than Pass. A typical Chunk can be a set of draw calls, a mesh, or a larger rendering unit such as a separate rendering object containing multiple meshes. In this method, each Pass is divided into Chunks, which are evenly distributed by the main thread to multiple working threads. Each working thread is responsible for building a command list. After each command list is completed, the main thread is responsible for submitting them in order. The number of working threads is determined based on the number of physical cores, rather than the number of logical cores, in order to avoid excessive command list submissions resulting in excessive overhead. The Pass as the unit of submitting the command list is conducive to unifying the rendering status of the command list, advancing GPU processing, and multiplexing the command list between Passes.

Dividing rendering tasks by Chunk
Figure 9. Dividing rendering tasks by Chunk.

The multithreaded rendering method that divides rendering tasks by Chunk can achieve a significant performance improvement, and the performance is not affected by the number of passes and increases with the increase of the number of CPU cores. The shortcoming is that for certain situations that require orderly rendering (such as rendering semi-transparent objects), the strategy of distributing Chunks is limited, and it is easy to lose the load balance among the threads, thereby affecting the performance scalability.

No matter which of the above multithreaded rendering methods is used, the following points should be noted:

  • Since the submitted command list is serial and with a certain amount of overhead, the command list should be executed immediately after it is completed and allowed by rendering order, rather than waiting for the other command lists. The former helps shield the serial time and relieves the GPU from burst load pressure.
  • To shield the overhead of using deferred contexts, each deferred context, whether for Pass or Chunk, should contain enough draw calls. If the number of draw calls processed by the deferred context is too small, you should consider handling these draw calls in the immediate context or combining them.
  • Try to balance the load between different contexts to maximize the advantages of multithreaded rendering.

Case Study

Here we introduce multithreaded rendering methods and effects achieved in a real DirectX 11 game. Conqueror's Blade5 is a large-scale online game developed by NetEase*. The game has large-scale outdoor battle scenes, a large number of characters on the same screen, and rich visual effects. These characteristics make the game demand more CPU resources. To enable players on the low-end CPU platforms to have a smooth gaming experience, developers continue to optimize the game engine for multithreading optimization in order to improve performance by fully utilizing CPU resources, or to improve game details.

Single-threaded rendering causes C P U performance bottlenecks
Figure 10. Single-threaded rendering causes CPU performance bottlenecks.

Before the performance optimization of the game, the engine has achieved a certain amount of multithreading: some CPU-intensive tasks, such as game logic, physics, particles, animation, and other calculations use separate threads for execution. The rendering thread is mainly responsible for visibility detection and running the entire rendering pipeline. Nevertheless, the rendering thread is still a performance bottleneck for the game (see Figure 10). A typical combat scene with more than 5,000 draw calls per frame also results in considerable Direct3D runtime and driver overhead. The game uses the DirectX 11 API and a typical deferred shading pipeline. The task pipeline of the rendering thread is shown in Figure 11.

The game’s task pipeline of rendering thread
Figure 11. The game's task pipeline of rendering thread.

Based on some considerations such as limitation of game legacy code and implementation time, the game chooses a multithreading optimization method that divides rendering tasks according to Pass, which is a relatively easier implementation choice in a limited time. The specific implementation scheme is shown in Figure 12.

In the optimization scheme, Visibility is removed from the rendering thread and divided into two jobs: eye visibility and light visibility. The GBuffer generation, the shadow map generation, and the forward and transparent Passes that have or may dynamically have a large number of draw calls are also moved out of the rendering thread and encapsulated as a job that dispatches working threads. GBuffer generation has been further divided into three jobs: GBuffer Terrain, GBuffer Static, and GBuffer Dynamic, because there are too many draw calls. The rendering thread only retains the Scaleform UI. Deferred Shading and Post Process Passes must use immediate context rendering or have only a few draw calls.

Multithreaded rendering flowchart after game optimization
Figure 12. Multithreaded rendering flowchart after game optimization.

During the operation process, the working thread first processes the two visibility check jobs in parallel, then these two jobs derive six jobs rendering passes, and the working thread builds a related Pass DirectX 11 command list in the deferred context. The rendering thread orderly executes the Passes left in the rendering thread and the command list that has been completed by the working threads in the immediate context.

After the multithreading optimization is complete, the bottleneck of the rendering thread is eliminated, the multicore utilization is significantly improved (see Figure 13), and the frame rate is increased by an average of 1.7 times than before the optimization. It has achieved the set performance target.

Eliminate bottlenecks after multithreaded rendering
Figure 13. Eliminate bottlenecks after multithreaded rendering to improve multicore utilization.

Although the current solution significantly improves performance, there is still a lot of room for improvement in CPU utilization due to an uneven load between Passes. Therefore, in order to make the idle CPU further enhance performance details in games, the developers will rebuild the rendering code of the game engine and try to divide the multithreading optimization of the rendering task by Chunk.

Summary

On the multicore CPUs and GPUs with the largest share of the current game market, and for those DirectX 11 games on CPUs with performance bottlenecks in rendering, achieving multithreaded rendering may help realize significant performance improvements. Although the multicore performance scalability of DirectX 11 multithreaded rendering is limited for some GPUs with limited driver support, under the condition of reasonable implementation the performance of multithreaded rendering will be better than that of single-threaded rendering. The key to the advantage of multithreaded rendering is the division and scheduling of rendering tasks. For this reason, this article introduces the methods based on Pass and Chunk. These two methods are not only applicable to DirectX 11, but also to DirectX 12, so that multithreaded rendering optimization of DirectX 11 games can be easily ported to future DirectX 12 games. In the game Conqueror's Blade, a Pass-based multithreaded method is successfully applied to the traditional deferred shading pipeline, proving the effectiveness of DirectX 11 multithreaded rendering.

Footnotes

1. Introduction to Multithreading in Direct3D 11

2. Steam Hardware and Software Survey

3. How To: Check for Driver Support

4. Microsoft DirectX SDK (June 2010)

5. Conquerors' Blade official website

Viewing all 1201 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>