Quantcast
Channel: Intel Developer Zone Articles
Viewing all 1201 articles
Browse latest View live

How to use the Intel® Advisor Python API

$
0
0

Introduction

You can now access the Intel® Advisor database using our new Python API. We have provided several reference examples on how to use this new functionality. The API provides a flexible way to report on useful program metrics (Over 500 metric elements can be displayed). This article will describe how to use this new functionality.

Getting started

To get started, you first need to setup the Intel Advisor environment.

> source advixe-vars.sh

Next, to setup the Intel Advisor database, you need to run some collections. Some of the program metrics require additional analysis such as tripcounts, memory access patterns and dependencies.

> advixe-cl --collect survey --project-dir ./your_project -- <your-executable-with-parameters>

> advixe-cl --collect tripcounts -flops-and-masks -callstack-flops --project-dir ./your_project -- <your-executable-with-parameters>

> advixe-cl --collect map –mark-up-list=1,2,3,4 --project-dir ./your_project -- <your-executable-with-parameters>

> advixe-cl --collect dependencies –mark-up-list=1,2,3,4 --project-dir ./your_project -- <your-executable-with-parameters>

Finally you will need to copy the Intel Advisor reference examples to a test area.

cp –r /opt/intel/advisor_2018/python_api/examples .

Using the Intel Advisor Python API

The reference examples we have provided are just small set of the reporting possible using this flexible way to access your program data. The file columns.txt provides a list of the metrics we currently support. Here is are some example showing the python api in action:

  1.  Generate a combined report showing all data collected
            python joined.py ./your_project >& report.txt
  2. Generate an html report
           python to_html.py ./your_project
  3. Generate a Roofline html chart
           python to_html.py ./your_project
  4. Generate cache simulation statistics
          Before doing your data collection set the following environment variable: export ADVIXE_EXPERIMENTAL=cachesim
    1.  You need to do a memory access pattern collection to collect cache statistics
      advixe-cl --collect map –mark-up-list=4 --project-dir ./your_project -- <your-executable-with-parameters>
    2. Setup cache collection in Project properties 
    3. cache.py ./your_project

Conclusion/Summary

The new Intel Advisor Python API provides a powerful way to generate meaningful program statistics and reports. The provided examples gives a framework of scripts showing the power of this this new interface.  

 

 


Machine Learning and Knowledge Reasoning Probing with Intel® Architecture

$
0
0

Introduction

Intelligence varies in kinds and degrees, and it occurs in humans, many animals and some machines. Considering machines, it is said that Artificial Intelligence (AI) is the set of methods and procedures that provide machines with the ability to achieve goals in the world. It is present in many studying areas such as deep learning, computer vision, reinforcement learning, natural language processing, semantics, learning theory, case based reasoning, robotics, etc. During the 1990’s, the attention was on logic-based AI, mainly concerned with knowledge reasoning (KR), whereas the focus nowadays lies on machine learning (ML). This shift contributed to the field in a way knowledge reasoning never did. However, a new shift is coming. Knowledge reasoning resurges as a response to a demand on inference methods, while machine learning keeps its achievements on statistical approach. This new change occurs when knowledge reasoning and machine learning begin to cooperate with each other, a scenario at which computing is not yet defined.

Intelligent computing is pervasive, demands are monotonically increasing and time for results is shortening. But while consumer products operate in those conditions, the process of building the complex mathematical models, which support such applications, rely on a computational infra-structure that demands large amounts of energy, time and processing power. There is a race to develop specialized hardware to make modern AI methods significantly faster and cheaper.

The strategy of packing such specialized hardware with elaborated software components into a successful architecture is a wise plan of action. Intel® has incorporated to its expertise the top of technology on machine learning when it acquired the hardware and software startup Intel® Nervana™. Moreover, the well-known Altera®, which makes FPGAs chips that can be reconfigured to power up specific algorithms, was also integrated to the company. Therefore, the power and energy efficiency from Intel® processor and architecture can help companies, software houses, cloud providers, and end-user devices to upgrade their capability to use AI. The relevance of such chips for developing and training new AI algorithms cannot be underestimated.

AI systems are usually only perceived as software since this is the layer nearest to ordinary developers and final users. However, it also requires high hardware functionality to support calculations. This is why choosing the Deep Neural Network (DNN) performance primitives within the Intel® Math Kernel Library (Intel® MKL) and the Intel® Data Analytics Acceleration Library (Intel® DAAL) is a clever decision, since such libraries allow better usage of Intel processors and support AI development through hardware. Intelligent applications need to rely on CPUs that perform specific types of mathematical calculations such as vector algebra, linear equations, eigenvalues and eigenvectors, statistics, matrix decomposition, linear regression, and handle large quantities of basic computations in parallel, to mention some. Concerning machine learning, there are a lot of the neural network solutions within hardware artifacts, and deep learning requires a huge amount of matrix multiplication. Considering knowledge representation, the forward and backward chaining1 demands many vector algebra computations, while resolution principle2 requires singular value decomposition. Therefore, AI benefits from using specialized processors with speedy connections between parallel onboard computing cores, fast access to ample memory for storing complex models and data, and mathematical operations optimized for speed.

There are many research and development reports describing the usage of Intel® Architecture supporting machine learning applications. However, such context can also be used on symbolist AI approach, a market share that has been overlooked by programmers and software architects. This paper aims to promote the usage of Intel® architecture to speedup not only machine learning, but also knowledge reasoning applications.

Performance Test

In order to illustrate that knowledge reasoning applications can also benefit from using Intel architecture, this test will consider two tasks from real artificial intelligence problems: one as a baseline for comparison and the other as a knowledge reasoning sample. The first task (ML) represents the machine learning approach by using Complement Naive Bayes3 (github.com/oiwah/classifier) classifier in order to identify the encryption algorithm used to encode plain text messages4. The classification model is constructed by training over 600 text samples, with more than 140,000 characters each, cyphered with DES, Blowfish, ARC4, RSA, Rijndael, Serpent and Twofish. The second task (KR) represents the knowledge reasoning approach by using Resolution Principle2 from an inference machine called Mentor*5 (not publically available) in order to detect frauds on car insurance claims. The sample is composed of 1,000 claims, and the inference machine is loaded with 78 first order logic rules.

Performance is measured based on how many RDTSC clock cycles (Read Time Stamp Counter)6 long it takes to run the tests. The RDTSC was used to track performance rather than wall-clock time because the former counts clock ticks and thus it is invariant even if the processor core changes frequency. This does not happens with wall-clock, and thus, RDTSC is a more precise measuring method than wall-clock. However, note that traditional performance measuring is usually accomplished by using wall-clock time since it provides an acceptable precision.

Tests were performed on a system equipped with Intel® Core™ i7 4500U@1.8 GHz processor, 64 bits, Ubuntu 16.04 LTS operating system, 8GB RAM, hyper threading turned on with 2 threads per core (you may check it by typing sudo dmidecode -t processor | grep -E '(Core Count|Thread Count)') and with system power management disabled.

First, the C++ source codes were compiled with gcc 5.4.0 compiler and the test was performed. Then, the same source codes were recompiled with Intel® C++ Compiler XE 17.0.4, Intel® MKL 2017 (-mkl=parallel) and a new test was performed. Note that many things happen within the operating system, which are invisible to the application programmer, affecting the cycle count, thus measuring variations are expected to occur. Hence, each test ran 300 times in a loop and it was discarded any result that is too much higher than other results.

Figure 1 shows the average clock cycles spent to build the Complement Naive Bayes classification model for the proposed theme. It uses statistical and math routines for training its model. The combination of Intel® C++ Compiler XE, Intel® MKL demand less clock cycles than the commonly used configuration for compiling C++ programs, and thus such tuning platform did a much better job. Notice that this evaluation compares source-codes that were not changed at all. Therefore, although it obtained a 1.66 speedup, it is expected higher values once parallelism and specialized methods are explored by developers.


Figure 1: Test of machine learning approach using Complement Naive Bayes classifier.

Figure 2 shows the average clock cycles spent to produce the deductions using the Resolution Principle as the core engine of an inference machine. It uses several math routines and lots of singular value decomposition to compute the first order predicates. Here, the Intel® C++ Compiler XE and Intel® MKL (-mkl=parallel) combination outperformed the traditional compiling configuration, and thus it also bet the ordinary developing environment. The speedup obtained was 2.95, despite neither parallelism was explored, nor were specialized methods called.


Figure 2: Test of knowledge reasoning approach using resolution principle to perform inference.

The former test shows a machine learning method being enhanced by a tuning environment. Such result is not significant, since this was already expected. The relevance of this test lies in its function as a reference to the latter test, in which the same environment was used. The inference machine, under the same conditions, also obtained a good speedup. This is an evidence that applications based on this approach, such as expert systems, deduction machines, theorem provers, can also be enhanced by Intel® architecture.

Conclusion

This article presented a performance test of a tuning platform composed by Intel® processor, Intel® C++ Compiler XE and Intel® MKL applied to usual AI problems. The two existing approaches of artificial intelligence were probed. Machine learning was represented by an automatic classification method and knowledge reasoning was characterized by a computational inference method. The results suggest that it is possible to accelerate those AI computations as compared to using the traditional software developing environment by employing such tuning platform. These approaches are necessary to supply intelligent behavior to machines. The libraries and the processor helped to improve the performance of those functions by taking advantage of special features in Intel® products, speeding up the execution. The reader must notice that it was not necessary to modify source codes to take advantage of such features.

AI applications can run faster and consume less power when paired with processors designed to handle the set of mathematical operations these systems require. Intel® architecture provides specialized instruction sets in processors, with fast bus connections to parallel onboard computing cores and computational cheaper access to memory. The environment composed of Intel® processor, Intel® C++ Compiler XE and Intel MKL empower developers to construct tomorrow’s intelligent machines.

References

1. Merritt, Dennis. Building Expert Systems in Prolog, Springer-Verlag, 1989.

2. Russell, Stuart; Norvig, Peter. Artificial Intelligence: A Modern Approach, Prentice Hall Series in Artificial Intelligence, Pearson Education Inc., 2nd edition, 2003.

3. Rennie, Jason D.; Shih, Lawrence ; Teevan, Jaime; KargerDavid R. Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: International Conference on Machine Learning, 616-623, 2003.

4. Mello, Flávio L.; Xexéo, José A. M. Cryptographic Algorithm Identification Using Machine Learning and Massive Processing. IEEE Transactions Latin America, v.14, p.4585 - 4590, 2016. doi: 10.1109/TLA.2016.7795833

5. Metadox Group, Mentor, 2017. http://www.metadox.com.br/mentor.html  Accessed on June 12th, 2017.

6. Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual Volume 2B: Instruction Set Reference, M-U, Order Number: 253667-060US, September, 2016. http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-2b-manual.pdf  Accessed on May 30th, 2017.

How to find the host ID for floating licenses

$
0
0

The floating and named-user licenses for the Intel® Parallel Studio XE and Intel® System Studio products require that you provide host name and host ID information for the host computer that you install the associated license file on. To enable you to obtain the required license file, these unique values must be available when you register your product. Refer to the information below for help identifying the host name and host ID (i.e. Physical Address) for supported platforms.

Before registering your product and generating the license file, you should be familiar with the different license types and how they are used with the Intel® Software Development Products. License types supported are:

  • Floating (counted)
  • Named-user (uncounted)

Only counted licenses require use of the Intel® Software License Manager software on the host computer.

In this context the host computer is known as the “license server”. The “host ID” in this context, depending on terminology used for your host operating system, is a 12-character Physical (Ethernet adapter) Address or hardware (Ethernet) address.

The host name and host ID (i.e. Physical Address) are system-level identifiers on supported platforms used to generate the license file with specific host information for use only on the specified floating license server.

When entering the physical or hardware address as prompted by the Intel® Registration Center (IRC) , enter a 12-digit numeric value only and exclude all hyphens ("-") and colons (":"). For example, a host ID value of 30-8D-99-12-E4-87 should be entered as:  308D9912E487 

This article pertains specifically to floating licenses. Please refer to How to find the Host ID for the Named-user license for information about the named-user license.

Refer to the Intel® Software License Manager FAQ for additional details including information about downloading the software license server software.

Identifying the hostname and host ID

Microsoft Windows*
---------------------------

1. Launch a Command Prompt.
   (Tip: Multiple methods exist for starting a Command Prompt, a few include:
         Windows 7*:  Open the Start Menu and go to All Programs -> Accessories.
                               Locate and use the Command Prompt shortcut.
         Windows 8*:  Open the Start screen. Click or tap on All apps and scroll right to locate the
                               Windows System folder. Locate and use the Command Prompt shortcut.
         Windows 10*: Open the Start Menu and go to All apps -> Windows System.
                                Locate and use the Command Prompt shortcut.

On all systems you may use the hostname/getmac commands as demonstrated below:

2. Use hostname at the command prompt to display the host name.
3. Use getmac /v at the command prompt to display the host ID.

In the resulting output below, the hostname is my-computer and the host ID is 30-8D-99-12-E4-87
(i.e. the value corresponding to the Physical Address for the Ethernet Network Adapter)

C:\> hostname
my-computer

C:\> getmac /v

Connection Name Network Adapter Physical Address    Transport Name
=============== =============== =================== ========================
Ethernet        Intel(R) Ethern 30-8D-99-12-E4-87   \Device\Tcpip_{1B304A28-
Wi-Fi           Intel(R) Dual B 34-02-86-7E-16-61   Media disconnected


On a system where the Intel® Software License Manager software is installed, you may elect to use lmhostid to obtain the hostname and host ID information for the system. For systems that report multiple host IDs, it may be necessary to use getmac /v to identify the host ID (i.e. Physical Address) associated with the Ethernet Network Adapter.

In the resulting output below, the hostname is my-computer and the host ID is 308d9912e487
(i.e. the value corresponding to the Physical Address for the Ethernet Network Adapter)

C:\> cd "C:\Program Files (x86)\Common Files\Intel\LicenseServer\"
C:\> .\lmhostid -hostname
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is "HOSTNAME= my-computer"

C:\> .\lmhostid
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is ""3402867e1661 308d9912e487""
Only use ONE from the list of hostids.


Linux*
--------

On all systems you may use the hostname/ifconfig commands as demonstrated below:

1. Use the command hostname to display the host name.
2. Use the command /sbin/ifconfig eth0 to display the HWaddr (i.e. hardware address) for the Ethernet adapter. On some systems it may be necessary to use: /sbin/ifconfig | grep eth

In the (partial) resulting output below, the hostname is my-othercomputer and the host ID is 00:1E:67:34:EF:18
(i.e. the value corresponding to the hardware (Ethernet) address)

$ hostname
my-othercomputer

$ /sbin/ifconfig eth0
eth0      Link encap:Ethernet  HWaddr 00:1E:67:34:EF:18
          inet addr:10.25.234.110  Bcast:10.25.234.255  Mask:255.255.255.0


On a system where the Intel® Software License Manager software is installed, you may elect to use lmhostid to obtain the hostname and host ID information for the system. For systems that report multiple host IDs, it may be necessary to use the ifconfig command to identify the HWaddr (i.e. hardware address) for the Ethernet adapter.

In the resulting output below, the hostname is my-othercomputer and the host ID is 001e6734ef18
(i.e. the value corresponding to the hardware (Ethernet) address)

$  lmhostid -hostname
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is "HOSTNAME= my-othercomputer"

$  lmhostid
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is ""001e6734ef18 001e6734ef19""
Only use ONE from the list of hostids.


OS X*
-------

On all systems you may use the hostname/ifconfig commands as demonstrated below:

1. Use the command hostname  to display the host name.
2. Run the command /sbin/ifconfig en0 to display the ether (i.e. hardware address) for the Ethernet adapter.

In the (partial) resulting output below, the hostname is my-macmini and the host ID is 40:6c:8f:1f:b8:57
(i.e. the value corresponding to the hardware (Ethernet) address)

$ hostname
my-macmini

$ /sbin/ifconfig en0
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        options=10b<RXCSUM,TXCSUM,VLAN_HWTAGGING,AV>
        ether 40:6c:8f:1f:b8:57


On a system where the Intel® Software License Manager software is installed, you may elect to use lmhostid to obtain the hostname and host ID information for the system. For systems that report multiple host IDs, it may be necessary to use the ifconfig command to identify the ether (i.e. hardware address) for the Ethernet adapter.

In the resulting output below, the hostname is my-macmini and the host ID is 406c8f1fb857
(i.e. the value corresponding to the hardware (Ethernet) address)

$  lmhostid -hostname
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is "HOSTNAME= my-macmini"

$  lmhostid
lmhostid - Copyright (c) 1989-2017 Flexera Software LLC. All Rights Reserved.
The FlexNet host ID of this machine is ""7073cbc3edd9 406c8f1fb857""
Only use ONE from the list of hostids.


Refer to the Software EULA for additional details on the Floating license.

For CoFluent products: Please refer to product documentation for instructions on how to find your composite host ID for node-locked and floating licenses.

Object Classification Using CNN Across Intel® Architecture

$
0
0

Abstract

In this work, we present the computational performance and classification accuracy for object classification using the VGG16 network on Intel® Xeon® processors and Intel® Xeon Phi™ processors. The results can be used as criteria for iteration selection optimization in different experimental setups using these processors and also in multinode architecture. With an objective of evaluating accuracy for real-time logo detection from video, the results are applicable on a logo image dataset suitable for detecting the classification accuracy of the logos.

1. Introduction

Deep learning (DL), which refers to a class of neural network models with deep architectures, forms an important and expressive family of machine learning (ML) models. Modern deep learning models, such as convolutional neural networks (CNNs), have achieved notable successes in a wide spectrum of machine learning tasks including speech recognition1, visual recognition2, and language understanding3. The explosive success and rapid adoption of CNNs by the research community is largely attributable to high-performance computing hardware such as the Intel® Xeon® processor, Intel® Xeon Phi™ processor, and graphics processing units (GPUs), as well as a wide range of easy-to-use open source frameworks including Caffe*, TensorFlow*, the cognitive toolkit (CNTK*), Torch*, and so on.

2. Setting up a Multinode Cluster

The Intel® Distribution for Caffe* is designed for both single node and multinode operation. There are two general approaches to parallelization (data parallelism and model parallelism), and Intel uses data parallelism.

Data parallelism is when you use the same model for every thread, but feed it with different data. It means that the total batch size in a single iteration is equal to the sum of individual batch sizes of all nodes. For example, a network is trained on three nodes. All of them have a batch size of 64. The (total) batch size in a single iteration of the stochastic gradient descent algorithm is 3*64=192. Model parallelism means using the same data across all nodes, but each node is responsible for estimating different parameters. The nodes then exchange their estimates with each other to come up with the right estimate for all parameters.

To set up a multinode cluster, download and install the Intel® Machine Learning Scaling Library (Intel® MLSL) 2017 package from https://github.com/01org/MLSL/releases/tag/v2017-Preview and source the mlslvars.sh, and then recompile the Caffe build with MLSL: = 1 in the makefile.config. When the makefile completes successfully, start the Caffe training using the message passing interface (MPI) command as follows:

mpirun -n 3 -ppn 1 -machinefile ~/mpd.hosts ./build/tools/caffe train \
  --solver=models/bvlc_googlenet/solver_client.prototxt --engine=MKL2017

where n defines the number of nodes and ppn represents the number of processes per node. The nodes will be configured in the ~/mpd.hosts with their respective IP addresses as follows:

192.161.32.1
192.161.32.2
192.161.32.3
192.161.32.4

Ansible* scripts are used to copy the binaries or files across the nodes.

Clustering communication employs Intel® Omni-Path Architecture (Intel® OPA)4.

Validation of cluster setup is performed by using the command ‘opainfo’ in all machines, and the port state must always be ‘Active’.

Screenshot of Intel Omni-Path cluster results

Figure 1:Intel® Omni-Path Architecture (Intel® OPA) cluster information.

3. Experiments

The current experiment focuses on measuring the performance of the VGG16 network on the Flickr* logo dataset, which has 32 different classes of logo. Intel® Optimized Technical Preview for Multinode Caffe* is used for experiments on the single node and with Intel® MLSL enabled for multinode experiments. The input images were all converted to lightning memory-mapped database (LMDB) format for better efficiency. All of the experiments are set to run for 10K iterations, and the observations are noted below. We conducted our experiments in the following machine configurations. Due to lack of time we had to limit our experiments to a single execution per architecture.

Intel Xeon Phi processor

  • Model Name: Intel® Xeon Phi™ processor 7250 @1.40GHz
  • Core(s) Per Socket: 68 RAM (free): 70 GB
  • OS: CentOS* 7.3

Intel Xeon processor

  • Model Name: Intel® Xeon® processor E5-2699 v4 @ 2.20GHz
  • Core(s) Per Socket: 22 RAM (free): 123 GB
  • OS: Ubuntu* 16.1

The multinode cluster setup is configured as follows:

KNL 01 (Master)

  • Model Name: Intel® Xeon Phi™ processor 7250 @1.40GHz
  • Core(s) Per Socket: 68 RAM (free): 70 GB
  • OS: CentOS 7.3

KNL 03 (Slave node)

  • Model Name: Intel Xeon Phi processor 7250 @1.40GHz
  • Core(s) Per Socket: 68 RAM (free): 70 GB
  • OS: CentOS 7.3

KNL 04 (Slave node)

  • Model Name: Intel Xeon Phi processor 7250 @1.40GHz
  • Core(s) Per Socket: 68 RAM (free): 70 GB
  • OS: CentOS 7.3

3.1. Training Data

The training and test image datasets were obtained from Datasets: FlickrLogos32 / FlickrLogos47, which is maintained by the Multimedia Computing and Computer Vision Lab, Augsburg University. There are 32 logo classes or brands in the dataset, which are downloaded from Flickr, as illustrated in the following figure:

Screenshot of a collage of logos, brands, objects, etc..
Figure 2:Flickr logo image dataset with 32 classes.

The 32 classes are as follows: Adidas*, Aldi*, Apple*, Becks*, BMW*, Carlsberg*, Chimay*, Coca-Cola*, Corona*, DHL*, Erdinger*, Esso*, Fedex*, Ferrari*, Ford*, Foster's*, Google*, Guinness*, Heineken*, HP*, Milka*, Nvidia*, Paulaner*, Pepsi*, Ritter Sport*, Shell, Singha*, Starbucks*, Stella Artois*, Texaco*, Tsingtao*, and UPS*.

The training set consists of 8240 images; 6000 images are no_logo images, and 70 images per class for 32 classes comprise the remaining 2240 images, thereby making the dataset highly skewed. Also, the training and test dataset is split in a ratio of 90:10 from the full 8240 samples.

3.2. Model Building and Network Topology

VGG16 network topology was used for our experiments. VGG16 network topology is a 16 weights layer (13 convolutional and 3 fully connected (FC) layers) and has very small (3 x 3) convolution filters, which showed significant enhancement in network performance and detection accuracy over prior art (winning the first and second prizes in the ImageNet* challenge in 2014), and henceforth widely used as a reference topology.

4. Results

4.1 Observations on Intel® Xeon® Processor

The Intel Xeon processors are running under the following software configurations:

Caffe Version: 1.0.0-rc3

MKL Version: _2017.0.2.20170110

MKL_DNN: SUPPORTED

GCC Version: 5.4.0

The following observations were noted while training for 10K iterations with a batch size of 32 and learning rate policy as POLY.

 Training loss variation with iterations (batch size 32, LR policy as POLY).
Figure 3:Training loss variation with iterations (batch size 32, LR policy as POLY).

 Accuracy variation with iterations (batch size 32, LR policy as POLY).
Figure 4:Accuracy variation with iterations (batch size 32, LR policy as POLY).

The following observations were noted while training for 10K iterations with a batch size of 64 and learning rate policy as POLY.

 Training loss variation with iterations (batch size 64, LR policy as POLY).
Figure 5:Training loss variation with iterations (batch size 64, LR policy as POLY).

 Accuracy variation with iterations (batch size 64, LR policy as POLY).
Figure 6:Accuracy variation with iterations (batch size 64, LR policy as POLY).

The real-time training and test observations using different batch sizes for the Intel Xeon processor is depicted in the following table. The Table 2 depicts how the accuracy varies with batch size.

Table 1:Real-time training results for Intel® Xeon® processor.

Batch SizeLR PolicyStart TimeEnd TimeDurationLossAccuracy at Top 1Accuracy at Top 5
32POLY18:2023:465.260.000160.620.84
64POLY16:209:5717:370.000030.640.86
64STEP16:416:3713:560.00050.650.85

Table 2:Batch size versus accuracy details on the Intel® Xeon® processor.

32 Batch Size64 Batch Size
IterationsAccuracy@Top1Accuracy@Top5IterationsAccuracy@Top1Accuracy@Top5
000000
10000.1659370.4912510000.303750.6375
20000.3743750.75406220000.4198440.785156
30000.4468750.7412530000.5139060.803437
40000.503750.7862540000.5228120.838437
50000.4840620.78343750000.5807810.848594
60000.5490620.81906260000.5845310.843594
70000.5531250.82656370000.6329690.847969
80000.6156250.80718780000.643750.84875
90000.6078130.8390000.6248440.856406
10000.6145670.83616100000.6412340.859877

4.2 Observations on Intel® Xeon Phi™ Processor

The Intel Xeon Phi processors are running under the following software configurations:

Caffe Version: 1.0.0-rc3

MKL Version: _2017.0.2.20170110

MKL_DNN: SUPPORTED

GCC Version: 6.2

The following observations were noted while training for 10K iterations with a batch size of 32 and learning rate policy as POLY.

 Training loss variation with iterations on Intel Xeon Phi processor (batch size 32, LR policy as POLY).
Figure 7:Training loss variation with iterations on Intel® Xeon Phi™ processor (batch size 32, LR policy as POLY).

 Accuracy variation with iterations on Intel Xeon Phi processor (batch size 32, LR policy as POLY).
Figure 8:Accuracy variation with iterations on Intel® Xeon Phi™ processor (batch size 32, LR policy as POLY).

 Training loss variation with iterations on Intel Xeon Phi processor (batch size 64, LR policy as POLY).
Figure 9: Training loss variation with iterations on Intel® Xeon Phi™ processor (batch size 64, LR policy as POLY).

 Accuracy variation with iterations on Intel Xeon Phi processor (batch size 64, LR policy as POLY).
Figure 10:Accuracy variation with iterations on Intel® Xeon Phi™ processor (batch size 64, LR policy as POLY).

 Training loss variation with iterations on Intel Xeon Phi processor (batch size 128, LR policy as POLY).

Figure 11:Training loss variation with iterations on Intel® Xeon Phi™ processor (batch size 128, LR policy as POLY).

 Accuracy variation with iterations on Intel Xeon Phi processor (batch size 128, LR policy as POLY).

Figure 12: Accuracy variation with iterations on Intel® Xeon Phi™ processor (batch size 128, LR policy as POLY).

Table 3:Batch size versus accuracy details for the Intel® Xeon Phi™ processor.

32 Batch Size64 Batch Size
IterationsAccuracy@Top1Accuracy@Top5IterationsAccuracy@Top1Accuracy@Top5
000000
10000.1381250.42781210000.2004690.54875
20000.240.58968820000.3307810.678594
30000.2956250.62187530000.3621880.68375
40000.2953120.66031240000.406250.708906
50000.3378130.6750000.4378130.74625
60000.3746870.7160000.406250.723594
70000.3350.687570000.4321870.749219
80000.383750.69218780000.4553120.745781
90000.396250.7087590000.4554690.722969
100000.401310.713456100000.4698710.748901
128 Batch Size
IterationsAccuracy@Top1Accuracy@Top5
000
10000.2722660.665156
20000.3974220.696328
30000.4328130.750234
40000.460.723437
50000.4463280.776641
60000.4329690.74125
70000.4732030.75
80000.4196880.700938
90000.4553120.763281
100000.4789010.798771

Table 4:Real-time training results for the Intel® Xeon Phi™ processor.

Batch SizeLR PolicyStart TimeEnd TimeDurationLoss

 

Accuracy at Top 1Accuracy at Top 5
32POLY17:5320:362:430.0050.40.71
64POLY10:5916:076:080.000070.47

0.75

128POLY18:004:1910:190.000750.480.8

5. Conclusion and Future Work

We observed from Table 1 that the batch size of 32 was the optimal configuration in terms of speed and accuracy. Though there is a slight increase in accuracy with batch size 64, the gain seems to be quite low, compared to the increase in training time. It was also observed that the learning rate policies have quite a significant impact on the training time and less impact on accuracy. Perhaps the recalculation of the learning rates on every iteration would have slowed down this training. There is a minor gain in the Top 5 Accuracy with the LR policy as POLY, and this might be due to the optimal calculation of the learning rate. There is a chance that the gain might vary quite significantly in a larger dataset.

We observed from Table 3 that the Intel Xeon Phi processor efficiency increases as the batch size is increased, and also the decrease in loss happens faster as the batch size is increased. Table 4 infers that the higher batch size also runs faster on Intel Xeon Phi processors.

The observations as per the above tables implicates that training in Intel Xeon Phi machines are faster than the same conducted in Xeon machines. Thanks to the bootable host processor that delivers massive parallelism & vectorization. However the accuracy rate produced by Intel Xeon Phi processors is much lower than those produced for Intel Xeon processors for the same number of iterations, so it must be noted that we have to run a few more iterations on Intel Xeon Phi processors as compared to Intel Xeon processors to meet the same accuracy levels.

List of Abbreviations

AbbreviationsExpanded Form
MLSLmachine learning scalable library
CNNconvolution neural network
GPUgraphics processing unit
MLmachine learning
CNTKcognitive toolkit
DLdeep learning
LMDBlightning memory-mapped database

References and Links

1. Deng, L., LI, J., Huang, J.-T., Yao, K., Yu, D., Seide, F., Seltzer, M. L., Zweig, G., He, X., Williams, J., Gong, Y., and Aceri, A. Recent Advances in Deep Learning for Speech Research at Microsoft. In ICASSP (2013).

2. Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS (2012).

3. Mikolov, T., Chen, K., Corrado, G., and Deahn, J. Efficient Estimation of Word Representations in Vector Space. In ICLRW (2013).

4. Cherlopalle, Deepthi and Weage, Joshua Dell HPC Omni-Path Fabric: Supported Architecture and Application Study June 2016

More details on Intel Xeon Phi processor: Intel Xeon Phi Processor

Intel® Distribution for Caffe*: Manage Deep Learning Networks with Intel Distribution for Caffe

Multinode Guide:Guide to multi-node training with Intel® Distribution of Caffe*

Intel Omni Path Architecture Cluster Setup: Dell HPC Omni-Path Fabric: Supported Architecture and Application Study

Intel MLSL Package: Intel® MLSL 2017 Beta https://github.com/01org/MLSL/releases/tag/v2017-Beta

Building and Probing Prolog* with Intel® Architecture

$
0
0

Introduction

A lot of buzz talks over Internet which suggests that machine learning and Artificial Intelligence (AI) are basically the same thing, but this is a misunderstanding. Both machine learning and Knowledge Reasoning have the same concern: the construction of intelligent software. However, while machine learning is an approach to AI based on algorithms whose performance improve as they are exposed to more data over time, Knowledge Reasoning is a sibling approach based on symbolic logic.

Knowledge Reasoning’s strategy is usually developed by using functional and logic based programming languages such as Lisp*, Prolog*, and ML* due to their ability to perform symbolic manipulation. This kind of manipulation is often associated with expert systems, where high level rules are often provided by humans and used to simulate knowledge, avoiding low-level language details. This focus is called Mind Centered. Commonly, some kind of (backward or forward) logical inference is needed.

Machine learning, on its turn, is associated with low-level mathematical representations of systems and a set of training data that lead the system toward performance improvement. Once there is no high-level modeling, the process is called Brain Centered. Any language that facilitates writing vector algebra and numeric calculus over an imperative paradigm works just fine. For instance, there are several machine learning systems written in Python* simply because the mathematical support is available as libraries for such programming language.

This article aims to explore what happens when Intel solutions support functional and logic programming languages that are regularly used for AI. Despite machine learning systems success over the last two decades, the place for traditional AI has neither disappeared nor diminished, especially in systems where it is necessary to explain why a computer program behaves the way it does. Hence, it is not feasible to believe that next generations of learning systems will be developed without high-level descriptions, and thus it is expected that some problems will demand symbolical solutions. Prolog and similar programming languages are valuable tools for solving such problems.

As it will be detailed below, this article proposes a Prolog interpreter recompilation using Intel® C++ Compiler and libraries in order to evaluate their contribution to logic based AI. The two main products used are Intel® Parallel Studio XE Cluster Edition and SWI-Prolog interpreter. An experiment with a classical AI problem is also presented.

Building Prolog for Intel® Architecture

1. The following description uses a system equipped with: Intel® Core™ i7 4500U@1.8 GHz processor, 64 bits, Ubuntu 16.04 LTS operating system, 8GB RAM, and hyper threading turned on with 2 threads per core (you may check it by typing sudo dmidecode -t processor | grep -E '(Core Count|Thread Count)') . Different operating systems may require minor changes.

2. Preparing the environment.

Optimizing performance on hardware is an iterative process. Figure 1 shows a flow chart describing how the various Intel tools help you in several stages of such optimization task.


Figure 1: Optimizing performance flowchart and libraries. Extracted from Intel® Parallel Studio documentation1.

The most convenient way to install Intel tools is downloading and installing Intel® Parallel Studio XE 2017. Extracting the .tgz file, you will obtain a folder called parallel_studio_xe_2017update4_cluster_edition_online (or similar version). Open the terminal and then choose the graphical installation:

<user>@<host>:~% cd parallel_studio_xe_2017update4_cluster_edition_online<user>@<host>:~/parallel_studio_xe_2017update4_cluster_edition_online% ./install_GUI.sh

Although you may prefer to perform a full install, this article will choose a custom installation with components that are frequently useful for many developers. It is recommended that these components also be installed to allow further use of such performance libraries in subsequent projects.

  • Intel® Trace Analyzer and Collector
  • Intel® Advisor
  • Intel® C++ Compiler
  • Intel® Math Kernel Library (Intel® MKL) for C/C++
  • Intel® Threading Building Blocks (Intel® TBB)
  • Intel® Data Analytics Acceleration Library (Intel® DAAL)
  • Intel® MPI Library

The installation is very straight-forward, and it does not require many comments to be made. After finishing such task, you must test the availability of Intel® C++ Compiler by typing in your terminal:

<user>@<host>:~% cd ..<user>@<host>:~% icc --version
icc (ICC) 17.0.4 20170411

If the icc command was not found, it is because the environment variables for running the compiler environment were not set. You must do it by running a predefined script with an argument that specifies the target architecture:

<user>@<host>:~% source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh -arch intel64 -platform linux

If you wish, you may save disk space by doing:

<user>@<host>:~% rm -r parallel_studio_xe_2017update4_cluster_edition_online

3. Building Prolog.

This article uses the SWI-Prolog interpreter2, which is covered by the Simplified BSD license. SWI-Prolog offers a comprehensive free Prolog environment. It is widely used in research and education as well as commercial applications. You must download the sources in .tar.gz format. At the time this article was written, the available version is 7.4.2. First, decompress the download file:

<user>@<host>:~% tar zxvf swipl-<version>.tar.gz

Then, create a folder where the Prolog interpreter will be installed:

<user>@<host>:~% mkdir swipl_intel

After that, get ready to edit the building variables:

<user>@<host>:~% cd swipl-<version><user>@<host>:~/swipl-<version>% cp -p build.templ build<user>@<host>:~/swipl-<version>% <edit> build

At the build file, look for the PREFIX variable, which indicates the place where SWI-Prolog will be installed. You must set it to:

PREFIX=$HOME/swipl_intel

Then, it is necessary to set some compilation variables. The CC variable must be changed to indicate that Intel® C++ Compiler will be used instead of other compilers. The COFLAGS enables optimizations for speed. The compiler vectorization is enabled at –O2. You may choose higher levels (–O3), but the suggested flag is the generally recommended optimization level. With this option, the compiler performs some basic loop optimizations, inlining of intrinsic, intra-file interprocedural optimization, and most common compiler optimization technologies. The –mkl=parallel option allows access to a set of math functions that are optimized and threaded to explore all the features of the latest Intel® Core™ processors. It must be used with a certain Intel® MKL threading layer, depending on the threading option provided. In this article, the Intel® TBB is such an option and it is used by choosing –tbb flag. At last, the CMFLAGS indicates the compilation will create a 64-bit executable.

export CC="icc"
export COFLAGS="-O2 -mkl=parallel -tbb"
export CMFLAGS="-m64"

Save your build file and close it.

Note that when this article was written, SWI-Prolog was not Message Passing Interface (MPI) ready3. Besides, when checking its source-code, no OpenMP* macros were found (OMP) and thus it is possible that SWI-Prolog is not OpenMP ready too.

If you already have an SWI-Prolog instance installed on your computer you might get confused with which interpreter version was compiled with Intel libraries, and which was not. Therefore, it is useful to indicate that you are using the Intel version by prompting such feature when you call SWI-Prolog interpreter. Thus, the following instruction provides a customized welcome message when running the interpreter:

<user>@<host>:~/swipl-<version>% cd boot<user>@<host>:~/swipl-<version>/boot% <edit> messages.pl

Search for:

prolog_message(welcome) -->
    [ 'Welcome to SWI-Prolog (' ],
    prolog_message(threads),
    prolog_message(address_bits),
    ['version ' ],
    prolog_message(version),
    [ ')', nl ],
    prolog_message(copyright),
    [ nl ],
    prolog_message(user_versions),
    [ nl ],
    prolog_message(documentaton),
    [ nl, nl ].

and add @ Intel® architecture by changing it to:

prolog_message(welcome) -->
    [ 'Welcome to SWI-Prolog (' ],
    prolog_message(threads),
    prolog_message(address_bits),
    ['version ' ],
    prolog_message(version),
    [ ') @ Intel® architecture', nl ],
    prolog_message(copyright),
    [ nl ],
    prolog_message(user_versions),
    [ nl ],
    prolog_message(documentaton),
    [ nl, nl ].

Save your messages.pl file and close it. Start building.

<user>@<host>:~/swipl-<version>/boot% cd ..<user>@<host>:~/swipl-<version>% ./build

The compilation performs several checking and it takes some time. Don’t worry, it is really very verbose. Finally, you will get something like this:

make[1]: Leaving directory '~/swipl-<version>/src'
Warning: Found 9 issues.
No errors during package build

Now you may run SWI-Prolog interpreter by typing:

<user>@<host>:~/swipl-<version>% cd ~/swipl_intel/lib/swipl-7.4.2/bin/x86_64-linux<user>@<host>:~/swipl_intel/lib/swipl-<version>/bin/x86_64-linux% ./swipl

Welcome to SWI-Prolog (threaded, 64 bits, version 7.4.2) @ Intel® architecture SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software. Please run ?- license. for legal details. For online help and background, visit http://www.swi-prolog.org For built-in help, use ?- help(Topic). or ?- apropos(Word). 1 ?-

For exiting the interpreter, type halt. . Now you a ready to use Prolog, powered by Intel® architecture.

You may also save disk space by doing:

<user>@<host>:~/swipl_intel/lib/swipl-<version>/bin/x86_64-linux% cd ~<user>@<host>:~% rm -r swipl-<version>

Probing Experiment

Until now, there is an Intel compiled version of SWI-Prolog in your computer. Since this experiment intends to compare such combination with another environment, a SWI-Prolog interpreter using a different compiler, such as gcc 5.4.0, is needed. The procedure for building an alternative version is quite similar to the one described in this article.

The Tower of Hanoi puzzle4 is a classical AI problem and it was used for probing the Prolog interpreters. The following code is the most optimized implementation:

move(1,X,Y,_) :-
     write('Move top disk from '),
     write(X),
     write(' to '),
     write(Y),
     nl.

move(N,X,Y,Z) :-
    N>1,
    M is N-1,
    move(M,X,Z,Y),
    move(1,X,Y,_),
    move(M,Z,Y,X).

It moves the disks between pylons and logs their moments. When loading such implementation and running a 3 disk instance problem (move(3,left,right,center)), the following output is obtained after 48 inferences:

Move top disk from left to right
Move top disk from left to center
Move top disk from right to center
Move top disk from left to right
Move top disk from center to left
Move top disk from center to right
Move top disk from left to right
true .

This test intends to compare the performance of Intel SWI-Prolog version against gcc compiled version. Note that terminal output printing is a slow operation, so it is not recommended to use it in benchmarking tests since it masquerades results. Therefore, the program was changed in order to provide a better probe with a dummy sum of two integers.

move(1,X,Y,_) :-
     S is 1 + 2.

move(N,X,Y,Z) :-
    N>1,
    M is N-1,
    move(M,X,Z,Y),
    move(1,X,Y,_),
    move(M,Z,Y,X).

Recall that the SWI-Prolog source-code did not seem to be OpenMP ready. However, most loops can be threaded by inserting the macro #pragma omp parallel for right before the loop. Thus, time-consuming loops from SWI-Prolog proof procedure were located and the OpenMP macro was attached to such loops. The source-code was compiled with –openmp option, a third compilation of Prolog interpreter was built, and 8 threads were used. If the reader wishes to build this parallelized version of Prolog, the following must be done.

At ~/swipl-/src/pl-main.c add #include <omp.h> to the header section of pl-main.c; if you chose, you can add omp_set_num_threads(8) inside main method to specify 8 OpenMP threads. Recall that this experiment environment provides 4 cores and hyper threading turned on with 2 threads per core, thus 8 threads are used, otherwise leave it out and OpenMP will automatically allocate the maximum number of threads it can.

int main(int argc, char **argv){
  omp_set_num_threads(8);
  #if O_CTRLC
    main_thread_id = GetCurrentThreadId();
    SetConsoleCtrlHandler((PHANDLER_ROUTINE)consoleHandlerRoutine, TRUE);
  #endif

  #if O_ANSI_COLORS
    PL_w32_wrap_ansi_console();	/* decode ANSI color sequences (ESC[...m) */
  #endif

    if ( !PL_initialise(argc, argv) )
      PL_halt(1);

    for(;;)
    { int status = PL_toplevel() ? 0 : 1;

      PL_halt(status);
    }

    return 0;
  }

At ~/swipl-<version>/src/pl-prof.c add #include <omp.h> to the header section of pl-prof.c; add #pragma omp parallel for right before the for-loop from methods activateProfiler, add_parent_ref, profResumeParent, freeProfileNode, freeProfileData(void).

int activateProfiler(prof_status active ARG_LD){
  .......... < non relevant source code ommited > .......…

  LD->profile.active = active;
  #pragma omp parallel for
  for(i=0; i<MAX_PROF_TYPES; i++)
  { if ( types[i] && types[i]->activate )
      (*types[i]->activate)(active);
  }
  .......... < non relevant source code ommited > ..........

  return TRUE;
}



static void add_parent_ref(node_sum *sum,
	       call_node *self,
	       void *handle, PL_prof_type_t *type,
	       int cycle)
{ prof_ref *r;

  sum->calls += self->calls;
  sum->redos += self->redos;

  #pragma omp parallel for
  for(r=sum->callers; r; r=r->next)
  { if ( r->handle == handle && r->cycle == cycle )
    { r->calls += self->calls;
      r->redos += self->redos;
      r->ticks += self->ticks;
      r->sibling_ticks += self->sibling_ticks;

      return;
    }
  }

  r = allocHeapOrHalt(sizeof(*r));
  r->calls = self->calls;
  r->redos = self->redos;
  r->ticks = self->ticks;
  r->sibling_ticks = self->sibling_ticks;
  r->handle = handle;
  r->type = type;
  r->cycle = cycle;
  r->next = sum->callers;
  sum->callers = r;
}



void profResumeParent(struct call_node *node ARG_LD)
{ call_node *n;

  if ( node && node->magic != PROFNODE_MAGIC )
    return;

  LD->profile.accounting = TRUE;
  #pragma omp parallel for
  for(n=LD->profile.current; n && n != node; n=n->parent)
  { n->exits++;
  }
  LD->profile.accounting = FALSE;

  LD->profile.current = node;
}



static void freeProfileNode(call_node *node ARG_LD)
{ call_node *n, *next;

  assert(node->magic == PROFNODE_MAGIC);

  #pragma omp parallel for
  for(n=node->siblings; n; n=next)
  { next = n->next;

    freeProfileNode(n PASS_LD);
  }

  node->magic = 0;
  freeHeap(node, sizeof(*node));
  LD->profile.nodes--;
}



static void freeProfileData(void)
{ GET_LD
  call_node *n, *next;

  n = LD->profile.roots;
  LD->profile.roots = NULL;
  LD->profile.current = NULL;

  #pragma omp parallel for
  for(; n; n=next)
  { next = n->next;
    freeProfileNode(n PASS_LD);
  }

  assert(LD->profile.nodes == 0);
}

The test employs a 20 disk instance problem, which is accomplished after 3,145,724 inferences. The time was measured using Prolog function called time. Each test ran 300 times in a loop and any result that is much higher than others was discarded. Figure 2 presents the CPU time consumed by all three configurations.


Figure 2: CPU time consumed by SWI-Prolog compiled with gcc, Intel tools, Intel tools+OpenMP.

Considering the gcc compiled Prolog as baseline, the speedup obtained by Intel tools was 1.35. This is a good result since the source-code was not changed at all, parallelism was not explored by the developer and specialized methods were not called, that is, all blind duty was delegated to Intel® C++ Compiler and libraries. When Intel implementation of OpenMP 4.0 was used, the same speedup increased to 4.60x.

Conclusion

This article deliberately paid attention to logic based AI. It shows that benefits with using Intel development tools for AI problems are not restricted to machine learning. A common distribution of Prolog was compiled with Intel® C++ Compiler, Intel® MKL and Intel implementation of OpenMP 4.0. A significant acceleration was obtained, even though the algorithm of Prolog inference mechanism is not easily optimized. Therefore, any solution for a symbolic logic problem, implemented in such Prolog interpreter, will be powered by an enhanced engine.

References

1. Intel. Getting Started with Intel® Parallel Studio XE 2017 Cluster Edition for Linux*, Intel® Parallel Studio 2017 Documentation, 2017.

2. SWI-Prolog, 2017. http://www.swi-prolog.org/, access on June 18th, 2017.

3. Swiprolog - Summary and Version Information, High Performance Computing, Division of Information Technology, University of Maryland, 2017. http://hpcc.umd.edu/hpcc/help/software/swiprolog.html, access on June 20th, 2017.

4. A. Beck, M. N. Bleicher, D. W. Crowe, Excursions into Mathematics, A K Peters, 2000.

5. Russell, Stuart; Norvig, Peter. Artificial Intelligence: A Modern Approach, Prentice Hall Series in Artificial Intelligence, Pearson Education Inc., 2nd edition, 2003.

Machine learning on Intel® FPGAs

$
0
0

Introduction

Artificial intelligence (AI) originated in classical philosophy and has been loitering in computing circles for decades. Twenty years ago, AI surged in popularity, but interest waned as technology lagged. Today, technology is catching up, and AI’s resurgence far exceeds its past glimpses of popularity. This time, the compute, data sets, and technology can deliver, and Intel leads the AI pack in innovation.

Among Intel’s many technologies contributing to AI’s advancements, field-programmable gate arrays (FPGAs) provide unique and significant value propositions across the spectrum. Understanding the current and future capabilities of Intel® FPGAs requires a solid grasp on how AI is transforming industries in general.

AI Is Transforming Industries

Industries in all sectors benefit from AI. Three key factors contribute to today’s successful resurgence of AI applications:

  • Large data sets
  • Recent AI research
  • Hardware performance and capabilities

The combination of massive data collections, improved algorithms, and powerful processors enables today’s ongoing, rapid advancements in machine learning, deep learning, and artificial intelligence overall. AI applications now touch the entire data spectrum from data centers to edge devices (including cars, phones, cameras, home and work electronics, and more), and infiltrate every segment of technology, including:

  • Consumer devices
  • Enterprise efficiency systems
  • Healthcare, energy, retail, transportation, and others

Some of AI’s largest impacts are found in self-driving vehicles, financial analytics, surveillance, smart cities, and cyber security. Figure 1 illustrates AI’s sizable impact on just a few areas.


Figure 1. Examples of how AI is transforming industries.

To support AI’s growth today and well into the future, Intel provides a range of AI products in its AI ecosystem. Intel® FPGAs are a key component in this ecosystem.

Intel’s AI Ecosystem and Portfolio

As a technology leader, Intel offers a complete AI ecosystem that concentrates far beyond today’s AI—Intel is committed to fueling the AI revolution deep into the future. It’s a top priority for Intel, as demonstrated through in-house research, development, and key acquisitions. FPGAs play an important role in this commitment.

Intel’s comprehensive, flexible, and performance-optimized AI portfolio of products for machine and deep learning covers the entire spectrum from hardware platforms to end user applications, as shown in Figure 2, including:

  • Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN)
  • Compute Library for Deep Neural Networks
  • Deep Learning Accelerator Library for FPGAs
  • Frameworks such as Caffe* and TensorFlow*
  • Tools like the Deep Learning Deployment Toolkit from Intel


Figure 2. Intel’s AI Portfolio of products for machine and deep learning.

Overall, Intel provides a unified front end for the broad variety of backend hardware platforms, enabling users to develop a system with one device today and seamlessly switch to a newer, different hardware platform tomorrow. This comprehensive nature of the Intel’s AI Ecosystem and portfolio means Intel is uniquely situated to help developers at all levels access the full capacity of Intel hardware platforms, both current and future. This approach empowers hardware and software developers to take advantage of the FPGAs’ capabilities with machine learning, leading to increased productivity and shorter design cycles.

The Intel® FPGA Effect

Intel® FPGAs offer unique value propositions, and they are now enabled for Intel’s AI ecosystem. Intel® FPGAs provide excellent system acceleration with deterministic low latency, power efficiency, and future proofing, as illustrated in Figure 3.


Figure 3. Intel® FPGAs offer unique value propositions for AI.

System Acceleration

Today, people are looking for ways to leverage CPU and GPU architectures to get more total operations processing out of them, which helps with compute performance. FPGAs are concerned with system performance. Intel® FPGAs accelerate and aid the compute and connectivity required to collect and process the massive quantities of information around us by controlling the data path. In addition to FPGAs being used as compute offload, they can also directly receive data and process it inline without going through the host system. This frees the processor to manage other system events and provide higher real time system performance.

Real time is key. AI often relies on real-time processing to draw instantaneous conclusions and respond accurately. Imagine a self-driving car waiting for feedback after another car breaks hard or a deer leaps from the bushes. Immediacy has been a challenge given the amount of data involved, and lag can mean the difference between responding to an event and missing it entirely.

FPGAs’ flexibility enables them to deliver deterministic low latency (the guaranteed upper limit on the amount of time between a message sent and received under all system conditions) and high bandwidth. This flexibility supports the creation of custom hardware for individual solutions in an optimal way. Regardless of the custom or standard data interface, topology, or precision requirement, an FPGA can implement the exact architecture defined, which allows for unique solutions and fixed data paths. This also equates to excellent power efficiency and future proofing.

Power Efficiency

FPGAs’ ability to create custom solutions means they can create power-efficient solutions. They enable the creation of solutions that address specific problems, in the way each problem needs to be solved, by removing individual bottlenecks in the computation, not by pushing solutions through fixed architectures.

Intel® FPGAs have over 8 TB/s of on-die memory bandwidth. Therefore, solutions tend to keep the data on the device tightly coupled with the next compute. This minimizes the need to access external memory, which results in running at significantly lower frequencies. These lower frequencies and efficient compute implementations result in very powerful and efficient solutions. For example, FPGAs show up to an 80% power reduction when using AlexNet* (a convolutional neural network) compared to CPUs.

Future Proofing

In addition to system acceleration and power efficiency, Intel® FPGAs help future proof systems. With such a dynamic technology as machine learning, which is evolving and changing constantly, Intel® FPGAs provide the flexibility unavailable in fixed devices. As precisions drop from 32-bit to 8-bit and even binary/ternary networks, an FPGA has the flexibility to support those changes instantly. As next generation architectures and methodologies are developed, FPGAs will be there to implement them. By reprogramming an FPGA’s image, its functionality can be changed completely. Dedicated ASICs can provide a higher total cost of ownership (TCO) in the long run, and with such a dynamic technology, there is a higher and higher threshold to warrant building them, especially if FPGAs can meet a system’s needs.

Some markets demand longevity and high reliability from hardware with systems being deployed for 5, 10, 15, or more years in harsh environments. For example, imagine putting smart cameras on the street or compute systems in automobiles and requiring the same 18 month refresh cycle that CPUs and GPUs expect. The FPGAs flexibility enables users to update the hardware capabilities for the system without requiring a hardware refresh. This results in longer lifespans of deployed products. FPGAs have a history of long production cycles with devices being built for well over 15 to 20 years. They have been used in space, military, and extremely high reliability environments for decades.

For these reasons and more, developers at all levels need to understand how the Intel’s AI Ecosystem and portfolio employs Intel® FPGAs. This knowledge will enable developers to use Intel® FPGAs to accelerate and extend the life and efficiency of AI applications.

Increased Productivity and Shortened Design Cycles

Most developers know FPGAs are flexible and robust devices providing a wide variety of uses:

  • FPGAs can become any digital circuit as long as the unit has enough logic blocks to implement that circuit.
  • Their flexible platform enables custom system architectures that other devices simply cannot efficiently support.
  • FPGAs can perform inline data processing, such as machine learning, from a video camera or Ethernet stream, for example, and then pass the results to a storage device or to the process for further processing. FPGAs can do this while simultaneously performing, in parallel, compute offload.

But not all developers know how to access Intel® FPGAs’ potential or that they can do so with shorter-than-ever design cycles (illustrated in Figure 4).


Figure 4. Intel’s AI ecosystem is now enabled for FPGA.

To help developers bring FPGAs to market running machine learning workloads, Intel has shortened the design time for developers by creating a set of API layers. Developers can interface with the API layers based on their level of expertise, as outlined in Figure 5.


Figure 5. Four Entry Points for Developers

Typical users can start at the SDK or framework level. More advanced users, who want to build their own software stack, can enter at the Software API layer. The Software API layer abstracts away the lower-level OpenCL™ runtime and is the same API the libraries use. Customers who want to build their own software stack can enter at the C++ Embedded API Layer. Advanced platform developers who want to add more than machine learning to their FPGA—such as support for asynchronous parallel compute offload functions or modified source code—can enter in at the OpenCL™ Host Runtime API level or the Intel Deep Learning Architecture Library level, if they want to customize the machine learning library.

Several design entry methods are available for power users looking to modify source code and customize the topology by adding custom primitives. Developers can customize their solutions by using traditional RTL (Verilog or VHDL), which is common for FPGA developers, or the higher level compute languages, such as C/C++ or OpenCL™. By offering these various entry points for developers, Intel makes implementing FPGAs accessible for various skillsets in a timely manner.

Conclusion

Intel is uniquely positioned for AI development—the Intel’s AI Ecosystem offers solutions for all aspects of AI by providing a unified front end for a variety of backend technologies, from hardware to edge devices. In addition, Intel’s ecosystem is now fully enabled for FPGA. Intel® FPGAs provide numerous benefits, including system acceleration opportunities, power efficiency, and future proofing, due to FPGAs’ long lifespans, flexibility, and re-configurability. Finally, to help propel AI today and into the future, Intel AI solutions allow a variety of language-agnostic entry points for developers at all skillset levels.

Optimizing Computer Applications for Latency: Part 1: Configuring the Hardware

$
0
0

For most applications, we think about performance in terms of throughput. What matters is how much work an application can do in a certain amount of time. That’s why hardware is usually designed with throughput in mind, and popular software optimization techniques aim to increase it.

However, there are some applications where latency is more important, such as High Frequency Trading (HFT), search engines and telecommunications. Latency is the time it takes to perform a single operation, such as delivering a single packet. Latency and throughput are closely related, but the distinction is important. You can sometimes increase throughput by adding more compute capacity; for example: double the number of servers to do twice the work in the same amount of time. But you can’t deliver a particular message any quicker without optimizing the way the messages are handled within each server.

Some optimizations improve both latency and throughput, but there is usually a trade-off. Throughput solutions tend to store packets in a buffer and process them in batches, but low latency solutions require every packet to be processed immediately.

Consistency is also important. In HFT, huge profits and losses can be made on global events. When news breaks around elections or other significant events, there can be bursts of trading activity with significant price moves. Having an outlier (a relatively high latency trade) at this busy time could result in significant losses.

Latency tuning is a complex topic requiring a wide and deep understanding of networking, kernel organization, CPU and platform performance, and thread synchronization. In this paper, I’ll outline some of the most useful techniques, based on my work with companies in telecommunications and HFT.

Understanding the Challenge of Latency Optimization

Here’s an analogy to illustrate the challenge of latency optimization. Imagine a group of people working in an office, who communicate by passing paper messages. Each message contains the data of a sender, recipient and an action request. Messages are stored on tables in the office. Some people receive messages from the outside world and store them on the table. Others take messages from the table and deliver them to one of the decision makers. Each decision maker only cares about certain types of messages.

The decision makers read the messages and decide whether the action request is to be fulfilled, postponed or cancelled. The requests that will be fulfilled are stored on another table. Messengers take these requests and deliver them to the people who will carry out the actions. That might involve sending the messages to the outside world, and sending confirmations to the original message sender.

To complicate things even more, there is a certain topology of message-passing routes. For example, the office building might have a complicated layout of rooms and corridors and people may need access to some of the rooms. Under normal conditions the system may function reasonably well in handling, let’s say, two hundred messages a day with an average message turnaround of five minutes.

Now, the goal is to dramatically reduce the turnaround time. At the same time, you want to make sure the turnaround time for a message is never more than twice the average. In other words, you want to be able to handle the bursts in activity without causing any latency outliers.

So, how can you improve office efficiency? You could hire more people to move messages around (increasing throughput), and hire faster people (reducing latency). I can imagine you might reduce latency from five minutes to two minutes (maybe even slightly less if you manage to hire Usain Bolt). But you will eventually hit a wall. There is no one faster than Bolt, right? Comparing this approach to computer systems, the people represent processes and this is about executing more threads or processes (to increase throughput) and buying faster computers (to cut latency).

Perhaps the office layout is not the best for the job. It’s important that everyone has enough space to do their job efficiently. Are corridors too narrow so people get stuck there? Make them wider. Are rooms tiny, so people have to queue to get in? Make them bigger. This is like buying a computer with more cores, larger caches and higher memory and I/O bandwidth.

Next, you could use express delivery services, rather than the normal postal service, for messages coming into and out of the office. In a computer system, this is about the choice of network equipment (adapters and switches) and their tuning. As in the office, the fastest delivery option might be the right choice, but is also probably the most expensive.

So now the latency is down to one minute. You can also instruct people and train them to communicate and execute more quickly. This is like tuning software to execute faster. I’ll take 15 percent off the latency for that. We are at 51 seconds.

The next step is to avoid people bumping into each other, or getting in each other’s way. We would like to enable all the people taking messages from the table and putting messages on it to do so at the same time, with no delay. We may want to keep messages sorted in some way (in separate boxes on the table) to streamline the process. There may also be messages of different priorities. In software, this is about improving thread synchronization. Threads should have as parallel and as quick access to the message queue as possible. Fixing bottlenecks increases throughput dramatically, and should also have some effect on latency. Now we can handle bursts of activity, although we do still have the risk of outliers.

People might stop for a chat sometimes or a door may stick in a locked position. There are a lot of little things that could cause delay. The highest priority is to ensure the following: that there are never more people than could fit into a particular space, there are no restrictions on people’s speed, there are no activities unrelated to the current job, and there is no interference from other people. For a computer application, this means we need to ensure that it never runs out of CPU cores, power states are set to maximum performance, and kernel (operating system) or middleware activities are isolated so they do not evict application thread activities.

Now let’s consider whether the office environment is conducive to our goal. Can people open doors easily? Are the floors slippery, so people have to walk with greater care and less speed? The office environment is like the kernel of an operating system. If the office environment can’t be made good enough, perhaps we can avoid part of it. Instead of going through the door, the most dexterous could pass a message through a window. It might be inconvenient, but it’s fast. This is like using kernel bypass solutions for networking.

Instead of relying on a kernel network stack, kernel bypass solutions implement user space networking. It helps to avoid unnecessary memory copies (kernel space to user space) and avoids the scheduler delay when placing the receiver thread for execution. In kernel bypass, the receiver thread typically uses busy-waiting. Rather than waiting on a lock, it continuously checks the lock variable until it flags: “Go!”

On top of that there may be different methods of exchanging messages through windows. You would likely start with delivering hand to hand. This sounds reliable, but it’s not the fastest. That’s how the Transmission Control Protocol (TCP) protocol works. Moving to User Datagram Protocol (UDP) would mean just throwing messages into the receiver’s window. You don’t need to wait for a person’s readiness to get a message from your hand. Looking for further improvement? How about throwing messages through the window so they land right on the table in the message queue? In a networking world, such an approach is called remote direct memory access (RDMA). I believe the latency has been cut to about 35 seconds now.

What about an office purpose-built, according to your design? You can make sure the messengers are able to move freely and their paths are optimized. That could get the latency down to 30 seconds, perhaps. Redesigning the office is like using a field programmable gate array (FPGA). An FPGA is a compute device that can be programmed specifically for a particular application. CPUs are hardcoded, which means they can only execute a particular instruction set with a data flow design that is also fixed. Unlike CPUs, FPGAs are not hardcoded for any particular instruction set so programming them makes them able to run a particular task and only that task. Data flow is also programmed for a particular application. As with a custom-designed office, it’s not easy to create an FPGA or to modify it later. It might deliver the lowest latency, but if anything changes in the workflow, it might not be suitable any more. An FPGA is also a type of office where thousands of people can stroll around (lots of parallelism), but there’s no running allowed (low frequency). I’d recommend using an FPGA only after considering the other options above.

To go further, you’ll need to use performance analysis tools. In part two of this article, I’ll show you how Intel® VTuneTM Amplifier and Intel® Processor Trace technology can be used to identify optimization opportunities.

Making the Right Hardware Choices

Before we look at tuning the hardware, we should consider the different hardware options available.

Processor choice

One of the most important decisions is whether to use a standard CPU or an FPGA.

The most extreme low latency solutions are developed and deployed on FPGAs. Despite the fact that FPGAs are not particularly fast in terms of frequency, they are nearly unlimited in terms of parallelism, because the device can be designed to satisfy the demands of the task at hand. This only makes a difference if the algorithm is highly parallel. There are two ways that parallelism helps. First, it can handle a huge number of packets simultaneously, so it handles bursts very well with a stable latency. As soon as there are more packets than cores in a CPU, there will be a delay. This has an impact on throughput than latency. The second way that parallelism helps is at the instruction level. A CPU can only carry out four instructions per cycle. An FPGA can carry out a nearly unlimited number of instructions simultaneously. For example, it can parse all the fields of an incoming packet concurrently. This is why it delivers lower latency despite its lower frequency.

In low latency applications, the FPGA usually receives a network signal through a PHY chip and does a full parsing of the network packets. It takes roughly half the time, compared to parsing and delivering packets from a network adapter to a CPU (even using the best kernel bypass solutions). In HFT, Ethernet is typically used because exchanges provide Ethernet connectivity. FPGA vendors provide Ethernet building blocks for various needs.

Some low latency solutions are designed to work across CPUs and FPGAs. Currently a typical connection is by PCI-e, but Intel has announced a development module using Intel® Xeon® processors together with FPGAs, where connectivity is by Intel® QuickPath Interconnect (Intel® QPI) link. This reduces connection latency significantly and also increases throughput.

When using CPU-based solutions, the CPU frequency is obviously the most important parameter for most low latency applications. The typical hardware choice is a trade-off between frequency and the number of cores. For particularly critical workloads, it’s not uncommon for server CPUs and other components to be overclocked. Overclocking memory usually has less impact. For a typical trading platform, memory accounts for about 10 percent of latency, though your mileage may vary, so the gains from overclocking are limited. In most cases, it isn’t worth trying it. Be aware that having more DIMMs may cause a drop in memory speed.

Single-socket servers operating independently are generally better suited for latency because they eliminate some of the complications and delay associated with ensuring consistent inter-socket communication.

Networking

The lowest latencies and the highest throughputs are achieved by high-performance computing (HPC) specialized interconnect solutions, which are widely used in HPC clusters. For Infiniband* interconnect, the half-roundtrip latency could be as low as 700 nanoseconds. (The half-roundtrip latency is measured from the moment a packet arrives at the network port of a server, until the moment the response has been sent from the server’s network port).

In HFT and telco, long range networking is usually based on Ethernet. To ensure the lowest possible latency when using Ethernet, two critical components must be used - a low latency network adapter and kernel bypass software. The fastest half-roundtrip latency you can get with kernel bypass is about 1.1 microseconds for UDP and slightly slower with TCP. Kernel bypass software implements the network stack in user space and eliminates bottlenecks in the kernel (superfluous data copies and context switches).

DPDK

Another high throughput and low latency option for Ethernet networking is the Data Plane Development Kit (DPDK). DPDK dedicates certain CPU cores to be the packet receiver threads and uses a permanent polling mode in the driver to ensure the quickest possible response to arriving packets. For more information, see http://dpdk.org/

Storage

When we consider low latency applications, storage is rarely on a low latency path. When we do consider it, the best solution is a solid state drive (SSD). With access latencies of dozens of microseconds, SSDs are much faster than hard drives. There are PCI-e-based NVMe SSDs that provide the lowest latencies and the highest bandwidths.

Intel has announced the 3D XPointTM technology, and released the first SSD based on it. These disks bring latency down to several microseconds. This makes the 3D XPoint technology ideal for high performance SSD storage, delivering up to ten times the performance of NAND across a PCIe NVMe interface. An even better alternative in the future will be non-volatile memory based on 3D XPoint technology.

Tuning the Hardware for Low Latency

The default hardware settings are usually optimized for the highest throughput and reasonably low power consumption. When we’re chasing latency, that’s not what we are looking for. This section provides a checklist for tuning the hardware for latency.

In addition to these suggestions, check for any specific guidance from OEMs on latency tuning for their servers.

BIOS settings
  • Ensure that Turbo is on.

  • Disable lower CPU power states. Settings vary among different vendors, so after turning C-states off, you should check whether there are extra settings like C1E, and memory and PCI-e power saving states, which should also be disabled.

  • Check for other settings that might influence performance. This varies greatly by OEM, but should include anything power related, such as fan speed settings.

  • Disable hyper-threading to reduce variations in latency (jitter).

  • Disable any virtualization options.

  • Disable any monitoring options.

  • Disable Hardware Power Management, introduced in the Intel® Xeon® processor E5-2600 v4 product family. It provides more control over power management, but it can cause jitter and so is not recommended for latency-sensitive applications.

Networking
  • Ensure that the network adapter is inserted into the correct PCI-e slot, where the receiver thread is running. That shaves off inter-socket communication latency and allows Intel® Data Direct I/O Technology to place data directly into the last level cache (LLC) of the same socket.

  • Bind network interrupts to a core running on the same socket as a receiver thread. Check entry N in /proc/interrupts (where N is the interrupt queue number) and then set it by:
    echo core # > /proc/irq/N/smp_affinity

  • Disable interrupt coalescing. Usually the default mode is adaptive which is much better than any fixed setting, but it is still several microseconds slower than disabling it. The recommended setting is:
    ethtool –C <interface> rx-usecs 0 rx-frames 0 tx-usecs 0 tx-frames 0 pkt-rate-low 0 pkt-rate-high 0

Kernel bypass
  • Kernel bypass solutions usually come tuned for latency, but there still may be some useful options to try out such as polling settings.

Kernel tuning
  • Set the correct power mode. Edit /boot/grub/grub.conf and add:
    nosoftlockup intel_idle.max_cstate=0 processor.max_cstate=0 mce=ignore_ce idle=poll
     to the kernel line. For more information, see www.kernel.org/doc/Documentation/x86/x86_64/boot-options.txt

  • Turn off the cpuspeed service.

  • Disable unnecessary kernel services to avoid jitter.

  • Turn off the IRQ Balance service if interrupt affinity has been set.

  • Try tuning IPv4 parameters. Although this is more important for throughput, it can help to handle bursts of network activity.

  • Disable the TCP timestamps option for better CPU utilization:
    sysctl -w net.ipv4.tcp_timestamps=0

  • Disable the TCP selective acks option for better CPU utilization:
    sysctl -w net.ipv4.tcp_sack=0

  • Increase the maximum length of processor input queues:
    sysctl -w net.core.netdev_max_backlog=250000

  • Increase the TCP maximum and default buffer sizes using setsockopt():
    sysctl -w net.core.rmem_max=16777216
    sysctl -w net.core.wmem_max=16777216
    sysctl -w net.core.rmem_default=16777216
    sysctl -w net.core.wmem_default=16777216
    sysctl -w net.core.optmem_max=16777216

  • Increase memory thresholds to prevent packet dropping:
    sysctl -w net.ipv4.tcp_mem="16777216 16777216 16777216"

  • Increase the Linux* auto-tuning of TCP buffer limits. The minimum, default, and maximum number of bytes to use are shown below (in the order minimum, default, and maximum):
    sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
    sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

  • Enable low latency mode for TCP:
    sysctl -w net.ipv4.tcp_low_latency=1

  • For tuning network stack there is a good alternative:
    tuned-adm profile network-latency

  • Disable iptables.

  • Set the scaling governor to “performance” mode for each core used by a process:
    for ((i=0; i<num_of_cores; i++)); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done

  • Configure the kernel as preemptive to help reduce the number of outliers.

  • Use a tickless kernel to help eliminate any regular timer interrupts causing outliers.

  • Finally, use the isolcpus parameter to isolate the cores allocated to an application from OS processes.

Conclusion

This article provides an introduction to the challenge of latency tuning, the hardware choices available, and a checklist for configuring it for low latency. In the second article in this series, we look at application tuning, including a working example.

Optimizing Computer Applications for Latency: Part 2: Tuning Applications

$
0
0

For applications such as high frequency trading (HFT), search engines and telecommunications, it is essential that latency can be minimized. My previous article Optimizing Computer Applications for Latency, looked at the architecture choices that support a low latency application. This article builds on that to show how latency can be measured and tuned within the application software.

Using Intel® VTuneTM Amplifier

Intel® VTuneTM Amplifier XE can collect and display a lot of useful data about an application’s performance. You can run a number of pre-defined collections (such as parallelism and memory analysis) and see thread synchronization on a timeline. You can break down activity by process, thread, module, function, or core, and break it down by bottleneck too (memory bandwidth, cache misses, and front-end stalls).

Intel VTune can be used to identify many important performance issues, but it struggles with analyzing intervals measured in microseconds. Intel VTune uses periodic interrupts to collect data and save it. The frequency of those interrupts is limited to roughly one collection point per 100 microseconds per core. While you can filter the data to observe some of the outliers, the data on any single outlier will be limited, and some might be missed by the sampling frequency.

You can download a free trial of Intel VTune AmplifierXE. Read about the VTune AmplifierXE capabilities.

Figure 1. Intel® VTune™ Amplifier XE 2017, showing hotspots (above) and concurrency (below) analyses

Using Intel® Processor Trace Technology

The introduction of Intel® Processor Trace (Intel® PT) technology in the Broadwell architecture, for example in the Intel® Xeon® processor E5-2600 v4, makes it possible to analyze outliers in low latency applications. Intel® PT is a hardware feature that logs information about software execution with minimal impact on system execution. It supports control flow tracing, so decoder software can be used to determine the exact flow of the software execution, including branches taken and not taken, based on the trace log. Intel PT can store both cycle count and timestamp information for deep performance analysis. If you can time stamp other measurements, traces, and screenshots you can synchronize the Intel PT data with them. The granularity of a capture is a basic block. Intel PT is supported by the “perf” performance analysis tool in Linux*.

Typical Low Latency Application Issues

Low latency applications can suffer from the same bottlenecks as any kind of application, including:

  • Using excessive system library calls (such as inefficient memory allocations or string operations)

  • Using outdated instruction sets, because of obsolete compilers or compiler options

  • Memory and other runtime issues leading to execution stalls

On top of those, latency-sensitive applications have their own specific issues. Unlike in high-performance computing (HPC) applications, where loop bodies are usually small, the loop body in a low latency application usually covers a packet processing instruction path. In most cases, this leads to heavy front-end stalls because the decoded instructions for the entire packet processing path do not fit into the instruction (uop) cache. That means instructions have to be decoded on the fly for each loop iteration. Between 40 and 50 per cent of CPU cycles can stall due to the lack of instructions to execute.

Another specific problem is due to inefficient thread synchronization. The impact of this usually increases with a higher packet/transaction rate. Higher latency may lead to a limited throughput as well, making the application less able to handle bursts of activity. One example I’ve seen in customer code is guarding a single-threaded queue with a lock to use it in a multithreaded environment. That’s hugely inefficient. Using a good multithreaded queue, we’ve been able to improve throughput from 4,000 to 130,000 messages per second. Another common issue is using thread synchronization primitives that go to kernel sleep mode immediately or too soon. Every wake-up from kernel sleep takes at least 1.2 microseconds.

One of the goals of a low latency application is to reduce the quantity and extent of outliers. Typical reasons for jitter (in descending order) are:

  • Thread oversubscriptions, accounting for a few milliseconds

  • Runtime garbage collector activities, accounting for a few milliseconds

  • Kernel activities, accounting for up to 100s of microseconds

  • Power-saving states:

    • CPU C-states, accounting for 10s to 100s of microseconds

    • Memory states

    • PCI-e states

  • Turbo mode frequency switches, accounting for 7 microseconds

  • Interrupts, IO, timers: responsible for up to a few microseconds

Tuning the Application

Application tuning should begin by tackling any issues found by Intel VTune. Start with the top hotspots and, where possible, eliminate or reduce excessive activities and CPU stalls. This has been widely covered by others before, so I won’t repeat their work here. If you’re new to Intel VTune, there’s a Getting Started guide.

In this article, we will focus on the specifics for low latency applications. The biggest issue arises from front-end stalls in the instruction decoding pipeline. This issue is difficult to address, and results from the loop body being too big for the uop cache. One approach that might help is to split the packet processing loop and process it by a number of threads passing execution from one another. There will be a synchronization overhead, but if the instruction sequence fits into a few uop caches (each thread bound to different cores, one cache per thread), it may well be worth the exercise.

Thread synchronization issues are somewhat difficult to monitor. Intel VTune Amplifier has a collection that captures all standard thread sync events (Windows* Thread API, pthreads* API, Intel® Threading Building Blocks and OpenMP*). It helps to understand what is going on in the application, but deeper analysis is required to see if a thread sync model introduces any limitations. This is non-trivial exercise requiring quite some expertise. The best advice is to use a highly performant threading solution.

An interesting topic is thread affinities. For complex systems with multiple thread synchronization patterns along the workflow, setting the best affinities may bring some benefit. A synchronization object is a variable or data structure, plus its associated lock/release functionality. Threads synchronized on a particular object should be pinned to a core of the same socket, but they don’t need to be on the same core. Generally the goal of this exercise is to keep thread synchronization on a particular object local to one of the sockets, because cross-socket thread sync is much costlier.

Tackling Outliers in Virtual Machines

If the application runs in a Java* or .NET* virtual machine, the virtual machine needs to be tuned. The garbage collector settings are particularly important. For example, try tuning the tenuring threshold to avoid unnecessary moves of long-lived objects. This often helps to reduce latency and cut outliers down.

One useful technology introduced in the Intel® Xeon® processor E5-2600 v4 product family is Cache Allocation Technology. It allows a certain amount of last level cache (LLC) to be dedicated to a particular core, process, or thread, or to a group of them. For example, a low latency application might get exclusive use of part of the cache so anything else running on the system won’t be able to evict its data.

Another interesting technique is to lock the hottest data in the LLC “indefinitely”. This is a particularly useful technique for outlier reduction. The hottest data is usually considered to be the data that’s accessed most often, but for low latency applications it can instead be the data that is on a critical latency path. A cache miss costs roughly 50 to 100 nanoseconds, so a few cache misses can cause an outlier. By ensuring that critical data is locked in the cache, we can reduce the number and intensity of outliers.

For more information on Cache Allocation Technology, see Using Hardware Features in Intel® Architecture to Achieve High Performance in NFV.           

Exercise

Let’s play with a code sample implementing a lockless single-producer single-consumer queue. Downloadtext/x-csrcDownload

To start, grab the source code for the test case from the download link above. Build it like this:

gcc spsc.c -lpthread –lm -o spsc

Or

icc spsc.c -lpthread –lm -o spsc

Here’s how you run the spsc test case:

./spsc 100000 10 100000

The parameters are: numofPacketsToSend bufferSize numofPacketsPerSecond. You can experiment with different numbers.

Let’s check how the latency is affected by CPU power-saving settings. Set everything in the BIOS to maximum performance, as described in Part 1 of this series. Specifically, CPU C-states must be set to off and the correct power mode should be used, as described in the Kernel Tuning section. Also, ensure that cpuspeed is off.

Next, set the CPU scaling governor to powersave. In this code, the index i goes up to the number of cores:

for ((i=0; i<23; i++)); do echo powersave > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor;  done

Then set all threads to stay on the same NUMA node using taskset, and run the test case:

taskset –c 0,1 ./spsc 1000000 10 1000000 

On a server based on the Intel® Xeon® Processor E5-2697 v2, running at 2.70GHz, we see the following results for average latency with and without outliers, the highest and lowest latency, the number of outliers and the standard deviation (with and without outliers). All measurements are in microseconds:

taskset -c 0,1 ./spsc 1000000 10 1000000

Avg lat = 0.274690, Avg lat w/o outliers = 0.234502, lowest lat = 0.133645, highest lat = 852.247954, outliers = 4023

Stdev = 0.001214, stdev w/o outliers = 0.001015

Now set the performance mode (overwriting the powersave mode) and run the test again:

for ((i=0; i<23; i++)); do echo performance > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor;  done

taskset -c 0,1 ./spsc 1000000 10 1000000

Avg lat = 0.067001, Avg lat w/o outliers = 0.051926, lowest lat = 0.045660, highest lat = 422.293023, outliers = 1461

Stdev = 0.000455, stdev w/o outliers = 0.000560

As you can see all the performance metrics improved significantly when we enabled performance mode: the average, the lowest and highest latency, and the number of outliers. (Table 1 summarizes all the results from this exercise for easy comparison).

Let’s compare how the latency is affected by a NUMA locality. I’m assuming you have a machine with more than one processor. We’ve already run the test case bound to a single NUMA node.

Let’s run the test case over two nodes:

taskset -c 8,16 ./spsc 1000000 10 1000000

Avg lat = 0.248679, Avg lat w/o outliers = 0.233011, lowest lat = 0.069047, highest lat = 415.176207, outliers = 1926

Stdev = 0.000901, stdev w/o outliers = 0.001103

All of the metrics, except for the highest latency, are better on a single NUMA node. This results from the cost of communicating with another node, because data needs to be transferred over Intel® QuickPath Interconnect (Intel® QPI) link and over all parts of the cache coherency mechanism.

Don’t be surprised that the highest latency is lower on two nodes. You can run the test multiple times and verify that the highest latency outliers are roughly the same for both one node and two nodes. The lower value shown here for two nodes is most likely a coincidence. The outliers are two to three orders of magnitude higher than the average latency, which shows that NUMA locality doesn’t matter for the highest latency. The outliers are caused by kernel activities that are not related to NUMA.

Test

Avg Lat

Avg Lat w/o Outliers

lowest Lat

highest Lat

outliers

Stdev

Stdev w/o Outliers

Powersave

0.274690

0.234502

0.133645

852.247954

4023

0.001214

0.001015

Performance

1 node

0.067001

0.051926

0.045660

422.293023

1461

0.000455

0.000560

Performance

2 nodes

0.248679

0.233011

0.069047

415.176207

1926

0.000901

0.001103

Table 1: The results of the latency tests conducted under different conditions, measured in microseconds.

I also recommend playing with Linux perf to monitor outliers. Intel PT support starts with Kernel 4.1. You need to add timestamps (start, stop) for all latency intervals, identify a particular outlier and then drill down into perf data to see what was going on during the interval of the outlier.

For more information, see https://github.com/torvalds/linux/blob/master/tools/perf/Documentation/intel-pt.txt.

Conclusion

This two-part article has summarized some of the approaches you can take, and tools you can use, when tuning applications and hardware for low latency. Using the worked example here, you can quickly see the impact of NUMA locality and powersave mode, and you can use the test case to experiment with other settings, and quickly see the impact they can have on latency.


Intel® Software Development tools integration to Microsoft* Visual Studio 2017 issue

$
0
0

Issue: Installation of Intel® Parallel Studio XE with Microsoft* Visual Studio 2017 integration hangs and fails on some systems. The problem is intermittent and not reproducible on every system. Any attempts to repair it fails with the message "Incomplete installation of Microsoft Visual Studio* 2017 is detected". Note, in some cases the installation may complete successfully with no error/crashes, however, the integration to VS2017 is not installed. The issue may be observed with Intel® Parallel Studio XE 2017 Update 4, Intel® Parallel Studio XE 2018 Beta and later versions as well as Intel® System Studio installations.

Environment: Microsoft* Windows, Visual Studio 2017

Root Cause: A root cause was identified and reported to Microsoft*. Note that there may be different reasons of integration failures. We are documenting all cases and providing to Microsoft for further root-cause analysis.

Workaround:

Note that with Intel Parallel Studio XE 2017 Update 4 there is no workaround for this integration problem. The following workaround is expected to be implemented in Intel Parallel Studio XE 2017 Update 5. It is implemented in Intel Parallel Studio XE 2018 Beta Update 1.

Integrate the Intel Parallel Studio XE components manually. You need to run all the files from the corresponding folders:

  • C++/Fortran Compiler IDE: <installdir>/ide_support_2018/VS15/*.vsix
  • Amplifier: <installdir>/VTune Amplifier 2018/amplxe_vs2017-integration.vsix
  • Advisor: <installdir>/Advisor 2018/advi_vs2017-integration.vsix
  • Inspector: <installdir>/Inspector 2018/insp_vs2017-integration.vsix
  • Debugger: <InstallDir>/ide_support_2018/MIC/*.vsix
                      <InstallDir>/ide_support_2018/CPUSideRDM/*.vsix

If this workaround doesn't work and installation still fails then please report the problem to Intel through the Intel® Developer Zone Forums or Online Service Center. You will need to supply the installation log file and error message from Microsoft installer.

Announcing the Intel Modern Code Developer Challenge from CERN openlab

$
0
0

It is always an exciting time when I get to announce a Modern Code Developer Challenge from my friends at Intel, but it is even more special when I get to announce a collaboration with the brilliant minds at CERN. Beginning this month (July 2017), and running for nine weeks, five exceptional students participating in the CERN openlab Summer Student Programme are working to research and develop solutions for five modern-code-centered challenges. These are no ordinary challenges, as you might have already guessed—here is a brief summary of what they are tackling:

  1. Smash-simulation software: Teaching algorithms to be faster at simulating particle-collision events.
  2. Connecting the dots: Using machine learning to better identify the particles produced by collision events.
  3. Cells in the cloud: Running biological simulations more efficiently with cloud computing.
  4. Disaster relief: Helping computers to get better at recognizing objects in satellite maps created by a UN agency.
  5. IoT at the LHC: Integrating Internet of Things devices into the control systems for the Large Hadron Collider.

After the nine weeks of interactive support from an open community of developers, scientists, fellow students, and other people passionate about science, one of the five students will be selected to showcase their winning project at a number of leading industry events. The winner will be announced at the upcoming Intel® HPC Developers Conference on November 11, 2017, and will also be shown at the SC17 SuperComputing conference in Denver, Colorado.

Follow the Intel Developer Zone on Facebook for more announcements and information, including those about this exciting new challenge that will surely teach us a thing or two about modern coding.

I will add comments to this blog as I learn more about the opportunities to review/comment/vote on the on-going work of these five CERN openlab Summer Student Programme students working to make the world a better place!

 

Rendering Researchers: Hugues Labbe

$
0
0

Hugues LabbeHugues has been passionate about Graphics programming since an early age, starting with the Commodore Amiga demo scene in the mid-80s.

He earned his Masters degree in Computer Graphics from IRISA University (France) in 1995. After relocating to California and helping grow a couple of San Francisco bay area startups in the early 2000s, he joined Intel in 2005 where he worked on optimizations of the geometry pipe for Intel’s graphics driver stack, followed by shader compiler architecture and end-to-end optimizations of the Larrabee graphics pipeline, and more recently on an end-to-end Virtual Reality compositor, including design, architecture, implementation, and optimizations for Intel’s Project Alloy VR headset.

Working on competitive graphics innovation for future Intel platforms, his current research focus revolves around advancing the state-of-the-art in real-time rendering, ray-tracing GPU acceleration, and GPU compiler + hardware architecture.

System Analyzer Utility for Linux

$
0
0

Overview

This article describes a utility to help diagnose system and installation issues for Intel(R) Computer Vision SDK, Intel(R) SDK for OpenCL(TM) Applications and Intel(R) Media Server Studio.  It is a simple Python script with full source code available.

It is intended as a reference for the kinds of checks to consider from the command line and possibly from within applications.  However, this implementation should be considered a prototype/proof of concept -- not a productized tool.

Features

When executed, the tool reports back 

  • Platform readiness: check if processor has necessary GPU components
  • OS readiness: check if OS can see GPU, and if it has required glibc/gcc level
  • Install checks for Intel(R) Media Server Studio/Intel(R) SDK for OpenCL Applications components
  • Results from runs of small smoke test programs for Media SDK and OpenCL

System Requirements

The tool is based on Python 2.7.  It should run on a variety of systems with or without necessary components to run GPU applications.  However, it is still a work in progress so it may not always exit cleanly when software components are missing.

 

Using the Software

The display should look like the output below for a successful installation

$ python sys_analyzer_linux.py -v
--------------------------
Hardware readiness checks:
--------------------------
 [ OK ] Processor name: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
 [ INFO ] Intel Processor
 [ INFO ] Processor brand: Core
 [ INFO ] Processor arch: Skylake
--------------------------
OS readiness checks:
--------------------------
 [ INFO ] GPU PCI id     : 1916
 [ INFO ] GPU description: SKL ULT GT2
 [ OK ] GPU visible to OS
 [ INFO ] no nomodeset in GRUB cmdline (good)
 [ INFO ] Linux distro   : Ubuntu 14.04
 [ INFO ] Linux kernel   : 4.4.0
 [ INFO ] glibc version  : 2.19
 [ INFO ] gcc version    : 4.8.4 (>=4.8.2 suggested)
 [ INFO ] /dev/dri/card0 : YES
 [ INFO ] /dev/dri/renderD128 : YES
--------------------------
Intel(R) Media Server Studio Install:
--------------------------
 [ OK ] user in video group
 [ OK ] libva.so.1 found
 [ INFO ] Intel iHD used by libva
 [ OK ] vainfo reports valid codec entry points
 [ INFO ] i915 driver in use by Intel video adapter
 [ OK ] /dev/dri/renderD128 connects to Intel i915

--------------------------
Media SDK Plugins available:
(for more info see /opt/intel/mediasdk/plugins/plugins.cfg)
--------------------------
    H264LA Encoder 	= 588f1185d47b42968dea377bb5d0dcb4
    VP8 Decoder 	= f622394d8d87452f878c51f2fc9b4131
    HEVC Decoder 	= 33a61c0b4c27454ca8d85dde757c6f8e
    HEVC Encoder 	= 6fadc791a0c2eb479ab6dcd5ea9da347
--------------------------
Component Smoke Tests:
--------------------------
 [ OK ] Media SDK HW API level:1.19
 [ OK ] Media SDK SW API level:1.19
 [ OK ] OpenCL check:platform:Intel(R) OpenCL GPU OK CPU OK

 

 

 

The Modern Code Developer Challenge

$
0
0

As part of its ongoing support of the world-wide student developer community and advancement of science, Intel® Software has partnered with CERN through CERN openlab to sponsor the Intel® Modern Code Developer Challenge. The goal for Intel is to give budding developers the opportunity to use modern programming methods to improve code that helps move science forward. Take a look at the winners from the previous challenge here!

The Challenge will take place from July - October, 2017 with the winners announced in November, 2017 at the Intel® HPC Developer Conference

Check back on this site soon for more information!

 

1) Smash-simulation software: Teaching algorithms to be faster at simulating particle-collision events

Physicists widely use a software toolkit called GEANT4 to simulate what will happen when a particular kind of particle hits a particular kind of material in a particle detector. In fact, this toolkit is so popular that it is also used by researchers in other fields who want to predict how particles will interact with other matter: it’s used to assess radiation hazards in space, for commercial air travel, in medical imaging, and even to optimise scanning systems for cargo security.

An international team, led by researchers at CERN, is now working to develop a new version of this simulation toolkit, called GeantV. This work is supported by a CERN openlab project with Intel on code modernisation. GeantV will improve physics accuracy and boost performance on modern computing architectures.

The team behind GeantV is currently implementing a ‘deep-learning' tool that will be used to make simulation faster. The goal of this project is to write a flexible mini-application that can be used to support the efforts to train the deep neural network on distributed computing systems.

 

2) Connecting the dots: Using machine learning to better identify the particles produced by collision events

The particle detectors at CERN are like cathedral-sized 3D digital cameras, capable of recording hundreds of millions of collision events per second. The detectors consist of multiple ‘layers’ of detecting equipment, designed to recognise different types of charged particles produced by the collisions at the heart of the detector. As the charged particles fly outwards through the various layers of the detector, they leave traces, or ‘hits’.

Tracking is the art of connecting the hits to recreate trajectories, thus helping researchers to understand more about and identify the particles. The algorithms used to reconstruct the collision events by identifying which dots belong to which charged particles can be very computationally expensive. And, with the rate of particle collisions in the LHC set to be further increased over the coming decade, it’s important to be able to identify particle tracks as efficiently as possible.

Many track-finding algorithms start by building ‘track seeds’: groups of two or three hits that are potentially compatible with one another. Compatibility between hits can also be inferred from what are known as ‘hit shapes’. These are akin to footprints; the shape of a hit depends on the energy released in the layer, the crossing angle of the hit at the detector, and on the type of particle.

This project investigates the use of machine-learning techniques to help recognise these hit shapes more efficiently. The project will explore the use of state-of-the-art many-core architectures, such as the Intel Xeon Phi processor, for this work.

 

3) Cells in the cloud: Running biological simulations more efficiently with cloud computing

BioDynaMo is one of CERN openlab’s knowledge-sharing projects. It is part of CERN openlab’s collaboration with Intel on code modernisation, working on methods to ensure that scientific software makes full use of the computing potential offered by today’s cutting-edge hardware technologies.

It is a joint effort between CERN, Newcastle University, Innopolis University, and Kazan Federal University to design and build a scalable and flexible platform for rapid simulation of biological tissue development.

The project focuses initially on the area of brain tissue simulation, drawing inspiration from existing, but low-performance software frameworks. By using the code to simulate the development of the normal and diseased brain, neuroscientists hope to be able to learn more about the causes of — and identify potential treatments for — disorders such as epilepsy and schizophrenia.

Late 2015 and early 2016 saw algorithms already written in Java code ported to C++. Once porting was completed, work was carried out to optimise the code for modern computer processors and co-processors. In order to be able to address ambitious research questions, however, more computational power will be needed. Work will, therefore, be undertaken to adapt the code for running using high-performance computing resources over the cloud. This project focuses on adding network support for the single-node simulator and prototyping the computation management across many nodes.

 

4) Disaster relief: Helping computers to get better at recognising objects in satellite maps created by a UN agency

UNOSAT is part of the United Nations Institute for Training and Research (UNITAR). It provides a rapid front-line service to turn satellite imagery into information that can aid disaster-response teams. By delivering imagery analysis and satellite solutions to relief and development organizations — both within and outside the UN system — UNOSAT helps to make a difference in critical areas such as humanitarian relief, human security, and development planning.

Since 2001, UNOSAT has been based at CERN and is supported by CERN's IT Department in the work it does. This partnership means UNOSAT can benefit from CERN's IT infrastructure whenever the situation requires, enabling the UN to be at the forefront of satellite-analysis technology. Specialists in geographic information systems and in the analysis of satellite data, supported by IT engineers and policy experts, ensure a dedicated service to the international humanitarian and development communities 24 hours a day, seven days a week.

CERN openlab and UNOSAT are currently exploring new approaches to image analysis and automated feature recognition to ease the task of identifying different classes of objects from satellite maps. This project evaluates available machine-learning-based feature-extraction algorithms. It also  investigates the potential for optimising these algorithms for running on state-of-the-art many-core architectures, such as the Intel Xeon Phi processor.

 

5) IoT at the LHC: Integrating ‘internet-of-things’ devices into the control systems for the Large Hadron Collider

The Large Hadron Collider (LHC) accelerates particles to over 99.9999% of the speed of light. It is the most complex machine ever built, relying on a wide range industrial control systems for proper functioning.  

This project will focus on integrating modern ‘systems-on-a-chip’ devices into the LHC control systems. The new, embedded ‘systems-on-a-chip’ available on the market are sufficiently powerful to run fully-fledged operating systems and complex algorithms. Such devices can also be easily enriched with a wide range of different sensors and communication controllers.

The ‘systems-on-a-chip’ devices will be integrated into the LHC control systems in line with the ‘internet of things’ (IoT) paradigm, meaning they will be able to communicate via an overlaying cloud-computing service. It should also be possible to perform simple analyses on the devices themselves, such as filtering, pre-processing, conditioning, monitoring, etc. By exploiting the IoT devices’ processing power in this manner, the goal is to reduce the network load within the entire control infrastructure and ensure that applications are not disrupted in case of limited or intermittent network connectivity.

 

Pirate Cove as an Example: How to Bring Steam*VR into Unity*

$
0
0

View PDF [1065 kb]

Who is the Target Audience of This Article?

This article is aimed at an existing Unity* developer who would like to incorporate Steam*VR into their scene. I am making the assumption that the reader already has an HTC Vive* that is set up on their computer and running correctly. If not, follow the instructions at the SteamVR site.

Why Did I Create the My Pirate Cove VR Scene?

My focus at work changed and I needed to ramp up on virtual reality (VR) and working in the Unity environment. I wanted to figure out how to bring SteamVR into Unity, layout a scene, and enable teleporting so I could move around the scene.

This article is intended to talk to some points that I learned along the way as well as show you how I got VR working with my scene. I will not be talking much about laying out the scene and how I used Unity to get everything up and running; rather, the main focus of this article is to help someone get VR incorporated into their scene.

What Was I Trying to Create?

I was trying to create a virtual reality visual experience. Not a game per se, even though I was using Unity to create my experience. I created a visual experience that would simulate what a small, pirate-themed tropical island could look like. I chose something that was pleasing to me; after all, I live in the rain-infested Pacific Northwest. I wanted to experience something tropical.

What Tools Did I Use?

I used Unity* 5.6. From there I purchased a few assets from the Unity Asset Store. The assets I chose were themed around an old, tropical, pirate setting:

  • Pirates Island
  • SteamVR
  • Hessburg Navel Cutter
  • Aquas

Along with a few other free odds and ends.

What Did I Know Going Into This Project?

I had some experience with Unity while working with our Intel® RealSense™ technology. Our SDK had an Intel RealSense Unity plugin, and I had written about the plugin as well as created a few training examples on it.

Up to this point I had never really tried to lay out a first-person type level in Unity, never worried about performance, frames per second (FPS), or anything like that. I had done some degree of scripting while using Intel RealSense and other simple ramp up tools. However, I’d never had to tackle a large project or scene and any issues that could come with that.

What Was the End Goal of This Project?

What I had hoped for is that I would walk away from this exercise with a better understanding of Unity and incorporating VR into a scene. I wanted to see how difficult it was to get VR up and running. What could I expect once I got it working? Would performance be acceptable? Would I be able to feel steady, not woozy, due to potential lowered frame rates?

And have fun. I wanted to have fun learning all this, which is also why I chose to create a tropical island, pirate-themed scene. I personally have a fascination with the old Caribbean pirate days.

What Misconceptions Did I Have?

As mentioned, I did have some experience with Unity, but not a whole lot.

The first misconception I had was what gets rendered and when. What do I mean? For some reason I had assumed that if I have a terrain including, for example, a 3D model such as a huge cliff, that if I placed the cliff such that only 50 percent of the cliff was above the terrain, Unity would not try to render what was not visible. I somehow thought that that there was some kind of rendering algorithm that would prevent Unity from rendering anything under a terrain object. Apparently that is not the case. Unity still renders the entire cliff 3D model.

This same concept applied to two different 3D cliff models. For example, if I had two cliff game objects, I assumed that if I pushed one cliff into the other to give the illusion of one big cliff, any geometry or texture information that was no longer visible would not get rendered. Again, not the case.

Apparently, if it has geometry and textures, no matter if it’s hidden inside something else, it will get rendered by Unity. This is something to take into consideration. I can’t say this had a big impact on my work or that it caused me to go find a fix; rather, just in the normal process of ramping up on Unity, I discovered this.

Performance

This might be where I learned the most. Laying out a scene using Unity’s terrain tools is pretty straightforward. Adding assets is also pretty straightforward. Before I get called out, I didn’t say it was straightforward to do a GOOD job; I’m just saying that you can easily figure out how to lay things out. While I think my Pirates Cove scene is a thing of beauty, others would scoff, and rightfully so. But, it was my first time and I was not trying to create a first-person shooter level. This is a faux visual experience.

FPS: Having talked with people about VR I had learned that the target FPS for VR is 90. I had initially tried to use the Unity Stats window. After talking with others on the Unity forum, I found out that this is not the best tool for true FPS performance. I was referred to this script to use instead, FPS Display script. Apparently it’s more accurate.

Occlusion culling: This was an interesting situation. I was trying to figure out a completely non-related issue and a co-worker came over to help me out. We got to talking about FPS and things you can do to help rendering. He introduced me to occlusion culling. He was showing me a manual way to do it, where you define the sizes and shapes of the boxes. I must confess, I simply brought up Unity’s Occlusion Culling window and allowed it to figure out the occlusions on its own. This seemed to help with performance.

Vegetation: I didn’t realize that adding grass to the terrain was going to have such a heavy impact. I had observed other scenes that seemed to have a lot of grass swaying in the wind. Thus, I went hog wild; dropped FPS to almost 0 and brought Unity to its knees. Rather than deal with this, I simply removed the grass and used a clover-looking texture that still made my scene look nice, yet without all the draw calls.

How I Got Vive* Working in My Scene

As mentioned at the top of the article, I’m making the assumption that the reader already has Vive set up and running on their workstation. This area is a condensed version of an existing article that I found, the HTC Vive Tutorial for Unity. I’m not planning on going into any detail on grabbing items with the controllers; for this article I will stick to teleporting. I did modify my version of teleporting, not so much because I think mine is better, but rather, by playing with it and discovering things on my own.

Before you can do any HTC Vive* development, you must download and import the SteamVR plugin.

screenshot of Steam*VR plugin logo

Once the plugin has been imported you will see the following file structure in your project:

screenshot of file structure in unity

In my project, I created a root folder in the hierarchy called VR. In there, I copied the SteamVR and CameraRig prefabs. You don’t have to create a VR folder; I just like to keep my project semi-organized.

screenshot of folder organization

I did not need to do anything with the SteamVR plugin other than add it to my project hierarchy; instead, we will be looking at the CameraRig.

I placed the CameraRig in my scene where I wanted it to initially start. 

screenshot of game environment

After placing the SteamVR CamerRig prefab, I had to delete the main camera; this is to avoid conflicts. At this point, I was able to start my scene and look around. I was not able to move, but from a stationary point I could look around and see the controllers in my hand. You can’t go anywhere, but at least you can look around.

screenshot of game environment

Getting Tracked Objects

Tracked objects are both the hand controllers as well the headset itself. For this code sample, I didn’t worry about the headset; instead, I needed to get input from the hand controllers. This is necessary for tracking things like button clicks, and so on.

First, we must get an instance of the tracked object that the script is on. In this case it will be the controller; this is done in the Awake function.

void Awake( )

{

    _trackedController = GetComponent( );

}

Then, when you want to test for input from one of the two hand controllers, you can select the specific controller by using the following Get function. It uses the tracked object (the hand controller) that this script is attached to:

private SteamVR_Controller.Device HandController
{
    get
    {
        return SteamVR_Controller.Input( ( int )_trackedObj.index );
    }
}

Creating a Teleport System

Now we want to add the ability to move around in the scene. To do this, I had to create a script that knew how to read the input from the hand controllers. I created a new Unity C# script and named it TeleportSystem.cs.

Not only do we need a script but we need a laser pointer, and in this specific case, a reticle. A reticle is not mandatory by any means but does add a little flair to the scene because the reticle can be used as a visual feedback tool for the user. I will just create a very simple circle with a skull image on it.

Create the Laser

The laser was created by throwing a cube into the scene; high enough in the scene so that it didn’t interfere with any of the other assets in that scene. From there I scaled it to x = 0.005, y = 0.005, and z = 1. This gives it a long, thin shape.

screenshot of model in the unity envrionment

After the laser was created, I saved it as a prefab and removed the original cube from the scene because the cube was no longer needed.

Create the Reticle

I wanted a customized reticle at the end of the laser; not required, but cool nonetheless. I created a prefab that is a circle mesh with a decal on it.

screenshot of pirate flag decal

screenshot of the unity inspector panel

Setting Up the Layers

This is an important step. You have to tell your laser/reticle what is and what is not teleportable. For example, you may not want to give the user the ability to teleport onto water, or you may not want to allow them to teleport onto the side of a cliff. You can restrict them to specific areas in your scene by using layers. I created two layers—Teleportable and NotTeleportable.

screenshot of game layers in unity

Things that are teleportable, like the terrain itself, the grass huts, and the stairs I would put on the Teleportable layer. Things like the cliffs or other items in the scene that I don’t want a user to teleport to, I put on the NotTeleportable layer.

When defining my variables, I defined two-layer masks. One mask just had all layers in it. Then I had a non-teleportable mask that indicates layers are not supposed to be teleportable.

// Layer mask to filter the areas on which teleports are allowed
public LayerMask _teleportMask;

// Layer mask specifies which layers we can NOT teleport to.
public LayerMask _nonTeleportMask;

When defining the public layer masks, you will see them in the script. They contain drop-down lists that let you pick and choose which layers you do not want someone teleporting to.

screenshot of unity CSharp script

Setting up the layers works in conjunction with the LayerMatchTest function.

/// <summary>
/// Checks to see if a GameObject is on a layer in a LayerMask.
/// </summary>
/// <param name="layers">Layers we don't want to teleport to</param>
/// <param name="objInQuestion">Object that the raytrace hit</param>
/// <returns> true if the provided GameObject's Layer matches one of the Layers in the provided LayerMask.</returns>
private static bool LayerMatchTest( LayerMask layers, GameObject objInQuestion )
{
    return( ( 1 << objInQuestion.layer ) & layers ) != 0;
}

When LayerMatchTest() is called, I’m sending the layer mask that has the list of layers I don’t want people teleporting to, and the game object that the HitTest detected. This test will see if that object is or is not in the non-teleportable layer list.

Updating Each Frame

void Update( )
{
    // If the touchpad is held down
    if ( HandController.GetPress( SteamVR_Controller.ButtonMask.Touchpad ) )
    {
        _doTeleport = false;

        // Shoot a ray from controller. If it hits something make it store the point where it hit and show
        // the laser. Takes into account the layers which can be teleported onto
        if ( Physics.Raycast( _trackedController.transform.position, transform.forward, out _hitPoint, 100, _teleportMask ) )
        {
            // Determine if we are pointing at something which is on an approved teleport layer.
            // Notice that we are sending in layers we DON'T want to teleport to.
            _doTeleport = !LayerMatchTest( _nonTeleportMask, _hitPoint.collider.gameObject );

            if( _doTeleport )
            {
                PointLaser( );
            }
            else
            {
                DisplayLaser( false );
            }
        }
    }
    else
    {
        // Hide _laser when player releases touchpad
        DisplayLaser( false );
    }
    if( HandController.GetPressUp( SteamVR_Controller.ButtonMask.Touchpad ) && _doTeleport )
    {
        TeleportToNewPosition();
        ResetTeleporting( );
    }
}

On each update, the code will test to see if the controller’s touchpad button was pressed. If so, I’m getting a Raycast hit. Notice that I’m sending my teleport mask that has everything in it. I then do a layer match test on the hit point. By calling the LayerMatchTest function we determine whether it hit something that is or is not teleportable. Notice that I’m sending the list of layers that I do NOT want to teleport to. This returns a Boolean value that is then used to determine whether or not we can teleport.

If we can teleport, I then display the laser using the PointLaser function. In this function, I’m telling the laser prefab to look in the direction of the HitTest. Next, we stretch (scale) the laser prefab from the controller to the HitTest location. At the same time, I reposition the reticle at the end of the laser.

private void PointLaser( )
{
    DisplayLaser( true );

    // Position laser between controller and point where raycast hits. Use Lerp because you can
    // give it two positions and the % it should travel. If you pass it .5f, which is 50%
    // you get the precise middle point.
    _laser.transform.position = Vector3.Lerp( _trackedController.transform.position, _hitPoint.point, .5f );

    // Point the laser at position where raycast hits.
    _laser.transform.LookAt( _hitPoint.point );

    // Scale the laser so it fits perfectly between the two positions
    _laser.transform.localScale = new Vector3( _laser.transform.localScale.x,
                                            _laser.transform.localScale.y,
                                            _hitPoint.distance );

    _reticle.transform.position = _hitPoint.point + _VRReticleOffset;
}

If the HitTest is pointing to a non-teleportable layer, I ensure that the laser pointer is turned off via the DisplayLaser function.

At the end of the function, if both the touch pad is being pressed AND the shouldTeleport variable is true, I call the Teleport function to teleport the user to the new location.

private void TeleportToNewPosition( )
{
    // Calculate the difference between the positions of the camera's rig's center and players head.
    Vector3 difference = _VRCameraTransform.position - _VRHeadTransform.position;

    // Reset the y-position for the above difference to 0, because the calculation doesn’t consider the
    // vertical position of the player’s head
    difference.y = 0;

    _VRCameraTransform.position =  _hitPoint.point + difference;
}

In Closing

This is pretty much how I got my scene up and running. It involved a lot of discovering things on the Internet, reading other people’s posts, and a lot of trial and error. I hope that you have found this article useful, and I invite you to contact me.

For completeness, here is the full script:

using System.Collections;
using System.Collections.Generic;
using UnityEngine;


/// <summary>
/// Used to teleport the players location in the scene. Attach to SteamVR's ControllerRig/Controller left and right
/// </summary>
public class TeleportSystem : MonoBehaviour
{
    // The controller itself
    private SteamVR_TrackedObject _trackedController;

    // SteamVR CameraRig transform
    public Transform _VRCameraTransform;

    // Reference to the laser prefab set in Inspecter
    public GameObject _VRLaserPrefab;

    // Ref to teleport reticle prefab set in Inspecter
    public GameObject _VRReticlePrefab;

    // Stores a reference to an instance of the laser
    private GameObject _laser;

    // Ref to instance of reticle
    private GameObject _reticle;

    // Ref to players head (the camera)
    public Transform _VRHeadTransform;

    // Reticle offset from the ground
    public Vector3 _VRReticleOffset;

    // Layer mask to filter the areas on which teleports are allowed
    public LayerMask _teleportMask;

    // Layer mask specifies which layers we can NOT teleport to.
    public LayerMask _nonTeleportMask;

    // True when a valid teleport location is found
    private bool _doTeleport;

    // Location where the user is pointing the hand held controller and releases the button
    private RaycastHit _hitPoint;




    /// <summary>
    /// Gets the tracked object. Can be either a controller or the head mount.
    /// But because this script will be on a hand controller, don't have to worry about
    /// knowing if it's a head or hand controller, this will only get the hand controller.
    /// </summary>
    void Awake( )
    {
        _trackedController = GetComponent<SteamVR_TrackedObject>( );
    }


    /// <summary>
    /// Initialize the two prefabs
    /// </summary>
    void Start( )
    {
        // Spawn prefabs, init the classes _hitPoint

        _laser      = Instantiate( _VRLaserPrefab );
        _reticle    = Instantiate( _VRReticlePrefab );

        _hitPoint   = new RaycastHit( );

    }


    /// <summary>
    /// Checks to see if player holding down touchpad button, if so, are the trying to teleport to a legit location
    /// </summary>
    void Update( )
    {
        // If the touchpad is held down
        if ( HandController.GetPress( SteamVR_Controller.ButtonMask.Touchpad ) )
        {
            _doTeleport = false;

            // Shoot a ray from controller. If it hits something make it store the point where it hit and show
            // the laser. Takes into account the layers which can be teleported onto
            if ( Physics.Raycast( _trackedController.transform.position, transform.forward, out _hitPoint, 100, _teleportMask ) )
            {
                // Determine if we are pointing at something which is on an approved teleport layer.
                // Notice that we are sending in layers we DON'T want to teleport to.
                _doTeleport = !LayerMatchTest( _nonTeleportMask, _hitPoint.collider.gameObject );

                if( _doTeleport )
                    PointLaser( );
                else
                    DisplayLaser( false );
            }
        }
        else
        {
            // Hide _laser when player releases touchpad
            DisplayLaser( false );
        }
        if( HandController.GetPressUp( SteamVR_Controller.ButtonMask.Touchpad ) && _doTeleport )
        {
            TeleportToNewPosition( );
            DisplayLaser( false );
        }
    }


    /// <summary>
    /// Gets the specific hand contoller this script is attached to, left or right controller
    /// </summary>
    private SteamVR_Controller.Device HandController
    {
        get
        {
            return SteamVR_Controller.Input( ( int )_trackedController.index );
        }
    }


    /// <summary>
    /// Checks to see if a GameObject is on a layer in a LayerMask.
    /// </summary>
    /// <param name="layers">Layers we don't want to teleport to</param>
    /// <param name="objInQuestion">Object that the raytrace hit</param>
    /// <returns>true if the provided GameObject's Layer matches one of the Layers in the provided LayerMask.</returns>
    private static bool LayerMatchTest( LayerMask layers, GameObject objInQuestion )
    {
        return( ( 1 << objInQuestion.layer ) & layers ) != 0;
    }


    /// <summary>
    /// Displays the lazer and reticle
    /// </summary>
    /// <param name="showIt">Flag </param>
    private void DisplayLaser( bool showIt )
    {
        // Show _laser and reticle
        _laser.SetActive( showIt );
        _reticle.SetActive( showIt );
    }



    /// <summary>
    /// Displays the laser prefab, streteches it out as needed
    /// </summary>
    /// <param name="hit">Where the Raycast hit</param>
    private void PointLaser( )
    {
        DisplayLaser( true );

        // Position laser between controller and point where raycast hits. Use Lerp because you can
        // give it two positions and the % it should travel. If you pass it .5f, which is 50%
        // you get the precise middle point.
        _laser.transform.position = Vector3.Lerp( _trackedController.transform.position, _hitPoint.point, .5f );

        // Point the laser at position where raycast hits.
        _laser.transform.LookAt( _hitPoint.point );

        // Scale the laser so it fits perfectly between the two positions
        _laser.transform.localScale = new Vector3( _laser.transform.localScale.x,
                                                    _laser.transform.localScale.y,
                                                    _hitPoint.distance );

        _reticle.transform.position = _hitPoint.point + _VRReticleOffset;
    }



    /// <summary>
    /// Calculates the difference between the cameraRig and head position. This ensures that
    /// the head ends up at the teleport spot, not just the cameraRig.
    /// </summary>
    /// <returns></returns>
    private void TeleportToNewPosition( )
    {
        Vector3 difference = _VRCameraTransform.position - _VRHeadTransform.position;
        difference.y = 0;
        _VRCameraTransform.position =  _hitPoint.point + difference;
    }
}

About the Author

Rick Blacker works in the Intel® Software and Services Group. His main focus is on virtual reality with focus on Primer VR application development.

Introducing: Movidius™ Neural Compute Stick

$
0
0

With the recent announcement and availability of the Movidius™ Neural Compute Stick, a new device for developing and deploying deep learning algorithms at the edge. The Intel® Movidius team created the Neural Compute Stick (NCS) to make deep learning application development on specialized hardware even more widely available.

The NCS is powered by the same low-power Movidius Vision Processing Unit (VPU) that can be found in millions of smart security cameras, gesture-controlled autonomous drones, and industrial machine vision equipment, for example. The convenient USB stick form factor makes it easier for developers to create, optimize and deploy advanced computer vision intelligence across a range of devices at the edge.

The USB form factor easily attaches to existing hosts and prototyping platforms, while the VPU inside provides machine learning on a low-power deep learning inference engine. You start using the NCS with trained Caffe* framework-based feed-forward Convolutional Neural Network (CNN), or you can choose one of our example pre-trained networks.  Then, by using our Toolkit, you can profile the neural network, then compile a tuned version ready for embedded deployment using our Neural Compute Platform API.

Here are some of its key features:

  • Supports CNN profiling, prototyping, and tuning workflow

  • All data and power provided over a single USB Type A port

  • Real-time, on device inference – cloud connectivity not required

  • Run multiple devices on the same platform to scale performance

  • Quickly deploy existing CNN models or uniquely trained networks

 

The Intel Movidius team, is inspired by the incredible sophistication of the human brain’s visual system, and we would like to think we’re getting a little closer to matching its capabilities with our new Neural Compute Stick.

To get started, you can visit developer.movidius.com for more information.


Getting Started in Linux with Intel® SDK for OpenCL™ Applications

$
0
0

This article is a step by step guide to quickly get started developing using Intel®  SDK for OpenCL™ Applications with the Linux SRB5 driver package.

  1. Install the driver
  2. Install the SDK
  3. Set up Eclipse

For SRB4.1 instructions, please see https://software.intel.com/en-us/articles/sdk-for-opencl-gsg-srb41.

Step 1: Install the driver

 

This script covers the steps needed to install the SRB5 driver package in Ubuntu 14.04, Ubuntu 16.04, CentOS 7.2, and CentOS 7.3.

 

To use

$ mv install_OCL_driver.sh_.txt install_OCL_driver.sh
$ chmod 755 install_OCL_driver.sh
$ sudo su
$ ./install_OCL_driver.sh install

This script automates downloading the driver package, installing prerequisites and user-mode components, patching the 4.7 kernel, and building it. 

You can check your progress with the System Analyzer Utility.  If successful, you should see smoke test results looking like this at the bottom of the the system analyzer output:

--------------------------
Component Smoke Tests:
--------------------------
 [ OK ] OpenCL check:platform:Intel(R) OpenCL GPU OK CPU OK

 

Experimental installation without kernel patch or rebuild:

If using Ubuntu 16.04 with the default 4.8 kernel you may be able to skip the kernel patch and rebuild steps.  This configuration works fairly well but several features (i.e. OpenCL 2.x device-side enqueue and shared virtual memory, VTune GPU support) require patches.  Install without patches has been "smoke test" validated to check that it is viable to suggest for experimental use only, but it is not fully supported or certified.   

 

Step 2: Install the SDK

This script will set up all prerequisites for successful SDK install for Ubuntu. 

$ mv install_SDK_prereq_ubuntu.sh_.txt install_SDK_prereq_ubuntu.sh
$ sudo su
$ ./install_SDK_prereq_ubuntu.sh

After this, run the SDK installer.

Here is a kernel to test the SDK install:

__kernel void simpleAdd(
                       __global int *pA,
                       __global int *pB,
                       __global int *pC)
{
    const int id = get_global_id(0);
    pC[id] = pA[id] + pB[id];
}                               

Check that the command line compiler ioc64 is installed with

$ ioc64 -input=simpleAdd.cl -asm

(expected output)
No command specified, using 'build' as default
OpenCL Intel(R) Graphics device was found!
Device name: Intel(R) HD Graphics
Device version: OpenCL 2.0
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
fcl build 1 succeeded.
bcl build succeeded.

simpleAdd info:
	Maximum work-group size: 256
	Compiler work-group size: (0, 0, 0)
	Local memory size: 0
	Preferred multiple of work-group size: 32
	Minimum amount of private memory: 0

Build succeeded!

 

Step 3: Set up Eclipse

Intel SDK for OpenCL applications works with Eclipse Mars and Neon.

After installing, copy the CodeBuilder*.jar file from the SDK eclipse-plug-in folder to the Eclipse dropins folder.

$ cd eclipse/dropins
$ find /opt/intel -name 'CodeBuilder*.jar' -exec cp {} . \;

Start Eclipse.  Code-Builder options should be available in the main menu.

After teaching a tutorial, I’m going to go see the high-fidelity motion blurring at SIGGRAPH from the Intel Embree/OSPRay engineers

$
0
0

When I’m at SIGGRAPH, I’m planning to visit some friends in the Intel booth to see their new motion blurring technology.  I’m sure people more knowledgeable in the field will find even more interesting developments from these engineers who are helping drive software defined visualization work, in particular with their Embree open source ray tracing library, and the OSPRay open source rendering engine for high-fidelity visualization (built on top of Embree). Key developers will be in the Intel booth at SIGGRAPH, and have a couple papers in the High-Performance Graphics (HPG) conference that is collocated with SIGGRAPH.

The Embree/OSPRay engineers have two interesting papers they will present at HPG.  Both will be presented on Saturday July 29, in “Papers Session 3: Acceleration Structures for Ray Tracing”:

The SDVis machine (I refer to it as a “dream machine for software visualization” – but they simply call it the “Software Defined Visualization Appliance”) – and one such machine will be in the Intel booth at SIGGRAPH.  I did not go to see it at ISC in Germany where they showed off HPC related visualization work with the hot topic being “in situ visualization.” At SIGGGRAPH, they will have demos around high-fidelity (realistic) visualization – specifically demonstrating Embree's novel approach to handle multi-segment motion blur and OSPRay's photorealistic renderer to interactively render a scene.  These demos relate to their HPG papers.

showing the mblur approach: original (left) and with blurring (right)
original image (CC) caminandes.com

I’m sure the partial re-braiding is amazing, but it’s the blurring that caught my attention.

First of all, theoretically blurring is not needed. With a super high framerate, and amazing resolution, the blurring would just appear to us like real life.  At least, I think that’s right.

However, with realistic framerates and resolutions we detect a scene as being unrealistic when blurring is not there.  In fact, in some cases, spokes on wheels appear to go backwards.

The solution?  Blurring.  But, like many topics, what seems simple enough, is not. A simple algorithm might be to take adjacent frames and create a blur based on changes. Perhaps do this on a higher framerate visualization as your sample it down to the target framerate for your final production.  Unfortunately, this approach is not efficient because the geometry has to be processed multiple times per frame and adaptive noise reduction on parts of the image is not possible.

That where “A Spatial-Temporal BVH for Efficient Multi-Segment Motion Blur” kicks in. These engineers had a different approach in which they pay attention to the actual motion of object.  Imagine a scene with a helicopter blade turning around-and-around while a bird flies through the scene in something much closer to a straight line. Their method comprehends the actual motion, and create blurring based on that.  Of course, doing this with high performance and high-fidelity both are what really makes their work valuable. In the example images above, the train blurring varies in a realistic fashion.

If want to read a better description of their work, and their comparisons with previous work, you should read their paper and/or visit them at HPG, or SIGGRAPH in the Intel booth.

I hope to see some of you at SIGGRAPH.  I’m co-teaching a tutorial “Multithreading for Visual Effects” on Monday at 2pm. After that, I’m running over to see the Embree/OSPRay folks in the Intel booth.

 

 

 

PIN Errors in 2017 Update 3 and 4 Analysis Tools

$
0
0

Problem:

As of July 28th, 2017, we have been receiving many reports of people who have been having problems with the analysis tools (Intel® VTune™ Amplifier, Advisor, and Inspector) as a result of a problem with PIN, the tool they use to instrument software.

PIN problems can produce several types of error. One of the more common ones is

__bionic_open_tzdata_path: PIN_CRT_TZDATA not set!

The PIN executable is located in the bin64 and/or bin32 folders in the installation directories of the analysis tools. You can test whether PIN is the source of your problems by running it on any executable. For example:

pin -- Notepad.exe

Solution:

On Windows*, certain virus checkers have been known to interfere with PIN. Excluding pin.exe from the virus checker may resolve the issue.

On Linux*, a recent security patch (CVE-2017-1000364) is causing problems with PIN. Intel® VTune™ Amplifier 2017 Update 4, available on the Registration Center, uses a new version of PIN which should fix these problems.

Intel® Advisor and Inspector have not yet received a patch. We apologize for the inconvenience, and we assure you we're working on getting it fixed as soon as possible. If PIN problems are causing a significant blockage of your work with Intel® Inspector or Advisor, please submit a ticket to the Online Service Center to let us know.

Optimizing Edge-Based Intelligence for Intel® HD Graphics

$
0
0

Background on AI and the Move to the Edge

Our daily lives intersect with artificial intelligence (AI)-based algorithms. With fits and starts, AI has been a domain of research over the last 60 years. Machine learning, or the many layers of deep learning, are propelling AI into all parts of modern life. Its applied usages are varied, from computer vision for identification and classification to natural language processing, to forecasting. These base-level tasks then lead to higher level tasks such as decision making.

What we call deep learning is typically associated with servers, the cloud, or high-performance computing. While AI usage in the cloud continues to grow, there is a trend toward AI inference engine on the edge (i.e. PCs, IoT devices, etc.). Having devices perform machine learning locally versus relying solely on the cloud is a trend driven by the need to lower latency, to ensure persistent availability, to reduce costs (for example, the cost of running inferencing algorithms on servers), and to address privacy concerns. Figure 1 shows the phases of deep learning.

Image of a map
Figure 1. Deep learning phases

Moving AI to consumers– Personify* is an expert system performing real-time image segmentation. Personify enabled real-time segmentation within the Intel® RealSense™ camera in 2015. In 2016, Personify launched ChromaCam*, an application that can remove/replace/blur the user's background in all major video chat apps like Microsoft Skype*, Cisco WebEx*, and Zoom* as well as streaming apps like OBS and XSplit*. ChromaCam uses deep learning to do dynamic background replacement in real-time and works with any standard 2D webcam commonly found on laptops.

Image of a person
Figure 2. Personify segmentation

One of the requirements for Personify is to run an inference engine process on their deep learning algorithm on the edge as fast as possible. To get good segmentation quality, Personify needs to run the inference algorithm on the edge to avoid the unacceptable latencies of the cloud. The Personify software stack runs on CPUs and graphics processing units (GPUs), and was originally optimized for discrete graphics. However, running an optimized deep learning inference engine that requires a discrete GPU limits the application to a relatively small segment of PCs. Further, the segmentation effect should ideally be very efficient since it will usually be used along with other intense applications such as gaming with segmentation, and most laptops are constrained in terms of total compute and system-on-chip (SOC) package thermal requirements. Intel and Personify started to explore optimizing an inference engine on Intel® HD Graphics with a goal to meeting these challenges and bringing this technology to the mainstream laptop.

We used Intel® VTune™ Amplifier XE to profile and optimize GPU performance for deep learning inference usage on Intel® 6th Generation Core™ i7 Processors with Intel HD Graphics 530.

Figure 3 shows the baseline execution profile for running the inference workload on a client PC. Even though the application is using the GPU for the deep learning algorithm, the performance lags in its requirements. The total time to run a non-optimized inference algorithm on Intel HD Graphics is about three seconds for compute and the GPU is 70 percent stall.

Image of a spreadsheet
Figure 3. Intel® VTune™ analyzer GPU profile for segmentation (baseline)

The two most important columns that stand out are total time in general matrix-to-matrix multiplication (GEMM) and execution unit (EU) stalls. Without optimization, the deep learning inference engine is very slow for image segmentation in a video conferencing scenario on Intel HD Graphics. Our task was to hit maximum performance from an Intel® GPU on a mainstream consumer laptop.

Optimization: Optimizing a matrix-to-matrix multiplication kernel and increasing EU active time were top priority.

Figure 4 shows the default pipeline of convolutional neural network.

Image of map
Figure 4. Default deep learning pipeline

We identified several items for deep learning inferencing on Intel HD Graphics (Figure 5).

  • CPU Copies - Algorithm is using CPU to copy data from CPU to GPU for processing at every deep learning layer.
  • GEMM Convolution Algorithm - based on OpenCV* and OpenCL™.
  • Convert to Columns - Uses extra steps and needs extra memory.

Image of a map
Figure 5. Remove extra copies (using spatial convolution)

We replaced GEMM convolution with spatial convolution, which helped to avoid additional memory copies and produced code that was optimized for speed. We overcame dependencies on reading individual kernels in this architecture by auto-tuning the mechanism (see Figure 6).

Image of a map
Figure 6. New simple architecture

Result: Our testing on Intel® 6th Generation Core™ i7 Processors with Intel HD Graphics 530 shows 13x better performance on total time (2.8 seconds vs. 0.209 seconds) with about 69.6 percent GPU utilization and ~8.6x performance gain (1.495seconds vs. 0.173 seconds) in GEMM kernels (Figure 7 vs. Figure 3), thus increasing quality real-time segmentation as frame rate is increased.

Image of properties
Figure 7. Intel® VTune™ analyzer GPU profile for segmentation (after optimization)

To support mobile usage, battery life is another important metric for a laptop. Bringing high-performance, deep algorithms to a client, at the cost of higher power consumption, impacts the user experience. We analyzed estimated battery power consumption using the Intel® Power Gadget tool during 120 seconds of total video conference workload.

Power Utilization for Personify Application

  • GPU Power Utilization 5.5W
  • Package SOC Power 11.28W

Summary: We are witnessing a reshuffling of compute partitioning between cloud data centers and clients in favor of moving applications of deep learning models to the client. Local model deployment also has the advantage of latency overhead and keeping personal data local to the device, and thus protecting privacy. Intel processor-based platforms enable high-end CPU and GPUs to provide inference engine applications on clients that cover big, consumer-based ecosystems.

References

Intel® SDK for OpenCL™ Applications - Release Notes

$
0
0

This page provides the current Release Notes for Intel® SDK for OpenCL™ Applications. All files are in PDF format - Adobe Reader* (or compatible) required.

For questions or technical support, visit the OpenCL SDK support forum.

Intel® SDK for OpenCL™ Applications

What's New?Release Notes

What's new? Intel® SDK for OpenCL™ Applications 2016, R3

Intel® SDK for OpenCL™ Applications 2016, R3
English

What's new? Intel® SDK for OpenCL™ Applications 2016, R2

Intel® SDK for OpenCL™ Applications 2016, R2
English

What's new? Intel® SDK for OpenCL™ Applications 2016

Intel® SDK for OpenCL™ Applications 2016
English

What's new? Intel® SDK for OpenCL™ Applications 2015 R3

Intel® SDK for OpenCL™ Applications 2015
English

What's new? OpenCL™ Code Builder 2015 R2

OpenCL™ Code Builder for Intel® Integrated Native Developers Experience (Intel® INDE) Release Notes
English

What's new? OpenCL™ Code Builder 2015 R1

OpenCL™ Code Builder for Intel® Integrated Native Developers Experience (Intel® INDE) Release Notes
English

What's new? Beta

Beta Release Notes
English

Viewing all 1201 articles
Browse latest View live