Intel® Distribution for Python* versus Non-Optimized Python: Breast Cancer Classification

December 6, 2017, 4:00 pm

Latest and popular articles on Intel Technologies

≫ Next: Large Matrix Operations with SciPy* and NumPy*: Tips and Best Practices

≪ Previous: Installing the Intel® Distribution for Python* and Intel® Performance Libraries with pip and PyPI

Abstract

This case study compares the performance of Intel® Distribution for Python* to that of non-optimized Python using a breast cancer classification. This comparison was done using machine learning algorithms from the scikit-learn* package in Python.

Introduction

Cancer refers to cells that grow out of control and invade other tissues. This process can also result in a tumor, where there is more cell growth than cell death. There are various types of cancer including bladder cancer, kidney cancer, lung cancer, and breast cancer. Currently, breast cancer is one of the most prevalent types of cancer, especially in women. It occurs when the cells in the breast divide and grow uncontrollably. Early detection of breast cancer can save lives. Causes of cancer include inherited genes, hormones, and an individual’s lifestyle.

This article provides a comparative study between the performance of non-optimized Python* and the Intel® Distribution for Python using breast cancer classification as an example. The classifiers used for breast cancer classification were taken from the scikit-learn* package in Python. The time and accuracy of each classifier for each distribution was calculated and compared.

Dataset Description

The dataset for this study can be accessed from the Breast Cancer Wisconsin (Diagnostic) Data Set. The features of this dataset were computed from a digitized image of a fine needle aspirate of a breast mass in a CSV format and describe the characteristics of the cell nuclei present in the image. These values obtained were the features for classification. Using these features, a cancer cell can be classified into two classes: benign and malignant. Benign refers to a tumor that is not cancerous, whereas a malignant tumor has cancer in it. Observing the class distribution, there were 357 benign and 212 malignant data rows. The classification is based on the diagnosis field that has values M or B, where M denotes malignant and B denotes benign. Hence, this is a binary classification.

Hardware Configuration

The experiment used Intel® architecture with the following hardware configuration:

Feature	Specifications
Architecture	x86_64
CPU op-mode(s)	32-bit, 64-bit
Byte order	Little Endian
CPU(s)	256
On-line CPU(s) list	0-255
Thread(s) per core	4
Core(s) per socket	64
Socket(s)	1
NUMA node(s)	2
Vendor ID	GenuineIntel
CPU family	6
Mode	87
Model name	Intel® Xeon Phi™ processor 7210 @ 1.30 GHz
Stepping	1
CPU MHz	1375.917
BogoMIPS	2593.85
L1d cache	32K
L1i cache	32K
L2 cache	1024K
NUMA node0 CPU(s)	0-255

Software Configuration

The following are the software dependencies used to perform these classification:

Software	Version
Python*	2.7.13
scikit-learn*	0.18.2
Anaconda*	4.3.25

Classifier Implementation Pipeline

The goal was to identify the class (M or B) to which the tumor belonged. The following block diagram shows the classification steps, explained in the following section, for both the Intel Distribution for Python and non-optimized Python.

Image of a flowchart

Implementation

The scikit-learn Python library provides a wide variety of machine learning algorithms for classification. Ten classifiers from the package were used for the study: Decision Tree Classifier, Gaussian NB, SGD Classifier, SVC, KNeighbors Classifier, OneVsRest Classifier, Quadratic Discriminant Analysis (QDA), Random Forest Classifier, MLP Classifier, and AdaBoost Classifier.

Create a Python file called classifier_ml.py. The following steps are implemented in this file:

The input data mentioned in Dataset Description section is given for preprocessing.
As part of the preprocessing, the given dataset is checked for categorical values (if any) and are converted to numerical data. This is performed using a technique called One Hot Encoding. This is important because a few classifiers in scikit-learn work only with numerical values. Here, diagnosis fields containing values "M" and "B" are converted to 1 and 0, respectively. Columns such as "id" are irrelevant for classification and hence can be dropped.
After preprocessing, all the columns except diagnosis field is considered as the features. Diagnosis column is taken as the target.
70 percent of the data is used for training and 30 percent is used for testing. The split is done using the StratifiedShuffleSplit function from cross_validation module of sklearn¹.
Keeping the default environment intact, the accuracy of each classifier is recorded using the scikit-learn package of Python².
The file 'classifier_ml.py', is now executed. The time taken (t_nop) is measured as a 10-times average for better accuracy as follows:

time(cmd="python classifier_ml.py"; for i in $(seq 10); do $cmd; done)

Steps 1 through 6 provide the time and accuracy values for non-optimized Python. Repeat these steps for the Intel Distribution for Python. The time (t_idp) and accuracy are calculated.

To enable the Intel Distribution for Python, follow the steps given in Installing Intel® Distribution for Python* and Intel® Performance Libraries with Anaconda*.

The results are shown in Table 1.

The accuracy values for each classifier are the same for both non-optimized Python and the Intel Distribution for Python. Therefore, the accuracy values listed in Table 1 are common for both distributions.

Performance Gain percentage with respect to time is calculated by the given formula:

Performance Gain % = (t_nop - t_idp) / t_nop * 100

From the formula, it is clear that a positive value of Performance Gain percentage indicated better performance of the Intel Distribution for Python. The higher the value, the better the performance for the Intel Distribution for Python compared to non-optimized Python.

Results

Table 1 shows the percentage of performance gain for non-optimized Python compared to that of the Intel Distribution for Python*.

Table 1: Gain percentage: non-optimized Python* versus Intel® Distribution for Python*

Classifier	Accuracy (Percent)	Performance Gain Percentage
DecisionTreeClassifier	90.64	34.69
GaussianNB	94.74	35.01
SGDClassifier	88.89	33.04
SVC	94.74	32.29
KNeighborsClassifier	92.98	34.35
OneVsRestClassifier	92.40	33.00
QuadraticDiscriminantAnalysis	94.15	33.65
RandomForestClassifier	93.57	30.36
MLPClassifer	65.50	32.09
AdaBoostClassifier	94.74	27.09

Conclusion

The performance gain clearly shows that the Intel Distribution for Python had better performance in terms of the time taken for execution as compared to non-optimized Python. The accuracy remained the same as expected and did not change whether non-optimized Python or the Intel Distribution for Python was used.

References

Cross Validation - Stratified Shuffle Split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html
An introduction to machine learning with scikit-learn: http://scikit-learn.org/stable/tutorial/basic/tutorial.html

↧

Large Matrix Operations with SciPy* and NumPy*: Tips and Best Practices

December 7, 2017, 9:32 am

Latest and popular articles on Intel Technologies

≫ Next: Finding Missing Kids through Code

≪ Previous: Intel® Distribution for Python* versus Non-Optimized Python: Breast Cancer Classification

Introduction

Large matrix operations are the cornerstones of many important numerical and machine learning applications. In this article, we provide some recommendations for using operations in Scipy or Numpy for large matrices with more than 5,000 elements in each dimension.

General Advice for Setting up Python*

Use the latest version of Intel® Distribution for Python* (version 2018.0.0 at the time of this writing), and preferably Python (version 3.6) for better memory deallocation and timing performance. This means downloading the Miniconda* installation script (that is, Miniconda3-latest-Linux-x86_64.sh) with Python (version 3.6). After downloading the script, the following lines are some example bash commands that you can use to execute the script for the installation of conda*, a Python package manager that we will use throughout this guide:

INSTALLATION_DIR=$HOME/miniconda3
bash $<DOWNLOAD_PATH>/Miniconda3-latest-Linux-x86_64.sh -b -p $INSTALLATION_DIR -f
CONDA=${INSTALLATION_DIR}/bin/conda
ACTIVATE=${INSTALLATION_DIR}/bin/activate

For the installation of the Python packages from Intel, you can create a conda environment called idp as follows:

$CONDA create -y -c intel -n idp intelpython3_core python=3

To activate the conda environment for use, run

source $ACTIVATE idp

Then you will be able to use the installed packages from the Intel Distribution for Python.

Tips for Using the Matrix Operations

These recommendations may help you obtain faster computational performance for large matrix operations on compatible Intel® processors. From our benchmarks, we see great speedups of these large matrix operations when used in parallel on the Intel processors that belong to the server class, such as the Intel® Xeon® processors and Intel® Xeon Phi™ processors. These speedups are a result of the parallel computation at the multithreading layer of the Numpy and Scipy libraries.

We based the five tips described below on the performance observed for the matrix operation benchmarks, which are version controlled. After activating the conda environment for the Intel Distribution for Python, you can install our benchmarks following these steps for the bash command line:

git clone https://github.com/IntelPython/ibench.git
cd ibench
python setup.py install

The example SLURM job script for running the benchmark on an Intel Xeon Phi processor (formerly code-named Knights Landing) processor can be found at the following link, while the example SLURM job script for an Intel Xeon processor can be found at the following link.

Tip 1: Tune the relevant environmental variable settings

To make the best use of the multithreading capabilities of server-class Intel processors, we recommend you tune the threading and memory allocation settings accordingly. Factors that can affect the performance of the matrix operations include:

Shape and size of the input matrix
Matrix operation used
Amount of computational resources on the system
Usage pattern of computational resource of the specific code base that runs the matrix operation(s)

For Intel® Xeon® Phi™ processors

Some example (bash environmental) settings are listed below as a baseline for tuning the performance.

The first set of parameters determines how many threads will be used.

export NUM_OF_THREADS=$(grep 'model name' /proc/cpuinfo | wc -l)
export OMP_NUM_THREADS=$(( $NUM_OF_THREADS / 4  ))
export MKL_NUM_THREADS=$(( $NUM_OF_THREADS / 4  ))

The parameters OMP_NUM_THREADS and MKL_NUM_THREADS specify the number of threads for the matrix operations. On an Intel Xeon Phi processor node dedicated for the computation job, we recommend using one thread per available core as a starting point for tuning the performance.

The second set of parameters specifies the behavior of each thread.

export KMP_BLOCKTIME=800
export KMP_AFFINITY=granularity=fine,compact
export KMP_HW_SUBSET=${OMP_NUM_THREADS}c,1t

The parameter KMP_BLOCKTIME specifies how long a thread should stay active after the completion of a compute task. When KMP_BLOCKTIME is longer than the default value of 200 milliseconds, there will be less overhead for waking up the thread for subsequent computation(s).

The KMP_AFFINITY parameter dictates the placement of neighboring OpenMP thread context. For instance, the setting of compact assigns the OpenMP thread <n>+1 to a free thread context as close as possible to the thread context where the <n> OpenMP thread was placed. A detailed guide for KMP_AFFINITY setting can be found at this link.

The KMP_HW_SUBSET setting specifies the number of threads placed onto each processor core.

The last parameter controls the memory allocation behavior.

export HPL_LARGEPAGE=1

The parameter HPL_LARGEPAGE enables large page size for the memory allocation of data objects. Having a huge page size can enable more efficient memory patterns for large matrices due to better translation lookaside buffer behavior.

For Intel® Xeon® processors (processors released with a code-name newer than Haswell)

Some (bash environmental) example settings are:

export NUM_OF_THREADS=$(grep 'model name' /proc/cpuinfo | wc -l)
export OMP_NUM_THREADS=$(( $NUM_OF_THREADS ))
export MKL_NUM_THREADS=$(( $NUM_OF_THREADS ))
export KMP_HW_SUBSET=${OMP_NUM_THREADS}c,1t
export HPL_LARGEPAGE=1
export KMP_BLOCKTIME=800

Tip 2: Use Fortran* memory layout for Numpy arrays (assuming matrix operations are called repeatedly)

Numpy arrays can use either Fortran memory layout (column-major order) or C memory layout (row-major order). Many Numpy and Scipy APIs are implemented with LAPACK and BLAS, which require Fortran memory layout. If a C memory layout Numpy array is passed to a Numpy or Scipy API that uses Fortran order internally, it will perform a costly transpose first. If a Numpy array is used repeatedly, convert it to Fortran order before the first use.

The Python function that can enable this memory layout conversion is numpy.asfortranarray. Here is a short code example:

import numpy as np
matrix_input = np.random.rand(5000, 5000)
matrix_fortran = np.asfortranarray(matrix_input, dtype=matrix_input.dtype)

Tip 3: Save the result of a matrix operation in the input matrix (`kwargs: overwrite_a=True`)

It is natural to obtain large outputs from matrix operations that have large matrices as inputs. The creation of additional data structures can add overhead.

Many Scipy matrix linear algebra functions have an optional parameter called overwrite_a, which can be set to True. This option makes the function provide the result by overwriting an input instead of allocating a new Numpy array. Using the LU decomposition from the Scipy library as an example, we have:

import scipy
scipy.linalg.lu(a=matrix, overwrite_a=True)

Tip 4: Use the `TBB` or `SMP` Python modules to avoid the oversubscription of threads

Efficient parallelism is a known recipe for unlocking the performance on an Intel processor with more than one core. The reasoning about parallelism within Python, however, is sometimes less than transparent. Individual Python libraries may implement their own mechanism and level of parallelism. When a combination of Python libraries is used, it can result in an unintended usage pattern of the computational resources. For example, some Python modules (for example, multiprocessing) and libraries (for example, Scikit-Learn*) introduce parallelism into some functions by forking multiple Python processes. Each of these Python processes can spin up a number of threads as specified by the library.

Sometimes the number of threads can be bigger than the available amount of CPU resources in a way that is known as thread oversubscription. Thread oversubscription can slow down a computational job. Within the Intel Distribution for Python, several modules help to manage threads more efficiently by avoiding oversubscription. These are the Static Multi-Processing library (SMP) and the Python module for Intel® Threading Building Blocks (Intel® TBB).

In Table 1, Intel engineers Anton Gorshkov and Anton Malakhov provide recommendations for which module to use for a Python application based on the parallelism characteristics, in order to achieve good thread subscription for parallel Python applications. These recommendations can serve as a starting point for tuning the parallelism.

Table 1. Module recommendations based on parallelism characteristics.

Innermost parallelism level	Outermost parallelism level
	Balanced work		Unbalanced work
	Low subscription	High subscription
Low subscription	Python	Python with SMP	Python with Intel® Threading Building Blocks
High subscription			KMP_COMPOSABILITY

Below we give the exact command line instructions for each entry in the table.

The SMP and Intel TBB modules can be easily installed in an Anaconda* or a Miniconda installation of Python using the conda installer commands:

conda install -c intel smp
conda install -c intel tbb

More specifically, we explain the bash commands for each possible scenario outlined in Table 1. Assume the Python script that we try to run is called PYTHONPROGRAM.py, the syntax for invoking the script at a bash command line is:

python PYTHONPROGRAM.py                  # No change

If we want to use the Python modules with the script, we invoke one of the Python modules by supplying one of them as the argument of the -m flag of the Python executable:

python -m smp PYTHONPROGRAM.py           # Use SMP with Python script
python -m tbb PYTHONPROGRAM.py           # Use TBB with Python script

Additionally, the TBB module has various options. For example, we can supply a -p flag with Tbb to limit the maximum number of threads:

python -m tbb -p $MAX_NUM_OF_THREADS PYTHONPROGRAM.py

And for a Python script with high thread subscription in the inner parallelism level and unbalanced work, we can set the variable KMP_COMPOSABILITY for a bash shell as follows:

KMP_COMPOSABILITY=mode=exclusive python

To read about additional resources for Intel TBB and SMP, we recommend the SciPy 2017 tutorial given by Anton Malakhov. Or you can find a tutorial at Unleash the Parallel Performance of Python* Programs.

Tip 5: Turn on transparent hugepages for memory allocation on Cray Linux*

The Linux* OS has a feature called hugepages that can accelerate programs with a large memory footprint by 50 percent or greater. For many Linux systems, the hugepages functionality is transparently provided by the kernel. Enabling hugepages for Python on Cray Linux requires some extra steps. Our team at Intel has verified that the presence of hugepages is needed to achieve the best possible performances for some large matrix operations (> 5000 x 5000 matrix size). These operations include LU, QR, SVD, and Cholesky decompositions.

First, on the relevant Cray system, check that the system hugepages module is loaded. For example, if Tcl* modules are used, you can find and load the system hugepages module via instructions similar to:

module avail craype-hugepages2M
module load craype-hugepages2M

You will also need to run the one-line ainstallation instruction of the hugetlbfs package within a Conda Python environment or installation:

You will also need to run the one-line installation instruction of the hugetlbfs package within a Conda Python environment or installation:

conda install -c intel hugetlbfs

After that the Python binary within the conda environment will allocate the memory for objects using hugepages.

↧

Finding Missing Kids through Code

December 11, 2017, 2:52 pm

Latest and popular articles on Intel Technologies

≫ Next: IoT 101 Series #1 - Design IoT devices - Introduction

≪ Previous: Large Matrix Operations with SciPy* and NumPy*: Tips and Best Practices

Every year, hundreds of thousands of children go missing in the United States. Missing teenagers are at high risk of becoming victims of sex trafficking. Thorn, a nonprofit based in San Francisco, has had great success in developing software to aid law enforcement. Its tools have helped to recover thousands of victims of sex-trafficking and identify over 2,000 perpetrators. Through its Intel Inside®, Safer Children Outside program, Intel is working with Thorn to help improve the facial recognition component of a new Thorn application, the Child Finder Service* (CFS).

This series of articles describe how Intel is contributing to the facial recognition component of CFS. We hope that people from a broad range of software engineering backgrounds will find this material approachable and useful—even if you don’t have a background in machine learning—and that you'll be able to apply what you learn to your own work.

The Child Finder Service (CFS) is an application designed to support analysts as they work to recover missing children. A small number of human analysts must sift through large numbers of photographs of people—tens of thousands—almost all of which will turn out to be irrelevant. Using facial recognition to accelerate their work (sorting the images so that the best matches rise to the top of the list) makes sense, but comes with a lot of challenges. Published work on face recognition tends to use academic benchmarks likes Labeled Faces in the Wild* (LFW). However, academic benchmarks tend to be based on photos of celebrities (simply because they are readily available and easy for humans to identify), which are not a perfect match for the images we need to work with.

These photos are usually taken by professionals, who use good lighting and proper focus, while our photos are usually taken by amateurs.
The subject is usually facing the camera head-on, while ours are often in sexualized poses, with a more oblique view of the face.
The camera chip is high-end with very low sensor noise, while most of our input images come from cell phones.
The subjects in these public datasets skew male. Sex trafficking affects both genders, but the vast majority of our subjects are female.
The ages are older than our target demographic.
The ethnicities aren’t a perfect match.

These differences—what a machine learning engineer might call a domain mismatch—mean that existing face recognition models perform much worse on our dataset than they do on the datasets on which they were original trained. In order for the CFS to succeed, this performance gap needed to be addressed.

In the first article of this series, we describe how face extraction and recognition works. In subsequent articles, we'll talk about how we worked to get the maximum possible accuracy from our models.

Finding Faces

Before we can recognize a face, we have to find it. That is, we have to write code that can sift through the pixels of an image and detect a face, hopefully in a way that can cope with variations in lighting and even with objects that might obscure part of the face, such as glasses or a phone held up to take a selfie. Originally, engineers tried to write code to describe what a face looks like. For example, we would start with code intended to find circles or ellipses, and then look for circles of similar size (eyes) paired above ellipses (mouth), as shown in Figure 1.

Figure 1. Faces look like this!

Approaches like this do not work in the real world. This is because there is a huge variety in facial appearance when we photograph them, and the rules to describe what faces look like are difficult for humans to express as code: we are really good at knowing a face when we see it but terrible at explaining how we know it is a face (see Figure 2). This problem—that a domain expert's understanding of a task is implicit (so cannot readily be codified)—is actually extremely common when we try to develop a computer system to automate a task. If you think about where computers are really dominant, such as summing numbers for accountants, you’ll realize that this is a case where the task (arithmetic) involves explicit knowledge—the rules of arithmetic. Contrast this with a nonmathematical task like ironing a shirt, which is not really a complex task but yet excruciatingly difficult to express as code.

Figure 2. Pose, lighting, hair, expression, glasses. Don’t forget to code for all these!

Thorn's initial choice for face localization (finding a region of a photo that contains a face) was an implementation from the Open Source Computer Vision* (OpenCV*) library. This implementation works using a feature descriptor called Histograms of Oriented Gradients (HOGs): essentially, an image is represented in terms of how many edges point in different directions. Using the direction or orientation of edges allows machine learning methods to ignore trivial things like the direction of lighting, which isn't important for recognizing a face. In Figure 3, notice how HOG removes information about color and brightness but keeps information about edges and shapes.

Figure 3. The direction of edges creates a more abstract image.

Hand-designed features allowed machine learning to replace handwritten rules and made face detection much more robust. However, when we evaluated this method using our own test set of images considered by humans to contain at least one face, OpenCV found faces in just 58 percent of images. Admittedly, our test set is tough, but then so is our problem domain. When doing our error analysis—looking at which faces had been missed—partially obscured faces (bangs, glasses, a phone held up to take a selfie) or more oblique angles seemed to be common failure causes.

Luckily, there is a better way. Just as replacing handwritten rules with machine learning made face detection more robust, modern machine learning methods achieve even better performance by further reducing the involvement of human engineers: instead of requiring hand-designed features (like HOGs), state-of-the-art methods use machine learning end to end. Pixels go in, face locations come out.

By applying machine learning, we can stop focusing on the how of the solution and instead focus on specifying the what. The most popular (and successful) way to get machines to turn images into answers is a type of machine learning called deep learning. To give you a rough sense of the difference versus traditional machine learning, you might picture a traditional model as being a single (very complicated) line of code, while a deep model has many lines of code. Simply put, a longer program allows much more complicated things to be done with whatever data it is that you are processing. Scientists and engineers are finding that making models deeper (adding more lines of code) works much better than just making them wider (imagine making a single line of code very, very long and very, very complicated).

Deep learning methods tend to be state of the art whenever we need to get a computer to make sense of an unstructured input (image understanding, EEGs, audio, text, and so on), and face localization is no exception. “Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Neural Networks” (MTCNN; a most-cited paper from IEEE Signal Processing Letters 2016) explains how three different neural networks can work together to identify faces within larger images and output accurate bounding boxes. As a popular paper, MTCNN implementations for the main deep learning frameworks are easy to find.

Using MTCNN, we were able to find faces in 74 percent of the images in our test set, which is a big leap from the 58 percent we found with OpenCV. Of course, that still leaves 26 percent of faces not found. However, the faces that MTCNN misses are in many case poor candidates for facial recognition: the faces are too oblique, too blurred, or too far away to be reliably recognized.

The lesson here is that when we human engineers accept our limitations—that we aren’t good at writing complex rules to describe what to look for in messy, high-dimensional inputs like a photograph—and simply step back and confine ourselves to being “managers” of the machine, we can get a solution that is much more robust. In effect, we are going beyond test-driven development to test-defined development: through training data, engineers tell the machine what is expected; that is, “you should find a face here, you figure it out.” Techniques like deep learning give the machine a powerful and expressive way to describe its solution.

Recognizing Faces

Once we have found a region that contains a face, we can feed that region of the image to a recognition model. The job of the model is to convert our input from an image (a big grid of pixels) into an embedding—a vector that we can use for comparison. To see the value of this, think about what would happen if we didn't do this, but instead we tried to compare images of faces pixel by pixel. Let's take the easiest possible case: we have two pictures of the subject, but in one picture, they have moved their head slightly to one side (see Figure 4).

Figure 4. Original, shifted, and difference image.

Notice that although we are looking at the same woman, with the same expression, a great many pixels in the face have changed their value. In the rightmost panel, brighter pixels have had larger changes. Imagine what would have happened if there was a change in pose, expression, or lighting. Although a photograph is a representation of a face, it isn't a very good one for our purposes, because it has a lot of dimensions (each pixel is a separate value, in effect a separate axis or dimension), and it changes a lot based on things we don't care about. We want to ignore variations that aren't important for checking identity, such as hair style, hair color, whether the subject is wearing glasses or makeup, expression, pose, and lighting, while still capturing variations that do matter, such as face shape.

The task of machine learning for facial recognition is going to be to convert a bad representation of a face (a photograph) into a good one: a number, or rather a set of numbers called a vector, which we can use to calculate how similar one face is to another. Turning a “messy,” unstructured input like images, audio, free text, and so on into a vector with properties that make it a more structured, useful input to an application has been one of the biggest contributions of machine learning. These vectors, whose dimensionality is much, much lower than that of the original input, are called “embeddings.” The name comes from the idea that a simpler, cleaner, lower-dimensional representation is concealed—embedded—within the high dimensional space that is the original image.

For example, popular face-recognition models like FaceNet* (https://github.com/davidsandberg/facenet) accept as input an image measuring 224 ´ 224 pixels (effectively a point in a space with 150,528 dimensions, 224 pixels high ´ 224 pixels wide ´ 3 color channels) and reduce it to a vector with just 128 components or dimensions (a point in a sort of abstract “face space”). Why 128? The number is somewhat arbitrary, but if you think about it, there are good reasons we should think that 150K dimensions are far too many.

Theoretically, every color channel of every pixel could take values that are completely independent of its neighbors. What would that image look like, though?

Figure 5. Pixels with no correlations—noise.

It would look like this (see Figure 5). I “drew” this picture with a Python* script, creating a matrix of random values. Even if I let the script generate billions of images, would we see anything like a “natural image,” that is, a photograph? Probably not. This should give you the sense that natural images are a tiny subset of the set of possible images, and that we shouldn’t need quite so many degrees of freedom to describe this set, and in fact, compression methods like JPEG are able to find much simpler (smaller) representations of an image. Although we don’t usually think of it that way, the compressed file produced by the JPEG algorithm is actually a model of an image that codifies some of our expectations about natural images. You can think of the set of natural images as forming a continuous surface embedded within a much vaster, far higher-dimensional space of possible images, surrounded by an infinite number of those random mosaics in Figure 5. Of course, pictures of faces are just a tiny subset of our natural image subset, and so we should expect to need still fewer dimensions or parameters to capture their variations.

Figure 6. Unfolding a 2D manifold embedded in a 3D space.

I picture these manifolds as being like the crumpled sheet of paper in this animated GIF. Although I myself exist in a 3D space, the surface of the sheet of paper in my hands is essentially 2D. You can think of the Xs and Os on the paper as being points in image space—photographs of two different people. The job of a face recognition model is essentially to uncrumple the paper, to find that simpler, flatter surface where different identities can easily be separated (like the dashed line dividing Mr. X from Mrs. O). I like to think of each layer of the model as being like one of the movements my hands make as they uncrumple the page (See Figure 6).

The analogy also works another way. Shaped by its training objectives, the model tries to learn to map the input image to a vector that encodes only those attributes of a face that will help distinguish it from other faces. Ideally, every photo of a particular face would map to a small, well-defined region (the areas of Mr. X and Mrs. O), and photos of other faces would arrive at other points in that space, with more different faces landing further away (imagine someone scribbling Mr. Y and Mr. Z onto new areas of the same page).

Figure 7. Other faces should land in their own regions, not overlap the X and O regions.

In the paper analogy, faces of other people would map elsewhere on the page, in regions not over-lapping the Xs and Os and repeating exactly the same flattening, uncrumpling procedure would reveal this structure (see Figure 7). Facial recognition models are good at mapping images to face space, even if the model hasn’t seen the face before. The machine learning “term of art” here is “transfer learning.” The model is able to generalize from the pictures and identities in its training set to new ones.

If you are familiar with using machine learning to train a classifier, you may be wondering how a face recognition model can cope with faces that weren’t in the training set. The output of a classifier is a list of numbers, one for each class the classifier supports. In facial recognition, one begins by asking a model to say who is pictured in a given photograph. You might imagine training a deep neural network-based model using lots of pictures of Mr. X and Mrs. O. Now suppose we want to use our already-trained network to identify Mr. Y. Our network doesn’t even have a way to describe that answer. Do we have to retrain our whole model for each new user?

Figure 8. Schematic of face recognition with neural networks.

Fortunately not. The trick is to use the output of the penultimate layer of the network (see Figure 8). The key insight here is that during training each layer of a neural network tries to learn to make its output as useful as possible to the next layer. In the case of the penultimate (last but one) layer, this means describing the input, such as a particular instance of Mrs O on our crumpled paper, in such a way that the final layer can easily separate the two classes (the dashed line between the Xs and Os). Unlike the final layer, which gives us probabilities for a fixed number of classes, the penultimate layer gives us a vector—a position—in some abstract space shaped by the training process to suit the problem of distinguishing the output classes. In practice, it turns out that this space—this embedding—already works pretty well for faces that the model never trained on. However, to maximize performance, facial recognition models usually undergo a fine-tuning process after the classification training stage.

Curious readers may wonder about those 128 dimensions. What do they represent? It is likely that different axes encode different aspects of a face; there are (probably) axes that encode things like the squareness of the jaw, the width of the nose, the prominence of the brow ridge, and so on. However, since the features are not designed but learned—induced by the problem of learning to recognize people—we don’t really know exactly what each axis is for. But wouldn’t that be an interesting research project?

If we don’t have a classifier layer, how do we do the final step of assigning a name to a face? To do this, we simply need to have a labeled (named) photo of Mr. Y in our database. When a new selfie arrives in our application, we put it through our neural network to get a face vector—to find the right location in face space. We then calculate the distance—the same Euclidean distance that you learned about in high school—between the new face vector and labeled or named face vectors in our database; if the new vector is much closer to a particular known vector than any other, we have a name.

Let’s look at a more concrete example (code here). We’ll take some pictures of the author, keeping lighting, clothing, hair and background constant:

Here are some more pictures of the author, but this time varying location, lighting, clothing and age (one picture in my teens):

If we were comparing these images using raw pixels as our description of my face, we might expect the images in the first row to be each other’s closes matches. However, if we crop each image using MTCNN, and then use Facenet to convert to an embedding (face vector) and use Euclidean distance as our measure for comparison, we find that the “most similar” faces are those shown in Figure 9.

Figure 9. Most similar author faces (0.37 units apart).

Clearly, the model has been able to completely ignore the background and lighting similarities of the other images in the “pose” set. What about the most different pair (see Figure 10)?

Figure 10. Most dissimilar author faces (1.03 units apart).

These two faces were 1.03 units apart, but what does that even mean? A distance of zero would mean their face vectors were identical, while a large number would mean they were very different (in theory, there isn’t an upper limit on how large the distance could get). This still leaves the question: “How far is 1 unit?” To get a better handle on how far 1 unit is (one inch? one mile?), we can look at the distance between my face and that of someone else.

Just for fun, I took each of the faces in the 3x3 “facial diversity” grid from Figure 2, calculated their face vectors, and then calculated their distance from a vector obtained by averaging all my own face vectors. The non-Ed faces averaged 1.38 units distance, with a minimum at 1.27 and a maximum at 1.5 units, whereas “teenage Ed” was just 0.73 units from my averaged vector (see Figure 10).

Figure 11. Most and least like the author: but which is which?

There is even a way to look at all the pictures together. Although my boss hasn’t yet agreed to buy me that shiny new 128 spatial-dimensions display I’m after, we can use an exciting non-linear dimensionality reduction technique called t-SNE to project our 128 dimensional vectors down to a more laptop-friendly 2, giving us the scatter plot shown in Figure 12.

Figure 12. "Face space": our face vectors projected down to 2D.

Lots going on here! Notice the male faces from the article clustered quite closely; notice also that the two child faces are so close they overlap. My face is broken into multiple clusters.

Conclusion

In this introduction to facial recognition, we discussed the following:

The CFS application required us to actually do some machine learning ourselves, rather than rely on an off-the-shelf model, because the dataset differs from others in important ways (for example, image quality is lower and demographics aren’t a good match).
Deep neural networks, although large and complex if described in terms of parameters and layers, actually simplify the application as a whole, replacing complex difficult-to-review human code with a machine-optimized solution.
The output of a neural network is actually pretty easy to use in the rest of the application. It lets us do familiar things like calculate distances and sort matches.
The key task of machine learning in this application is to produce an embedding, a useful abstraction of the input: instead of a photograph, a vector that captures something particular about a given face. Embeddings are a powerful concept, and we’ll return to them in future articles.

Don’t fear the model. Check out the sample code. And have fun.

↧

IoT 101 Series #1 - Design IoT devices - Introduction

December 10, 2017, 8:41 am

Latest and popular articles on Intel Technologies

≫ Next: IoT 101 Series #2 - Design IoT devices - Technologies & Design

≪ Previous: Finding Missing Kids through Code

The innovative process that is applied to the design and development of IoT objects must guarantee, as for any other product, on the other hand, high performance compared to some aspects, among the bases to be taken into consideration:

Quality, seen with respect to the customer satisfaction and to be monitored in order to not to leave the market;
Times, in terms of promptness and speed in placing the product on the market in order to take advantage of opportunities (time to market);
Costs, in terms of maximizing the value and therefore of the product performance ratio on costs incurred to obtain it;

These fundamental aspects are to be considered in a market context where the life cycle of the product is becoming shorter and shorter due to the continuous progress and, consequently, the competition in terms of costs becomes more and more ruthless.

The phase of conception of the object is of fundamental importance as a careful design of the product acts so much on customer satisfaction, and therefore on the competitive advantage of the company, as well as on costs and efficiency in general and in way, much more incisive compared to other factors such as process design, industrialization or production.

On average, 80% of the total life cycle costs of the product are used before the first product has been manufactured and therefore, in the phase of the specification and product design phase.

In the traditional process that generates a new product or service, we often make the mistake of being in a hurry to get to the prototype and then devoting few resources to the embryonic phases of conception and design of the product / service.

This happens due to the erroneous and dangerous belief that only when the first piece is in hand can you verify and optimize its costs and its characteristics.

Unfortunately, when it comes to the industrialization phase or even the production phase, it becomes too late to think of being able to intervene with changes also aimed at reducing costs, among other things, to keep in mind that most of the costs it will have already been used in the conception and design phases.

This is why, already in the initial phases of the product life cycle, we need a process of innovation and development that is extremely structured and oriented towards optimization, as well as the reduction and control of the time to market.

The ideal process to inspire consists of two fundamental aspects: target costing and concurrent engineering.

The process develops as follows:

An expressed or unexpressed need of the market is defined, if it is possible, with a high potential for spreading and growth, the business opportunity is assessed and a business plan is defined as well as the product strategy, the design solution of the defined need in the beginning is realized , the project is refined considering the production phases and after the integration of the technological components is carried out, the production cycle is optimized, the project can be retouched, followed by the validation process, the sampling, the pre-series and, finally, the production.

From here, it is clear that the optimization of the design process necessarily requires its placement within a fundamental macro process that is composed by a design process, a production process and a consumption process.

This method of process management is focused on the cancellation of the extra costs generally due to changes in the various phases, to problems related to the quality or disservices, to production difficulties or, for example, to dissatisfaction with the product created.

It becomes more and more fundamental therefore, in order to reduce development time and costs, to adopt the concurrent engineering process in the product development program.

Not to be underestimated, as already anticipated, are the diversified and inter-functional skills of the development team. These last can allow a reduction of the main costs of product in the phase of conception and definition of specifications.

↧

IoT 101 Series #2 - Design IoT devices - Technologies & Design

December 10, 2017, 8:41 am

Latest and popular articles on Intel Technologies

≫ Next: IoT 101 Series #3 - Design IoT devices - Usability & Affordance

≪ Previous: IoT 101 Series #1 - Design IoT devices - Introduction

One of the problems to be faced when entering a new technology or service on the market is the possibility that people will not accept and / or understand innovation.

As normal it is, in addition to a technology capable of carrying out a specific task or resolving particular needs, perhaps still unexpressed, it is clearly necessary that the community of people is going to accept and use it.

The boundary between a bankruptcy and a successful technology often lies in the fact that successful technology came some time later, when people became unconsciously more likely to accept them.

It is therefore evident how fundamental it is to study the market before making its technology available.

A technology is generally adopted when it is more inclined to resemble something that users already know.

For example, the mobile phone was presented as a telephone independent from a specific fixed location, today we use the same technology to provide an object connected to the Internet, portable, able to operate many other technologies and also be used as a telephone.

The designers must therefore present solutions that do not excessively force the boundaries of the field of normality or things already seen even if the underlying technology is very far from normal because changing the habits of people is not a trivial matter and eventually it should be done gradually.

In some cases, however, one could imagine that design is not such an important aspect.

For example, imagine the appearance of a sensor installed inside a production line of a factory, although the design of this kind of device may be less important than that of an object intended for different purposes, the additional features that will go interacting with the rest of the factory and its control systems are elements that must be carefully evaluated and not left to chance.

The design, as you can well imagine, does not simply concern the shape and appearance of an object, it is not just something about the aesthetics of the product, the product design is an area much more extensive and complex that should not be underestimated in any way.

Industrial or product design actually includes the shape and aesthetics of the product to be designed, but also deals with functional aspects such as construction materials and that the product functions are easily understandable.

Finally, the product design, which should not be forgotten and kept in mind, takes into consideration the perspective of the end user of the product and proposes to create the best solution for the latter in order to make it as much as possible pleasant and efficient.

↧

IoT 101 Series #3 - Design IoT devices - Usability & Affordance

December 10, 2017, 8:43 am

Latest and popular articles on Intel Technologies

≫ Next: CIFAR-10 Classification using Intel® Optimization for TensorFlow*

≪ Previous: IoT 101 Series #2 - Design IoT devices - Technologies & Design

Careful usability design is an essential part of the overall design process of the objects if you want your product to reach its full potential, therefore it is essential to take into account all the usability principles, so the product can adapt to the skills of the users.

Good design of usability aspects is a critical requirement for product reliability in some cases, many of the user errors are caused by the fact that at the design stage the real capabilities of users in their environment have not been taken into account.

Not considering the usability aspects often means preventing the user from accessing certain features of the product, causing him to make mistakes and, therefore, considering the product as something useless rather than an aid.

When making design decisions that affect the user, one should consider the physical and mental capabilities of the people who will use the product.

People have a limited short-term memory, they can instantly remember about seven objects, so if you present too much information to users at the same time they might not be able to assimilate and / or manage them all.

Everybody makes mistakes, especially when too much information have to be deal with or when users are under stress; when systems fails and sends notifications or triggers alarms, it causes further stress for users, increasing the chances that they will make further mistakes.

Everyone has different physical abilities, some people see and feel better than others, someother are colorblind, some are very good with manual work etc etc. This is to demonstrate that you should never design according to your skills and assume that all users understands how to use it.

These human factors underlie the design principles listed below:

User familiarity: the product should take into account the experience of the people who will make the most use of it.
Consistency: the functionality design should be consistent so that when possible the functions are activated in similar ways, perhaps in the manner of other similar products.
Minimal surprises: the user should never be surprised about the behavior of the product.
Recoverability: the product should include mechanisms that allow users to restore the status in the event of an error.
User Guide: The product should provide significant feedback to the user when errors occur and provide help features depending on the context.
Diversity of users: the product should provide interaction functionalities suitable for the different types of users of the system

Affordance

The term was introduced in 1979 by the American psychologist James Gibson in the work An ecological approach to visual perception.

Each object has its affordances, as well as the surfaces, the events and the places, the higher the affordance, the more automatic and intuitive the use of a device or an instrument will be.

For example, the appearance of a handle should make the best sense and automatically how the door should be opened: if pulled, pushed, or slid (a door that opens automatically to the passage has a poor affordance, since it is very unintuitive how it works).

Thus, with affordance, an important design parameter is defined: the physical quality of an object that suggests to a user the appropriate actions to use it.

Among the objects with an excellent affordance there are, for example, the fork or the spoon, tools that have been improved over the millennia up to the extremely intuitive and very simple use of today.

↧

CIFAR-10 Classification using Intel® Optimization for TensorFlow*

December 12, 2017, 4:17 pm

Latest and popular articles on Intel Technologies

≫ Next: Configure Virtual Fabrics in Intel® Omni-Path Architecture

≪ Previous: IoT 101 Series #3 - Design IoT devices - Usability & Affordance

Abstract

This work demonstrates the experiments to train and test the deep learning AlexNet* topology with the Intel® Optimization for TensorFlow* library using CIFAR-10 classification data on Intel® Xeon Phi™ processor powered machines. These experiments were conducted with options set at compile time and run time. From these runs the training accuracy, validation accuracy, and testing accuracy numbers were captured for different compiler switches and environment configurations to identify the optimal configuration. For the optimal combination identified, the top-1 and top-5 accuracies were plotted.

Introduction

Many deep learning frameworks running on different processors have evolved in recent years to solve various complex problems in image classification, detection, and segmentation. Continued research in this space helped to optimize these frameworks and hardware to improve the training, inference accuracy, and speed of performance. Intel has optimized the TensorFlow* library for Intel® Xeon Phi™ processors. The Intel Xeon Phi processor is designed to scale out in a near-linear fashion across cores and nodes to reduce the time and to train machine deep learning models. During the experiment various optimization options were tried to train and test AlexNet* topology with CIFAR-10 images using Intel® optimized TensorFlow on Intel Xeon Phi processors. The optimal combination has been identified based on the results.

Document Content

Environment Setup

The following hardware and software environments were used to perform the experiments.

Hardware

Architecture	x86_64
CPU op-mode(s)	32 bit, 64 bit
Byte order	Little endian
CPU(s)	256
Online CPU(s) list	0-255
Thread(s) per core	4
Core(s) per socket	64
Socket(s)	1
Non-uniform memory access (NUMA) node(s)	2
Vendor ID	Genuine Intel
CPU family	6
Model	87
Model name	Intel® Xeon Phi™ processor 7210 @ 1.30 GHz
Stepping	1
CPU MHz	1153.648
Bogus Million Instructions Per Second (BogoMIPS)	2593.69
L1d cache	32K
L1i cache	32K
L2 cache	1024K
NUMA node0 CPU(s)	0-255

Software

TensorFlow*	1.3.0 (Intel® optimized)
Python*	3.5.3 (Intel distributed)
GNU Compiler Collection* (GCC)	6.2.1
Virtual environment	Conda*

Choosing the Optimal Software Configuration

Trial runs were performed to choose the optimal software configuration. For these runs, an Intel optimized TensorFlow library was built from the sources using Bazel* 0.4.5.

TensorFlow versions 1.2 and 1.3 with Python* version 2.7.5 were tried on an AlexNet benchmark script. It was found that TensorFlow 1.3 showed 16 times faster performance (refer configurations>) compared with TensorFlow 1.2. Further, Python 2.7.5 and 3.5.1 versions were tried with TensorFlow 1.3 on AlexNet topology, with 10,000 images.

The results of the evaluation are as follows.

TensorFlow* Version	Python* Version	No. of Epochs	Compiler Switches	Accuracy
1.3	2.7.5	20	DEIGEN_USE_VML, config=mkl	16.10%
1.3	2.7.5	20	mfma, Intel® AVX2, DEIGEN_USE_VML, config=mkl	16.40%
1.3	3.5.3 (Intel)	20	mfma, Intel AVX2, DEIGEN_USE_VML, config=mkl	51.9%

From the above results, the software configuration listed in in the software table was finalized with the compiler switches; namely, mfma, Intel® Advanced Vector Extensions 2 (Intel® AVX2), and Intel® Math Kernel Library (Intel® MKL) config.

Network Topology and Model Training

This section details the dataset adopted, AlexNet architecture, and training the model in the current work.

Dataset

The CIFAR-10 dataset chosen for these experiments consists of 60,000 32 x 32 color images in 10 classes. Each class has 6,000 images. The 10 classes are: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.

The dataset was taken from Kaggle*³. The following figure shows a sample set of images for each classification.

CIFAR-10 sample images

Figure 1: CIFAR-10 sample images.

For the experiments, out of the 60,000 images, 50,000 images were chosen for training and 10,000 images for testing.

AlexNet* Architecture

The AlexNet network is made of five convolution layers, max-pooling layers, dropout layers, and three fully connected layers. The network was designed to be used for classification with 1,000 possible categories.

MIT2).

Figure 2: AlexNet* architecture (credit: MIT²).

Model Training

In these experiments, it was decided to train the model from the beginning using the CIFAR-10 dataset. The dataset is split as 50,000 images for training and validation and 10,000 images for testing.

Experimental Runs

The experiment was conducted in two steps.

Step 1, multiple compiler switches were used and runs were performed for different batch sizes. The epoch counts considered for these runs are 25 and 100. The aim of this step was to observe the accuracy and throughput for each batch.

Step 2, Intel suggested environment configuration was used on top of the complier switches set in Step 1. Benchmark scripts were run to observe the throughput and, based on that, AlexNet runs using CIFAR-10 were executed to get the top-1 and top-5 accuracies.

Step 1: With Compiler Switches

The following are the compiler switches that were set during the Bazel build:

mfma, Intel AVX2, DEIGEN_USE_VML, config=mkl

The runs were performed for different batch sizes and the following results were obtained. For 25 epochs:

Batch Size	Epochs	Training Accuracy	Validation Accuracy	Testing Accuracy
64	25	71.87%	69.22%	67.47%
96	25	68.50%	66.63%	67.16%
128	25	65.80%	64.55%	64.82%
256	25	59.30%	58.98%	59.16%

Training with 25 epochs.

Figure 3: Training with 25 epochs.

It was observed that while using a larger batch, there is a degradation in the quality of the model as there is a reduction in the stochasticity of the gradient descent. The accuracy fall is steeper when there is an increase in the batch size from 128 to 256. In general, the performance of processors is better if the batch size is a power of 2. Considering this, it was decided to perform runs with a higher epoch count on batch sizes of 64 and 128.

For 100 epochs:

Batch Size	Epochs	Training Accuracy	Validation Accuracy	Testing Accuracy
64	100	94.98%	72.62%	72.29%
128	100	89.19%	72.23%	70.94%

Training with 100 epochs.

Figure 4: Training with 100 epochs.

As the epoch count increased, the network showed improvement in accuracy, but significant overfitting of the model was observed. At this stage, it warranted to consider additional options beyond compiler flags to best utilize the Intel Xeon Phi processor capability to improve the performance of the model and reduce the training time.

Step 2: With Environment Configurations

Retaining the compiler options as-is, in this step different environmental parameters as suggested by Intel¹ were set.These parameters are as follows:

"OMP_NUM_THREADS = "136"

"KMP_BLOCKTIME" = "30"

"KMP_SETTINGS" = "1"

"KMP_AFFINITY"= "granularity=fine, verbose, compact, 1, 0"

'inter_op' = 1

'intra_op' = 136

Using the same TensorFlow setup built in Step 1 which is built using the compiler switches as mentioned in Step 1 and setting the above environment parameters, the AlexNet topology was run using the CIFAR-10 dataset for 1,000 epochs to capture the top-5 and top-1 accuracies. The results are as follows:

Sr. No	Top-n Accuracy	Training Accuracy	Testing Accuracy
1	Top-5	99.74%	96.98%
2	Top-1	93.26%	70.94%

The following graph represents the top-1 and top-5 accuracies for training and testing for every 100 epochs:

Training accuracy comparison.

Figure 5: Training accuracy comparison.

Comparing the top-1 training and testing accuracy, it can be inferred that the network tends to overfit after 500 epochs. The reason could be that the model is training on the same data again.

Conclusion

The experiments on training the AlexNet topology on Intel Xeon Phi processor powered machines, with Intel optimized TensorFlow using the CIFAR-10 classification data set illustrates that the performance gains on Intel Xeon Phi processors can be achieved by setting the compiler switches (mfma, Intel AVX2), the configuration option (Intel® Math Kernel Library), and the environment options (as suggested¹).

Further, making the Intel Xeon Phi processor numactl-aware helps to optimize the performance by 1.2x times (refer configurations). Similar runs can be performed on newly released Intel® Xeon® Gold processor powered machines to experience enhanced performance.

About the Authors

Rajeswari Ponnuru, Ajit Kumar Pookalangara, and Ravi Keron Nidamarty are part of the Intel-Tata Consultancy Services relationship, working on the AI academia evangelization.

Acronyms and Abbreviations

Term/Acronym	Definition
CIFAR	Canadian Institute for Advanced Research
CIFAR-10	Established computer-vision dataset used for object recognition
GCC	GNU Compiler Collection*

Configurations

For performance reference under Choosing Optimal software Configuration section:

Hardware: refer Hardware under Environment Setup

Software:

Virtual environment 1: Intel Optimized Tensorflow 1.2, Python 2.7.5

Virtual environment 2: Intel Optimized Tensorflow 1.3, Python* version 2.7.5

Test performed: executed the script benchmark_alexnet.py from convent-benchmarks

For performance reference under Conclusion section:

Hardware: refer Hardware under Environment Setup

Software: Intel Optimized Tensorflow 1.3, Python* version 3.5.3

Test performed: executed the script benchmark_alexnet.py from convent-benchmarks

For more information go to http://www.intel.com/performance.

References

TensorFlow* Optimizations on Modern Intel® Architecture: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture
Alexnet topology diagram: http://vision03.csail.mit.edu/cnn_art/
CIFAR-10 dataset taken from: https://www.kaggle.com/c/cifar-10/data

Related Resources

Alexnet details: http://vision.stanford.edu/teaching/cs231b_spring1415/slides/alexnet_tugce_kyunghee.pdf

About CIFAR-10 data: https://www.cs.toronto.edu/~kriz/cifar.html

↧

Configure Virtual Fabrics in Intel® Omni-Path Architecture

December 13, 2017, 8:18 am

Latest and popular articles on Intel Technologies

≫ Next: License manager installer reports invalid license

≪ Previous: CIFAR-10 Classification using Intel® Optimization for TensorFlow*

Introduction

Virtual Fabrics (vFabrics)* allow multiple network applications running on the same fabric at the same time with limited interference. Using vFabrics, a physical fabric is divided into many overlapping virtual fabrics, which keep network applications separate even though they connect to the same physical fabric. Virtual fabric is a feature of the Intel® Omni-Path Architecture (Intel® OPA) Fabric Manager (FM). For a complete overview of the FM, please refer to the document Intel® Omni-Path Fabric (Intel® OP Fabric) Suite Fabric Manager User Guide.

Typical usage models for vFabrics include:

Separating a cluster into multiple virtual fabrics so independent applications running in different virtual fabrics can run with minimal or no effect on each other.
Separating classes of traffic. Each class of traffic runs in a different virtual fabric so that they don’t interfere with each other.

Each vFabric can be assigned quality of service (QoS) and security policies to control how common physical resources in the fabric are shared among the virtual fabrics. A virtual fabric is defined as a list of applications, a list of device groups, with a set of security and QoS along with other identifiers for the virtual fabric. Refer to Section 2.1 of the document Intel® Omni-Path Fabric (Intel® OP Fabric) Suite Fabric Manager User Guide for more information on virtual fabric.

This document shows how to configure vFabrics, how to use various Intel® OPA command lines to display virtual fabrics, to display port counters, and to clear them. The Intel® MPI Benchmarks, part of Intel® Parallel Studio XE 2018 Cluster Edition, is used to generate traffic in these virtual fabrics. Finally, you can verify the packets sent and received in the virtual fabrics with the Intel® OPA command lines.

Preparation

The following setup and tests are run on two systems equipped with the Intel® Xeon® processor E5-2698 v3 @2.30 GHz and the Intel® Xeon Phi™ processor 7250 @ 1.40 GHz. The first system has the IP address 10.23.3.28 with hostname lb0, the second system has the IP address 10.23.3.182 with hostname knl-sb2. Both systems are running Red Hat Enterprise Linux* 64-bit 7.2. Also, on each system, an Intel® Omni-Path Host Fabric Interface (Intel® OP HFI) Peripheral Component Interconnect Express (PCIe) x16 is installed and connected directly to the other HFI. The Intel® Omni-Path Fabric Software version 10.4.2.0.7 is installed on each system. Also, the Intel® Parallel Studio XE 2018 Cluster Edition is installed on both systems to run the Intel® MPI Benchmarks.

In this test, IP over Fabric (IPoFabric) is also configured. The corresponding IP addresses on the Intel® Xeon® processor host and the Intel® Xeon Phi™ processor are 192.168.100.101 and 192.168.100.102, respectively.

Creating Virtual Fabrics

The Intel® OPA fabric supports redundant FMs. This is implemented in order to ensure management coverage of the fabric continues in the case of failure on the master FM. When there are redundant FMs, only one is selected as a master FM arbitrary, and others are standby FMs. Only the master FM can create vFabrics.

Using the above systems, you will create the vFabrics and generate traffic in these vFabrics. First, you need to determine the master FM. Both commands opareport and opafabricinfo can show which FM is a master FM. For example, issue the command opafabricinfo in either machine lb0 or knl-sb2:

# opafabricinfo
Fabric 0:0 Information:
SM: knl-sb2 Guid: 0x00117501017444e0 State: Master
SM: lb0 Guid: 0x0011750101790311 State: Inactive
Number of HFIs: 2
Number of Switches: 0
Number of Links: 1
Number of HFI Links: 1              (Internal: 0   External: 1)
Number of ISLs: 0                   (Internal: 0   External: 0)
Number of Degraded Links: 0         (HFI Links: 0   ISLs: 0)
Number of Omitted Links: 0          (HFI Links: 0   ISLs: 0)
-------------------------------------------------------------------------------

This indicates that the master FM resides in the machine knl-sb2.

Show the current virtual fabrics with the following command. You will see two virtual fabrics were configured by default: one is named Default; the other is Admin:

# opapaquery -o vfList
Getting VF List...
 Number of VFs: 2
 VF 1: Default
 VF 2: Admin
opapaquery completed: OK

In the following exercise, you will add two virtual fabrics named “VirtualFabric1” and “VirtualFabric2”. To create virtual fabrics, you can modify the /etc/opa-fm/opafm.xml configuration file in the host where the master FM resides. Note that existing tools such as opaxmlextract, opaxmlindent, opaxmlfilter, and opaxmlgenerate can help to manipulate the opafm.xml file.

The /etc/opa-fm/opafm.xml configuration file stores information about how the fabric is managed via the master FM. This file uses standard XML syntax. In the node where the master FM resides, knl-sb2 in this case, edit the FM configure file /etc/opa-fm/opaf.xml and add two virtual fabrics by adding the following lines in the virtual fabrics configuration section:

# vi /etc/opa-fm/opafm.xml

........................

<VirtualFabric><Name>VirtualFabric1</Name><Application>AllOthers</Application><BaseSL>1</BaseSL><Enable>1</Enable><MaxMTU>Unlimited</MaxMTU><MaxRate>Unlimited</MaxRate><Member>All</Member><QOS>1</QOS></VirtualFabric><VirtualFabric><Name>VirtualFabric2</Name><Application>AllOthers</Application><BaseSL>2</BaseSL><Enable>1</Enable><MaxMTU>Unlimited</MaxMTU><MaxRate>Unlimited</MaxRate><Member>All</Member><QOS>1</QOS></VirtualFabric>

........................

A virtual fabric is created by adding an XML element <VirtualFabric>. Inside the XML element, you can define many parameters for this virtual fabric. The following parameters are defined for the virtual fabrics used in this example:

Name: a unique name for this virtual fabric.
Application: a catchall that can be used to identify all applications.
BaseSL (base service level): allows a specified service level (0–15) to be used for the vFabric. In this example, VirtualFabric1 uses service level=1, and VirtualFabric2 uses service level=2.
Enable: vFabric is enabled if this field is set to 1. If set to 0, the vFabric is simply ignored (allows user to easily disable the vFabric without deleting it).
MaxMTU: the maximum of Maximum Transmission Unit (MTU) of the package. The actual value of MTU may be further reduced by hardware capability or by request. It can be set to unlimited.
MaxRate: the maximum static rate. It can be set to unlimited.
Member: the group name that specifies whether a device can talk to any other members.
QoS: 1 means the QoS for this vFabric is enabled.

A complete list of parameters is shown in Section 6.5.12 of the document Intel® OPA Fabric Suite Fabric Manager User Guide.

Log in as root or a user with privileges, and restart the master FM in knl-sb2 to read the new /etc/opa-fm/opafm.xml configuration file:

# systemctl restart opafm

Verify the master FM is up and running:

# systemctl status opafm
● opafm.service - OPA Fabric Manager
   Loaded: loaded (/usr/lib/systemd/system/opafm.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2017-11-14 08:39:58 EST; 18s ago
  Process: 36995 ExecStop=/usr/lib/opa-fm/bin/opafmd halt (code=exited, status=0/SUCCESS)
 …………

Now, display all current virtual fabrics:

# opapaquery -o vfList
Getting VF List...
 Number of VFs: 4
 VF 1: Default
 VF 2: Admin
 VF 3: VirtualFabric1
 VF 4: VirtualFabric2
opapaquery completed: OK

The output shows that the two new virtual fabrics “VirtualFabric1” and “VirtualFabric2” are added. To show the configuration of the recently added virtual fabrics, issue the command “opareport -o vfinfo”:

# opareport -o vfinfo
Getting All Node Records...
Done Getting All Node Records
Done Getting All Link Records
Done Getting All Cable Info Records
Done Getting All SM Info Records
Done Getting vFabric Records

vFabrics:
Index:0 Name:Default
PKey:0x8001   SL:0  Select:0x0   PktLifeTimeMult:2
MaxMtu:unlimited  MaxRate:unlimited   Options:0x00
QOS: Disabled  PreemptionRank: 0  HoQLife:    8 ms

Index:1 Name:Admin
PKey:0x7fff   SL:0  Select:0x1: PKEY   PktLifeTimeMult:2
MaxMtu:unlimited  MaxRate:unlimited   Options:0x01: Security
QOS: Disabled  PreemptionRank: 0  HoQLife:    8 ms

Index:2 Name:VirtualFabric1
PKey:0x2   SL:1  Select:0x3: PKEY SL   PktLifeTimeMult:2
MaxMtu:unlimited  MaxRate:unlimited   Options:0x03: Security QoS
QOS: Bandwidth:  33%  PreemptionRank: 0  HoQLife:    8 ms

Index:3 Name:VirtualFabric2
PKey:0x3   SL:2  Select:0x3: PKEY SL   PktLifeTimeMult:2
MaxMtu:unlimited  MaxRate:unlimited   Options:0x03: Security QoS
QOS: Bandwidth:  33%  PreemptionRank: 0  HoQLife:    8 ms

4 VFs

This confirms that VirtualFabric1 and VirtualFabric2 have service level (SL) 1 and SL 2, respectively. Get membership information for all vFabrics in every node in the fabric:

# opareport -o vfmember
Getting All Node Records...
Done Getting All Node Records
Done Getting All Link Records
Done Getting All Cable Info Records
Done Getting All SM Info Records
Done Getting vFabric Records
Getting All Port VL Tables...
Done Getting All Port VL Tables
VF Membership Report

knl-sb2
LID 1   FI      NodeGUID 0x00117501017444e0
Port 1
    Neighbor Node: lb0
    LID 2       FI      NodeGUID 0x0011750101790311
    Port 1
VF Membership:
    VF Name        VF Index     Base SL Base SC VL
    Default        0            0       0       0
    Admin          1            0       0       0
    VirtualFabric1 2            1       1       1
    VirtualFabric2 3            2       2       2


lb0
LID 2   FI      NodeGUID 0x0011750101790311
Port 1
    Neighbor Node: knl-sb2
    LID 1       FI      NodeGUID 0x00117501017444e0
    Port 1
VF Membership:
    VF Name        VF Index     Base SL Base SC VL
    Default        0            0       0       0
    Admin          1            0       0       0
    VirtualFabric1 2            1       1       1
    VirtualFabric2 3            2       2       2


    2 Reported Port(s)
-------------------------------------------------------------------------------

Virtual lanes (VLs) allow multiple logical flows over a single physical link. They are configurable from 0 to 8 VLs plus one management. The above output shows these two new virtual fabrics use VL 1 and VL 2, respectively. Note that local identifier (LID) is assigned to every port in the fabric.

Get the buffer control tables for all vFabrics on every node in the fabric:

# opareport -o  bfrctrl
Getting All Node Records...
Done Getting All Node Records
Done Getting All Link Records
Done Getting All Cable Info Records
Done Getting All SM Info Records
Done Getting vFabric Records
Done Getting Buffer Control Tables
BufferControlTable Report
    Port 0x00117501017444e0 1 FI knl-sb2 (LID 1)
        Remote Port 0x0011750101790311 1 FI lb0 (LID 2)
        BufferControlTable
            OverallBufferSpace   (AU/B):    2176/  139264
            Tx Buffer Depth     (LTP/B):     128/   16384
            Wire Depth          (LTP/B):      13/    1664
            TxOverallSharedLimit (AU/B):    1374/   87936
                VL | Dedicated  (   Bytes) |  Shared  (   Bytes) |  MTU
                 0 |       224  (   14336) |    1374  (   87936) |  10240
                 1 |       224  (   14336) |    1374  (   87936) |  10240
                 2 |       224  (   14336) |    1374  (   87936) |  10240
                15 |       130  (    8320) |    1374  (   87936) |   2048
    Port 0x0011750101790311 1 FI lb0 (LID 2)
        Remote Port 0x00117501017444e0 1 FI knl-sb2 (LID 1)
        BufferControlTable
            OverallBufferSpace   (AU/B):    2176/  139264
            Tx Buffer Depth     (LTP/B):     128/   16384
            Wire Depth          (LTP/B):      11/    1408
            TxOverallSharedLimit (AU/B):    1374/   87936
                VL | Dedicated  (   Bytes) |  Shared  (   Bytes) |  MTU
                 0 |       224  (   14336) |    1374  (   87936) |  10240
                 1 |       224  (   14336) |    1374  (   87936) |  10240
                 2 |       224  (   14336) |    1374  (   87936) |  10240
                15 |       130  (    8320) |    1374  (   87936) |   2048
2 Reported Port(s)
-------------------------------------------------------------------------------

For a detailed summary of all systems, users can issue the following command (a long report):

# opareport -V -o comps -d 10

Just before running traffic in the added virtual fabrics, show port status on VirtualFabric1 and VirtualFabric2 (–n mask 0x2 indicates port 1 and -w mask 0x6 indicates VL 1 and VL2):

# opapmaquery -o getportstatus -n 0x2 -w 0x6 | grep -v 0$
…………………………………
        VL Number    1
            Performance: Transmit
                 Xmit Data                                0 MB (0 Flits)
            Performance: Receive
                 Rcv Data                                 0 MB (0 Flits)
            Performance: Congestion
            Performance: Bubbles
            Errors: Other

        VL Number    2
            Performance: Transmit
                 Xmit Data                                0 MB (0 Flits)
            Performance: Receive
                 Rcv Data                                 0 MB (0 Flits)
            Performance: Congestion
            Performance: Bubbles
            Errors: Other

Observe that all port counters in VirtualFabric1 and VirtualFabric2 are null. Many job schedulers provide an integrated mechanism to launch jobs in a proper virtual fabric. When launching jobs manually, you can use the service level (SL) to identify the virtual fabric.

Find the LIDs assigned to the ports. Two LIDs are found, LID 0x0001 and 0x002:

# opareport -o lids
Getting All Node Records...
Done Getting All Node Records
Done Getting All Link Records
Done Getting All Cable Info Records
Done Getting All SM Info Records
Done Getting vFabric Records
LID Summary

2 LID(s) in Fabric:
   LID(Range) NodeGUID          Port Type Name
0x0001        0x00117501017444e0   1 FI knl-sb2
0x0002        0x0011750101790311   1 FI lb0
2 Reported LID(s)
-------------------------------------------------------------------------------

Every HFI and switch has port counters to hold error counters and performance counters. Show the port counters for LID 0x0001, port 1:

# opapaquery -o portcounters -l 1 -P 1 | grep -v 0$

In the above command, since the output is a long list and most of the lines show 0 at the end, you may want to output only the lines that do not end in 0 just to see anything not normal.

Similarly, show the port counters for LID 0x0002, port 1:

# opapaquery -o portcounters -l 2 -P 1 | grep -v 0$
Getting Port Counters...
PM controlled Port Counters (total) for LID 0x0002, port number 1:
Performance: Transmit
    Xmit Data                             3113 MB (389186568 Flits)
    Xmit Pkts                           944353
    MC Xmit Pkts                            15
Performance: Receive
    Rcv Data                              3054 MB (381812586 Flits)
    Rcv Pkts                           1002110

To clear counters on port 1 (-n 0x2) of VL 1 and VL 2 (-w 0x6):

# opapmaquery -o clearportstatus -n 0x2 -w 0x6
 Port Select Mask   0x0000000000000002
 Counter Sel Mask   0xffffffff

Testing and Verifying

In this section, after setting virtual fabrics, you will run MPI traffic on both VirtualFabric1 and VirtualFabric2 and verify that using the Intel® OPA command lines. You can use the Intel® MPI Benchmarks to generate traffic between two nodes. First, you need to set the secure shell password-less:

# ssh-keygen -t rsa
# ssh-copy-id root@192.168.100.101

And disable the firewall for the MPI traffic test:

# systemctl status firewalld
# systemctl stop firewalld
# systemctl status firewalld

Set up proper environment variables before using Intel® MPI Benchmarks:

# source /opt/intel/parallel_studio_xe_2018.0.033/psxevars.sh intel64
Intel(R) Parallel Studio XE 2018 for Linux*
Copyright (C) 2009-2017 Intel Corporation. All rights reserved.

Run the Intel® MPI Benchmarks IMB-MPI1 (benchmarks for MPI-1 functions) on VirtualFabric1 by setting the environment variable HFI_SL=1 (note that the flag –PSM2, Intel® Performance Scaled Messaging 2, was used for the Intel® OPA family of products):

# mpirun -genv HFI_SL=1 -PSM2 -host localhost -n 1 /opt/intel/impi/2018.0.128/bin64/IMB-MPI1 Sendrecv : -host 192.168.100.101 -n 1 /opt/intel/impi/2018.0.128/bin64/IMB-MPI1
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018, MPI-1 part
#------------------------------------------------------------
# Date                  : Tue Nov 28 15:05:11 2017
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-327.22.2.el7.x86_64
# Version               : #1 SMP Thu Jun 9 10:09:10 EDT 2016
# MPI Version           : 3.1
# MPI Thread Environment:


# Calling sequence was:

# /opt/intel/impi/2018.0.128/bin64/IMB-MPI1 Sendrecv

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000         2.77         2.77         2.77         0.00
            1         1000         2.68         2.68         2.68         0.75
            2         1000         2.62         2.62         2.62         1.53
            4         1000         2.61         2.61         2.61         3.06
            8         1000         2.64         2.64         2.64         6.06
           16         1000         4.47         4.47         4.47         7.16
           32         1000         4.36         4.36         4.36        14.67
           64         1000         4.37         4.37         4.37        29.31
          128         1000         4.41         4.41         4.41        58.07
          256         1000         4.42         4.42         4.42       115.94
          512         1000         4.59         4.59         4.59       223.34
         1024         1000         4.76         4.76         4.76       430.62
         2048         1000         5.02         5.02         5.02       815.80
         4096         1000         5.79         5.79         5.79      1414.85
         8192         1000         7.57         7.57         7.57      2164.05
        16384         1000        12.25        12.25        12.25      2674.27
        32768         1000        16.41        16.41        16.41      3993.64
        65536          640        36.22        36.22        36.22      3618.56
       131072          320        68.82        68.83        68.82      3808.70
       262144          160        80.24        80.31        80.27      6528.68
       524288           80       119.05       119.17       119.11      8798.73
      1048576           40       202.05       202.25       202.15     10369.08
      2097152           20       370.85       371.46       371.15     11291.52
      4194304           10       673.70       674.99       674.34     12427.81


# All processes entering MPI_Finalize

Note that the performance numbers presented in this document are for reference purposes only.

On a separate shell, run the Intel® MPI Benchmarks on VirtualFabric2 by setting the environment variable HFI_SL=2:

# mpirun -genv HFI_SL=2 -PSM2 -host localhost -n 1 /opt/intel/impi/2018.0.128/bin64/IMB-MPI1 Sendrecv : -host 192.168.100.101 -n 1 /opt/intel/impi/2018.0.128/bin64/IMB-MPI1
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018, MPI-1 part
#------------------------------------------------------------
# Date                  : Tue Nov 28 15:06:47 2017
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-327.22.2.el7.x86_64
# Version               : #1 SMP Thu Jun 9 10:09:10 EDT 2016
# MPI Version           : 3.1
# MPI Thread Environment:


# Calling sequence was:

# /opt/intel/impi/2018.0.128/bin64/IMB-MPI1 Sendrecv

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Sendrecv

#-----------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#-----------------------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000         2.70         2.70         2.70         0.00
            1         1000         2.66         2.66         2.66         0.75
            2         1000         2.66         2.66         2.66         1.50
            4         1000         3.55         3.55         3.55         2.26
            8         1000         2.67         2.67         2.67         6.00
           16         1000         4.40         4.40         4.40         7.27
           32         1000         4.37         4.37         4.37        14.63
           64         1000         4.27         4.27         4.27        29.95
          128         1000         4.29         4.29         4.29        59.66
          256         1000         4.39         4.39         4.39       116.63
          512         1000         4.48         4.48         4.48       228.42
         1024         1000         4.63         4.64         4.63       441.85
         2048         1000         5.04         5.04         5.04       812.86
         4096         1000         5.81         5.81         5.81      1409.28
         8192         1000         7.54         7.54         7.54      2172.12
        16384         1000        11.93        11.93        11.93      2747.13
        32768         1000        16.24        16.25        16.24      4034.19
        65536          640        35.99        35.99        35.99      3641.52
       131072          320        67.15        67.16        67.15      3903.48
       262144          160        77.56        77.62        77.59      6754.15
       524288           80       118.23       118.37       118.30      8858.10
      1048576           40       206.30       206.63       206.46     10149.53
      2097152           20       367.50       368.05       367.77     11396.12
      4194304           10       676.80       677.80       677.30     12376.23


# All processes entering MPI_Finalize

Display the port status for port 1, VL 1, and VL 2 (shows result only if it is different than 0):

# opapmaquery -o getportstatus -n 0x2 -w 0x6 | grep -v 0$
Port Number             1
    VL Select Mask      0x00000006
    Performance: Transmit
        Xmit Data                             2289 MB (286211521 Flits)
        Xmit Pkts                           746057
    Performance: Receive
        Rcv Data                              2334 MB (291751815 Flits)
    Errors: Signal Integrity
        Link Qual Indicator                      5 (Excellent)
    Errors: Security
    Errors: Routing and Other Errors
    Performance: Congestion
        Xmit Wait                            10899
    Performance: Bubbles

        VL Number    1
            Performance: Transmit
                 Xmit Data                              736 MB (92044480 Flits)
                 Xmit Pkts                           148719
            Performance: Receive
                 Rcv Data                               736 MB (92037226 Flits)
                 Rcv Pkts                            148583
            Performance: Congestion
                 Xmit Wait                             6057
            Performance: Bubbles
            Errors: Other

        VL Number    2
            Performance: Transmit
                 Xmit Data                              736 MB (92039568 Flits)
                 Xmit Pkts                           147831
            Performance: Receive
                 Rcv Data                               736 MB (92038857 Flits)
                 Rcv Pkts                            148803
            Performance: Congestion
                 Xmit Wait                             4833
            Performance: Bubbles
            Errors: Other

Verify that the port counters at VirtualFabric1 and VirtualFabric2 are incrementing by displaying port counters of fabric port for LID 1, port 1, and VirtualFabric1:

# opapaquery -o vfPortCounters -l 1 -P 1 -V VirtualFabric1 | grep -v 0$
Getting Port Counters...
PM Controlled VF Port Counters (total) for node LID 0x0001, port number 1:
VF name: VirtualFabric1
Performance: Transmit
    Xmit Data                              736 MB (92044480 Flits)
    Xmit Pkts                           148719
Performance: Receive
    Rcv Data                               736 MB (92037226 Flits)
    Rcv Pkts                            148583
Routing and Other Errors:
Congestion:
    Xmit Wait                             6057
Bubbles:
ImageTime: Tue Nov 21 21:45:53 2017
opapaquery completed: OK

Similarly, for VirtualFabric2:

# opapaquery -o vfPortCounters -l 1 -P 1 -V VirtualFabric2 | grep -v 0$
Getting Port Counters...
PM Controlled VF Port Counters (total) for node LID 0x0001, port number 1:
VF name: VirtualFabric2
Performance: Transmit
    Xmit Data                              736 MB (92039568 Flits)
    Xmit Pkts                           147831
Performance: Receive
    Rcv Data                               736 MB (92038857 Flits)
    Rcv Pkts                            148803
Routing and Other Errors:
Congestion:
    Xmit Wait                             4833
Bubbles:
ImageTime: Tue Nov 21 21:50:23 2017
opapaquery completed: OK

View the data counters per virtual lane (shows only if different than 0):

# opapmaquery -o getdatacounters -n 0x2 -w 0x6 | grep -v 0$
 Port Select Mask   0x0000000000000002
 VL Select Mask     0x00000006
    Port Number     1
        Xmit Data                             2226 MB (278332959 Flits)
        Rcv Data                              2236 MB (279538592 Flits)
        Rcv Pkts                            506372
    Signal Integrity Errors:
        Link Qual. Indicator                     5 (Excellent)
    Congestion:
        Xmit Wait                            10899
    Bubbles:
        VL Number     1
             Xmit Data                              736 MB (92044480 Flits)
             Rcv Data                               736 MB (92037226 Flits)
             Xmit Pkts                           148719
             Rcv Pkts                            148583
             Xmit Wait                             6057
        VL Number     2
             Xmit Data                              736 MB (92039568 Flits)
             Rcv Data                               736 MB (92038857 Flits)
             Xmit Pkts                           147831
             Rcv Pkts                            148803
             Xmit Wait                             4833

Finally, view the error counters for 16 virtual lanes; this shows errors if different than 0:

# opapmaquery -o geterrorcounters -n 0x2 -w 0xffff | grep -v 0$
 Port Select Mask   0x0000000000000002
 VL Select Mask     0x0000ffff
    Port Number     1
    Signal Integrity Errors:
    Security Errors:
    Routing and Other Errors:
         VL Number     1
         VL Number     2
         VL Number     3
         VL Number     4
         VL Number     5
         VL Number     6
         VL Number     7
         VL Number     8
         VL Number     9
         VL Number     11
         VL Number     12
         VL Number     13
         VL Number     14
         VL Number     15

Conclusion

Configuring vFabrics in Intel® OPA requires users to manually edit the configuration file. This document shows how to set up two virtual fabrics by editing the opafm.xml file. The necessary command lines are shown to set up, to verify the newly created virtual fabrics, and to clear counters. The Intel® MPI Benchmarks are used to run traffic in each virtual fabric. Verification of traffic and errors in virtual fabric can be observed by looking at the counters.

References

↧

License manager installer reports invalid license

December 13, 2017, 2:18 pm

Latest and popular articles on Intel Technologies

≫ Next: What's New? OpenCL™ Runtime 16.1.2 (CPU only)

≪ Previous: Configure Virtual Fabrics in Intel® Omni-Path Architecture

When installing version 2.6 or higher of the Intel® Software License Manager, it may report that the license file provided is invalid. This can be caused by a mismatch in the hostname between the license file and OS. The license file hostname must contain the entire value returned by running the hostname OS command.

For example:

$ hostname
myhost.mydomain.com
$

The license file must contain myhost.domain.com. If it only has myhost, then it will fail the validation check in the installer.

↧

What's New? OpenCL™ Runtime 16.1.2 (CPU only)

December 13, 2017, 6:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Video Transcoding on Intel® Xeon® Scalable Processor with FFmpeg*

≪ Previous: License manager installer reports invalid license

The 16.1.2 release update includes:

 New optional __attribute__((intel_vec_len_hint(<uint>)))

This attribute can be used to provide a hint to the compiler that the kernel will perform best if vectorized to the specified vector length.

You can specify one of the following lengths for this attribute:

uint	Description
0	The compiler uses heuristics to decide whether to vectorize the kernel, and if so, which vector length to use. This is the default behavior.
1	No vectorization is performed by the compiler. Explicit vector data types in kernels are left intact.
4	Disables heuristics and vectorizes to the length of 4 respectively.
8	Disables heuristics and vectorizes to the length of 8 respectively.

New OpenCL™ C predefined macro __INTEL_OPENCL_CPU_<CPUSIGN>

This macro can be used to fine tune the kernel for a specific CPU device microarchitecture. <CPUSIGN> is the CPU signature of a device.

You can specify one of the following values for this macro:

Macro	Intel Microarchitectures
__INTEL_OPENCL_CPU_SKL__	Intel® microarchitecture code name Skylake
__INTEL_OPENCL_CPU_SKX__	Intel® microarchitecture code name Skylake on Intel Xeon® processor family
__INTEL_OPENCL_CPU_BDW__	Intel® microarchitecture code name Broadwell
__INTEL_OPENCL_CPU_BDW_XEON__	Intel® microarchitecture code name Broadwell on Intel Xeon® processor family
__INTEL_OPENCL_CPU_HSW__	Intel® microarchitecture code name Haswell
__INTEL_OPENCL_CPU_HSW_XEON__	Intel® microarchitecture code name Haswell on Intel Xeon® processor family
__INTEL_OPENCL_CPU_IVB__	Intel® microarchitecture code name Ivy Bridge
__INTEL_OPENCL_CPU_IVB_XEON__	Intel® microarchitecture code name Ivy Bridge on Intel Xeon® processor family
__INTEL_OPENCL_CPU_SNB__	Intel® microarchitecture code name Sandy Bridge
__INTEL_OPENCL_CPU_SNB_XEON__	Intel® microarchitecture code name Sandy Bridge on Intel Xeon® processor family
__INTEL_OPENCL_CPU_WST__	Intel® microarchitecture code name Westmere
__INTEL_OPENCL_CPU_WST_XEON__	Intel® microarchitecture code name Westmere on Intel Xeon® processor family
__INTEL_OPENCL_CPU_UNKNOWN__	Unknown microarchitecture

 Improved heuristics for choosing local size when ndrange is enqueued to the
command queue that was created with
CL_QUEUE_THREAD_LOCAL_EXEC_ENABLE_INTEL property (extension
https://www.khronos.org/registry/OpenCL/extensions/intel/cl_intel_thread_local_exec.
txt).
 A fix for a previous issue where an incorrect library was loaded when running on Intel®
microarchitecture code name Skylake.

Check out the release notes for the previous release here.

↧

Video Transcoding on Intel® Xeon® Scalable Processor with FFmpeg*

December 8, 2017, 1:04 pm

Latest and popular articles on Intel Technologies

≫ Next: Use Intel® QuickAssist Technology Efficiently with NUMA Awareness

≪ Previous: What's New? OpenCL™ Runtime 16.1.2 (CPU only)

Abstract

Video streaming is becoming a common practice across many different fields and companies. Software vendors have to deliver video streams efficiently and quickly, while maintaining a high level of content quality. FFmpeg*¹ is widely used to meet these requirements for video and audio compressing and decompressing, and provides a way to make use of multiple different libraries in one package. This white paper showcases Linux* performance improvements for an Intel^® Xeon^® Scalable processor on video transcoding using FFmpeg with x264 and x265 libraries, compared to the previous-generation Intel Xeon processor E5-2699 v4.

FFmpeg* Basics

FFmpeg is a free software framework that is used for transcoding multimedia files including audio and video. It utilizes several other libraries and codecs, and packages them into one software bundle. Users are able to either quickly complete transcoding with minimal options, or get more advanced and provide different optimizations, depending on their needs. In most cases, a raw input video is passed into FFmpeg, which is then converted to a more common file format accessible by a wide range of devices including smartphones and desktops. This is done by breaking down the stream into individual decoded frames to then repackage in the new format. Often FFmpeg is used to do this live on the Internet; when watching a video online, this process is completed to get the video to stream on your device. FFmpeg can also be used to combine video and audio, and add various filters to improve or modify the resulting output video file.

Many large to small software vendors make use of FFmpeg for their multimedia content creation and delivery. FFmpeg’s popularity can largely be attributed to it being free software and supporting a wide array of formats and file types, both legacy and new. It is also highly portable across many different hardware configuration and operating system versions. Using FFmpeg instead of directly using the codecs themselves is also advantageous, as FFmpeg supports multiple outputs with one input.

Metrics chart image
Figure 1: FFmpeg transcoding high-level flowchart.

x264 and x265

x264 and x265 are both implementations of h.264 and h.265 encoding methods. The h.264 encoding standard (also known as Advanced Video Coding, or AVC) was developed first. h.265 (known as High Efficiency Video Coding, or HEVC) is in general an extension of h.264. In most cases, h.265 outperforms h.264 by providing extra compression without losing video quality. However, h.265 requires more computing power and is still being adopted by most hardware. Since h.264 has been around longer, it is currently supported by most hardware, and can be used on a wider range of devices.

Best Practice

Setup Guide—The following guide was followed for this white paper:

https://trac.ffmpeg.org/wiki/CompilationGuide/Centos

During analysis we only installed Yasm*, libx264, and libx265, and then installed FFmpeg. Note that because not all libraries were used, not all libraries needed to be enabled for FFmpeg. In this particular workload, the General Public License (GPL), libx264, libx265, and non-free libraries were enabled.

In FFmpeg the preset determines how fast the encoding process will be—at the expense of compression efficiency. Put differently, if you choose ultrafast, the encoding process is going to run fast, but the file size will be larger when compared to medium. The visual quality will be the same. Valid presets are ultrafast,superfast, veryfast, faster, fast,medium, slow, slower, veryslow and placebo.

Video bitrate is the rate of video data transmitted over time. The higher the bitrate, the better the quality is, but will take longer to encode. This, together with the type of preset can influence the encoding speed and quality of video, which is important when deciding on what parameters should be used when streaming to different target devices and connections.

How different preset setting and bitrate size affects encoding time:

Metrics chart image
Figure 2: Image from https://trac.ffmpeg.org/wiki/Encode/H.264.

Intel® Xeon® Platinum 8180 Processor versus Intel® Xeon® Processor E5-2699 v4

Platform Test Configurations

Hardware and software

FFmpeg: http://ffmpeg.org/releases/ffmpeg-snapshot.tar.bz2

x264: https://git.videolan.org/

x265: https://bitbucket.org/multicoreware/x265

Table 1: Test systems hardware and software configurations.

	Intel® Xeon® Platinum 8180 Processor-based system	Intel® Xeon® Processor E5-2699-based system
#Sockets, cores and, or sockets	2 sockets, 28 cores	2 sockets, 22 cores
#Logical cores (Intel® Hyper-Threading Technology enabled)	112	88
Processor base frequency	2.50 GHz	2.20 GHz
Max turbo frequency	3.80 GHz	3.60 GHz
Memory capacity (test systems setup)	192 GB	128 GB
Memory frequency	DDR4 2666 MHz	DDR4 2400 MHz
Max #of memory channels	6	4
Operating system	CentOS* 7.4	CentOS 7.4
TDP (thermal design power)	205 W	145 W

Memory Latency and Bandwidth

Using Intel® Memory Latency Checker (Intel® MLC) v3.4 default benchmark, the memory latencies and bandwidth of both platforms were measured. The Intel® Xeon® Platinum 8180 processor system shows better latency and more memory bandwidth than the Intel Xeon processor E5-2699 v4 based platform.

Metrics chart image
Figure 3: Memory latency and bandwidth comparisons.

FFmpeg Workload²

Transcoding profiles and workloads: speed test, capacity test (channel density):

Input y4m video source: https://media.xiph.org/video/derf/
Y4M file is a recognized FFmpeg format that stores uncompressed images that make up the video frames before compressing into MPEG-2 format.

Workload scripts:

ffmpeg -i in.y4m -codec:v libx264 -preset <preset> -b:v <bitrate> -maxrate <bitrate> -bufsize <2*bitrate> -psnr out.264

ffmpeg -i in.y4m -codec:v libx265 -preset <preset> -b:v <bitrate> -maxrate <bitrate> -bufsize <2*bitrate> -psnr out.265

Preset: slow, faster

Bitrate: x264: 1Mbps, 5Mbps, 10Mbps, 15Mbps | x265: 5Mbps, 10Mbps, 15Mbps, 20Mbps

²Represents a typical video transcoding workload but other usages will have different FFmpeg configuration and settings.

Performance

FFmpeg speed test

The speed test's performance metrics are measured in FPS (frames per second) while encoding a single y4m input to x264 and x265 files. For the x264 speed test, bitrate didn't vary much when comparing the Intel Xeon Platinum 8180 processor against the Intel Xeon processor E5-2699 v4. The scaling is somewhat similar for all resolutions and presets except for the 4k (3840 x 2160), with the preset set to faster where scaling is a little lower. In the x265 speed test, the scaling is much higher compared to similar workloads in x264, and bitrate size makes a bigger difference in performance.

Metrics chart image
Figure 4: x264 speed test (converting y4m file to x264).

Metrics chart image
Figure 5: x265 speed test (converting y4m file to x265).

FFmpeg capacity test (channel density) N-to-N

In the capacity tests, the goal is to run as many instances of the speed test until platform %CPU utilization is > 90 percent. We use numactl on both systems combined with the speed test scripts to guarantee higher local memory utilization. In both x264 and x265 workloads, Intel Xeon Platinum 8180 processor scaled well against the previous-generation platform, although performance scaling is higher with x265 workloads.

Metrics chart image
Figure 6: Capacity test (run N instances of speed test until %CPU is > 90 percent).

Performance Analysis

So why is the Intel® Xeon® Scalable processor system performing better than the previous-generation Intel Xeon processor E5-2699 v4 based platform in transcoding videos? Several improvements in the new architecture that have more efficient instructions per cycle, higher turbo frequency, and memory bandwidth provided performance improvement.

To further understand what's going on with the two platforms, we ran one of the capacity test workloads (launched several instances of FFmpeg encoding a 1920 x 1080 y4m video to x264 until platform %CPU utilization was > 90 percent) and analyzed the CPU activity.

We used numactl --physcpubind [cpus] to control the non-uniform memory access (NUMA) policy for shared memory, ensuring each core will be using local memory as much as possible.

As seen in Figures 7 and 8, the Intel Xeon Platinum 8180 processor not only has higher CPU operating frequency, but also more efficient (lower) cycles per instructions (CPI) than the Intel Xeon E5-2699 v4.

Metrics chart image
Figure 7: CPU operating frequency.

Metrics chart image
Figure 8: CPI (lower is better).

Similar to the memory latency chart in Figure 3, the memory bandwidth capacity of the Intel Xeon Platinum 8180 processor system is much higher, having six memory channels, while running one of the capacity test workloads. This means that you have more capacity to run more instances/channels of the transcoding workload simultaneously.

Metrics chart image
Figure 9: Memory bandwidth while running a workload.

Summary

Video is becoming one of the most popular mediums to deliver or share content either via live streaming or video on demand. FFmpeg has become a popular framework for video transcoding codecs such as AVC/x264 and HEVC/x265. With the Intel Xeon Scalable processor-based platform, it provides a performance improvement compared to the previous-generation Intel Xeon processor E5- v4 family. Improvements in instructions per cycle, higher turbo frequency, and higher memory bandwidth support provides a more efficient way to transcode videos.

About the Authors

Meghan Gorman is an Application Engineer in Intel's Software and Services Group, working on application tuning and optimization for Intel® architecture.

Rodolfo De Vega is an Application Engineer in Intel's Software and Services Group, working on application tuning and optimization for Intel architecture.

References

FFmpeg documentation: https://www.ffmpeg.org/documentation.html
Compile FFmpeg on CentOS: https://trac.ffmpeg.org/wiki/CompilationGuide/Centos

Related Resources

Intel Xeon Scalable Processor https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-platform.html
Intel® VTune™ Amplifier https://software.intel.com/en-us/intel-vtune-amplifier-xe
Intel® MLC https://software.intel.com/en-us/articles/intelr-memory-latency-checker

↧

Use Intel® QuickAssist Technology Efficiently with NUMA Awareness

December 15, 2017, 9:05 am

Latest and popular articles on Intel Technologies

≫ Next: Best Known Methods: Firewall Blocks MPI Communication among Nodes

≪ Previous: Video Transcoding on Intel® Xeon® Scalable Processor with FFmpeg*

Introduction

Intel® QuickAssist Technology (Intel® QAT) delivers high-performance capabilities for commonly used encryption/decryption and compression/decompression operations. With the advent of the Intel® Xeon® processor Scalable family, some servers now ship with Intel QAT onboard as part of the system’s Platform Controller Hub. It’s a great time for data center managers to learn how to optimize their operations with the power of Intel QAT.

Like any other attached device, Intel QAT should be used carefully to obtain optimal performance, especially in light of non-uniform memory access (NUMA) concerns on multisocket servers. When using an Intel QAT device identifier, it is best to be aware of where it sits in the system topology so you can intelligently direct workloads into and out of the Intel QAT device channels.

In this article, we show how to discover your system’s NUMA topology, find your Intel QAT device identifiers, and configure your Intel QAT drivers to ensure that you are using the system as efficiently as possible.

Why Is NUMA Awareness Important?

In general, NUMA awareness is important whenever you are using a server that supports two or more physical processor sockets. Each processor socket has its own connections for accessing the system’s main memory and device buses. In a NUMA-based architecture, there will be physical memory and device buses that are accessed from specific processor sockets. Take, for example, a two-socket server in which there are processors numbered 0 and 1. When accessing the memory that is directly attached to processor 0, processor 1 will have to cross an inter-processor bus. This access is “non-uniform” in the sense that processor 0 will access this memory faster than processor 1 does, due to the longer access distance. Under some (but not all) conditions, accessing memory or devices across NUMA boundaries will result in decreased performance.

Figure 1 A simple, abstracted two-socket NUMA layout.

Figure 1. A simple, abstracted two-socket NUMA layout.

It is outside the scope of this article to discuss specific NUMA scenarios, architectures, and solutions, or the details of how memory and I/O buses are accessed in these scenarios. We will instead discuss an example of a current-model, two-socket Intel® architecture server as we discuss Intel QAT utilization. The general ideas described here can be extrapolated to other scenarios such as four- and eight-socket servers.

For more general information about NUMA, read NUMA: An Overview.

Discovering NUMA Topology

Note that in Figure 1, the processor sockets are labeled as NUMA nodes. Usually this is a direct mapping of node to socket, but it doesn’t have to work this way. When determining the NUMA layout, read the hardware documentation carefully to understand the exact definition and boundaries of a NUMA node.

A commonly available package for determining these layouts is the hwloc package, more formally known as the Portable Hardware Locality project. The site includes complete instructions for obtaining and installing hwloc for your OS.

Once installed, you can use the lstopo command to determine your system’s topology. The tool can output in graphical format (see Figure 2 for an example) or in plain text.

Figure 2 - Graphical output from the lstopo command

Figure 2. Graphical output from thelstopocommand.

The output shows that the system queried has two NUMA nodes. Here is the full command that was used to produce this output:

$ lstopo --ignore PU --merge --no-caches

This directs the output to remove processing units (processor core details) and processor caches. It also removes elements that do not affect the NUMA hierarchy, such as the processor cores themselves. This information is interesting and useful for other queries, but is of little value in assessing Intel QAT location, at least for now.

Locating Intel QuickAssist Technology (Intel® QAT) Within the NUMA Topology

Locating Intel QAT devices can be a challenge, even with the help of lstopo. We must know which device identifier to look for. To do so, we'll use the lspci command, which is generally in the pciutils package. Most Linux* distributions include it by default.

Here is a command that can help locate the Intel QAT coprocessor devices:

$ lspci -vv | less

This command generates quite a bit of output. To find your Intel QAT devices, type /qat to search for the driver, and then page up or down to view entries. You should find entries that look like this one:

We've highlighted in red two items in the output. The one at the bottom is the one we found by searching the output, indicating that this device is controlled by the Intel QAT kernel driver. The second is the device identifier 37c8. You can see that this identifier is in the lstopo output shown in Figure 1. You can also see it in the text output, shown here:

We've elided some of the output (where “...” is shown) to make things visually clearer. However, looking at either this output’s indentation or the graphical output shown in Figure 1 shows that all three Intel QAT devices represented on this particular system fall into NUMA node 0.

Locating SR-IOV Devices

If your Intel QAT kernel driver was installed with support for hosting Single-Root I/O Virtualization (SR-IOV), you will see many more devices present (see Figure 3).

Figure 3 SR-IOV devices

Figure 3. SR-IOV devices.

The graphical output has helpfully collapsed the SR-IOV virtual functions (VFs) into arrays of 16 devices per physical function (PF) device. In text format, each device identifier will be shown in the lstopo (and lspci) output individually.

Note that since VFs are derived from their host PFs, they are still going to be installed to the same NUMA nodes as the PFs. Thus, in our example, all 48 VFs are in NUMA node 0.

Using Intel® QAT with NUMA Awareness

The good news is that it is quite likely that your Intel QAT driver installation has already set up the Intel QAT devices correctly for the NUMA topology of your system. To check that it is correct and to fine tune it, however, you must understand the topology, where Intel QAT exists within it, and how to ensure Intel QAT uses processor cores that are within the same NUMA node. We now know how to do the first two of those activities; let's examine the last.

When we worked with lstopo previously, we used the --ignore PU flag so that the output would not be cluttered with extra information. Now we want to see the layout of processor cores, so let’s try it without the flag:

Figure 4. Four cores per NUMA node.

Figure 4. Four cores per NUMA node.

In this case, we actually cheated a little. Here's the command that generated that output:

$ lstopo --input "n:2 4" --merge --no-caches

We asked lstopo to simulate a four-core-per-node NUMA topology with two nodes. This was for the benefit of seeing the output, since the machine we've been running our tests on has 48 processor cores on board. This would create very long output, in the case of text, or very wide output, in the case of graphics.

The important part is seeing the processor numbering. Here it is evident that processors 0‒3 are installed to NUMA node 0 and processors 4‒7 are installed to NUMA node 1. Now we are ready to configure our Intel QAT drivers appropriately.

Let's revisit the output from lspci above; this time we'll highlight a different value:

The kernel driver reported to be in use for our current Intel QAT installation is c6xx. The configuration files for the devices in use can be found in /etc with this prefix:

$ ls /etc/c6xx*
/etc/c6xx_dev0.conf
/etc/c6xx_dev1.conf
/etc/c6xx_dev2.conf

Note that if you have SR-IOV enabled, you will also see the VF device configuration files in /etc, but it is unnecessary to configure them since they will be slaved to their parent host PF configuration. Their affinity selections will be centered on the (likely) single- and dual-core virtual machine (VMs) that they are used by.

Now we can examine the processor affinity selections within the configuration files. Here's an easy way to get that information:

$ sudo grep Core /etc/c6xx_dev*.conf
/etc/c6xx_dev0.conf:Cy0CoreAffinity = 0
/etc/c6xx_dev0.conf:Dc0CoreAffinity = 0
/etc/c6xx_dev0.conf:Cy0CoreAffinity = 1
/etc/c6xx_dev0.conf:Cy1CoreAffinity = 2
/etc/c6xx_dev0.conf:Cy2CoreAffinity = 3
/etc/c6xx_dev0.conf:Cy3CoreAffinity = 4
/etc/c6xx_dev0.conf:Cy4CoreAffinity = 5
/etc/c6xx_dev0.conf:Cy5CoreAffinity = 6
/etc/c6xx_dev0.conf:Dc0CoreAffinity = 1
/etc/c6xx_dev0.conf:Dc1CoreAffinity = 2
/etc/c6xx_dev1.conf:Cy0CoreAffinity = 0
/etc/c6xx_dev1.conf:Dc0CoreAffinity = 0
/etc/c6xx_dev1.conf:Cy0CoreAffinity = 9
/etc/c6xx_dev1.conf:Cy1CoreAffinity = 10
/etc/c6xx_dev1.conf:Cy2CoreAffinity = 11
/etc/c6xx_dev1.conf:Cy3CoreAffinity = 12
/etc/c6xx_dev1.conf:Cy4CoreAffinity = 13
/etc/c6xx_dev1.conf:Cy5CoreAffinity = 14
/etc/c6xx_dev1.conf:Dc0CoreAffinity = 9
/etc/c6xx_dev1.conf:Dc1CoreAffinity = 10
/etc/c6xx_dev2.conf:Cy0CoreAffinity = 0
/etc/c6xx_dev2.conf:Dc0CoreAffinity = 0
/etc/c6xx_dev2.conf:Cy0CoreAffinity = 17
/etc/c6xx_dev2.conf:Cy1CoreAffinity = 18
/etc/c6xx_dev2.conf:Cy2CoreAffinity = 19
/etc/c6xx_dev2.conf:Cy3CoreAffinity = 20
/etc/c6xx_dev2.conf:Cy4CoreAffinity = 21
/etc/c6xx_dev2.conf:Cy5CoreAffinity = 22
/etc/c6xx_dev2.conf:Dc0CoreAffinity = 17
/etc/c6xx_dev2.conf:Dc1CoreAffinity = 18

The core affinity settings are specifying specific core numbers to use for various Intel QAT functions. As mentioned at the beginning of this section, it is likely that the installation of your Intel QAT drivers already has configured this, but you should check to make sure. Note that in the above output, affinity values are recorded for processors 0‒22. In the system that we've been using for examples, processors 0‒23 are on NUMA node 0 and therefore co-resident with all the Intel QAT devices. This machine is configured without cross-NUMA-node problems.

If we did find core affinity set to cross the NUMA node boundary, we would want to edit these files to specify core numbers within the same node as the device under consideration. After doing that, we would reset the drivers with the following command:

$ sudo adf_ctl restart
Restarting all devices.
Processing /etc/c6xx_dev0.conf
Processing /etc/c6xx_dev1.conf
Processing /etc/c6xx_dev2.conf

A note about virtual machines and VFs

In an SR-IOV configuration, it is wise to also pin your physical CPU usage to cores that are in the same NUMA node as your Intel QAT devices. You can do this various ways depending on your hypervisor of choice. See your hypervisor's documentation to determine how to best allocate NUMA resources within it. For example, QEMU*/KVM* allows usage of the -numa flag to specify a particular NUMA node for a VM to run within.

Running applications

Finally, for applications that utilize Intel QAT, it may be optimal to ensure that they too operate within the same NUMA node boundaries. This is easily accomplished by launching them in a NUMA-aware fashion with the numactl command. It is available in the numactl package on most operating systems. The easiest way to ensure usage of local resources is as follows:

$ numactl -m 0 -N 0 <command> <arguments>

This tells the system to launch command with arguments (that is, a normal application launch) and ensure that both memory and CPU utilization are isolated to NUMA node 0. If your targeted Intel QAT devices are on node 0, you might try launching this way to ensure optimal NUMA usage.

Summary

We examined how to discover your systems' NUMA topology, locate Intel QAT devices within that topology, and adjust the configuration of the QAT driver set and application launch to best take advantage of the system.

In general, a well-installed system with proper Intel QAT drivers will likely not need much adjustment. However, if Intel QAT is configured for cross-node operation, performance may suffer.

About the Author

Jim Chamings is a senior software engineer at Intel Corporation. He works for the Intel Developer Relations Division, in the Data Center Scale Engineering team, specializing in Cloud and SDN/NFV. You can reach him at jim.chamings@intel.com.

↧

Best Known Methods: Firewall Blocks MPI Communication among Nodes

December 12, 2017, 3:03 pm

Latest and popular articles on Intel Technologies

≫ Next: Enhancing User Experience: Krita Application Utilizing Multiple Cores on Intel® Architecture Platforms

≪ Previous: Use Intel® QuickAssist Technology Efficiently with NUMA Awareness

This article shares three methods you can use when dealing with the firewall blocking the Message Passing Interface (MPI) communication among many machines. For example, when running an MPI program between two machines, you might see a communication error like this:

[proxy:0:1@knl-sb0] HYDU_sock_connect (../../utils/sock/sock.c:268): unable to connect from "knl-sb0" to "knc4" (No route to host)
[proxy:0:1@knl-sb0] main (../../pm/pmiserv/pmip.c:461): unable to connect to server knc4 at port 39652 (check for firewalls!)

This symptom suggests the MPI ranks cannot communicate with each other, because the firewall blocks the MPI communication.

Below are three methods to help you solve this problem.

First Method: Stop the `firewalld` deamon

The first and simplest method is to stop the firewall on the machine where you run the MPI program. First, check the status of the firewalld deamon on a Red Hat Enterprise Linux* (RHEL*) and CentOS* system.

$ systemctl status firewalld
firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2017-12-05 21:36:10 PST; 12min ago
 Main PID: 47030 (firewalld)
   CGroup: /system.slice/firewalld.service
           47030 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid

The output shows that firewalld is running. You can stop it and verify its status with the following command lines:

$ sudo systemctl stop firewalld
$ systemctl status firewalld
firewalld.service - firewalld - dynamic firewall daemon
   Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
   Active: inactive (dead) since Tue 2017-12-05 21:51:19 PST; 4s ago
  Process: 48062 ExecStart=/usr/sbin/firewalld --nofork --nopid $FIREWALLD_ARGS (code=exited, status=0/SUCCESS)
 Main PID: 48062 (code=exited, status=0/SUCCESS)

With firewalld now stopped, you should be able to run your MPI program between the two machines (in this example, I use the Intel® MPI Benchmarks IMB-MPI1 as the MPI program).

$ mpirun -host localhost -n 1 /opt/intel/impi/2018.0.128/bin64/IMB-MPI1 Sendrecv : -host 10.23.3.61 -n 1  /opt/intel/impi/2018.0.128/bin64/IMB-MPI1
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018, MPI-1 part
#------------------------------------------------------------
# Date                  : Tue Dec  5 21:51:45 2017
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-327.el7.x86_64
# Version               : #1 SMP Thu Nov 19 22:10:57 UTC 2015
# MPI Version           : 3.1
# MPI Thread Environment:

# Calling sequence was:

# /opt/intel/impi/2018.0.128/bin64/IMB-MPI1 Sendrecv

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Sendrecv

#---------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#---------------------------------------------------------------------------
    #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec

         0         1000        16.57       16.57         16.57          0.00
         1         1000        16.57       16.57         16.57          0.12
         2         1000        16.52       16.53         16.53          0.24
         4         1000        16.58       16.58         16.58          0.48
         8         1000        16.51       16.51         16.51          0.97
        16         1000        16.20       16.20         16.20          1.98
        32         1000        16.32       16.32         16.32          3.92
        64         1000        16.55       16.55         16.55          7.73
       128         1000        16.65       16.65         16.65         15.37
       256         1000        29.07       29.09         29.08         17.60
       512         1000        30.75       30.76         30.76         33.29
      1024         1000        31.13       31.15         31.14         65.75
      2048         1000        33.58       33.58         33.58        121.98
      4096         1000        34.79       34.80         34.80        235.38

However, this method can pose a problem, because this machine is vulnerable to security issues. It may not be suitable in some scenarios. In that case, start the firewalld deamon again, and then try the second method.

$ sudo systemctl start firewalld

Second Method: Use Rich Rule in `firewalld`

This method uses the Rich Rule feature in firewalld to accept only IP v4 packets from the other machine whose IP address is 10.23.3.61.

$ sudo firewall-cmd --add-rich-rule='rule family="ipv4" source address="10.23.3.61" accept'
Success

Verify the rule you just added.

$ firewall-cmd --list-rich-rules
rule family="ipv4" source address="10.23.3.61" accept

Run the MPI program.

$ mpirun -host localhost -n 1 /opt/intel/impi/2018.0.128/bin64/IMB-MPI1 Sendrecv : -host 10.23.3.61 -n 1  /opt/intel/impi/2018.0.128/bin64/IMB-MPI1
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018, MPI-1 part
#------------------------------------------------------------
# Date                  : Tue Dec  5 22:01:17 2017
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-327.el7.x86_64
# Version               : #1 SMP Thu Nov 19 22:10:57 UTC 2015
# MPI Version           : 3.1
# MPI Thread Environment:


# Calling sequence was:

# /opt/intel/impi/2018.0.128/bin64/IMB-MPI1 Sendrecv

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Sendrecv

#---------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#---------------------------------------------------------------------------
    #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
         0         1000        16.88        16.88        16.88         0.00
         1         1000        16.86        16.86        16.86         0.12
         2         1000        16.57        16.57        16.57         0.24
         4         1000        16.55        16.55        16.55         0.48
         8         1000        16.40        16.40        16.40         0.98
        16         1000        16.29        16.29        16.29         1.96
        32         1000        16.63        16.63        16.63         3.85
        64         1000        16.87        16.87        16.87         7.59
       128         1000        17.03        17.04        17.03        15.03
       256         1000        27.58        27.60        27.59        18.55
       512         1000        27.52        27.54        27.53        37.18
      1024         1000        26.87        26.89        26.88        76.16
      2048         1000        28.62        28.64        28.63       143.02
      4096         1000        30.27        30.27        30.27       270.62

^C[mpiexec@knc4] Sending Ctrl-C to processes as requested
[mpiexec@knc4] Press Ctrl-C again to force abort

You can remove a Rich Rule that you defined by entering the following command:

$ sudo firewall-cmd --remove-rich-rule='rule family="ipv4" source address="10.23.3.61" accept'
success

Third Method: Add a Rule in `iptables-service` to Accept Packets from Other Machines

In addition to firewalld, iptables-service can also be used to manage the firewall on a RHEL and CentOS system. In this method, you can add a rule in iptables-service to allow only traffic from the other machine.

First, download and install the iptables-services package.

$ sudo yum install iptables-servicesi

Next, start the iptables-service service.

$ sudo systemctl start iptables
$ systemctl status iptables
iptables.service - IPv4 firewall with iptables
   Loaded: loaded (/usr/lib/systemd/system/iptables.service; disabled; vendor preset: disabled)
   Active: active (exited) since Tue 2017-12-05 21:53:41 PST; 55s ago
  Process: 49042 ExecStart=/usr/libexec/iptables/iptables.init start (code=exited, status=0/SUCCESS)
 Main PID: 49042 (code=exited, status=0/SUCCESS)

Dec 05 21:53:41 knc4-jf-intel-com systemd[1]: Starting IPv4 firewall with iptables...
Dec 05 21:53:41 knc4-jf-intel-com iptables.init[49042]: iptables: Applying firewall rules: [ OK ]
Dec 05 21:53:41 knc4-jf-intel-com systemd[1]: Started IPv4 firewall with iptables.

The firewall rules are defined in the /etc/sysconfig/iptables file.

$ sudo cat /etc/sysconfig/iptables

# sample configuration for iptables service
# you can edit this manually or use system-config-firewall
# please do not ask us to add additional ports/services to this default configuration
*filter
:INPUT ACCEPT [0:0]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [0:0]
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A INPUT -p icmp -j ACCEPT
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m state --state NEW -m tcp --dport 22 -j ACCEPT

Display the current defined rules; there shouldn’t be any. To add a rule to accept packets from the other machine, specify its IP address.

$ firewall-cmd --direct --get-all-rules
$ sudo firewall-cmd --direct --add-rule ipv4 filter INPUT 0 -s 10.23.3.61 -j ACCEPT
success
$ firewall-cmd --direct --get-all-rules
ipv4 filter INPUT 0 -s 10.23.3.61 -j ACCEPT

After adding the new rule, the above command line confirms that the new rule has been added. Run the MPI program again to verify it works.

$ mpirun -host localhost -n 1 /opt/intel/impi/2018.0.128/bin64/IMB-MPI1 Sendrecv : -host 10.23.3.61 -n 1  /opt/intel/impi/2018.0.128/bin64/IMB-MPI1
#------------------------------------------------------------
#    Intel (R) MPI Benchmarks 2018, MPI-1 part
#------------------------------------------------------------
# Date                  : Tue Dec  5 21:58:20 2017
# Machine               : x86_64
# System                : Linux
# Release               : 3.10.0-327.el7.x86_64
# Version               : #1 SMP Thu Nov 19 22:10:57 UTC 2015
# MPI Version           : 3.1
# MPI Thread Environment:

# Calling sequence was:

# /opt/intel/impi/2018.0.128/bin64/IMB-MPI1 Sendrecv

# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:
# Sendrecv

#---------------------------------------------------------------------------
# Benchmarking Sendrecv
# #processes = 2
#---------------------------------------------------------------------------
    #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
         0         1000        16.49        16.49        16.49         0.00
         1         1000        16.40        16.40        16.40         0.12
         2         1000        16.40        16.40        16.40         0.24
         4         1000        16.86        16.86        16.86         0.47
         8         1000        16.43        16.43        16.43         0.97
        16         1000        16.32        16.32        16.32         1.96
        32         1000        16.64        16.64        16.64         3.85
        64         1000        16.90        16.90        16.90         7.57
       128         1000        16.86        16.86        16.86        15.18
       256         1000        29.58        29.60        29.59        17.30
       512         1000        27.73        27.74        27.74        36.91
      1024         1000        28.07        28.09        28.08        72.91
      2048         1000        34.95        34.97        34.96       117.15
      4096         1000        36.22        36.23        36.22       226.12

^C[mpiexec@knc4] Sending Ctrl-C to processes as requested
[mpiexec@knc4] Press Ctrl-C again to force abort

To remove this rule, use the following command.

$ sudo firewall-cmd --direct --remove-rule ipv4 filter INPUT 0 -s 10.23.3.61 -j ACCEPT
success
$ firewall-cmd --direct --get-all-rules

Conclusion

Firewalls can block MPI communication among the nodes. This article shared three methods you can use to allow communication among MPI ranks.

↧

Enhancing User Experience: Krita Application Utilizing Multiple Cores on Intel® Architecture Platforms

December 11, 2017, 3:16 pm

Latest and popular articles on Intel Technologies

≫ Next: A Closer Look at Object Detection, Recognition and Tracking

≪ Previous: Best Known Methods: Firewall Blocks MPI Communication among Nodes

By Fredrick Odhiambo and Dmitry Kazakov

Krita is a professional, FREE, and open source painting program. It is made by artists that want to see affordable art tools for everyone.

Executive Summary

"Now I can paint with twice and even three times bigger brushes! Which actually means performance is about 5–10 times better." Rad (Igor Wect), Krita* artist.

Computer processing demands have grown and become more complex and intensive. Software applications demand more compute resources. This has led to software performance bottlenecks and unreliable software services, which lead to a poor user experience, and may eventually turn into complete rejection of an application by its users. Multicore refers to a single physical processor packaged into a single integrated circuit, which contains two or more processing units attached for simultaneous processing of multiple tasks, resulting in enhanced performance and reduced power consumption. Intel® Hyper-Threading Technology (Intel® HT Technology), which is available on the latest Intel Core™ vPro™ processors, the Intel Core™ processor family, the Intel Core™ m processor family, and the Intel® Xeon® processor family is a technology that allows a single microprocessor to act like two separate processors to the operating system and to the application program. This white paper highlights how the Krita application combines multicore and Intel HT Technology to improve the application’s performance and the overall user experience of Krita. Krita software developers used Intel VTune™ Amplifier to analyze Krita use cases that had been previously identified by Krita users as use cases that required improvement. The VTune Amplifier enabled quick identification of concurrency. These issues were resolved mostly through multithreading and thread synchronization techniques. An improved version of Krita was built and the aforementioned use cases were re-tested for performance gain. The results collected show a measurable performance gain of upto 2X on the different system configurations. System configuration information is included in a later section of this paper.

Software Developers Value Proposition

Multicore and Intel HT Technology afford software developers the ability to write software applications that take advantage of these technologies, and competitively positions the applications to the end user by providing an outstanding user experience. Besides that, software developers through their applications can now:

Run demanding applications simultaneously while maintaining system responsiveness.
Keep systems protected, efficient, and manageable while minimizing impact on productivity.
Provide headroom for future business growth and new solution capabilities.
Reduce application energy consumption, since multiple cores run on a lower frequency than a single core, which significantly reduces heat dissipation.
Increase the number of transactions that can be processed simultaneously.
Utilize existing 32-bit application technologies while maintaining 64-bit future readiness.
Multimedia applications can create, render, edit, and encode graphically intensive files while running background functions, without compromising overall performance of the application.

Krita* Case Study

Methodology

Krita gathered application usage complaints and issues using Bugzilla on https://bugs.kde.org/. On this online forum, artists and other users report their challenges and pain points from their real-world experiences of using Krita application as far as user experience is concerned. Among the issues raised, slow brush speeds featured prominently and Krita asked issue authors additional questions i.e. system settings to enable them define and de-limit the issue more concretely. From multiple issues raised, two use cases were prioritized: Brushes and Animation rendering.

Using the VTune Amplifier, Concurrency and Hotspot Analysis on affected use cases were done for establishing an in-depth view of performance issues. These issues were resolved through rewriting code, changing libraries, incorporating Intel® Advanced Vector Extensions (Intel® AVX) for Fused Multiply Add instructions, and multithreading the use cases as much as possible. This resulted in development of Krita version 4.0.0, which takes advantage of multicore and Intel HT Technology to improve application performance.

Krita Situation

The VTune Amplifier was used to profile Krita application performance. In-depth Concurrency and Basic Hotspot Analysis were reviewed to establish performance issues in the two use cases of importance (brush painting and animation rendering). The following issues were established.

Rendering Use Case

Each frame was being rendered sequentially in almost single thread. After initial profiling by the VTune Amplifier (Concurrency and Waits benchmarks), two issues both related to usage of mutexes were located:

Issue 1: Tiles hash table. All the image data in Krita is represented in a form of 256 by 256 pixel tiles, which are stored in hash tables. Each pixel layer has a hash table containing all the tiles dedicated to the layer, and each hash table is guarded by its own read-write lock (QReadWriteLock). According to the VTune Amplifier profiling results, up to 40 percent of the time was spent waiting on these locks.

Issue 2: The Krita application internal scheduler also had a bottleneck. The tasks in the scheduler are represented as QRunnable objects running on a QThreadPool pool. It was established that QRunnable-based tasks are too small, and there is high probability of the thread going to sleep just after one task is completed and before the next one is started, so the tasks were aggregated in bunches. Now each QRunnable-based task can pull extra tasks from the scheduler, therefore avoiding any sleeps.

These two fixes significantly improved the Krita rendering engine and it showed almost linear speed scaling, up to six physical cores. When adding more cores, the mutexes start to show up again. To ensure Krita scales well on higher CPU core numbers, there were two options—either rewrite the hash table in a lock-free manner or use more high-level methods of multithreading. The second approach was chosen.

Solution

Tiles Hash table issue was resolved by the usage of two-phase locking inside the hash table (KisTileHashTableTraits<T>::getTileLazy()). At first an attempt to fetch the tile without any lock held (KisTileHashTableTraits<T>::getTileMinefieldWalk()), and only if this process fails, the lock is taken and the process is repeated. Such trick is thread-safe because of two reasons: thread-safe shared pointers are used inside the hash table (so in the worst case the thread will just read outdated data) and the high level scheduler guarantees that two threads will never try to read +write to the same pixel of the image (it means that the tile cannot be deleted while some other thread is still accessing it). Such two-phase locking significantly reduced the contention over the lock.

When saving animation video clip, all frames need to be rendered and saved into separate files. Technically, each frame represents a different version of the same image, but with its own pixel data. Therefore different frames cannot be rendered on the same image at the same time. But copies of the image can be made! If each copy is passed to its own thread, the rendering processes will be fully independent!

A possibility of making shallow copies of the image was implemented. When creating a shallow copy, most of the pixel data becomes shared between the two images. No extra memory is needed except for the temporary "projections" used for rendering. A tests on a huge animation file was done by a professional animator, which occupies about 7.2 GB of RAM when fully loaded. Making a shallow copy of such image takes only about 200MB extra RAM. Regular user’s files demand even less extra memory. Even though shallow copies don’t occupy too much extra memory, it still takes a bit of time to create them. A trade off was considered: not creating a dedicated copy for each core, but create a few copies and assign each copy to several cores. E.g. in a scenario whereby we have 6+6 cores, create 4 shallow copies and assign each copy to 5 threads. In the result, CPU is utilized almost to 100% by making 12 threads and run almost without interdependencies.

Caution: when doing such memory/efficiency tradeoffs, one should pay great attention to the amount of memory the user has. After a few tests on real users it was established that even though the user can adjust the number of "clones" in the settings, none of the users would like to think about it. Therefore an automatic clones limit was implemented. Before cloning the image, calculation of the actual amount of memory the projections will take is done and compared to the amount of memory the user allowed Krita to use.

Painting Use Case

The brushes’ painting was inherently a sequential algorithm. The brush itself is just a set of small, round images (called dabs), which are painted one by one along the stroke line with small offsets. The situation is even more complicated, because some of the dabs could have dependencies between each other: two dabs can use the same base dab, but apply different post-processing to it.

Solutions

The solution for the problem consists of two parts:

The dabs themselves are prepared in parallel, asynchronously, from the main rendering thread. Here is the tricky part—when dabs are dispatched for parallel execution, they might become reordered, and some newerdabs will become ready before the older ones are done. A special queue was implemented (KisDabRenderingQueue) that handles all these cases, gathers the dabs into sequential bunches, and passes them further down the pipeline.
Each 20–100ms the dab rendering process is stopped, all prepared dabs taken then rendered onto a canvas. This rendering is also done in parallel: the working area is split into multiple smaller areas which are assigned to different threads.

Together with rendering core optimizations, this work guarantees that the brushes now scale nicely, up to 6 physical cores.

An interesting observation was made on the most responsive brush (Basic_tip_default), which utilizes Intel AVX technology for vectorized calculations. After all the optimizations were done, it became so fast that it almost hit the ceiling of the memory throughput (according to the VTune Amplifier, when painting with a 1000 px brush on a 5k canvas).

Future Krita Optimization Goals

It was established that the brush, optimized with Intel AVX technology, renders 4–6 times faster than non-optimized ones. The next step in brushes optimization is to implement vectorized versions of other brushes.
Utilizing specific integer instructions introduced in Intel® Advanced Vector Extensions 2 (Intel® AVX2) for composition can also give up to two times better rendering speed.
Implementing a lock-free hash table for tiles can also give some benefit for systems with over 6 physical cores.

Results

After implementing the solutions above, results were collected after benchmarking tests were done on two systems with system configurations below.

System 1:

Processor: Intel® Core™ i7-8550U @ 1.80GHz, 4Cores

Physical Memory: 8GB DDR4 2400MHz

Operating System: Windows 10® Pro (64 bit)

System 2:

Processor: Intel® Core™ i7-8550U @ 3.2GHz, 6Cores

Physical Memory: 32GB DDR4 2400MHz

Operating System: Windows 10® Pro (64 bit)

Use Case 1: Animation Rendering

The following data depicts time taken in milliseconds for frame generation (animation rendering pre-step) on different cores. This was achieved by enabling cores on the machine in an incremental fashion. Frames generation duration significantly determines animation rendering time in Krita.

Note: The animation file used for the test is from Wolthera Van Hovell (A top Krita artist), https://yadi.sk/d/n9a6YRrP3Lc3Ts.

Figure 1: Test animation screenshot

Figure 2: Animation results from System 2

NB: Cores in the graph above refers to physical cores + logical cores (i.e. 12 cores = 6physical cores + 6 logical cores)

Use case 2: Brushes

For reliability, an automated script was developed and executed on different core numbers in the two afore-mentioned systems, the script logged the brush speeds. The table below shows time spent to render 16 5000-px long strokes with the Gaussian brush of 1000px diameter. As the core number increases, the time spent on rendering reduces (brush speed increases).

Figure 3: Gaussian Brush speed results from System 2

NB: Cores in the graph above refers to physical cores + logical cores (i.e. 12 cores = 6 physical cores + 6 logical cores)

Figure 4: Gaussian Brush speed results from system 1

NB: Cores in the graph above refers to physical cores + logical cores (i.e. 8 cores = 4 physical cores + 4 logical cores)

Conclusion and Recommendations

From the two use cases above, multiple cores and Intel HT Technology significantly improve the performance of processor-intensive computations. A performance gain of 4.37X is achieved when an animation is rendered using all cores in a 6Core system with Intel HT Technology turned on, compared to rendering the animation on one core. This visible and measurable performance gain translates to reduced waiting times by Krita application users, thus increasing their overall productivity.

Intel provides technologies and hardware platforms that meet the ever-growing computer processing demands by software applications. Independent software developers can take advantage of these capabilities by writing multithreaded software which, in turn, offers an excellent user experience to the application users. Different tools are provided by Intel to software developers that could be used to determine software application performance bottlenecks, hotspots, and concurrency issues. An example of such a tool, and which was used in this study, is the Intel VTune Amplifier.

↧

A Closer Look at Object Detection, Recognition and Tracking

December 14, 2017, 12:19 pm

Latest and popular articles on Intel Technologies

≫ Next: Watch Fall 2017 Intel® VNF Next Virtual Conference Sessions on Intel Developer Zone

≪ Previous: Enhancing User Experience: Krita Application Utilizing Multiple Cores on Intel® Architecture Platforms

Through machine learning, computer programs learn how to identify people and objects.

Overview

Below, we seek guidance from the dictionary to appropriately define and discern the terms object detection, object recognition and object tracking. We then explore some of the algorithms involved with each process to underpin our definitions.

A Computer Vision Trinity

The computer vision terms object detection and object recognition are often used interchangeably (where the naming of an application many times depends on who wrote the program). Another term, object tracking, can be frequently found in the company of detection and recognition algorithms. The trio can work together to make a more reliable application although it may be unclear how they are distinct and how they relate to one another (is tracking just an extension of detection?). But we can surely make a clear distinction between them by first referencing a standard dictionary and then looking at the algorithms associated with each process.

It may be helpful to think of object discovery (instead of detection) and object comprehension (as a substitute for recognition). Certainly, we know that to comprehend a thing is not the same as making a discovery of that thing. Semantics aside, we want to know about the algorithms involved with each process (to have a basic idea about what the algorithms are designed to do). When we understand the results of applying a particular model or type of algorithm, then knowing what differentiates these terms becomes more than just a matter of words—it becomes a matter of process and outcomes. But the meaning of words remains important because they first influence the way we represent reality in our heads. So let’s begin with a standard definition of detection.

Detection

A detection algorithm asks the question: is something there?

An Act of Discovery

Dictionaries are always helpful to clear up any confusion about what a word does and does not mean (especially if those words aren’t always used in a consistent way by well-meaning programmers and engineers). Merriam Webster* defines detection as:

The act or process of discovering, finding, or noticing something.

That ‘something’ could be anything (a bird, a plane or maybe a balloon). The main idea being to notice that something is there.

The goal of object detection then is to notice or discover the presence of an object (within an image or video frame). To be able to tell an object (a distinct subset of pixels) apart from the static background (a larger set of pixels, the stuff that stays mostly unchanged frame after frame). But how do we even discern an object from the background? Well, the treatment is different for image and video.

Detection for Video and Images

There’s Something Different about the Pixels in this Frame (video)

When something shows up in a frame that wasn’t in the previous frame (some new pixels on the block), we can design an algorithm to notice the difference and register that as a detection. To notice that something is there that wasn’t there before—that counts as detection. And in compliance with our above definition, examples of detection techniques for video can include background subtraction methods (a popular way to create a foreground mask) such as MOG (meaning mixture of Gaussian) and absdiff (absolute difference).

Object Boundaries (image)

Because photos are static images, we can’t use motion to detect the photo’s objects but must rely on other methods to parse out the scene. When presented with a photo of a real-life situation (say a bustling downtown street containing a multitude of different, overlapping objects and surfaces) the busy nature of the scene makes it difficult to interpret (to know the boundaries of the objects). Edge detection methods (for example, Canny Edge detection) can help to determine the objects in such a scene. Edges define object boundaries and can be found by looking at how color changes across an image (abrupt changes in grayscale level). Knowing where the edges are helps to not only detect obvious objects (a blue bike leaning against an off-white wall) but to correctly interpret slightly more complicated situations where objects may overlap (a person sitting in a chair can be seen as two distinct objects and not one large hybrid object).

The Idea behind Background Subtraction

Unlike static images, with video we deal with multiple frames and that allows us to implement background subtraction methods. The basic idea behind background subtraction is to generate a foreground mask (Figure 3). We subtract one frame from another—the current frame (Figure 2) minus the previous frame (Figure 1)—to find a difference. And then a threshold is applied to the difference to create a binary image that contains any moving or new objects in a scene. In our example below, the "difference" is the drone that flies into the scene.

Figure 1. Previous frame (background model).

Figure 2. Current frame.

Figure 3. Foreground mask.

MOG, a Background Subtraction Algorithm

Mixture of Gaussians (MOG) is not to be confused with the popular Histogram of Oriented Gradients (HOG) feature descriptor, a technique (often paired with a support vector machine, a supervised machine learning model) that can be used to classify an object as either “person” or “not a person”. Unlike HOG which performs a classification task, the MOG method implements a Gaussian mixture model to subtract the background between frames. With detection techniques, that there is a difference (between frames) matters. But what the difference is (is the object a person? a robot?) does yet concern us. When we aim to identify or classify an object, that’s where recognition techniques come into play.

How We Get Notified of a Detection

To inform us (provide some sort of visual cue) of the detection of an object, a rectangle or box (often a brightly colored one) is often drawn around the detected thing. When something changes from frame to frame, an algorithm shouts, “Hey! What’s that group of pixels that’s just appeared (or moved) in the frame? Quick! Draw a green box around it to let the human know that we’ve detected something.”

Figure 4 below shows an object being detected by an application (with a live streaming webcam) that uses background subtraction methods. The application doesn’t have any clue what the object is. It simply looks for large regions of pixels that were not in the previous frame—it looks for a difference.

Figure 4. A bunny suit being detected by an application based on background subtraction methods.

Table 1. Detection techniques and functions from the OpenCV* library

Detection techniques	Examples of functions/classes from the OpenCV library
Background subtraction methods Edge detection	Absdiff cv::bgsegm::BackgroundSubtractorMOG cv::bgsegm::BackgroundSubtractorGMG Canny

Moving from the General to the Specific

Gas detectors are devices that detect or sense the presence of gas. Depending on the precision of the device, methane, alcohol vapor, propane and many more chemical compounds could sound the alarm. Metal detectors are instruments used to notice the presence of metal (to a metal detector, gold, brass, and cast iron are the same thing). And object detectors notice the presence of objects—where the objects are just regions of pixels in a frame. When we start to move from the general to the specific—gas to methane, metal to gold, object to person—the implication is that we have previous knowledge of the specific. This is what sets apart detection from recognition—knowing what the object is. We can recognize a detected gas as methane. We can identify a detected metal as gold. And we can recognize a detected object as a person. Object recognition techniques enable us to create more precise computer vision applications that can deal with the details of an object (person or primate, male or female, bird or plane). Recognition is like putting a pair of prescription glasses on detection. After putting on our glasses, we can now recognize that the small blurry object in the distance is, in fact, a cat and not a rock.

Recognition

Here, an algorithm gets more curious and asks: what’s there?

Having Previous Knowledge of Something

Merriam Webster defines recognition as:

The act of knowing who or what someone or something is because of previous knowledge or experience.

Based on that we can understand object recognition as a process for identifying or knowing the nature of an object (in an image or video frame). Recognition (applications) can be based on matching, learning, or pattern recognition algorithms with the goal being to label (classify) an object—to ask the question: what is the object?

The figure below comes from an application (using Intel® Movidius™ Neural Compute Sticks) for recognizing and labeling bird species. You can learn more about the sample application at GitHub*. Notice the “1.00” after the label “bald eagle”. There’s a confidence level associated with recognition and here the algorithm knows with 100% certainty that the object is, in fact, a bald eagle. But object recognition doesn’t always perform with such reliable accuracy.

Figure 5. Recognition of a bald eagle with perfect certainty.

Presented with another image (Figure 6), the same application is not entirely confident about any of the objects hovering above the shoreline. While not being able to recognize the object to the far left as anything specific, it’s able to correctly associate it (although not very confidently) with the more general object class of “bird”. For the other objects, it’s on the fence—it can’t decide whether or not the object is an albatross or what appears to be a barn owl.

Figure 6. A recognition application trained to recognize bird species unsure about the objects in the image.

Again, because the terms aren’t always used consistently, some may contest that they (detection and recognition) are the same thing. But by using the definitions above as a guide, surely the detection of an object (noticing that something is even there) cannot be equivalent to recognizing what that object is (being able to correctly identify the object because the algorithm has previous knowledge of it).

The act of recognition (I know that object is a bald eagle) is unlike the act of detection (I notice something there). But how exactly does an algorithm know a bald eagle when it sees it? Can we teach algorithms to know bald eagles from other bird species? You know, write a computer program detailing the nature of bald eagles and other species of birds. As it turns out, these are things we cannot effectively teach our computers (provide instruction to) and so we must devise algorithms that can learn by themselves.

Algorithms that Learn from Experience

We can further probe the nature of an object using recognition techniques (algorithms that are smart enough to know a seagull from a commercial airliner). These are algorithms that can classify an object precisely because they’ve been trained to do so—we call these machine learning algorithms. And the way an algorithm acquires knowledge about something (for example, bird species) is through training data—through exposure to tens of thousands of images of various species of bird, the algorithm can learn to recognize different kinds of birds. Machine learning algorithms work because they can extract visual features from an image. The algorithm then uses those features to associate one image (an unknown image that it’s presented with for the first time) with another (an image it has previously “seen” during its training). If the recognition application we reference above (figure 5) had never been trained with images labeled as “bald eagle”, it would have no ability to label a bird as a bald eagle when presented with one. But it might still be smart enough to know it’s a bird in general (as we’ve seen in figure 6, the object to the far left gets labeled as “bird” but not anything specific).

Table 2. Recognition techniques and functions from the OpenCV library.

Recognition techniques	Examples of functions/classes OpenCV library
Feature extraction and machine learning models HOG and Support Vector Machine (SVM) Deep learning models (convolutional neural networks)	FaceRecognizer FaceRecognizer::train createEigenFaceRecognizer Fisherfaces for Gender Classification

Tracking

A tracking algorithm wants to know where something is headed.

Keeping a Close Eye on Something

Tracking algorithms just can’t let go. These tenacious algorithms will follow you (if you’re the object of interest), follow you wherever you will go. At least, that’s what we want from an ideal tracker.

Merriam Webster defines the verb track as:

To follow or watch the path of something or someone.

The goal of object tracking then is to keep watch on something (the path of an object in successive video frames). Often built upon or in collaboration with object detection and recognition, tracking algorithms are designed to locate (and keep a steady watch on) a moving object (or many moving objects) over time in a video stream.

There’s a location history of the object (tracking always handles frames in relationship to one another) which allows us to know how its position has changed over time. And that means we have a model of the object’s motion (hint: models can be used for prediction). A Kalman filter, a set of mathematical equations, can be used to determine the future location of an object. By using a series of measurements made over time, this algorithm provides a means to estimating past, present and future states.

Certainly, state estimation is useful for tracking and in the case of our moving object, we’d like to predict the future states—an object’s next move before it even makes it. But why would we want to do that? Well, the object may get obstructed and if our ultimate goal is to maintain the identity of an object across frames, having an idea of the future location of the object helps us to handle cases of occlusion (when things get in the way).

The Problem of Occlusion

Occlusion can be a problem when you’re trying to keep a close eye on something— that’s where an object gets temporarily blocked. Say we’ve been tracking a particular pedestrian on a busy city street and then they get blocked by a bus. A robust tracking algorithm can handle the temporary obstruction and maintain its lock on the person of interest. And, in fact, that’s the hard part—making sure the algorithm is locked onto the same thing so that tracking doesn’t get lost. Even though the pedestrian is no longer there in the image (the pixels of the bus conceal our pedestrian), the algorithm has an idea of the future path they may traverse. Therefore, we can continue to effectively follow the pedestrian despite the myriad of obstacles that may hide them from our view.

Table 3. Tracking techniques and functions from the OpenCV library.

Tracking techniques	Examples of functions/classes OpenCV library
Kalman filtering, CAMShift	cv::KalmanFilter Class TrackerMIL TrackerTLD TrackerMedianFlow

Distinct but not Mutually Exclusive Processes

The process of object detection can notice that something (a subset of pixels that we refer to as an “object”) is even there, object recognition techniques can be used to know what that something is (to label an object as a specific thing such as bird) and object tracking can enable us to follow the path of a particular object.

Accurate definitions helps us to see these processes as distinctly separate. And pairing the definitions with an understanding of the algorithms involved allows us to further see how these terms are not interchangeable—that detection is not a synonym for recognition and tracking is not just a mere extension of detection. If we know the outcome of detection (based on the true meaning of the word), we'd know that the goal of a detection algorithm is not to classify or identify a thing but to simply notice its presence. We’d also know that tracking algorithms such as a Kalman filter (that can determine future states of an object) are not mere extensions of something like background subtraction. And that recognition is about having previous knowledge of something (always) while detection is not.

We now know that what differentiates the terms is not just a matter of words but process and outcomes (based on the goals and results of the algorithms involved). While distinct, one computer vision process is not better than the other and they can often be found working together to create more advanced or robust (reliable) applications—for example, detection and recognition algorithms paired together or using detection as a backup for when tracking fails. Each process—object detection, object recognition and object tracking—serves its own purpose and they can complement one another but... We first need to know how to tell them apart if we are to eventually put them together (come up with some particularly clever combination of algorithms) in order to create useful and dependable computer vision applications.

Code Samples

To further explore some of the algorithms and techniques mentioned in this article, check out the code samples below at the Intel IoT Developer Kit repository on GitHub*. The Face Access Control code sample makes use of both the FaceDetector and FaceRecognizer class from the OpenCV library and the Motion Heatmap is based on background subtraction (MOG).

Face Access Control: https://github.com/intel-iot-devkit/reference-implementation/tree/master/face-access-control
Motion Heatmap: https://github.com/intel-iot-devkit/python-cv-samples/tree/master/examples/motion-heatmap
Analog Gauge Reader: https://github.com/intel-iot-devkit/python-cv-samples/tree/master/examples/analog-gauge-reader

↧

Watch Fall 2017 Intel® VNF Next Virtual Conference Sessions on Intel Developer Zone

December 18, 2017, 12:16 pm

Latest and popular articles on Intel Technologies

≫ Next: Issue Fix: "Illegal Instructions" Error in Intel® IPP Functions

≪ Previous: A Closer Look at Object Detection, Recognition and Tracking

Fall 2017 Intel® VNF Next Virtual Conference Sessions The Out of the Box Network Developers Meetup group hosted the first Intel® VNF Next Virtual Conference on Nov. 13, 2017. At this event, experts from companies including Intel, Huawei, Red Hat, and Ericsson delivered real-world virtual network function (VNF) keynotes, technical presentations, tutorials and demos. The live conference is over, but you can still watch the sessions. Visit the VNF Next Virtual Conference site on the BeMyApp* platform.

The conference featured three tracks. If you want to follow a particular track view the agenda to identify the sessions for that track.

Track 1: Platform Optimizations

Learn how modern-day COTS hardware is optimized to take advantage of the Intel® Xeon® Scalable Processor (code name Skylake-SP) for network function virtualization (NFV). Topics include the Intel Fast Track Kit for NFVi, Network Services, Benchmarking, and Intel Xeon Scalable Processor architecture and VNF optimizations.

Track 2: VNF Containerization, Orchestration, and Management

Here you will learn about deployment and how to overcome challenges that arise when moving network functions from bare metal to virtual network functions (VNF). Topics cover enabling a performant and operationally efficient container based network, using VPP and SR-IOV with Clear Containers, migration of legacy apps to cloud native, and more.

Track 3: 5G and IoT Use Case Support

Discover how 5G and IoT use cases can be supported by an agile network enabled by software defined networks (SDN, and NFV. Sessions cover the impact of wireless networks on autonomous driving, developing software for multi-access edge computing, using the Intel Xeon Scalable AVX-512 instruction set, an introduction to FlexRAN, which is a wireless access solution for 4G and 5G, and an overview of the Intel® 5G Mobile Trial Platform.

We think you'll find these sessions interesting and informative. Check them out!

About the Author

Debra Graham worked for many years as a developer of enterprise storage applications. Now she works to help developers optimize their use of Intel® technology through articles and videos on Intel Developer Zone.

↧

Issue Fix: "Illegal Instructions" Error in Intel® IPP Functions

December 19, 2017, 7:41 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Resource Director Technology Extensions: Introducing the L2 Code and Data Prioritization Feature for Future Intel Atom® Processors

≪ Previous: Watch Fall 2017 Intel® VNF Next Virtual Conference Sessions on Intel Developer Zone

Summary:

Intel® IPP provides optimization code for the Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel® Advanced Vector Extensions 2(Intel® AVX2), and Intel® Advanced Vector Extensions (Intel® AVX) instruction sets. To use these instructions sets, explicit operating system support is required to protect the vectorization registers for the applications.

Intel IPP functions dispatch the processor-specific optimized code according to both CPU type and operating system support status. One dispatching problem was identified in the recent IPP releases, and was fixed in the latest IPP 2018 Update 1.1 release. The problems only happen on the following systems:

Using Intel IPP functions on one Intel® AVX-512 processor system, and its OS does not enable Intel® AVX-512 support. To check if your system OS enables Intel® AVX-512 support, you can build and run the example code attached in the article.
Using Intel IPP 2018 cryptography functions on one Intel® AVX processor system, and its OS does not enable Intel® AVX/Intel® AVX2 support. To Enable Intel® AVX/Intel® AVX2 support, it requires:
Microsoft Windows*: Windows* 7 with Service Pack 1 (SP1) and Windows* Server 2008 R2 with SP1, or a later version
Linux: kernels from 2.6.30 (June 2009) and later.

The Intel(R) IPP functions may report "illegal instructions” errors at the runtime with such problems.

Issue Fix:
A new versions of Intel® IPP 2018 Update 1.1 release is available, which includes the fix of the problems. To get this package:

Login into Intel® Software Development Products Registration Center, as a registered customer.
Select the “Intel® Integrated Performance Primitives” or “Intel® Integrated Performance Primitives Cryptography” production to download the package.

If your applications can not update to Intel® IPP 2018 Update 1.1 or a later version of Intel® IPP, the workaround is to use the following APIs to manually dispatch the optimization code for your target systems:

ippSetCpuFeatures: Sets the processor-specific library for Intel® IPP package.
ippcpSetCpuFeatures: Sets the processor-specific library for Intel® IPP cryptography library

If you need further help on this problem, please contact the technical support via the Online Service Center web site.

↧

Intel® Resource Director Technology Extensions: Introducing the L2 Code and Data Prioritization Feature for Future Intel Atom® Processors

December 19, 2017, 2:44 pm

Latest and popular articles on Intel Technologies

≫ Next: Hands-On AI Part 24: TensorFlow* Serving for AI API and Web App Deployment

≪ Previous: Issue Fix: "Illegal Instructions" Error in Intel® IPP Functions

Introduction

This article introduces a new feature on future Intel Atom® processors to enable isolation and prioritization of code and data (individually) within the L2 cache. For specialized applications with high sensitivity to either code or data locality in the cache, the L2 Code and Data Prioritization (CDP) feature can provide performance, latency, or determinism advantages.

Intel® Resource Director Technology (Intel® RDT) Feature Set: Background

The L2 CDP feature is part of the Intel® Resource Director Technology (Intel® RDT) feature set, which provides a number of monitoring and control technologies to help software understand and control the usage of shared resources within the platform, such as last-level cache (LLC) and memory bandwidth. A set of technical articles and other resources on Intel® RDT are linked from the main landing page.

The Intel RDT feature set includes Cache Allocation Technology (CAT), which provides programmatic control over LLC data placement per application, virtual machine (VM), container, or thread, which can be used to isolate or differentiate the performance of key workloads, in particular in complex data center environments. More details on the CAT feature can be found in a series of technical articles, with the first linked here.

An extension of CAT is the CDP feature, which remaps CAT masks to enable separate masks for code and data. In prior processor generations the CDP feature has been available at the L3 cache level in the Intel® Xeon® processor family, starting with the Intel® Xeon® processor E5 v4 family. The new L2 CDP feature brings this capability to the Intel Atom processor line for the first time.

CAT and CDP Background

A number of current Intel Atom processors support CAT architecturally at the L2 cache level, providing software the ability to control usage of the cache, enabling isolation and prioritization across a wide variety of usage models.

The CAT feature relies on mapping each running thread into a Class of Service (CLOS or COS) via a per-thread MSR (the IA32_PQR_ASSOC MSR, or “PQR”), which can be context swapped by an enabled OS or Virtual Machine Manager (VMM) to differentiate between threads dynamically.

Figure 1 shows the flexible mapping possible between a thread and a CLOS.

Hardware thread map
Figure 1: Each hardware thread is assigned a Class of Service (CLOS), which is assigned by the OS/Virtual Machine Manager. The same or different CLOS can be used dynamically as needed to differentiate between threads, applications, virtual machines, or containers.

Each CLOS maps into a block of MSRs (IA32_L3_QOS_MASK_n, where “n” is the corresponding CLOS) to select the associated capacity mask for the currently running thread on a core, as shown in Figure 2.

Configuration of L3 capacity bitmasks
Figure 2: Configuration of the L3 Capacity Bitmasks (CBMs) per logical Class of Service (CLOS), via the IA32_L3_MASK_n block of MSRs, where n corresponds to a CLOS number.

The assigned cache bitmasks can then be used to control the amount of cache available to threads in a given CLOS, where set bits indicate the ability to utilize an assigned region of the cache.

The CDP feature extends CAT by remapping masks to enable separate control over code and data placement in the cache. Each CLOS is remapped to have two masks, one for code and one for data, and the total number of mask MSRs is effectively cut in half.

When CDP is enabled the existing mask space is re-indexed to provide separate control over code and data, as shown in Figure 3.

Figure 3: Code and Data Prioritization (CDP) mask details: when enabled, one mask is provided for code, and one for data for each CLOS.

As shown in Figure 3, CDP provides separate control over code and data by enabling separate masks for code and data. With traditional Cache Allocation Technology enabled, Classes of Service map 1:1 with capacity bitmasks (CBMs). With CDP enabled the mapping is now 1:2 (each CLOS maps to two CBMs, one for code and one for data).

The new L2 CDP feature provides separate control over code and data at the L2 cache level. Enumeration, configuration, and usage details are provided in the next section.

L2 CDP Interface Details: Enumeration, Configuration, and Usage

L2 CDP is an extension of the L2 CAT feature. Once the presence of the L2 CAT feature has been confirmed (via CPUID.(EAX=07H, ECX=0):EBX.PQE[bit 15] and CPUID.(EAX=010H, ECX=0), see the Intel software developer’s manual for details) the L2 CDP feature can be enumerated.

The presence of the L2 CDP feature is enumerated via extensions to the L2 CAT CPUID sub-leaf, specifically CPUID.(EAX=10H, ECX=2):ECX.CDP[bit 2] (see Figure 4).

The CDP bit in the ECX register enumerates the presence of CDP
Figure 4: Figure 4. CPUID.010H.02H Sub-Leaves for L2 CAT and L2 CDP Details. The CDP bit above in the ECX register enumerates the presence of CDP, and the remainder of the fields are described in detail in the Intel software developer’s manual.

Most of the CPUID.(EAX=10H, ECX=2) sub-leaf data that applies to CAT also apply to CDP. However, CPUID.(EAX=10H, ECX=2):EDX.COS_MAX specifies the maximum COS applicable to CAT-only operation. For CDP operations, COS_MAX_CDP is equal to (CPUID.(EAX=10H, ECX=2):EDX.COS_MAX_CAT >>1).

If CPUID.(EAX=10H, ECX=2):ECX.CDP[bit 2] =1, the processor supports L2 CDP and provides a new MSR IA32_L2_QOS_CFG at address 0C82H. The layout of IA32_L2_QOS_CFG is shown in Figure 5. The bit field definitions of IA32_L2_QOS_CFG are:

Bit 0: L2 CDP Enable. If set, enables CDP, maps CAT mask MSRs into pairs of Data Mask and Code Mask MSRs. The maximum allowed value to write into IA32_PQR_ASSOC.COS is COS_MAX_CDP.
Bits 63:1: Reserved. Attempts to write to reserved bits result in a #GP(0).

Figure 5: Layout of IA32_L2_QOS_CFG

IA32_L2_QOS_CFG default values are all 0s at RESET, and the mask MSRs are all 1s. Hence all logical processors are initialized in COS0 allocated with the entire L2 available and with CDP disabled, until software programs CAT and CDP. The IA32_L2_QOS_CFG MSR is defined at the same scope as the L2 cache, typically at the module level for Intel Atom processors, for instance. In processors with multiple modules present it is recommended to program the IA32_L2_QOS_CFG MSR consistently across all modules for simplicity.

17.19.6.2 Mapping between L2 CDP Masks and L2 CAT Masks

When CDP is enabled, the existing CAT mask MSR space is remapped to provide a code mask and a data mask per COS. This remapping is shown in Table 1, and the same indexing pattern is used for L3 CDP feature, but for the L2 MSR block (IA32_L2_QOS_MASK_n) instead of the L3 MSR block (IA32_L3_QOS_MASK_n).

Table 1. Re-Indexing of COS Numbers and Mapping to CAT/CDP Mask MSRs.

Mask MSR	CAT-only Operation	CDP Operation
IA32_L2_QOS_Mask_0	COS0	COS0.Data
IA32_L2_QOS_Mask_1	COS1	COS0.Code
IA32_L2_QOS_Mask_2	COS2	COS1.Data
IA32_L2_QOS_Mask_3	COS3	COS1.Code
IA32_L2_QOS_Mask_4	COS4	COS2.Data
IA32_L2_QOS_Mask_5	COS5	COS2.Code
....	....	....
IA32_L2_QOS_Mask_’2n’	COS’2n’	COS’n’.Data
IA32_L2_QOS_Mask_’2n+1’	COS’2n+1’	COS’n’.Code

One can derive the MSR address for the data mask or code mask for a given COS number n by:

data_mask_address (n) = base + (n <<1), where base is the address of IA32_L2_QOS_MASK_0.
code_mask_address (n) = base + (n <<1) +1.

As with L3 CDP, when L2 CDP is enabled, each COS is mapped 1:2 with mask MSRs, with one mask enabling programmatic control over data fill location and one mask enabling control over data placement. A variety of overlapped and isolated mask configurations are possible (see the example in Figure 3).

Mask MSR field definitions for L2 CDP remain the same as for L2 CAT. Capacity masks must be formed of contiguous set bits with a length of 1 bit or longer, and should not exceed the maximum mask length specified in CPUID. For example, valid masks on a cache with max bitmask length of 16b (from CPUID) include 0xFFFF, 0xFF00, 0x00FF, 0x00F0, 0x0001, 0x0003 and so on. Maximum valid mask lengths are unchanged whether CDP is enabled or disabled, and writes of invalid mask values may lead to undefined behavior. Writes to reserved bits will generate #GP(0).

L2 CDP Software Considerations

Before enabling or disabling L2 CDP, software should write all 1s to all of the corresponding CAT/CDP masks to ensure proper behavior (for example, the IA32_L2_QOS_Mask_n set of MSRs for the L2 CAT feature). When enabling CDP, software should also ensure that only COS numbers that are valid in CDP operation are used, otherwise undefined behavior may result. For instance in a case with 16 CAT COS, since Classes of Service are reduced by half when CDP is enabled, software should ensure that only COS 0‒7 are in use before enabling CDP (along with writing 1s to all mask bits before enabling or disabling CDP).

Software should also account for the fact that mask interpretations change when CDP is enabled or disabled, meaning for instance that a CAT mask for a given COS may become a code mask for a different CLOS when CDP is enabled. In order to simplify this behavior and prevent unintended remapping, software should consider resetting all threads to COS[0] before enabling or disabling CDP.

The L2 mask MSRs are scoped at the same level as the L2 cache (similarly, the L3 mask MSRs are scoped at the same level as the L3 cache).

Configuration and Usage

After verifying the presence of L2 CAT and L2 CDP (via CPUID), the mask MSRs may be set to all 1s, the L2 CDP feature may be enabled (via the IA32_L2_QoS_CFG MSR), and the masks may be further configured into code/data masks. Key threads, applications, VMs, and containers of interest may then be assigned into Classes of Service as needed, with per-thread hardware and software associations maintained by the OS or VMM at context swap time with the IA32_PQR_ASSOC MSR.

Conclusion

The new L2 CDP feature enables programmatic control over code and data placement in the L2 cache for future Intel Atom processors. This new capability builds on the L2 CAT feature on certain Intel Atom processors, enabling new capabilities and advanced platform tuning opportunities for uses including industrial, motion control, communications, networking, digital signage and the Internet of Things.

While this article provides an early overview of the features and technical details on the enumeration and interfaces, software enabling and support is planned through Intel’s Software and Services Group, including the Open Source Technology Center (OTC), and enabling patch links will become available in the near future.

↧

Hands-On AI Part 24: TensorFlow* Serving for AI API and Web App Deployment

December 19, 2017, 2:49 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Advisor issue with resolving debug symbols

≪ Previous: Intel® Resource Director Technology Extensions: Introducing the L2 Code and Data Prioritization Feature for Future Intel Atom® Processors

A Tutorial Series for Software Developers, Data Scientists, and Data Center Managers

Welcome to the final article in the hands-on AI tutorial series. We have been building a movie-generation demo app with powerful AI capabilities, and now we are about to finish it. You can skim the first overview article in the series as a refresher. To recall a key components of the app, you can check an article about project planning.

This final article is devoted to deployment issues and provides step-by-step instructions on how to deploy:

Web API services
- Emotion recognition (image processing)
- Music generation
A web app server with a slick user interface (UI)

Each set of instructions can be performed independently. All the sources are available here and here (for emotion recognition and image processing) and here (pretrained models for music generation).

Once you have finished, you will have a fully functional app. For illustration purposes, the start page of the app is shown below.

App Overview

The app architecture consists of four parts (see the diagram below):

Web app server
Web client (UI)
- Photo uploading view
- Slideshow view, which will be sharable
Remote image processing service
Remote music generation service

The web app is based on the lightweight Flask* framework. The majority of the front-end functionality is implemented using JavaScript* (Dropzone.js* for drag-and-drop file upload), and the music (re)play is based on MIDI.js*.

The AI API for emotion recognition is served using a combination of Flask and TensorFlow* serving on Microsoft Azure*, and the AI API for computer music generation is also a containerized application on Microsoft Azure. We created two independent containers for the image and music parts following the Docker one container per process ideology.

Clone the project git repo as follows (for emotion recognition and music generation, some files are hosted on Dropbox due to the size of the corresponding archives, for example, pre-trained models; we provide the links to them inline in the corresponding sections):

git clonehttps://github.com/datamonsters/intel-ai-developer-journey.git

We provide complete deployment instructions for the AI components of our app, because the tutorial series focuses on that topic. We briefly cover the web app deployment process, including showing how to start a web app locally (http://localhost:5000) but not in the cloud. The details of web app deployment are straightforward and are already covered elsewhere, including in these tutorials: Deploying a Flask Application to AWS Elastic Beanstalk and How to Deploy a Flask Application on an Ubuntu* VPS.

Remote Image Processing API

In the previous articles on image processing, you learned how to (pre)process an image dataset and how to define and train a CNN model using Keras* and TensorFlow. In this article, you learn how to take a trained Keras model and deploy it in a Microsoft Azure cloud as a simple web service with REST API using TensorFlow Serving and Flask.

All the materials that are used for the emotion recognition deployment process are here.

Cloud instance

The first step of the deployment process is to choose the platform on which we want to deploy. The two main options are cloud instance and in-house machine. The pros and cons of each are discussed in Overview of Computing Infrastructure. Here we use a cloud-based approach and deploy the model in a Microsoft Azure cloud. Microsoft provides new users with a USD 200 free trial subscription, which is enough for a couple of weeks of 24/7 work of a small-size CPU instance.

The following step-by-step instructions show how to launch an appropriate instance in cloud.

Go to the Azure home page, and then create an account (skip this step if you already have an account). During the registration process you’ll be asked to provide a credit card number, but you won’t be charged for the free trial subscription.
After creating your account, go to the Azure portal, which is the main site for this cloud service. Here you can manage all the resources including virtual machines (cloud instances). Let’s create.

a virtual machine by clicking Virtual machines on the left panel. We want to create and add a new virtual machine that will run Ubuntu Server. The lightweight
Linux* Server distribution is a good choice to use for deploying with TensorFlow Serving. In particular, we want to run Ubuntu Server 16.04 LTS. LTS stands for Long-Term Support, which means that this version is stable and will be supported for 5 years by Canonical.
To start the process of machine configuration, click Create.
On the Create virtual machine screen, make sure Subscription is set to Free Trial. For the other settings, we recommend you use the ones shown in the following screenshot.
Next, you need to choose the size of the machine, which is the hardware that is set to run your instance. Choose the small CPU instance DS12_V2 with four Intel® Xeon® processors, 28 GB RAM, and 56 GB SSD. This setting should be enough for our small-size deployment. In some cases, you may need to change the location to see the desired machine available. For example, East US worked for one tutorial test reader, while West US worked for another.

The Settings and Summary sections are not applicable for us at this time and can be skipped.
After configuring the machine, the deployment process starts, which might take several minutes. When the process completes, you should see the new instance running when you go to Azure portal.
By default there are no ports opened at the machine, except for port 22 for the SSH connection to the instance. Since we are building a web service, we need to open a few more ports to be able to see the machine from the Internet.
To edit the network settings, click Emo-nsg (which is Network Security Group) in the All resources panel.
In the Emo-nsg Settings tab, click Inbound security rules.
The first one is for the Jupyter notebook* running at port 8888. The second one is for our Flask web service running at port 9000.
The virtual machine is ready. Click the Running Instance icon (see screenshot shown earlier in this section) to see its running state and IP address to connect.

Setting up the Docker* environment

We just launched the cloud machine running pristine Ubuntu Server 16.04 LTS. Installing all the dependencies there from scratch would take a considerable amount of time. Fortunately, the Docker* container technology is available, which allows you to wrap all the dependencies into one file that can be deployed quickly at any machine. See our article on Docker for more details.

To benefit from using container technology, we need to install the Docker engine at the first place. But first, you need to download the archive with the materials to your laptop.

Copy the archive with the materials to the cloud instance over SSH using the scp command.
Connect to the cloud virtual machine via SSH. It should have the archive, which we just copied in the home folder.
Unarchive it. All the scripts and the code that are needed for the deployment are in the home
directory now. You might have a .zip archive instead, if so, just use unzip command. To install unzip, run sudo apt install unzip, then unzip <name_of_archive.zip>
Install the Docker engine using the install_docker.sh script (run: sudo ./install_docker.sh).
Under the hood it just repeats the commands from the official tutorial.
Build the Docker image called “emotions” from Dockerfile, which is a kind of manifest for the system that will be inside Docker container.
(run: sudo docker build -f Dockerfile -t emotions)
The same can be done with the build_image.sh script (run: sudo ./build_image.sh). The building process might take a while (about 5‒10 minutes). After that you should be able to see the newly built image.
Launch the Docker container from the just-built emotions image. Here we run it with several options. We want to map the instance home folder /home/johndoe into the container
/root/shared folder, which effectively works like a shared folder. We also want to forward all the requests addressed to the instance ports 8888 (Jupyter) and 9000 (Flask) into the container to the same ports. It allows all the servers and services to run inside the Docker container and also to have access to them from the Internet.
Using the -d option (which means detach) runs the container in the background.
sudo docker run -d -v /home/johndoe:/root/shared -p 8888:8888 -p 9000:9000 emotions

Prepare the Keras model

We have our Docker container running on the cloud CPU instance. And we also have saved the Keras model for emotion recognition from images earlier (see basic and advanced articles on CNNs). The next step is to convert the Keras model into format which is appropriate for TensorFlow Serving.

Go inside the Docker container using the exec command, the container ID or Names (in my case it’s 88178b94f61c or fervent_lamarr, which can be seen from the sudo docker ps -a command), and the name of the program to run (/bin/bash in this case, which is the usual shell).
sudo docker exec -it 88178b94f61c /bin/bash
Go to the deployment folder, which contains the scripts and tools for deployment.
Run the serve_model.py Python* script, which converts the Keras model into a suitable TF Serving format. First we convert the baseline model and assign it to version 1. python serve_model.py --model-path ../models/baseline.model --model-version 1
Next we convert the advanced model and set its version to 2. The TensorFlow Serving handles the versioning automatically.
python serve_model.py --model-path ../models/pretrained_full.model --model-version 2
The core part of this script loads the Keras model, builds information about the input and output tensors, prepares the signature for the prediction function, and then finally compiles these things into a meta-graph, which is saved and can be fed into the TensorFlow Serving.

Tensorflow Serving server

Now we’re ready to launch the TensorFlow Serving server. Serving is a set of tools that allows you to easily deploy TensorFlow models into production.

Start the TensorFlow Serving server. tensorflow_model_server --port=9001 --enable-batching=true --model_name=emotions --model_base_path=../models &> emotions.log &
The core parameters to specify are the port on which the TensorFlow Serving will be running, model name, and model path. We also want to run the process in the background and store the logs in the separate emotions.log file. The same thing can be done with the serving_server.sh script.
Now the TensorFlow Serving server is running. Let’s test it using a simple client without a web interface. It takes the image from the specified path and sends it to the running TensorFlow Serving server. This functionality is implemented in the serving_client.py Python script.
python serving_client.py --image-path ../Yoga3.jpg
It works!

Flask server

The final step is to build a web service on top of TensorFlow* Serving. Here we use Flask as a back-end and build a simple API using the REST protocol.

Run the flask_server.py python script. It launches the Flask server, which transforms the corresponding POST requests into requests of proper form to TensorFlow Serving. We run this script in the background and store the logs in the flask.log file.
python flask_server.py --port 9000 &> flask.log &
The main idea of the code is to define a so-called “route,” which redirects all POST queries to the predict page of the corresponding Flask server to the predefined predict function.
Let’s test our Flask server now. First of all, from inside the Docker. curl '127.0.0.1:9000/predict' -X POST -F "data=@../Yoga3.jpg"
Then from outside the Docker but from the same cloud instance.
curl '127.0.0.1:9000/predict' -X POST -F "data=@./Yoga3.jpg"
And finally from the outer network—our laptop. It works.
Finally, we have a web service that works over REST API and can be accessed easily through the usual POST request with the special fields. You can take a look on API description here.

Remote Music Generation API

Prerequisites

Create one more VM image, using steps 3‒7 in the “Remote Image Processing API Cloud Instance” section, and then connect to this image via ssh. All the following steps in this section happen within this VM image unless otherwise stated.

Make sure that you have Python 3 set up, and follow the instructions. To set up an additional package, the pip3 utility is required. However, pip3 is preinstalled with Python 3 since the version 3.4. Music21 package is necessary for music transformations. To install it, just run it in the console:

pip3 install music21

and pip3 will do all the work.

Installing and setting the emotion transformation part

You can find our ideas about emotional-based transformations in music here.

Music-related files are placed in the music subfolder of the cloned repo. Copy this folder to your target machine using scp -r music user@host_ip:/home/user if you performed step 1 on your local machine. But no copying is needed if you perform git clone on the machine on which you are planning to deploy the music part.

The Music folder contains the following subfolders:

base_melodies. Contains the source base melodies
base_modulation. Contains the necessary Python scripts for melody emotion modulation
- emotransform.py. Performs melody transformation
- web-server.py. Wraps the transformation script in RESTful API and provides the basic http-server

transform_examples. Contains examples of already transformed melodies

To run the web server in your system environment, adjustments in the web-server.py file are required:

Change the HOST variable in the header of the file to the IP address of the machine on which you deploy the musical part of the application. Example (be sure to include the single quotation marks): HOST='192.168.1.45'
You can set it to ‘127.0.0.1’, but only do this if you intend to run the whole system (all parts of the application) on a single machine. The HOST value will be used as part of the URL for the file transfer procedure, so it must be visible and correct for your network.
Use your OS tools or the ifconfig utility to retrieve the IP address of your machine.
You can leave PORT variable as it is, but if problems occurred during the start of the server and you saw an error like [Address already in use] in the console, try setting it to a different value.
Example:
PORT=8082
NOTE: The IP address and PORT must be in sync with the main app, since they send requests and get responses for the IP, PORT pair.

Installing and setting BachBot*

Assuming that you have Docker already installed (if not, refer to the “Remote Image Processing API” section). Pull the image: docker pull fliang/bachbot:aibtb
Connect to a new Docker image. docker run -d -v /home/johndoe:/root/shared --name bachbot -it fliang/bachbot:cornell
You should see something like this:
Check that Torch in the Docker image works well on your system. Type the following commands in the console, which will invoke the Torch interactive shell:
```
sudo docker ps
	sudo docker exec -it YOUR_CONTAINERID /bin/bash
th
```
and then:
torch.rand(5,3)*torch.rand(3,4)
If it runs without problems, exit the shell with the exit command and go to the next step. Otherwise, please refer to the Troubleshooting section.
To avoid a training process, you can download a pretrained model to your local machine, and then copy it to VM with the following command: scp pretrained_music_model.tar.gz johndoe@ip:~/
Place the content of this archive into the /root/bachbot/ folder of the BachBot Docker image. Or just use the docker shared folder as you did with an image part:
cp ../shared/pretrained_music_model.tar.gz pretrained_music_model.tar.gz
Finally, unarchive it with
tar -xvf pretrained_music_model.tar.gz
This directory should contain six files:
- checkpoint_5100.json
- checkpoint_5100.t7
- checkpoint_5200.json
- checkpoint_5200.t7
- checkpoint_5300.json
- checkpoint_5300.t7
Exit from the Docker image with the exit command.

For more details related to BachBot, please follow the official GitHub* repo.

Starting the music generation web server

To start the web server for the musical part, type in the console:

cd [working_directory]/intel-ai-developer-journey/music/base_modulation/
sudo python3 web-server.py

Where [working_directory] is the name of your working directory from the Installing and setting the emotion transformation part section.

You should see output like this:

That’s it! The music generation service is up and running. Finally, we have a web service that can be accessed with a usual POST request. You can take a look at the Systems APIs listed below.

Web App

Web server

The tutorial Flask app needs methods for each view and also one method for uploading images. Since the index page will provide only the upload form, let's implement the slideshow generation logic right in the show method. The show method calls two remote API to extract emotions and generate MIDIs with music (in the code below, the API calls are stubbed with placeholders for clarity of presentation).

<code>
@app.route('/show/<path:session>')
def show_page(session):
    #check if music is already generated
    for x in range(1, 6):
        music_path = os.path.join(basedir, 'upload/' + session + '/' + str(x) + '.mid')
        if not os.path.exists(music_path):
            emotion = get_emotion(os.path.join(basedir, 'upload/' + session + '/' + str(x) + '.jpg'))
            generate_music(emotion, session, x)
    return render_template('show.html', session_name = session)

def generate_music(emotion, session, num):
    # remote api call

def get_emotion(file_path):
    # remote api call
    return 'happy'</code>

You can get the full source code of the web app here in the /slideshow folder.

Web client

On the client side, you need to implement a slideshow with playing MIDI. Slideshow implementations can be found on CodePen. Here is one working version.

Web app deployment

To launch the server using python2, type:

cd intel-slideshow-music/slideshow
pip install -r requirements.txt
pip install requests
export FLASK_APP=app
flask run

Then in your browser, open http://localhost:5000/slideshow-music. You should see a web app running locally and accessing remote API services (you will have to change the IP addresses of the remote AI APIs for emotion recognition and music generation as defined in the sections above).

Congratulations! All of the parts of the Slideshow Music project are now complete, and we successfully put them together. The app should look just like this example of a live version.

Conclusion

In this article, we covered the deployment and integration aspects of the AI app development process. We used a lightweight Flask framework, Midi.js, and Dropzone.js for the front-end and web app server. For the back-end we:

Overviewed the process of the Keras model deployment in the cloud. We used the Microsoft Azure cloud, Docker, Tensorflow Serving library, and Flask web server. We also got the web service that works over REST API and can be accessed through a usual POST request with the special fields.
Launched the emotion-modulation part with Bachbot’s harmonization model based on Torch, Docker, and a simple Python web server.

All of you have photos evoking emotions that can be shared by means of music. Enjoy your deep learning app and don’t forget to stop your virtual machines before the free trial subscription runs out!

Play the movie made of uploaded images and with a computer-generated song in the background.

System APIs

This section describes the APIs that were implemented for the demo. The system itself is made of three components: emotion recognition (images), music generation, and user interface. In turn, the music generation component contains two subcomponents --- the adjustment of the base song toward the emotion and the computer-assisted music generation.

Emotion Recognition

emotion_recognition_model train(images, emotions)

Given an annotated collection of images (for each image, we have an emotion, which is present on the image; only one emotion per image is used in this demo).

emotion predict(image, emotion_recognition_model)

Given a trained emotion recognition model and a new image, assign probabilities to each of the emotion classes and select the most probable emotion. For example, for an image of a beach, the model could very likely predict the following distribution:

{
“Anxiety” : 0.01,
“Sadness” :  0.01,
“Awe” : 0.2 ,
“Determination” : 0.05,
“Joy” : 0.3 ,
“Tranquility” : 0.4,
}

And the most probable emotion is

{
“Tranquility” : 0.4
}

The tranquility is the output of the image processing API.

Music Generation

base_song_modulated modulate(base_song, emotion)

Given a .MIDI file with a base song (e.g. “Happy Birthday to You…”) and an emotion, this method adjusts the scale, tonality, and tempo of the base song to fit the emotion. We call this process an emotion-based modulation. For example, if the emotion is “sad” the music will be in minor form, not loud, and not fast as compared to the case when the emotion is “joy” or “determination”.

music_generation_model train(songs)

Given a collection of songs in .MIDI format, train a sequence model that can predict a .MIDI note for the prefix of .MIDI notes. The model captures transition probabilities between .MIDI notes.

computer_generated_song generate_song(modulated_base_song, music_generation_model)

Given an emotion-modulated base song, which serves as a seed for the computerized music generation process, and a model trained to generate music, we produce a sequence of new .MIDI notes. For example, we can seed the generative process with a “Happy Birthday to You…” song modulated with the “Joy” emotion and complete it using a trained generative model toward the complete song.

User Interface

bool upload_image(image)

Uploads an image and returns the error code upon completion (True for success, False for fail). Used as part of the submission form.

int select_base_song(base_songs)

Select a song from a list of base songs to be modulated based on the emotions from uploaded images. Return the index of the selected base song. Used as part of the submission form.

true play(movie)

Appendix: Troubleshooting

If you experienced problems with the Torch setup in the Docker image for music generation (for example, the learning procedure didn’t start or Th failed to run any computation), update the Torch binaries within the Docker image. One way to do this is to follow the initial setup process, and then use some additional commands at the end:

cd /root/torch
./clean.sh
bash install-deps (it may take a significant amount of time)
./install.sh
luarocks install hdf5
luarocks install luautf8

Then, run the Th interactive shell and make sure that the following line works without any problem:

th

and then:

torch.rand(5,3)*torch.rand(3,4)

After that you can return to the main storyline.

‹ View All Tutorials

↧

Intel® Advisor issue with resolving debug symbols

December 21, 2017, 9:18 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® Graphics Performance Analyzers (Intel® GPA) 2017 R4 Release Notes

≪ Previous: Hands-On AI Part 24: TensorFlow* Serving for AI API and Web App Deployment

Problem:

An issue was discovered in Intel Advisor® where the tool was taking a long time to resolve symbols. During the finalization phase Intel Advisor does not progress past the message:

“Processing information for instruction mix report”.

Solution:

The issue is fixed in version Intel Advisor 2018 Update 1

To work around the issue in earlier releases

Edit file
Linux* <advisor-install-dir>/config/collector/include/common.xsl
Windows* <advisor-install-dir>\config\collector\include\common.xsl
Remove or comment the xml block containing:
<transformation name="Instruction Mix" boolean:deferred="true">
Re Run Survey

↧