Quantcast
Channel: Intel Developer Zone Articles
Viewing all 1201 articles
Browse latest View live

How Yahoo! JAPAN Used Open vSwitch* with DPDK to Accelerate L7 Performance in Large-Scale Deployment Case Study

$
0
0

View PDF [783 KB]

As cloud architects and developers know, it can be incredibly challenging to keep up with the rapidly increasing cloud infrastructure demands of both users and services. Many cloud providers are looking for proven and effective ways to improve network performance. This case study discusses one such collaborative project undertaken between Yahoo! JAPAN and Intel in which Yahoo! JAPAN implemented Open vSwitch* (OvS) with Data Plane Development Kit (OvS with DPDK) to deliver up to 2x practical cloud application L7 performance improvement while successfully completing a more than 500-node, large-scale deployment.

Introduction to Yahoo! JAPAN

Yahoo! JAPAN is a Japanese Internet company that was originally formed as a joint venture between Yahoo! Inc. and Softbank. The Yahoo! JAPAN portal is one of the most frequently visited websites in Japan, and its many services have been running on OpenStack* Private Cloud since 2012. Yahoo! JAPAN receives over 69 billion monthly page views, of which more than 39 billion come from smartphones alone. Yahoo! JAPAN also has over 380 million total app downloads, and it currently runs more than 100 services.

Network Performance Challenges

As a result of rapid cloud expansion, Yahoo! JAPAN began observing some network bottlenecks in its environment beginning in 2015. At that time, both cloud resources and users were doubling year by year, causing a rapid increase in virtual machine (VM) density. Yahoo! JAPAN was also noticing huge spikes in network traffic and burst traffic when breaking news, weather updates, or public service announcements, related to an earthquake for example, would happen. This dynamic put an additional burden on the network environment.

As these network performance challenges arose, Yahoo! JAPAN began experiencing some difficulties meeting service-level agreements (SLAs) for its many services. Engineers from the network infrastructure team at Yahoo! JAPAN noticed that noisy VMs (also known as “noisy neighbors”) were disrupting the network environment.

When that phenomenon occurs, a rogue VM may monopolize bandwidth, disk I/O, CPU, and other resources, which then impacts other VMs and applications in the environment.

Yahoo! JAPAN also noticed that the compute nodes were processing a large volume of short packets and that the network was handling a very heavy load (see Figure 1). Consequently, decreased network performance was affecting the SLAs.

Figure 1. A compute node showing a potential network bottleneck in a virtual switch.

Yahoo! JAPAN determined that its cloud infrastructure required a higher level of network performance in order to meet its application and SLAs. In the course of its research Yahoo! JAPAN had noticed that the Linux* Bridge overrun counter was increasing, which meant that the cause of its network difficulties was located in the network kernel. As a result, the company decided it needed to find a new solution to meet its needs going forward.

About OvS with DPDK

OvS with DPDK could be a potential solution to such network performance issues in cloud environments that are already using OpenStack Cloud, since it features OvS as a virtual switch. Native OvS uses kernel space for packet forwarding, which imposes a performance overhead and can limit network performance. DPDK, however, accelerates packet forwarding by bypassing the kernel.

DPDK integration with OvS offers other beneficial performance enhancements as well. For example, DPDK’s Poll Mode Driver eliminates context switch overhead. DPDK also uses direct user memory access to and from the NIC to eliminate kernel-user memory copy overhead. Both optimizations can greatly boost network performance. Overall, DPDK maintains compatibility with OvS while accelerating packet forwarding performance. Refer to Intel Developer Zone’s article, Open vSwitch with DPDK Overview, for more information.

Collaboration between Intel and Yahoo! JAPAN

As Yahoo! JAPAN was encountering network performance issues, Intel suggested that the company consider OvS with DPDK since it was now possible to use the two technologies in combination with one another. Yahoo! JAPAN was already aware that DPDK offered network performance benefits for a variety of telecommunications use cases but, being a web-based company, the company thought that it would not be able to take advantage of that particular solution. After discussing the project with Intel and learning about ways in which the technologies could work for a cloud service provider, Yahoo! JAPAN decided to try OvS with DPDK in their OpenStack environment.

For optimal performance deployment in OvS with DPDK, Yahoo! JAPAN enabled 1 GB hugepages. This step was important from a performance perspective, because it enabled Yahoo! JAPAN to reduce Translation Lookaside Buffer (TLB) misses and prevent page faults. The company also paid special attention to its CPU affinity design, carefully identifying ideal resource settings for each function. Without that step, Yahoo! JAPAN would not have been able to ensure stable network performance.

OpenStack’s Mitaka release offered the features required for Yahoo! JAPAN’s OvS with DPDK implementation, so the company decided to build a Mitaka cluster running with the configurations mentioned above. The first cluster includes over 150 nodes and uses Open Compute Project (OCP) servers.

Benchmark Test Results

Yahoo! JAPAN achieved impressive performance results after implementing OvS with DPDK in its cloud environment. To demonstrate these gains, the engineers measured two benchmarks: the network layer (L2) and the application layer (L7).

Table 1. Benchmark test configuration.

Hardware

Software

CPU

Intel® Xeon™ processor E5-2683 v3 2S

Host OS

CentOS* 7.2

Memory

512 GB DDR4-2400 RDIMM

Guest OS

CentOS 7.2

NIC

Intel® Ethernet Converged Network Adapter X520-DA2

OpenStack*

Mitaka

 

 

QEMU*

2.6.2

 

 

Open vSwitch

2.5.90 + TSO patch (a6be657)

 

 

Data Plane Development Kit

16.04

Figure 2. L2 network benchmark test.

L2 Network Benchmark Test Results

In the L2 benchmark test, Yahoo! JAPAN used Ixia IxNetwork* as a packet generator. Upon measuring L2 performance (see Figure 2), Yahoo! JAPAN observed 10x network throughput performance improvement in its short packet traffic. The company also found that OvS with DPDK reduced latency up to ~1/20x (1/20th). With these results, Yahoo! JAPAN successfully confirmed that OvS with DPDK accelerates the L2 path to the VM. These results were about in line with what Yahoo! JAPAN expected to find, as telecommunications companies had achieved similar results in their benchmark tests.

L7 Network Benchmark Test Results

The L7 single VM benchmark results for the application layer, however, exceeded Yahoo! JAPAN’s expectations. In this test, Yahoo! JAPAN instructed one VM to send a query and another VM to return a response. All applications (HTTP, MQ, DNS, RDB) demonstrated significant performance gains in this scenario (see Figure 3). Particularly in the MySQL* sysbench result, Yahoo! JAPAN saw simultaneous improvement in two important metrics: 1.5x better throughput (transaction/sec) and 1/1.5x less latency (response time).

Figure 3. Various application benchmark test results.

Application Benchmark Test Results

Why did network performance improve so dramatically? In the case of HTTP, for example, Yahoo! JAPAN saw a 2.0x improvement in OvS with DPDK when compared to Linux Bridge. Yahoo! JAPAN determined that this performance metric improved because OvS with DPDK reduces the number of context switches by 45 percent when compared with Linux Bridge.

The benchmark results for RabbitMQ* revealed another promising discovery. When Yahoo! JAPAN ran their first stress test on RabbitMQ under Linux Bridge, it observed degraded performance. When it ran the same stress test under OvS with DPDK, the application environment maintained a much more consistent and satisfactory level of performance (see Figure 4).

Figure 4. RabbitMQ stress test results.

RabbitMQ Stress Test Results

How was this possible? In both tests, noisy conditions created a high degree of context switching. In the Linux Bridge world, it’s necessary to pay a 50 percent tax to the kernel. But in the OvS with DPDK world, that tax is only 10 percent. This is because OvS with DPDK suppresses context switching, which prevents network performance from degrading even under challenging real world conditions. Yahoo! JAPAN found that CPU pinning relaxes interference between multiple noisy neighbor VMs and the critical OvS process, which also contributed to the performance improvements observed in this test. Which world would you want to live in: Linux Bridge or OvS with DPDK?

Ultimately, Yahoo! JAPAN found that OvS with DPDK delivers terrific network performance improvements for cloud environments. This finding was key to resolving Yahoo! JAPAN’s network performance issues and meeting the company’s SLA requirements.

Summary

Despite what you might think, deploying OvS with DPDK is actually not so difficult. Yahoo! JAPAN is already successfully using this technology in a production system with over 500 nodes. OvS with DPDK offers powerful performance benefits and provides a stable network environment, which enables Yahoo! JAPAN to meet its SLAs and easily support the demands placed on its cloud infrastructure. The impressive results that Yahoo! JAPAN has achieved through its implementation of OvS with DPDK can be enjoyed by other cloud service providers too.

When assessing whether OvS with DPDK will meet your requirements, it is important to carefully investigate what is causing the bottlenecks in your cloud environment. Once you fully understand the problem, you can identify which solution will best fit your specific needs.

To accomplish this task, Yahoo! JAPAN performed a thorough analysis of its network traffic before deciding how to proceed. The company learned that there was a high volume of short packets traveling throughout its network. This discovery indicated that OvS with DPDK might be a good solution for its problem, since OvS with DPDK is known to improve performance in network environments where a high volume of short packets is present. For this reason, Yahoo! JAPAN concluded that it is necessary to not only benchmark your results but also have a full understanding of your network’s characteristics in order to find the right solution.

Now that you’ve learned about the performance improvements that Yahoo! JAPAN achieved by implementing OvS with DPDK, have you considered deploying OvS with DPDK within your own cloud? To learn more about enabling OvS with DPDK on OpenStack, read these articles: Using Open vSwitch and DPDK with Neutron in DevStack, Using OpenSwitch with DPDK, and DPDK vHost User Ports.

Acknowledgment

Thanks to this successful collaboration with Intel, Yusuke Tatsumi, network engineer for Yahoo! JAPAN’s infrastructure team, said: “We found out that the OvS and DPDK combination definitely improves application performance for cloud service providers. It strengthened our cloud architecture and made it more robust.” Yahoo! JAPAN is pleased to have demonstrated that OvS with DPDK is a valuable technology that can achieve impressive network performance results and meet the demanding daily traffic requirements of a leading Japanese Internet company.

About the Author

Rose de Fremery is a New York-based writer and technologist. She is the former Managing Editor of The Social Media Monthly, the world's first print magazine devoted to the social media revolution. Rose currently writes about a range of business IT topics including cloud infrastructure, VoIP, UC, CRM, business innovation, and teleworking.

Notices

Testing conducted on Yahoo! JAPAN. Testing done by Yahoo! JAPAN.

Software and workloads used in performance tests may have been optimized for performance only on Intel® microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Copies of documents which have an order number and are referenced in this document may be obtained by calling 1-800-548-4725 or by visiting www.intel.com/design/literature.htm.

Intel, the Intel logo, and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

© 2017 Intel Corporation.


What's New? - Intel® VTune™ Amplifier XE 2017 Update 4

$
0
0

Intel® VTune™ Amplifier XE 2017 performance profiler

A performance profiler for serial and parallel performance analysis. Overviewtrainingsupport.

New for the 2017 Update 4! (Optional update unless you need...)

As compared to 2017 Update 3:

  • General Exploration, Memory Access, HPC Performance Characterization analysis types extended to support Intel® Xeon® Processor Scalable family
  • Support for Microsoft Windows* 10 Creators Update (RS2) 

Resources

  • Learn (“How to” videos, technical articles, documentation, …)
  • Support (forum, knowledgebase articles, how to contact Intel® Premier Support)
  • Release Notes (pre-requisites, software compatibility, installation instructions, and known issues)

Contents

File: vtune_amplifier_xe_2017_update4.tar.gz

Installer for Intel® VTune™ Amplifier XE 2017 for Linux* Update 4

File: VTune_Amplifier_XE_2017_update4_setup.exe

Installer for Intel® VTune™ Amplifier XE 2017 for Windows* Update 4 

File: vtune_amplifier_xe_2017_update4.dmg

Installer for Intel® VTune™ Amplifier XE 2017 - OS X* host only Update 4 

* Other names and brands may be claimed as the property of others.

Microsoft, Windows, Visual Studio, Visual C++, and the Windows logo are trademarks, or registered trademarks of Microsoft Corporation in the United States and/or other countries.

What is OpenCV?

$
0
0

OpenCV is a software toolkit for processing real-time image and video, as well as providing analytics, and machine learning capabilities.

Development Benefits

Using OpenCV, a BSD licensed library, developers can access many advanced computer vision algorithms used for image and video processing in 2D and 3D as part of their programs. The algorithms are otherwise only found in high-end image and video processing software.

Powerful Built-In Video Analytics

Video analytics is much simpler to implement with OpenCV API’s for basic building blocks such as background removal, filters, pattern matching and classification.

Real-time video analytics capabilities include classifying, recognizing, and tracking: objects, animals, people, specific features such as vehicle number plates, animal species, and facial features such as faces, eyes, lips, chin, etc.

Hardware and Software Requirements

OpenCV is written in Optimized C/C++, is cross-platform by design and works on a wide variety of hardware platforms, including Intel Atom® platform, Intel® Core™ processor family, and Intel® Xeon® processor family.

Developers can program OpenCV using C++, C, Python*, and Java* on Operating Systems such as Windows*, many Linux* distros, Mac OS*, iOS* and Android*.

Although some cameras work better due to better drivers, if a camera has a working driver for the Operating System in use, OpenCV will be able to use it.

Hardware Optimizations

OpenCV takes advantage of multi-core processing and OpenCL™. Hence, OpenCV can also take advantage of hardware acceleration if integrated graphics is present.

OpenCV v3.2.0 release can use Intel optimized LAPACK/BLAS included in the Intel® Math Kernel Libraries (Intel® MKL) for acceleration. It can also use Intel® Threading Building Blocks (Intel® TBB) and Intel® Integrated Performance Primitives (Intel® IPP) for optimized performance on Intel platforms.

OpenCV uses the FFMPEG library and can use Intel® Quick Sync Video technology to accelerate encoding and decoding using hardware.

OpenCV and IoT

OpenCV has a wide range of applications in traditional computer vision applications such as optical character recognition or medical imaging.

For example, OpenCV can detect Bone fractures1. OpenCV can also help classify skin lesions and help in the early detection of skin melanomas2.

However, OpenCV coupled with the right processor and camera can become a powerful new class of computer vision enabled IoT sensor. This type of design can scale from simple sensors to multi-camera video analytics arrays. See Designing Scalable IoT Architectures for more information.3

IoT developers can use OpenCV to build embedded computer vision sensors for detecting IoT application events such as motion detection or people detection.

Designers can also use OpenCV to build even more advanced sensor systems such as face recognition, gesture recognition or even sentiment analysis as part of the IoT application flow.

IoT applications can also deploy OpenCV on Fog nodes at the Edge as an analytics platform for a larger number of camera based sensors.

For example, IoT applications use camera sensors with OpenCV for road traffic analysis, Advanced Driver Assistance Systems (ADAS)3, video surveillance4, and advanced digital signage with analytics in visual retail applications5.

OpenCV Integration

When developers integrated OpenCV with a neural-network backend, it unleashed the true power of computer vision. Using this approach, OpenCV works with Convolutional Neural Networks (CNN) and Deep Neural Networks (DNN) to allow developers to build innovative and powerful new vision applications.

To target multiple hardware platforms, these integrations need to be cross platform by design. Hardware optimization of deep learning algorithms breaks this design goal. The OpenVX architecture standard proposes resource and execution abstractions.

Hardware vendors can optimize implementations with a strong focus on specific platforms. This allows developers to write code that is portable across multiple vendors and platforms, as well as multiple hardware types.

Intel® Computer Vision SDK (Beta) is an integrated design framework and a powerful toolkit for developers to solve complex problems in computer vision. It includes Intel’s implementation of the OpenVX API as well as custom extensions. It supports OpenCL custom kernels and can integrate CNN or DNN.

The pre-built and included OpenCV binary has hooks for Intel® VTune™Amplifier for profiling vision applications.

Getting Started:

Try this tutorial on basic people recognition.  Also, see OpenCV 3.2.0 Documentation for more tutorials.

Related Software:

Intel® Computer Vision SDK - Accelerated computer vision solutions based on OpenVX standard, integrating OpenCV and deep learning support using the included Deep Learning (DL) Deployment Toolkit.

Intel® Integrated Performance Primitives (IPP) - Programming toolkit for high-quality, production-ready, low-level building blocks for image processing, signal processing, and data processing (data compression/decompression and cryptography) applications.

Intel® Math Kernel Library (MKL) - Library with accelerated math processing routines to increase application performance.

Intel® Media SDK - A cross-platform API for developing media applications using Intel® Quick Sync Video technology.

Intel® SDK for OpenCL™ Applications - Accelerated and optimized application performance with Intel® Graphics Technology compute offload and high-performance media pipelines.

Intel® Distribution for Python* - Specially optimized Python distribution for High-Performance Computing (HPC) with accelerated compute-intensive Python computational packages like NumPy, SciPy, and scikit-learn.

Intel® Quick Sync Video - Leverage dedicated media processing capabilities of Intel® Graphics Technology to decode and encode fast, enabling the processor to complete other tasks and improving system responsiveness.

Intel® Threading Building Blocks (TBB) - Library for shared-memory parallel programming and intra-node distributed memory programming.

References:

  1. Bone fracture detection using OpenCV
  2. Mole Investigator: Detecting Cancerous Skin Moles Through Computer Vision
  3. Designing Scalable IoT Architectures
  4. Advanced Driver Assistance Systems (ADAS)
  5. Smarter Security Camera: A Proof of Concept (PoC) Using the Intel® IoT Gateway
  6. Introduction to Developing and Optimizing Display Technology

Intel® Xeon® Processor Scalable Family Technical Overview

$
0
0

Executive Summary

Intel uses a tick-tock model associated with its generation of processors. The new generation, the Intel® Xeon® processor Scalable family (formerly code-named Skylake-SP), is a “tock” based on 14nm process technology. Major architecture changes take place on a “tock,” while minor architecture changes and a die shrink occur on a “tick.”

Tick-Tock model
Figure 1. Tick-Tock model.

Intel Xeon processor Scalable family on the Purley platform is a new microarchitecture with many additional features compared to the previous-generation Intel® Xeon® processor E5-2600 v4 product family (formerly Broadwell microarchitecture). These features include increased processor cores, increased memory bandwidth, non-inclusive cache, Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Intel® Memory Protection Extensions (Intel® MPX), Intel® Ultra Path Interconnect (Intel® UPI), and sub-NUMA clusters.

In previous generations two and four socket processor families were segregated into different product lines. One of the big changes with the Intel Xeon processor Scalable family is that it includes all the processor models associated with this new generation. The processors from Intel Xeon processor Scalable family are scalable from a two-socket configuration to an eight-socket configuration. They are Intel’s platform of choice for the most scalable and reliable performance with the greatest variety of features and integrations designed to meet the needs of the widest variety of workloads.

New branding for processor models
Figure 2. New branding for processor models.

A two-socket Intel Xeon processor Scalable family configuration can be found within all the levels of bronze through platinum, while a four-socket configuration will only be found at the gold through platinum levels, and the eight-socket configuration will only be found at the platinum level. The bronze level has the least amount of features and as you move towards platinum more features are added. All available features are available across the entire range of processor socket count (two through eight) at the platinum level.

Introduction

This paper discusses the new features and enhancements available in Intel Xeon processor Scalable family and what developers need to do to take advantage of them.

Intel® Xeon® processor Scalable family Microarchitecture Overview

Block Diagram of the Intel® Xeon® processor scalable family microarchitecture
Figure 3. Block Diagram of the Intel® Xeon® processor Scalable family microarchitecture.

The Intel Xeon processor Scalable family on the Purley platform provides up to 28 cores, which bring additional computing power to the table compared to the 22 cores of its predecessor. Additional improvements include a non-inclusive last-level cache, a larger 1MB L2 cache, faster 2666 MHz DDR4 memory, an increase to six memory channels per CPU, new memory protection features, Intel® Speed Shift Technology, on-die PMAX detection, integrated Fabric via Intel® Omni-Path Architecture (Intel® OPA), Internet Wide Area RDMA Protocol (iWARP)*, memory bandwidth allocation, Intel® Virtual RAID on CPU (Intel® VROC), and more.

Table 1. Generational comparison of the Intel Xeon processor Scalable family to the Intel® Xeon® processor E5-2600 and E7-4600 product families.

Table 1 generational comparison

Intel Xeon processor Scalable family feature overview

The rest of this paper discusses the performance improvements, new capabilities, security enhancements, and virtualization enhancements in the Intel Xeon processor Scalable family.

Table 2. New features and technologies of the Intel Xeon processor Scalable family.

Table 2 New features and technologies

Skylake Mesh Architecture

On the previous generations of Intel® Xeon® processor families (formerly Haswell and Broadwell) on the Grantley platform, the processors, the cores, last-level cache (LLC), memory controller, IO controller and inter-socket Intel® QuickPath Interconnect (Intel® QPI) ports are connected together using a ring architecture, which has been in place for the last several generations of Intel® multi-core CPUs. As the number of cores on the CPU increased with each generation, the access latency increased and available bandwidth per core diminished. This trend was mitigated by dividing the chip into two halves and introducing a second ring to reduce distances and to add additional bandwidth.

Platform ring architecture
Figure 4. Intel® Xeon® processor E5-2600 product family (formerly Broadwell-EP) on Grantley platform ring architecture.

With additional cores per processor and much higher memory and I/O bandwidth in the Intel® Xeon® processor Scalable family, the additional demands on the on-chip interconnect could become a performance limiter with the ring-based architecture. Therefore, the Intel Xeon processor Scalable family introduces a mesh architecture to mitigate the increased latencies and bandwidth constraints associated with previous ring-based architecture. The Intel Xeon processor Scalable family also integrates the caching agent, the home agent, and the IO subsystem on the mesh interconnect in a modular and distributed way to remove bottlenecks in accessing these functions. Each core and LLC slice has a combined Caching and Home Agent (CHA), which provides scalability of resources across the mesh for Intel® Ultra Path Interconnect (Intel® UPI) cache coherency functionality without any hotspots.

The Intel Xeon processor Scalable family mesh architecture encompasses an array of vertical and horizontal communication paths allowing traversal from one core to another through a shortest path (hop on vertical path to correct row, and hop across horizontal path to correct column). The CHA located at each of the LLC slices maps addresses being accessed to specific LLC bank, memory controller, or IO subsystem, and provides the routing information required to reach its destination using the mesh interconnect.

Intel Xeon processor Scalable family mesh architecture
Figure 5. Intel Xeon processor Scalable family mesh architecture.

In addition to the improvements expected in the overall core-to-cache and core-to-memory latency, we also expect to see improvements in latency for IO initiated accesses. In the previous generation of processors, in order to access data in LLC, memory or IO, a core or IO would need to go around the ring and arbitrate through the switch between the rings if the source and targets are not on the same ring. In Intel Xeon processor Scalable family, a core or IO can access the data in LLC, memory, or IO through the shortest path over the mesh.

Intel® Ultra Path Interconnect (Intel® UPI)

The previous generation of Intel® Xeon® processors utilized Intel QPI, which has been replaced on the Intel Xeon processor Scalable family with Intel UPI. Intel UPI is a coherent interconnect for scalable systems containing multiple processors in a single shared address space. Intel Xeon processors that support Intel UPI, provide either two or three Intel UPI links for connecting to other Intel Xeon processors and do so using a high-speed, low-latency path to the other CPU sockets. Intel UPI uses a directory-based home snoop coherency protocol, which provides an operational speed of up to 10.4 GT/s, improves power efficiency through an L0p state low-power state, provides improved data transfer efficiency over the link using a new packetization format, and has improvements at the protocol layer such as no preallocation to remove scalability limits with Intel QPI.

Typical two- socket configuration
Figure 6. Typical two- socket configuration.

Typical four-socket ring configuration
Figure 7. Typical four-socket ring configuration.

Typical four-socket crossbar configuration
Figure 8. Typical four-socket crossbar configuration.

Typical eight-socket configuration
Figure 9. Typical eight-socket configuration.

Intel® Ultra Path Interconnect Caching and Home Agent

Previous implementations of Intel Xeon processors provided a distributed Intel QPI caching agent located with each core and a centralized Intel QPI home agent located with each memory controller. Intel Xeon processor Scalable family processors implement a combined CHA that is distributed and located with each core and LLC bank, and thus provides resources that scale with the number of cores and LLC banks. CHA is responsible for tracking of requests from the core and responding to snoops from local and remote agents as well as resolution of coherency across multiple processors.

Intel UPI removes the requirement on preallocation of resources at the home agent, which allows the home agent to be implemented in a distributed manner. The distributed home agents are still logically a single Intel UPI agent that is address-interleaved across different CHAs, so the number of visible Intel UPI nodes is always one, irrespective of the number of cores, memory controllers used, or the sub-NUMA clustering mode. Each CHA implements a slice of the aggregated CHA functionality responsible for a portion of the address space mapped to that slice.

Sub-NUMA Clustering

A sub-NUMA cluster (SNC) is similar to a cluster-on-die (COD) feature that was introduced with Haswell, though there are some differences between the two. An SNC creates two localization domains within a processor by mapping addresses from one of the local memory controllers in one half of the LLC slices closer to that memory controller and addresses mapped to the other memory controller into the LLC slices in the other half. Through this address-mapping mechanism, processes running on cores on one of the SNC domains using memory from the memory controller in the same SNC domain observe lower LLC and memory latency compared to latency on accesses mapped to locations outside of the same SNC domain.

Unlike a COD mechanism where a cache line could have copies in the LLC of each cluster, SNC has a unique location for every address in the LLC, and it is never duplicated within the LLC banks. Also, localization of addresses within the LLC for each SNC domain applies only to addresses mapped to the memory controllers in the same socket. All addresses mapped to memory on remote sockets are uniformly distributed across all LLC banks independent of the SNC mode. Therefore even in the SNC mode, the entire LLC capacity on the socket is available to each core, and the LLC capacity reported through the CPUID is not affected by the SNC mode.

Figure 10 represents a two-cluster configuration that consists of SNC Domain 0 and 1 in addition to their associated core, LLC, and memory controllers. Each SNC domain contains half of the processors on the socket, half of the LLC banks, and one of the memory controllers with three DDR4 channels. The affinity of cores, LLC, and memory within a domain are expressed using the usual NUMA affinity parameters to the OS, which can take SNC domains into account in scheduling tasks and allocating memory to a process for optimal performance.

SNC requires that memory is not interleaved in a fine-grain manner across memory controllers. In addition, SNC mode has to be enabled by BIOS to expose two SNC domains per socket and set up resource affinity and latency parameters for use with NUMA primitives.

Sub-NUMA cluster domains
Figure 10. Sub-NUMA cluster domains.

Directory-Based Coherency

Unlike the prior generation of Intel Xeon processors that supported four different snoop modes (no-snoop, early snoop, home snoop, and directory), the Intel Xeon processor Scalable family of processors only supports the directory mode. With the change in cache hierarchy to a non-inclusive LLC, the snoop resolution latency can be longer depending on where in the cache hierarchy a cache line is located. Also, with much higher memory bandwidth, the inter-socket Intel UPI bandwidth is a much more precious resource and could become a bottleneck in system performance if unnecessary snoops are sent to remote sockets. As a result, the optimization trade-offs for various snoop modes are different in Intel Xeon processor Scalable family compared to previous Intel Xeon processors, and therefore the complexity of supporting multiple snoop modes is not beneficial.

The Intel Xeon processor Scalable family carries forward some of the coherency optimizations from prior generations and introduces some new ones to reduce the effective memory latency. For example, some of the directory caching optimizations such as IO directory cache and HitME cache are still supported and further enhanced on the Intel Xeon processor Scalable family. The opportunistic broadcast feature is also supported, but it is used only with writes to local memory to avoid memory access due to directory lookup.

For IO directory cache (IODC), the Intel Xeon processor Scalable family provides an eight-entry directory cache per CHA to the cache directory state of IO writes from remote sockets. IO writes usually require multiple transactions to invalidate a cache line from all caching agents followed by a writeback to put updated data in memory or home sockets LLC. With the directory information stored in memory, multiple accesses may be required to retrieve and update directory state. IODC reduces accesses to memory to complete IO writes by keeping the directory information cached in the IODC structure.

HitME cache is another capability in the CHA that caches directory information for speeding up cache-to-cache transfer. With the distributed home agent architecture of the CHA, the HitME cache resources scale with number of CHAs.

Opportunistic Snoop Broadcast (OSB) is another feature carried over from previous generations into the Intel Xeon processor Scalable family. OSB broadcasts snoops when the Intel UPI link is lightly loaded, thus avoiding a directory lookup from memory and reducing memory bandwidth. In the Intel Xeon processor Scalable family, OSB is used only for local InvItoE (generated due to full-line writes from the core or IO) requests since data read is not required for this operation. Avoiding directory lookup has a direct impact on saving memory bandwidth.

Cache Hierarchy Changes

Generational cache comparison
Figure 11. Generational cache comparison.

In the previous generation the mid-level cache was 256 KB per core and the last level cache was a shared inclusive cache with 2.5 MB per core. In the Intel Xeon processor Scalable family, the cache hierarchy has changed to provide a larger MLC of 1 MB per core and a smaller shared non-inclusive 1.375 MB LLC per core. A larger MLC increases the hit rate into the MLC resulting in lower effective memory latency and also lowers demand on the mesh interconnect and LLC. The shift to a non-inclusive cache for the LLC allows for more effective utilization of the overall cache on the chip versus an inclusive cache.

If the core on the Intel Xeon processor Scalable family has a miss on all the levels of the cache, it fetches the line from memory and puts it directly into MLC of the requesting core, rather than putting a copy into both the MLC and LLC as was done on the previous generation. When the cache line is evicted from the MLC, it is placed into the LLC if it is expected to be reused.

Due to the non-inclusive nature of LLC, the absence of a cache line in LLC does not indicate that the line is not present in private caches of any of the cores. Therefore, a snoop filter is used to keep track of the location of cache lines in the L1 or MLC of cores when it is not allocated in the LLC. On the previous-generation CPUs, the shared LLC itself took care of this task.

Even with the changed cache hierarchy in Intel Xeon processor Scalable family, the effective cache available per core is roughly the same as the previous generation for a usage scenario where different applications are running on different cores. Because of the non-inclusive nature of LLC, the effective cache capacity for an application running on a single core is a combination of MLC cache size and a portion of LLC cache size. For other usage scenarios, such as multithreaded applications running across multiple cores with some shared code and data, or a scenario where only a subset of the cores on the socket are used, the effective cache capacity seen by the applications may seem different than previous-generation CPUs. In some cases, application developers may need to adapt their code to optimize it with the changed cache hierarchy on the Intel Xeon processor Scalable family of processors.

Page Protection Keys

Because of stray writes, memory corruption is an issue with complex multithreaded applications. For example, not every part of the code in a database application needs to have the same level of privilege. The log writer should have write privileges to the log buffer, but it should have only read privileges on other pages. Similarly, in an application with producer and consumer threads for some critical data structures, producer threads can be given additional rights over consumer threads on specific pages. 

The page-based memory protection mechanism can be used to harden applications. However, page table changes are costly for performance since these changes require Translation Lookaside Buffer (TLB) shoot downs and subsequent TLB misses. Protection keys provide a user-level, page-granular way to grant and revoke access permission without changing page tables.

Protection keys provide 16 domains for user pages and use bits 62:59 of the page table leaf nodes (for example, PTE) to identify the protection domain (PKEY). Each protection domain has two permission bits in a new thread-private register called PKRU. On a memory access, the page table lookup is used to determine the protection domain (PKEY) of the access, and the corresponding protection domain-specific permission is determined from PKRU register content to see if access and write permission is granted. An access is allowed only if both protection keys and legacy page permissions allow the access. Protection keys violations are reported as page faults with a new page fault error code bit. Protection keys have no effect on supervisor pages, but supervisor accesses to user pages are subject to the same checks as user accesses.

Diagram of memory data access with protection key
Figure 12. Diagram of memory data access with protection key.

In order to benefit from protection keys, support is required from the virtual machine manager, OS, and complier. Utilizing this feature does not cause a performance impact because it is an extension of the memory management architecture.

Intel® Memory Protection Extensions (Intel® MPX)

C/C++ pointer arithmetic is a convenient language construct often used to step through an array of data structures. If an iterative write operation does not take into consideration the bounds of the destination, adjacent memory locations may get corrupted. Such unintended modification of adjacent data is referred as a buffer overflow. Buffer overflows have been known to be exploited, causing denial-of-service (DoS) attacks and system crashes. Similarly, uncontrolled reads could reveal cryptographic keys and passwords. More sinister attacks, which do not immediately draw the attention of the user or system administrator, alter the code execution path such as modifying the return address in the stack frame to execute malicious code or script.

Intel’s Execute Disable Bit and similar hardware features from other vendors have blocked buffer overflow attacks that redirected the execution to malicious code stored as data. Intel® MPX technology consists of new Intel® architecture instructions and registers that compilers can use to check the bounds of a pointer at runtime before it is used. This new hardware technology is supported by the compiler.

New Intel® Memory Protection Extensions instructions and example of their effect on memory
Figure 13. New Intel® Memory Protection Extensions instructions and example of their effect on memory.

For additional information see Intel® Memory Protection Extensions Enabling Guide.

Mode-Based Execute (MBE) Control

MBE provides finer grain control on execute permissions to help protect the integrity of the system code from malicious changes. It provides additional refinement within the Extended Page Tables (EPT) by turning the Execute Enable (X) permission bit into two options:

  • XU for user pages
  • XS for supervisor pages

The CPU selects one or the other based on permission of the guest page and maintains an invariant for every page that does not allow it to be writable and supervisor-executable at the same time. A benefit of this feature is that a hypervisor can more reliably verify and enforce the integrity of kernel-level code. The value of the XU/XS bits is delivered through the hypervisor, so hypervisor support is necessary.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512)

Generational overview of Intel® Advanced Vector Extensions technology
Figure 14. Generational overview of Intel® Advanced Vector Extensions technology.

Intel® Advanced Vector Extensions 512 (Intel® AVX-512) was originally introduced with the Intel® Xeon Phi™ processor product line (formerly Knights Landing). There are certain Intel AVX-512 instruction groups (AVX512CD and AVX512F) that are common to the Intel® Xeon Phi™ processor product line and the Intel Xeon processor Scalable family. However the Intel Xeon processor Scalable family introduces new Intel AVX-512 instruction groups (AVX512BW and AVX512DQ) as well as a new capability (AVX512VL) to expand the benefits of the technology. The AVX512DQ instruction group is focused on new additions for benefiting high-performance computing (HPC) workloads such as oil and gas, seismic modeling, financial services industry, molecular dynamics, ray tracing, double-precision matrix multiplication, fast Fourier transform and convolutions, and RSA cryptography. The AVX512BW instruction group supports Byte/Word operations, which can benefit some enterprise applications, media applications, as well as HPC. AVX512VL is not an instruction group but a feature that is associated with vector length orthogonality.

Broadwell, the previous processor generation, has up to two floating point FMAs (Fused Multiple Add) per core and this has not changed with the Intel Xeon processor Scalable family. However the Intel Xeon processor Scalable family doubles the number of elements that can be processed compared to Broadwell as the FMAs on the Intel Xeon processor Scalable family of processors have been expanded from 256 bits to 512 bits.

Generation feature comparison of Intel® Advanced Vector Extensions technology
Figure 15. Generation feature comparison of Intel® Advanced Vector Extensions technology.

Intel AVX-512 instructions offer the highest degree of support to software developers by including an unprecedented level of richness in the design of the instructions. This includes 512-bit operations on packed floating-point data or packed integer data, embedded rounding controls (override global settings), embedded broadcast, embedded floating-point fault suppression, embedded memory fault suppression, additional gather/scatter support, high-speed math instructions, and compact representation of large displacement value. The following sections cover some of the details of the new features of Intel AVX-512.

AVX512DQ

The doubleword and quadword instructions, indicated by the AVX512DQ CPUID flag enhance integer and floating-point operations, consisting of additional instructions that operate on 512-bit vectors whose elements are 16 32-bit elements or 8 64-bit elements. Some of these instructions provide new functionality such as the conversion of floating point numbers to 64-bit integers. Other instructions promote existing instructions such as with the vxorps instruction to use 512-bit registers.

AVX512BW

The byte and word instructions, indicated by the AVX512BW CPUID flag, enhance integer operations, extending write-masking and zero-masking to support smaller element sizes. The original Intel AVX-512 Foundation instructions supported such masking with vector element sizes of 32 or 64 bits, because a 512-bit vector register could hold at most 16 32-bit elements, so a write mask size of 16 bits was sufficient.

An instruction indicated by an AVX512BW CPUID flag requires a write mask size of up to 64 bits because a 512-bit vector register can hold 64 8-bit elements or 32 16-bit elements. Two new mask types (_mmask32 and _mmask64) along with additional maskable intrinsics have been introduced to support this operation.

AVX512VL

An additional orthogonal capability known as Vector Length Extensions provide for most Intel AVX-512 instructions to operate on 128 or 256 bits, instead of only 512. Vector Length Extensions can currently be applied to most Foundation Instructions and the Conflict Detection Instructions, as well as the new Byte, Word, Doubleword, and Quadword instructions. These Intel AVX-512 Vector Length Extensions are indicated by the AVX512VL CPUID flag. The use of Vector Length Extensions extends most Intel AVX-512 operations to also operate on XMM (128-bit, SSE) registers and YMM (256-bit, AVX) registers. The use of Vector Length Extensions allows the capabilities of EVEX encodings, including the use of mask registers and access to registers 16..31, to be applied to XMM and YMM registers instead of only to ZMM registers.

Mask Registers

In previous generations of Intel® Advanced Vector Extensions and Intel® Advanced Vector Extensions 2, the ability to mask bits was limited to load and store operations. In Intel AVX-512 this feature has been greatly expanded with eight new opmask registers used for conditional execution and efficient merging of destination operands. The width of each opmask register is 64-bits, and they are identified as k0–k7. Seven of the eight opmask registers (k1–k7) can be used in conjunction with EVEX-encoded Intel AVX-512 Foundation Instructions to provide conditional processing, such as with vectorized remainders that only partially fill the register. While the Opmask register k0 is typically treated as a “no mask” when unconditional processing of all data elements is desired. Additionally, the opmask registers are also used as vector flags/element level vector sources to introduce novel SIMD functionality as seen in new instructions such as VCOMPRESSPS. Support for the 512-bit SIMD registers and the opmask registers is managed by the operating system using XSAVE/XRSTOR/XSAVEOPT instructions. (see Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 2B, and Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A).

Example of opmask register k1
Figure 16. Example of opmask register k1.

Embedded Rounding

Embedded Rounding provides additional support for math calculations by allowing the floating point rounding mode to be explicitly specified for an individual operation, without having to modify the rounding controls in the MXCSR control register. In previous SIMD instruction extensions, rounding control is generally specified in the MXCSR control register, with a handful of instructions providing per-instruction rounding override via encoding fields within the imm8 operand. Intel AVX-512 offers a more flexible encoding attribute to override MXCSR-based rounding control for floating-pointing instruction with rounding semantic. This rounding attribute embedded in the EVEX prefix is called Static (per instruction) Rounding Mode or Rounding Mode override. Static rounding also implies exception suppression (SAE) as if all floating point exceptions are disabled, and no status flags are set. Static rounding enables better accuracy control in intermediate steps for division and square root operations for extra precision, while the default MXCSR rounding mode is used in the last step. It can also help in cases where precision is needed the least significant bit such as in range reduction for trigonometric functions.

Embedded Broadcast

Embedded broadcast provides a bit-field to encode data broadcast for some load-op instructions such as instructions that load data from memory and perform some computational or data movement operation. A source element from memory can be broadcasted (repeated) across all elements of the effective source operand, without requiring an extra instruction. This is useful when we want to reuse the same scalar operand for all operations in a vector instruction. Embedded broadcast is only enabled on instructions with an element size of 32 or 64 bits and not on byte and word instructions.

Quadword Integer Arithmetic

Quadword integer arithmetic removes the need for expensive software emulation sequences. These instructions include gather/scatter with D/Qword indices, and instructions that can partially execute, where k-reg mask is used as a completion mask.

Table 3. Quadword integer arithmetic instructions.

Table 3 Quadword integer arithmetic instructions

Math Support

Math Support is designed to aid with math library writing and to benefit financial applications. Data types that are available include PS, PD, SS, and SS. IEEE division/square root formats, DP transcendental primitives, and new transcendental support instructions are also included.

Table 4. Math support instructions.

Table 4 Math support instructions

New Permutation Primitives

Intel AVX-512 introduces new permutation primitives such as 2-source shuffles with 16/32-entry table lookups with transcendental support, matrix transpose, and a variable VALIGN emulation.

Table 5. 2-Source shuffles instructions.

Table 5 2-Source shuffles instructions

Example of a 2-source shuffles operation
Figure 17. Example of a 2-source shuffles operation.

Expand and Compress

Expand and Compress allows vectorization of conditional loops. Similar to FORTRAN pack/unpack intrinsic it also provides memory fault suppression, can be faster than using gather/scatter, and also has opposite operation capability for compress. The figure below shows an example of an expand operation.

Expand instruction and diagram
Figure 18. Expand instruction and diagram.

Bit Manipulation

Intel AVX-512 provides support for bit manipulation operations on mask and vector operands including Vector rotate. These operations can be used to manipulate mask registers and they have some application with cryptography algorithms.

Table 6. Bit manipulation instructions.

Bit manipulation instructions

Universal Ternary Logical Operation

A universal ternary logical operation is another feature of Intel AVX-512 that provides a way to mimic an FPGA cell. The VPTERNLOGD and VPTERNLOGQ instructions operate on dword and qword elements and take three-bit vectors of the respective input data elements to form a set of 32/64 indices, where each 3-bit value provides an index into an 8-bit lookup table represented by the imm8 byte of the instruction. The 256 possible values of the imm8 byte is constructed as a 16x16 Boolean logic table, which can be filled with simple or compound Boolean logic expressions.

Conflict Detection Instructions

Intel AVX-512 introduces new conflict detection instructions. This includes the VPCONFLICT instruction along with a subset of supporting instructions. The VPCONFLICT instruction allows for detection of elements with previous conflicts in a vector of indexes. It can generate a mask with a subset of elements that are guaranteed to be conflict free. The computation loop can be re-executed with the remaining elements until all the indexes have been operated on.

Table 7. Conflict detection instructions.

Conflict detection instructions

VPCONFLICT{D,Q} zmm1{k1}{z}, zmm2/B(mV), For every element in ZMM2, compare it against everybody and generate a mask identifying the matches, but ignoring elements to the left of the current one, that is “newer.”

Diagram of mask generation for VPCONFLICT
Figure 19. Diagram of mask generation for VPCONFLICT.

In order to benefit from CDI, use Intel compilers version 16.0 in Intel® C++ Composer XE 2016 which will recognize potential run-time conflicts and generate VPCONFLICT loops automatically

Transcendental Support

Additional 512-bit instruction extensions have been provided to accelerate certain transcendental mathematic computations and can be found in the instructions VEXP2PD, VEXP2PS, VRCP28xx, and VRSQRT28xx, also known as Intel AVX-512 Exponential and Reciprocal instructions. These can benefit some finance applications.

Compiler Support

Intel AVX-512 optimizations are included in Intel compilers version 16.0 in Intel C++ Composer XE 2016 and the GNU* Compiler Collection (GCC) 5.0 (NASM 2.11.08 and binutils 2.25). Table 8 summarizes compiler arguments for optimization on the Intel Xeon processor Scalable family microarchitecture with Intel AVX-512.

Table 8. Summary of Intel Xeon processor Scalable family compiler optimizations.

Table 8 Summary of Intel Xeon processor Scalable family compiler optimizations

For more information see Intel® Architecture Instruction Set Extensions Programming Reference

Time Stamp Counter (TSC) Enhancement for Virtualization

The Intel Xeon processor Scalable family introduces a new TSC scaling feature to assist with migration of a virtual machine across different systems. In previous Intel Xeon processors, the TSC of a VM cannot automatically adjust itself to compensate for a processor frequency difference as it migrates from one platform to another. The Intel Xeon processor Scalable family enhances TSC virtualization support by adding a scaling feature in addition to the offsetting feature available in prior-generation CPUs. For more details on this feature see Intel® 64 SDM (search for “TSC Scaling”, e.g., Vol 3A – Sec 24.6.5, Sec 25.3, Sec 36.5.2.6).

Intel® Resource Director Technology (Intel® RDT)

Intel® Resource Director Technology (Intel® RDT) is a set of technologies designed to help monitor and manage shared resources. See Optimize Resource Utilization with Intel® Resource Director Technology for an animation illustrating the key principles behind Intel RDT. Intel RDT already has several existing features that provide benefit such as Cache Monitoring Technology (CMT), Cache Allocation Technology (CAT), Memory Bandwidth Monitoring (MBM), and Code Data Prioritization (CDP). The Intel Xeon processor Scalable family on the Purley platform introduces a new feature called Memory Bandwidth Allocation (MBA) which has been added to provide a per-thread memory bandwidth control. Through software the amount of memory bandwidth consumption of a thread or core can be limited. This feature can be used in conjunction with MBM to isolate a noisy neighbor. Chapter 17.16 in volume 3 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM) covers programming details on Intel RDT features. Using this feature requires enabling at the OS or VMM level, and the Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x) feature must be enabled at the BIOS level. For instructions on setting Intel VT-x, refer to your OEM BIOS guide.

Memory Bandwidth Monitoring (MBM)

Conceptual diagram of using Memory Bandwidth Monitoring
Memory Bandwidth Allocation (MBA)

 

Conceptual diagram of using Memory Bandwidth Monitoring
Figure 20. Conceptual diagram of using Memory Bandwidth Monitoring to identify noisy neighbor (core 0) and then using Memory Bandwidth Allocation to prioritize memory bandwidth.

Intel® Speed Shift Technology

Broadwell introduced Hardware Power Management (HWPM), a new optional processor power management feature in the hardware that liberates the OS from making decisions about processor frequency. HWPM allows the platform to provide information on all available constraints, allowing the hardware to choose optimal operating point. Operating independently, the hardware uses information that is not available to software and is able to make a more optimized decision in regard to the p-states and c-states. The Intel Xeon processor Scalable family on the Purley platform expands on this feature by providing a broader range of states that it can affect as well as a finer level of granularity and microarchitecture observability via the Package Control Unit (PCU). On Broadwell the HWPM was autonomous also known as Out-of-Band (OOB) mode and oblivious to the operating system, the Intel Xeon processor Scalable family allows for this as well but also offers the option for a collaboration between the HWPM and the operating system, known as native mode. The operating system can directly control the tuning of the performance and power profile when and where it is desired, while elsewhere the PCU can take autonomous control in the absence of constraints placed by the operating system. In native mode The Intel Xeon processor Scalable family is able to optimize frequency control for legacy operating systems, while providing new usage models for modern operating systems. The end user can set these options within the BIOS; see your OEM BIOS guide for more information. Modern operating systems that provide full integration with native mode include Linux* starting with kernel 4.10 and Windows Server* 2016.

PMax Detection

A processor implemented detection circuit provides faster detection and response to PMax level load events. Previously PMax detection circuits resided in either the power supply unit (PSU) or on the system board, while the new detection circuit on the Intel Xeon processor Scalable family resides primarily on the processor side. In general, the PMax detection circuit provided with the Intel Xeon processor Scalable family allows for faster PMax detection and response time as compared to the prior-generation PMax detection methods. PMax detection allows for the processor to be throttled back when it detects that power limits are being hit. This can assist with PMax spikes associated with virus applications while in turbo mode, prior to the PSU reacting. A faster response time due to PMax load events potentially allows for possible power cost savings. The end user can set PMax detection within the BIOS; see your OEM BIOS guide for more information.

Intel® Omni-Path Architecture (Intel® OPA)

Intel® Omni-Path Architecture (Intel® OPA), an element of Intel® Scalable System Framework, delivers the performance for tomorrow’s high performance computing (HPC) workloads and the ability to scale to tens of thousands of nodes—and eventually more—at a price competitive with today’s fabrics. The Intel OPA 100 Series product line is an end-to-end solution of PCIe* adapters, silicon, switches, cables, and management software. As the successor to Intel® True Scale Fabric, this optimized HPC fabric is built upon a combination of enhanced IP and Intel® technology.

For software applications, Intel OPA will maintain consistency and compatibility with existing Intel True Scale Fabric and InfiniBand* APIs by working through the open source OpenFabrics Alliance (OFA) software stack on leading Linux distribution releases. Intel True Scale Fabric customers will be able to migrate to Intel OPA through an upgrade program.

The Intel Xeon processor Scalable family on the Purley platform supports Intel OPA in one of two forms: through the use of an Intel® Omni-Path Host Fabric Interface 100 Series add-in card or through a specific processor model line (SKL-F) found within the Intel Xeon processor Scalable family that has a Host Fabric Interface integrated into the processor. The fabric integration on the processor has its own dedicated pathways on the motherboard and doesn’t impact the PCIe lanes available for add-in cards. The architecture is able to provide up to 100 Gb/s per processor socket.

Intel is working with the open source community to provide all host components with changes being pushed upstream in conjunction with Delta Package releases. OSVs are working in conjunction with Intel to incorporate into future OS distributions. While existing Message Passing Interface (MPI) programs and MPI libraries for Intel True Scale Fabric that use PSM will work as-is with Intel Omni-Path Host Fabric Interface without recompiling, although recompiling can expose additional benefit.

For software support Intel Download Center and complier support can be found in Intel® Parallel Studio XE 2017

Intel QuickAssist Technology

Intel® QuickAssist Technology (Intel® QAT) accelerates and compresses cryptographic workloads by offloading the data to hardware capable of optimizing those functions. This makes it easier for developers to integrate built-in cryptographic accelerators into network, storage, and security applications. In the case of the Intel Xeon processor Scalable family on the Purley platform, Intel QAT is integrated into the hardware of the Intel® C620 series chipset (formerly Lewisburg) on the Purley platform and offers outstanding capabilities including 100 Gbs Crypto, 100Gbs Compression, 100kops RSA, and 2k Decrypt. Segments that can benefit from the technology include the following:

  • Server: secure browsing, email, search, big-data analytics (Hadoop), secure multi-tenancy, IPsec, SSL/TLS, OpenSSL
  • Networking: firewall, IDS/IPS, VPN, secure routing, Web proxy, WAN optimization (IP Comp), 3G/4G authentication
  • Storage: real-time data compression, static data compression, secure storage.

Supported Algorithms include the following:

  • Cipher Algorithms: Null, ARC4, AES (key sizes 128,192, 256), DES, 3DES, Kasumi, Snow3G, and ZUC
  • Hash/Authentication Algorithms Supported: MD5, SHA1, SHA-2 (output sizes 224,256,384,512), SHA-3 (output size 256 only), Advanced Encryption Standard (key sizes 128, 192, 256), Kasumi, Snow 3G, and ZUC
  • Authentication Encryption (AEAD) Algorithm: AES (key sizes 128, 192, 256)
  • Public Key Cryptography Algorithms: RSA, DSA, Diffie-Hellman (DH), Large Number Arithmetic, ECDSA, ECDH, EC, SM2 and EC25519

ZUC and SHA-3 are new algorithms that have been included in the third generation of Intel QuickAssist Technology found on the Intel® C620 series chipset.

Intel® Key Protection Technology (Intel® KPT) is a new supplemental feature of Intel QAT that can be found on the Intel Xeon processor Scalable family on the Purley platform with the Intel® C620 series chipset. Intel KPT has been developed to help secure cryptographic keys from platform level software and hardware attacks when the key is stored and used on the platform. This new feature focuses on protecting keys during runtime usage and is embodied within tools, techniques, and the API framework.

For a more detailed overview see Intel® QuickAssist Technology for Storage, Server, Networking and Cloud-Based Deployments. Programming and optimization guides can be found on the 01 Intel Open Source website.

Internet Wide Area RDMA Protocol (iWARP)

iWarp is a technology that allows network traffic managed by the NIC to bypass the kernel, which thus reduces the impact on the processor due to the absence of network-related interrupts. This is accomplished by the NICs communicating with each other via queue pairs to deliver traffic directly into the application user space. Large storage blocks and virtual machine migration tend to place more burden on the CPU due to the network traffic. This is where iWARP can be of benefit. Through the use of the queue pairs it is already known where the data needs to go and thus it is able to be placed directly into the application user space. This eliminates extra data copies between the kernel space and the user space that would normally occur without iWARP.

For more information see an information video on Accelerating Ethernet with iWARP Technology

iWARP comparison block diagram
Figure 21. iWARP comparison block diagram.

The Purley platform has an Integrated Intel Ethernet Connection X722 with up to 4x10 GbE/1 Gb connections that provide iWARP support. This new feature can benefit various segments including network function virtualization and software-defined infrastructure. It can also be combined with the Data Plane Development Kit to provide additional benefits with packet forwarding.

iWARP uses VERB APIs to talk to each other instead of traditional sockets. For Linux* OFA OFED provides VERB APIs, while Windows* uses Network Direct APIs. Contact your Linux distribution to see if it supports OFED verbs, while on Windows support is provided starting with Windows Server 2012 R2 or newer.

New and Enhanced RAS Features

The Intel Xeon processor Scalable family on the Purley platform provides several new features as well as enhancements of some existing features associated with the RAS (Reliability, Availability, and Serviceability) and Intel® Run Sure Technology. Two levels of support are provided with the Intel Xeon processor Scalable family: Standard RAS and Advanced RAS. Advanced RAS includes all of the Standard RAS features along with additional features.

In previous generations there could be limitations in RAS features based on the processor socket count (2–8). This has changed and all of the RAS features are available on a two-socket version of the platform or greater depending on the level (bronze through platinum) of the processors. Listed below is a summary of the new and enhanced RAS features from the previous generation.

Table 9. RAS feature summary table.

Table 9 RAS feature summary table

Intel® Virtual RAID on CPU (Intel® VROC)

Intel® VROC replaces third-party raid cards
Figure 22. Intel® VROC replaces third-party raid cards.

Intel VROC is a software solution that integrates with a new hardware technology called Intel® Volume Management Device (Intel® VMD) to provide a compelling hybrid RAID solution for NVMe* (Non-Volatile Memory Express*) solid-state drives (SSDs). The CPU has onboard capabilities that work more closely with the chipset to provide quick access to the directly attached NVMe SSDs on the PCIe lanes of the platform. The major features that help to make this possible are Intel® Rapid Storage Technology enterprise (Intel® RSTe) version 5.0, Intel VMD, and the Intel provided NVMe driver.

Intel RSTe is a driver and application package that allows for administration of the RAID features. It has been updated (version 5.0) on the Purley platform to take advantage of all of the new features. The NVMe driver allows restrictions that might have been placed on it by an operating system to be bypassed. This means that features like hot insert could be available even if the OS doesn’t provide it, and the driver can also provide support for third-party vendor NVMe non-Intel SSDs.

Intel VMD is a new technology introduced with the Intel Xeon processor Scalable family primarily to improve the management of high-speed SSDs. Previously SSDs were attached to a SATA or other interface types and managing them through software was acceptable. When we move toward directly attaching the SSDs to a PCIe interface in order to improve bandwidth, software management of those SSDs adds more delays. Intel VMD uses hardware to mitigate these management issues rather than completely relying on software.

Some of the major RAID features provided by Intel VROC include the protected write-back cache, isolated storage devices from the OS (error handling), and protection of RAID 5 data from a RAID write hole issue through the use of software logging, which can eliminate the need for a battery backup unit. Direct attached NVMe RAID volumes are RAID bootable, have Hot Insert and Surprise Removal capability, provide LED management options, 4K native NVMe SSD support, and multiple management options including remote access from a webpage, interaction at the UEFI level for pre-OS tasks, and a GUI interface at the OS level.

Boot Guard

Boot Guard adds another level of protection to the Purley platform by performing a cryptographic Root of Trust for Measurement (RTM) of the early firmware platform storage device such as the trusted platform module or Intel® Platform Trust Technology (Intel® PTT). It can also cryptographically verify early firmware using OEM-provided policies. Unlike Intel® Trusted Execution Technology (Intel® TXT), Boot Guard doesn’t have any software requirements; it is enabled at the factory, and it cannot be disabled. Boot Guard operates independently of Intel® TXT but it is also compatible with it. Boot Guard reduces the chance of malware exploiting the hardware or software components.

Boot Guard secure boot options
Figure 23. Boot Guard secure boot options.

BIOS Guard 2.0

BIOS Guard is an augmentation of existing chipset-based BIOS flash protection capabilities. The Purley platform adds the fault tolerant boot block update capability. The BIOS flash is segregated into a protected and unprotected regions. Purley bypasses the top-swap feature and flash range register locks/protections, for explicitly enabled signed scripts, to facilitate the fault-tolerant boot block update. This feature protects the BIOS flash from modification without the platform manufacturer’s authorization, as well as during BIOS updates. It can also help defend the platform from low-level DOS attacks.

For more details see Intel® Hardware-based Security Technologies for Intelligent Retail Devices.

BIOS Guard 2.0 block diagram
Figure 24. BIOS Guard 2.0 block diagram.

Intel® Processor Trace

Intel® Processor Trace (Intel® PT) is an exciting feature with improved support on the Intel Xeon processor Scalable family that can be enormously helpful in debugging, because it exposes an accurate and detailed trace of activity with triggering and filtering capabilities to help with isolating the tracing that matters.

Intel PT provides the context around all kinds of events. Performance profilers can use Intel PT to discover the root causes of “response-time” issues—performance issues that affect the quality of execution, if not the overall runtime.

Further, the complete tracing provided by Intel PT enables a much deeper view into execution than has previously been commonly available; for example, loop behavior, from entry and exit down to specific back-edges and loop tripcounts, is easy to extract and report.

Debuggers can use Intel PT to reconstruct the code flow that led to the current location, whether this is a crash site, a breakpoint, a watchpoint, or simply the instruction following a function call we just stepped over. They may even allow navigating in the recorded execution history via reverse stepping commands.

Another important use case is debugging stack corruptions. When the call stack has been corrupted, normal frame unwinding usually fails or may not produce reliable results. Intel PT can be used to reconstruct the stack back trace based on actual CALL and RET instructions.

Operating systems could include Intel PT into core files. This would allow debuggers to not only inspect the program state at the time of the crash, but also to reconstruct the control flow that led to the crash. It is also possible to extend this to the whole system to debug kernel panics and other system hangs. Intel PT can trace globally so that when an OS crash occurs, the trace can be saved as part of an OS crash dump mechanism and then used later to reconstruct the failure.

Intel PT can also help to narrow down data races in multi-threaded operating systems and user program code. It can log the execution of all threads with a rough time indication. While it is not precise enough to detect data races automatically, it can give enough information to aid in the analysis.

To utilize Intel PT you need Intel® Vtune™ Amplifier version 2017.

For more information see Debug and fine-grain profiling with Intel processor trace given by Beeman Strong, Senior and Processor tracing by James Reinders.

Intel® Node Manager

Intel® Node Manager (Intel® NM) is a core set of power management features that provide a smart way to optimize and manage power, cooling, and compute resources in the data center. This server management technology extends component instrumentation to the platform level and can be used to make the most of every watt consumed in the data center. First, Intel NM reports vital platform information, such as power, temperature, and resource utilization using standards-based, out-of-band communications. Second, it provides fine-grained controls, such as helping with reduction of overall power consumption or maximizing rack loading, to limit platform power in compliance with IT policy This feature can be found across Intel’s product segments, including the Intel Xeon processor Scalable family, providing consistency within the data center.

The Intel Xeon processor Scalable family on the Purley platform includes the fourth generation of Intel NM, which extends control and reporting to a finer level of granularity than on the previous generation. To use this feature you must enable the BMC LAN and the associated BMC user configuration at the BIOS level, which should be available under the server management menu. The Programmer’s Reference Kit is simple to use and requires no additional external libraries to compile or run. All that is needed is a C/C++ compiler and to then run the configuration and compilation scripts.

Table 10. Intel Node Manager fourth-generation features

Table 10 Intel® Node Manager fourth-generation features

The Author: David Mulnix is a software engineer and has been with Intel Corporation for over 20 years. His areas of focus has included software automation, server power, and performance analysis, and he has contributed to the development support of the Server Efficiency Rating ToolTM.

Contributors: Akhilesh Kumar and Elmoustapha Ould-ahmed-vall

Resources

Intel® 64 and IA-32 Architectures Software Developer’s Manual (SDM)

Intel® Architecture Instruction Set Extensions Programming Reference

Intel® Resource Director Technology (Intel® RDT)

Optimize Resource Utilization with Intel® Resource Director Technology

Intel® Memory Protection Extensions Enabling Guide

Intel® Scalable System Framework

Intel® Run Sure Technology

Intel® Hardware-based Security Technologies for Intelligent Retail Devices

Processor tracing by James Reinders

Debug and fine-grain profiling with Intel processor trace given by Beeman Strong, Senior

Intel® Node Manager Website

Intel® Node Manager Programmer’s Reference Kit

Open Source Reference Kit for Intel® Node Manager

How to set up Intel® Node Manager

Intel® Performance Counter Monitor (Intel® PCM) a better way to measure CPU utilization

Intel® Memory Latency Checker (Intel® MLC) a Linux* tool available for measuring the DRAM latency on your system

Intel® VTune™ Amplifier 2017 a rich set of performance insight into hotspots, threading, locks & waits, OpenCL bandwidth and more, with profiler to visualize results

The Intel® Xeon® processor-based server refresh savings estimator

Intel® Sound Analytic Engine and Intel® Smart Home Developer Kits: Use Cases and Applications

$
0
0

Overview

Speech cognitive technology is all around us. From the telephone payment systems at your local utility company, to the digital personal assistant in your phone, and more recently, the smart speaker sitting in your living room, speech cognition is a pervasive and rapidly growing technology.
Over the past five years, adoption of voice-enabled devices has grown exponentially. Driven by market leaders (Amazon, Google and Microsoft) voice-controlled Smart Home devices are beginning to permeate the domestic environment. While we are most familiar with smart speakers, this is just the first phase in the evolution of the home, from a place of shelter and comfort to a valuable tool to make our lives easier. This presents an exciting opportunity for you as a product developer to add voice capabilities to new and innovative form factors. Whether it’s adding voice to a current design, or building a new, voice-first product, Intel® technology provides the building blocks for prototyping and bringing new Smart Home Solutions to market. But before determining what type of customer experience you intend to create, let’s look at some of the benefits and user requirements for adoption.

Benefits of Enabling Voice on Smart Home devices

The benefits of adding speech understanding to Smart Home devices for manufacturers can be grouped into three different categories:
  • Simplifying and accelerating access to the internet 
  • Learning and characterization of the user needs
  • Best-in-class hardware with over-the-top applications

Simplifying and Accelerating Access to the Internet 

Devices like smart speakers allow users to access information and services online with the intuitive power of their voice. These devices typically feature Personal Assistant technology and are connected to the internet to provide Natural Language Processing (NLP) in the cloud. This allows manufacturers to deliver cloud-based services seamlessly through voice-first devices. The device providers are also able to identify and learn customer preferences, which allows them to improve over time and grow adoption. 

Learning and Characterization of the User Needs

Speech can also be a valuable tool to understand your customers’ needs without requiring cloud-based services. Simple command and control functionality can be added to many different form factors (think coffee pots, dishwashers, microwaves) as an easy interface for users. In turn, this allows companies to better understand how their products are being used and improve the product lifecycle. 

Best-in-class Hardware with User-friendly Applications

Manufacturers looking to leverage their existing products to build high fidelity sound capable devices gain a clear competitive advantage when adding voice to their platforms. These highly-integrated, user-friendly designs often require little to no training for customers to be up and running. This allows companies to focus on continuing to develop best-in-class hardware while supporting a large application development infrastructure to bring a uniform and intuitive experiences to users.

User Requirements

The rapid adoption of voice-first technology in the home is due in large part to the ease and instinctiveness of communicating with your natural voice. For that reason, it’s important for developers to focus on ease of use when building voice-enabled Smart Home products. To drive user adoption, voice-enabled Smart Home devices will require low latency, low word error rate (WER), a large vocabulary (local or cloud-based), and the ability to speak and be understood from a reasonable conversation distance. 

Intel® Technology for Smart Home devices

The Intel® Sound Analytic Engine is a dual DSP and neural network accelerator that provides silicon, algorithms, and a reference design microphone array designed with complex, far-field signal processing algorithms that use high dimensional microphone arrays to do beamforming, echo cancellation, and noise reduction. This simplifies enabling speech across a range of form factors, allowing developers to add far-field voice, speech recognition, and amazing acoustics to low-power devices. It enables the user requirements by providing a building block for voice-enabling that uses a silicon-based, Intel-developed Gaussian Network Accelerator (GNA).
Intel® Sound Analytic Engine provides you with a straightforward path to developing either a cloud-based voice recognition system or a large vocabulary local speech recognition system. It allows you to bring products to market quickly with a pre-established framework for a smart speaker design that can be integrated into many different form factors. 

Intel® Developer Kits for Creating Smart Home Products

Intel is introducing Smart Home Developer Kits to empower hardware and software developers to quickly bring new voice-enabled products to market. The primary technology in these kits is the Intel® Sound Analytic Engine. 
The first developer kit, the Intel® Speech Enabling Developer Kit, will be available for sale in October 2017. This kit contains the Intel® Sound Analytical Engine (a dual DSP with neural network accelerator), mic arrays, speaker mount, and a Raspberry Pi* connector cable to get you quickly prototyping with Alexa* Voice Services. Future developer kits will enable additional features, including imaging and sensors.

What you can Build for the Smart Home

There are two main categories of Smart Home devices that can utilize Intel® Sound Analytic Engine technology to enable speech understanding: smart appliances and smart speakers.
Transforming traditional appliances into “smart appliances” requires being able to interact with them directly. Rather than adding a keyboard and mouse or touchscreen, which still requires users to physically interact with their devices, truly smart appliances should have voice as their primary interface. This will require far-field understanding, low latency, and low power for always-on capabilities. Low cost and low power are critical for the digital microphones and speakers that power speech interaction. Adding voice to existing form factors will also require flexible designs that can fit into established chassis, like ovens, dishwashers, and tea kettles. When these requirements are satisfied, then you will achieve true value for your users. For instance, enabling speech understanding on a coffee pot would allow you to start your coffee with your voice, freeing you to accomplish other morning tasks simultaneously. 
Smart speakers enabled with Personal Assistants is a rapidly growing segment of smart home products. Research from Parks Associates suggests that adoption doubled, from 5% to 10-11% in the U.S. between 2015 and 2016. And total sales of smart speakers with Personal Assistants is estimated at 14 million units in 2016 . This rapid adoption can be attributed to the intuitive interface and utility of these devices, for everything from playing music to accessing cloud-based services for information. The Intel® Speech Enabling Developer Kit can be leveraged to prototype smart speakers equipped with cloud-based Personal Assistants. The Intel® Sound Analytic Engine provides a straightforward path to developing either a cloud-based voice recognition system or a large vocabulary local speech recognition system that provides a high-quality speech recognition experience and a broad choice of pre- and post-processing capabilities. It allows you to get to market very quickly with a pre-established framework for a smart speaker design that can be integrated into many different form factors to enable a ubiquitous speech platform.

Conclusion

In the last five years, we have experienced an explosion of voice-enabled devices. However, we have just scratched the surface of what is possible when you add voice in the Smart Home. From dishwashers to washing machines, speech understanding can provide a clear competitive advantage by differentiating your product from that of your competitors. Not only will the Dev Kit help to differentiate your product from your competitors, but it will give you a head-start on the future when having a personal assistant will be the household standard. It also provides a path to better understand your customers, more easily deliver cloud-based services, and maintain differentiators like best-in-class hardware. 
The Intel® Sound Analytic Engine platform can enable speech across a wide range of form factors by providing silicon, algorithms, and a reference design microphone array that is designed to enable user requirements for Smart Home devices. It uses Intel’s silicon hardened GNA (Gaussian Network Accelerator) to improve cloud and local based speech recognition, acoustic context awareness, and power reduction. This technology is available in Intel® Smart Home Developer Kits to simplify prototyping. For more information about developer kits, check out the resources below.

Additional information

Links to:
  • How-to video
  • How-to workbook
  • Code samples
  • SmartHome.intel.com 

1 http://www.parksassociates.com/bento/shop/whitepapers/files/Parks%20Assoc%20%20Impact%20of%20Voice%20Whitepaper%202017.pdf

Get Started with IPsec Acceleration in the FD.io VPP Project

$
0
0

Introduction

This article looks at IPsec acceleration improvements in the FD.io VPP project based on the Data Plane Development Kit (DPDK) Cryptodev framework.

It gives a brief introduction to FD.io, VPP, DPDK, and the DPDK Cryptodev library, and shows how they are combined to provide enhanced IPsec performance and functionality.

It also shows how to install, build, configure, and run VPP with Cryptodev, and shows the type of performance gains that can be achieved.

Background

FD.io, the Fast Data Project, is an umbrella project for open source software aimed at providing high-performance networking solutions.

VPP, the Vector Packet Processing library, is one of the core projects in FD.io. VPP is an extensible framework that provides production-quality userspace switch/router functionality running on commodity CPUs.

The default implementation of VPP contains IPsec functionality, which relies on the OpenSSL library, as shown in Figure 1.

Graphic of a flowchart
Figure 1. Default IPsec implementation in VPP

While this implementation is feature rich, it is not optimal from a performance point of view. In VPP 17.01, the IPsec implementation was extended to include support for DPDK’s Cryptodev API to provide improved performance and features.

DPDK is a userspace library for fast packet processing. It is used by VPP as one of its optional I/O layers.

The DPDK Cryptodev library provides a crypto device framework for management and provisioning of hardware and software crypto poll mode drivers, and defines generic APIs that support a number of different Crypto operations.

The extended DPDK Cryptodev IPsec Implementation architecture in VPP is shown in Figure 2.

Graphic of a flowchart
Figure 2. DPDK Cryptodev IPsec Implementation in VPP

Enabling DPDK Cryptodev in VPP improves the performance for the most common IPsec encryption and authentication algorithms, including AES-GCM, which was not enabled in the default VPP implementation.

As such, enabling the DPDK Cryptodev feature not only increases performance but also provides access to additional options and flexibility such as:

  • Devices with Intel® QuickAssist Technology (Intel® QAT) for hardware offloading of all supported algorithms, including AES-GCM.
  • Intel® Multi-Buffer Crypto for IPsec Library for a heavily optimized software implementation.
  • Intel® Intelligent Storage Acceleration Library (Intel® ISA-L) for heavily optimized software implementation of the AES-GCM algorithm.

The Intel QuickAssist Technology-based Crypto Poll Mode Driver provides support for a range of hardware accelerator devices. For more information see the DPDK documentation for Intel QuickAssist Technology.

Building VPP with DPDK Cryptodev

To try out VPP with DPDK and Cryptodev support you can download and build VPP as follows:

		$ git clone https://gerrit.fd.io/r/vpp
		$ cd vpp
		$ git checkout v17.04
		$ vpp_uses_dpdk_cryptodev_sw=yes make build-release -j

Note that the build command line enables DPDK Cryptodev support for software-optimized libraries and the specific VPP release. The build process will automatically download and build VPP, DPDK and the required software crypto libraries.

To start VPP with DPDK Cryptodev use the following command:

$ make run-release STARTUP_CONF=/vpp_test/vpp_conf/startup.conf

The “startup_conf” path should be changed to suit the specific location in the end-user’s environment.

Testing VPP with DPDK Cryptodev

Figure 3 represents a typical test configuration.

Graphic of a flowchart
Figure 3. Test system setup

Pktgen, as shown in Figure 3, is a software traffic generator based on DPDK that is being used in this configuration to test the VPP IPsec.

For this example we used the following hardware configuration:

  • 1 x Intel® Core™ i7-4770K series processor @ 3.5 GHz
  • 2 x Intel® Ethernet Controller 10 Gigabit 82599ES NICs, 10G, 2 ports
  • 1 x Intel® QuickAssist Adapter 8950

The VPP “startup.conf” configuration file used in the test setup is shown below:

		unix {
		  nodaemon
		  interactive
		  exec /vpp_test/vpp_conf/ipsec.cli
		}

		cpu {
		  main-core 0
		  corelist-workers 1-2
		}


		dpdk {
		  socket-mem 1024
		  uio-driver igb_uio

		  dev 0000:06:00.0
		  {
			workers 0
		  }
		  dev 0000:06:00.1
		  {
			workers 1
		  }


		  # Option 1: Leave both options below commented out
		  # to fall back to the default OpenSSL.

		  # Option 2: Multi-Buffer Crypto Library.
		  #enable-cryptodev
		  #vdev cryptodev_aesni_mb_pmd,socket_id=0

		  # Option3: QAT hardware acceleration.
		  #enable-cryptodev
		  #dev 0000:03:01.1
		  #dev 0000:03:01.2
		}
		

The configuration file allows these configuration options:

  1. The default VPP + OpenSSL option
  2. Cryptodev with the optimized Multi-Buffer software library
  3. Cryptodev with Intel QuickAssist Technology-based hardware acceleration

The example “startup.conf” file in Figure 4 sets up the system configuration. The devices “0000:06:00.0” and “0000:06:00.1” are the network ports and “0000:03:01.1” and “0000:03:01.2” are the Intel QuickAssist Technology-based device VFs (Virtual Functions). These should be changed to match the end-user system where the test is replicated.

The CPP “ipsec.cli” file used for testing is shown in below:

		set int ip address TenGigabitEthernet6/0/0 192.168.30.30/24
		set int promiscuous on TenGigabitEthernet6/0/0
		set int ip address TenGigabitEthernet6/0/1 192.168.30.31/24
		set int promiscuous on TenGigabitEthernet6/0/1

		ipsec spd add 1
		set interface ipsec spd TenGigabitEthernet6/0/1 1
		ipsec sa add 10 spi 1000 esp tunnel-src 192.168.1.1 tunnel-dst 192.168.1.2 crypto-key 4339314b55523947594d6d3547666b45 crypto-alg aes-cbc-128 integ-key 4339314b55523947594d6d3547666b45 integ-alg sha1-96
		ipsec policy add spd 1 outbound priority 100 action protect sa 10 local-ip-range 192.168.20.0-192.168.20.255 remote-ip-range 192.168.40.0-192.168.40.255
		ipsec policy add spd 1 outbound priority 90 protocol 50 action bypass

		ip route add 192.168.40.40/32 via 192.168.1.2 TenGigabitEthernet6/0/1
		set ip arp TenGigabitEthernet6/0/1 192.168.1.2 90:e2:ba:50:8f:19

		set int state TenGigabitEthernet6/0/0 up
		set int state TenGigabitEthernet6/0/1 up
		

This file sets up the network interfaces and the IPsec configuration.

In the configuration shown above, incoming packets on interface “TenGigabitEthernet6/0/0” with the destination IP “192.168.40.40” will be encrypted using the AES-CBC-128 algorithm and authenticated using the SHA1-96 algorithm. The packets are then sent out on the “TenGigabitEthernet6/0/1” interface. Again, this configuration should to be adjusted to reflect the end-user’s environment.

Using the Pktgen traffic generator configured to send packets that match the IPsec configuration above (source 192.168.20.20, destination 192.168.40.40, port 80, and packets of variable size 64, 128, 256, 512, and 1024), performance data can be generated as shown in Figure 4. Note, the data shown is for illustration purposes only, actual values will depend on the hardware configuration and software versions. The “MB” data refers to the Multi-Buffer Crypto Library which is an optimized software library.

Graphic data
Figure 4. Example throughput vs. packet size for different configurations

The illustrative data in Figure 4 shows a significant improvement using software-optimized libraries and an even larger improvement using hardware offloading, with the throughput being capped by the line rate rather than processing power.

Conclusion

This article gives an overview of the DPDK Cryptodev framework in VPP and shows how it can be used to improve performance and functionality in the VPP IPsec implementation.

About the Author

Radu Nicolau is a network software engineer with Intel. His work is currently focused on IPsec-related development of data plane functions and libraries. His contributions include enablement of the AES-GCM crypto algorithm in the VPP IPsec stack, IKEv2 initiator support for VPP, and inline IPsec enablement in DPDK.

Solve SVD problem for sparse matrix with Intel Math Kernel Library

$
0
0

    Introduction :

    Intel(R) MKL presents new parallel solution for large sparse SVD problem. Functionality is based on subspace iteration method combined with recent techniques that estimate eigenvalue counts [1]. The idea of underlying algorithm is to split the interval that contains all singular values into subintervals with close to an equal number of values and apply a polynomial filter on each of the subintervals in parallel. This approach allows to solve problems of size up to 10^6 and enables multiple levels of parallelism.

    This functionality is now available for non-commercial use as a separate package that must be linked with installed MKL version. Current version of package supports two functions:

 - Shared memory SVD. Supports parallelization in Open MP.

- Cluster version of SVD. This realization supports MPI parallelization of the independent subproblems defined by each subinterval and matrix operations within each subinterval are Open MP parallel.

Support of the trial package is now limited to linux/windows 64-bit Architectures with only C interface. In future, these limitations will be removed.

    Product overview :

Problem statement : Compute several largest/smallest singular values and corresponding (right/left) singular vectors.

    Unlike available in Intel(R) MKL LAPACK and SCALAPACK ?gesvd and p?gesvd functionality [2,3], new package supports sparse input matrix and only part of largest/smallest singular values on output, thus allowing to solve problems of larger size more efficiently.  Current implementation supports input matrix in CSR format and dense singular vectors on output. Customer must specify number of largest/smallest singular values to find as well as desired tolerance of solution. Number of singular values in question can be up to the size of matrix. However, it is up to customer to ensure that enough space is available to store dense output for singular vectors. On output, each MPI rank will store its part of singular values and singular vectors.

The chart below shows MPI scalability for Cluster SVD solver on randomly generated matrix of size 4*10^5 with 1000 largest singular values requested. Chart shows that almost linear scalability is achieved over many MPI processes.

The attached C examples demonstrate how to make these computations. 

Please let us know in the case if you would like to take this evaluation package.

[1] E. Di Napoli, E. Polizzi, Y. Saad, Efficient Estimation of Eigenvalue Counts in an Interval

[2] https://software.intel.com/en-us/mkl-developer-reference-c-singular-value-decomposition-lapack-driver-routines

[3] https://software.intel.com/en-us/mkl-developer-reference-c-singular-value-decomposition-scalapack-driver-routines

 

OpenStack* Enhanced Platform Awareness: Feature Breakdown and Analysis

$
0
0

1 Introduction

OpenStack* Enhanced Platform Awareness (EPA) contributions from Intel and others enable fine-grained matching of workload requirements to platform capabilities. EPA features provide OpenStack with an improved understanding of the underlying platform hardware (HW), which allows it accurately assign the workload to the best HW resource.

This paper explains the OpenStack EPA features listed in the table in section 1.3. Each feature is covered in isolation, with a brief description, configuration steps to enable the feature, and a short discussion of its benefits.

1.1 Audience and purpose

This document is intended to help understand the performance gains for each EPA feature in isolation. Each section has detailed information on how to configure a system to utilize the EPA feature in question.

This document focuses on EPA features available in the Newton* release. The precursor to this document can be found here.

1.2 EPA Features Covered

Feature NameFirst OpenStack* ReleaseDescriptionBenefitPerformance Data
Host CPU feature requestIcehouse*

Expose host CPU features to OpenStack managed guests

Guest can directly use CPU features instead of emulated CPU features

~20% to ~40% improvement in guest computation

PCI passthrough

Havana*

Provide direct access to a physical or virtual PCI device

Avoid the latencies introduced by hypervisor and virtual switching layers

~8% improvement in network throughput

HugePages* support

Kilo*

Use memory pages larger than the standard size

Fewer memory translations requiring fewer cycles

~10% to ~20% improvement in memory access speed

NUMA awareness

Juno*

Ensures virtual CPUs (vCPU)s executing processes and the memory used by these processes are on the same NUMA node

Ensures all memory accesses are local to the node and thus do not consume the limited cross-node memory bandwidth, adding latency to memory accesses

~10% improvement in guest processing

IO based NUMA scheduling

Kilo*

Creates an affinity that associates a VM with the same NUMA nodes as the PCI device passed into the VM

Delivers optimal performance when assigning PCI device to a guest

~25% improvement in network throughput for smaller packets

CPU pinning

Kilo

Supports the pinning of VMs to physical processors

Avoids scheduling mechanism moving the guest virtual CPUs to other host physical CPU cores, improving performance and determinism

~10 % to ~20% improvement in guest processing

CPU threading policies

Mitaka*

Provides control over how guests can use the host hyper thread siblings

More fine-grained deployment of guests on HT-enabled systems

Up to ~50% improvement in guest processing

OVS-DPDK, neutron

Liberty*

An industry standard virtual switch accelerated by DPDK

Accelerated virtual switching

~900% throughput improvement

2 Test Configuration

Following is an overview of the environment that was used for testing the EPA features covered in this document.

2.1 Deployment

Several OpenStack deployment tools are available. Devstack which is basically a script used to configure and deploy each OpenStack service, was used for to demonstrate EPA features in this document. Devstack uses a single configuration file to determine the functionality of each node in your OpenStack cluster. Devstack modifies each OpenStack service configuration file to reflect the user's requirements defined in the configuration file.

To avoid dependency on a particular OpenStack deployment tool, the respective OpenStack configuration file that was modified for the respective service will be mentioned.

2.2 Topology

 Network topology flowchart
Figure 1: Network topology

2.3 Hardware

ItemDescriptionNotes
PlatformIntel® Server System R1000WT Family 
Form factor1U Rack

 

Processor(s)Intel® Xeon® CPU E5-2699 v4 @ 2.20GHz55MB Cache with Hyper-threading enabled
Cores44 physical cores/CPU44 hyper-threaded cores per CPU for 88 total cores
Memory132G RAMDDR4 2133
NIC’s2 * Intel® Ethernet Controller 10 Gigabit 82599 
BIOSSE5C610.86B.01.01.0019.101220160604Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d) Hyper-Threading enabled

2.4 Software

ItemDescriptionNotes
Host OSUbuntu 16.04.1 LTS4.2.0 Kernel
HypervisorLibvirt-3.1/Qemu-2.5.0 
OrchestrationOpenStack (Newton release) 
Virtual switchOpenvSwitch 2.5.0

 

Data plane development kitDPDK 16.07 
Guest OSUbuntu 16.04.1 LTS4.2.0 Kernel

2.5 Traffic generator

An Ixia XG-12 traffic generator was used to generate the networking workload for some of the tests described in this document. To simulate a worst-case scenario from a networking perspective, 64-byte packets are used.

3 Host CPU Feature Request

This feature allows the user to expose a specific host CPU instruction set to a guest. Instead of the hypervisor emulating the CPU instruction set, the guest can directly access the host's CPU feature. While there are many host CPU features available, the Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) instruction set is used in this example.

One sample use case would be a security application requiring a high level of cryptographic performance. This could be instrumented to leverage specific instructions such as Intel® AES-NI.

The following steps detail how to configure the host CPU feature request for this use case.

3.1 Configure the compute node

3.1.1 System configuration

Before a specific CPU feature is requested, the availability of the CPU instruction set should be checked using the cpuid instruction.

3.1.2 Configure libvirt driver

The Nova* libvirt driver takes its configuration information from a section in the main Nova file /etc/nova/nova.conf. This allows for customization of certain Nova libvirt driver functionality.

For example:

[libvirt]
...
cpu_mode = host-model
virt_type = kvm

The cpu_mode option in /etc/nova/nova.conf can take one of the following values: none, host-passthrough, host-model, and custom.

host-model

Libvirt identifies the CPU model in the /usr/share/libvirt/cpu_map.xml file that most closely matches the host, and requests additional CPU flags to complete the match. This configuration provides the maximum functionality and performance, and maintains good reliability and compatibility if the guest is migrated to another host with slightly different host CPUs.

host-passthrough

Libvirt tells KVM to pass through the host CPU with no modifications. The difference between host-passthrough and host-model is that, instead of just matching feature flags, every last detail of the host CPU is matched. This gives the best performance, and can be important to some apps that check low-level CPU details, but it comes at a cost with respect to migration. The guest can only be migrated to a matching host CPU.

custom

You can explicitly specify one of the supported named models using the cpu_model configuration option.

3.2 Configure the Controller node

3.2.1 Enable the compute capabilities filter in Nova*

The Nova scheduler is responsible for deciding which compute node can satisfy the requirements of your guest. It does this using a set of filters; to enable this feature, simply add the compute capability filter.

During the scheduling phase, the ComputeCapabilitiesFilter in Nova compares the CPU features requested by the guest with the compute node CPU capabilities. This ensures that the guest is scheduled on a compute node that satisfies the guest’s PCI device request.

Nova filters are configured in /etc/nova/nova.conf

scheduler_default_filters = ...,ComputeCapabilitiesFilter,...

3.2.2 Create a Nova flavor that requests the ntel® Advanced Encryption Standard New Instructions (Intel® AES-NI) for a VM

openstack flavor set <FLAVOR> --property hw:capabilities:cpu_info:features=aes <GUEST>

3.2.3 Boot guest with modified flavor

openstack server create --image <IMAGE> --flavor <FLAVOR> <GUEST>

3.2.4 Performance benefit

This feature gives the guest direct access to a host CPU feature instead of the guest using an emulated CPU feature. This feature can deliver a double digit performance improvement, depending on the size of data buffer being used.

To demonstrate the benefit of this feature, a crypto workload (openssl speed -evp aes256) is executed on guest A that has not requested a host CPU feature, while guest B has requested the host Intel AES-NI CPU feature. Guest A will use an emulated CPU feature, while guest B will use the host's CPU feature.

data graphic
Figure 2: CPU feature request comparison

4 Sharing Host PCI Device with a Guest

In most cases the guest will require some form of network connectivity. To do this, OpenStack needs to create and configure a network interface card (NIC) for guest use. There are several methods of doing this. The one you choose depends on your cloud requirements. The table below highlights each option and their respective pros and cons.

 NIC EmulationPCI Passthrough (PF)SRIOV (VF)

Overview

Hypervisor fully emulates the PCI device

The full PCI device is allocated to the guest.

A PCI device VF is allocated to the guest

Guest sharing

Yes

No

Yes

Guest IO performance

Slow

Fast

Fast

Device emulation is performed by the hypervisor, which has an obvious overhead. This overhead is worthwhile as long as the device needs to be shared by multiple guest operating systems. If sharing is not necessary there are more efficient methods for sharing devices.

data graphic
Figure 3: Host to guest communication methods

The PCI passthrough feature in OpenStack gives the guest full access and control of a physical PCI device. This mechanism can be used on any kind of PCI device, NIC, graphics processing unit (GPU), HW crypto accelerator (QAT), or any other device that can be attached to a PCI bus.

An example use case for this feature would be to pass a PCI network interface to a guest, avoiding the latencies introduced by hypervisor and virtual switching layers. Instead, the guest will use the PCI device directly.

When a full PCI device is assigned to a guest, the hypervisor detaches the PCI device from the host OS and assigns it to the guest, which means the PCI device is no longer available to the host OS. A downside of PCI passthrough is that the full physical device is assigned to only one guest and cannot be shared, and guest migration is not currently supported.

4.1 Configure the compute node

4.1.1 System configuration

Enable VT-d in BIOS.

Add “intel_iommu=on” to kernel boot line to enable the kernel.

Edit this file: /etc/default/grub

GRUB_CMDLINE_LINUX="intel_iommu=on"
sudo update-grub
sudo reboot

To verify VT-d/IOMMU is enabled on your system:

sudo dmesg | grep IOMMU
[    0.000000] DMAR: IOMMU enabled
[    0.133339] DMAR-IR: IOAPIC id 10 under DRHD base  0xfbffc000 IOMMU 0
[    0.133340] DMAR-IR: IOAPIC id 8 under DRHD base  0xc7ffc000 IOMMU 1
[    0.133341] DMAR-IR: IOAPIC id 9 under DRHD base  0xc7ffc000 IOMMU 1

4.1.2 Configure your PCI whitelist

OpenStack uses a PCI whitelist to define which PCI devices are available to guests. There are several ways to define your PCI whitelist; here is one method.

The Nova PCI whitelist is configured in: /etc/nova/nova.conf

[default]
pci_passthrough_whitelist={"address":"0000:02:00.1","vendor_id":"8086","physical_network":"default"}

4.1.3 Configure the PCI alias

Following the Newton release, you also need to configure the PCI alias on the compute node. This is to enable resizing a guest that has been allocated a PCI device.

Get the vendor and product ID of the PCI device:

sudo ethtool -i ens513f1 | grep bus-info
bus-info: 0000:02:00.1

sudo lspci -n | grep 02:00.1
02:00.1 0200: 8086:10fb (rev 01)

Nova PCI alias tags are configured in: /etc/nova/nova.conf

[default]
pci_alias = {"vendor_id":"8086","product_id":"10fb","device_type":"type-PF", "name":"nic" }

NOTE: To pass through a complete PCI device, you need to explicitly request a physical function in the pci_alias by setting the device_type = type-PF.

4.2 Configure the Controller Node

Nova scheduler is responsible for deciding which compute node can satisfy the requirements of your guest. It does this using a set of filters; to enable this feature add the PCI passthrough filter.

4.2.1 Enable the PCI passthrough filter in Nova

During the scheduling phase, the Nova PciPassthroughFilter filters compute nodes based on PCI devices they expose to the guest. This ensures that the guest is scheduled on a compute node that satisfies the guest’s PCI device request.

Nova filters are configured in: /etc/nova/nova.conf

scheduler_default_filters = ...,ComputeFilter,PciPassthroughFilter,...

NOTE: If you make changes to the nova.conf file on a running system, you will need to restart the Nova scheduler and Nova compute services.

4.2.2 Configure your PCI device alias

To make the requesting of a PCI device easier you can assign an alias to the PCI device. Define the PCI device information with an alias tag and then reference the alias tag in the Nova flavor.

Nova PCI alias tags are configured in: /etc/nova/nova.conf

Use the PCI device vendor and product ID obtained from step 4.1.3:

[default]
pci_alias = {"vendor_id":"8086","product_id":"10fb","device_type":"type-PF", "name":"nic" }

NOTE: To pass through a complete PCI device you must explicitly request a physical function in the pci_alias by setting the device_type = type-PF.

Modify Nova flavor

If you request a PCI passthrough for the guest, you also need to define a non-uniform memory access (NUMA) topology for the guest.

openstack flavor set <FLAVOR> --property  "pci_passthrough:alias"="nic:1"
openstack flavor set <FLAVOR> --property  hw:numa_nodes=1
openstack flavor set <FLAVOR> --property  hw:numa_cpus.0=0
openstack flavor set <FLAVOR> --property  hw:numa_mem.0=2048

Here, an existing flavor is modified to define a guest with a single NUMA node, one vCPU and 2G of RAM, and a single PCI physical device. You can create a new flavor if you need one.

4.3 Boot guest with modified flavor

openstack server create --image <IMAGE> --flavor <FLAVOR> <GUEST>

4.4 Performance benefit

This feature allows a PCI device to be directly attached to the guest, removing the overhead of the hypervisor and virtual switching layers, delivering a single digit gain in throughput.

To demonstrate the benefit of this feature, the conventional path a packet takes via the hypervisor and virtual switch is compared with the optimal path, bypassing the hypervisor and virtual switch layers.

Using these test scenarios, iperf3 is used to measure the throughput, and ping (ICMP) to measure the latencies for each scenario.

data graphic
Figure 4: Guest PCI device throughput

data graphic
Figure 5: Guest PCI device latency

4.5 PCI virtual function passthrough

The preceding section covered the passing of a physical PCI device to the guest. This section covers passing a virtual function to the guest.

Single root input output virtualization (SR-IOV) is a specification that allows a single PCI device to appear as multiple PCI devices. SR-IOV can virtualize a single PCIe Ethernet controller (NIC) to appear as multiple Ethernet controllers. You can directly assign each virtual NIC to a virtual machine (VM), bypassing the hypervisor and virtual switch layer. As a result, users are able to achieve low latency and near-line rate speeds. Of course, the total bandwidth of the physical PCI device will be shared between all allocated virtual functions.

The physical PCI device is referred to as the physical function (PF) and a virtual PCI device is referred to as a virtual function (VF). Virtual functions are lightweight functions that lack configuration resources.

The major benefit of this feature is that it makes it possible to run a large number of virtual machines per PCI device, which reduces the need for hardware and the resultant costs of space and power required by hardware devices.

4.6 Configure the Compute node

4.6.1 System configuration

Enable VT-d in BIOS.

Add “intel_iommu=on” to kernel boot line to enable the kernel. Edit this file: /etc/default/grub

GRUB_CMDLINE_LINUX="intel_iommu=on"
sudo update-grub
sudo reboot

To verify that VT-d/IOMMU is enabled on your system, execute the following command:

sudo dmesg | grep IOMMU
[    0.000000] DMAR: IOMMU enabled
[    0.133339] DMAR-IR: IOAPIC id 10 under DRHD base  0xfbffc000 IOMMU 0
[    0.133340] DMAR-IR: IOAPIC id 8 under DRHD base  0xc7ffc000 IOMMU 1
[    0.133341] DMAR-IR: IOAPIC id 9 under DRHD base  0xc7ffc000 IOMMU 1

4.6.2 Enable SR-IOV on a PCI device

There are several ways to enable a SR-IOV on a PCI device. Here is a method to enable a single virtual function on a PCI Ethernet controller (ens803f1):

sudo su -c "echo 1 > /sys/class/net/ens803f1/device/sriov_numvfs"

4.6.3 Configure your PCI whitelist

OpenStack uses a PCI whitelist to define which PCI devices are available to guests. There are several ways to define your PCI whitelist; here is one method.

The Nova PCI whitelist is configured in: /etc/nova/nova.conf

[default]
pci_passthrough_whitelist= {"address":"0000:02:10.1","vendor_id":"8086","physical_network":"default"}

4.6.4 Configure your PCI device alias

See section 3.2.2 for PCI device alias configuration.

NOTE: To pass through a virtual PCI device you just need to add the vendor and product ID for the device. If you use the PF PCI address, all associated VFs will be exposed to Nova.

4.7 Configure the controller node

4.7.1 Enable the PCI passthrough filter in Nova

Follow the steps described in section 4.2.1.

4.7.2 Configure your PCI device alias

To make the requesting of a PCI device easier you can assign an alias to the PCI device, define the PCI device information with an alias tag, and then reference the alias tag in the Nova flavor.

Nova PCI alias tags are configured in: /etc/nova/nova.conf

Use the PCI info obtained in step 4.1.3:

[default]
pci_alias = {"vendor_id":”8086","product_id":"10ed", "name":"nic" }

NOTE: To pass through a virtual PCI device (VF) you just need to add the vendor and product id of the VF.

Modify Nova flavor

If you request PCI passthrough for the guest, you also need to define a NUMA topology for the guest.

openstack flavor set <FLAVOR> --property  "pci_passthrough:alias"="nic:1"
openstack flavor set <FLAVOR> --property  hw:numa_nodes=1
openstack flavor set <FLAVOR> --property  hw:numa_cpus.0=0
openstack flavor set <FLAVOR> --property  hw:numa_mem.0=2048

Here, an existing flavor is modified to define a guest with a single NUMA node, one vCPU and 2G of RAM, and a single PCI physical device. You can create a new flavor if you need to.

4.7.3 Boot guest with modified flavor

openstack server create -–image <IMAGE> --flavor <FLAVOR> <GUEST-NAME>

5 Hugepage Support

5.1 Description

When a process uses memory the CPU marks the RAM as used by the process. This memory is divided into chunks of 4KB, or pages. The CPU and operating system must remember where in memory these pages are and to which process they belong. When processes begin to use large amounts of memory, lookups can take a lot of time; this is where hugepages come in. Depending on the processor two different huge page sizes can be used on x86_64 architecture, 2MB or 1GB. Using these larger page sizes makes lookups much quicker.

To show the value of hugepages in OpenStack, the “sysbench” benchmark suite is used along with two VMs; one with 2MB hugepages and one with regular 4KB pages.

5.2 Configuration

5.2.1 Compute host

First, enable hugepages on the compute host

sudo mkdir -p /mnt/huge
sudo mount -t hugetlbfs nodev /mnt/huge
sudo echo 8192 > \ /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages

If 1GB hugepages are needed it is necessary to configure this at boot time through the GRUB command line. It is also possible to set 2MB hugepages at this stage.

GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=8”

Enable hugepages to work with KVM/QEMU and libvirt. First, edit the line in the qemu-kvm file:

				vi /etc/default/qemu-kvm

				#Edit the line to match the line below
				KVM_HUGEPAGES=1
				

Now, tell libvirt where the hugepage table is mounted, and edit the security driver for libvirt. Add the hugetlbfs mount point to the cgroup_device_acl list:

vi /etc/libvirt/qemu.conf

security_driver = "none"
hugetlbfs_mount = "/mnt/huge"

cgroup_device_acl = [
"/dev/null", "/dev/full", "/dev/zero","/dev/random", "/dev/urandom","/dev/ptmx", "/dev/kvm", "/dev/kqemu","/dev/rtc", "/dev/hpet","/dev/net/tun","/dev/vfio/vfio","/mnt/huge"
]

Now restart libvirt-bin and the Nova compute service:

sudo service libvirt-bin restart
sudo service nova-compute restart

5.2.2 Test

Create a flavor that utilizes hugepages. For this benchmarking work, 2MB pages are utilized.

On the Controller node, run:

openstack flavor create hugepage_flavor --ram 4096 --disk 100 --vcpus 4
openstack flavor set hugepage_flavor --property hw:mem_page_size=2MB

To test hugepages, a VM running with regular 4KB pages and one using 2MB pages are needed. First, create the 2MB hugepage VM:

On the Controller node, run:

openstack server create --image ubu160410G --flavor hugepages --nic \ net-id=e203cb1e-988f-4bb5-bbd1-54fb34783e02 --availability-zone \ nova::silpixa00395293 hugepage_vm

Now simply alter the above statement to create a 4KB page VM:

openstack server create --image ubu160410G --flavor smallpages --nic \ net-id=e203cb1e-988f-4bb5-bbd1-54fb34783e02 --availability-zone \ nova::silpixa00395293 default_vm

Sysbench was used to benchmark the benefit of using hugepages within a VM. Sysbench is a benchmarking tool with multiple modes of operation, including CPU, memory, filesystem, and more. The memory mode is utilized to benchmark these VMs.

Install sysbench on both VMs:

sudo apt install sysbench

Run the command to benchmark memory:

sysbench --test=memory --memory-block-size=<SIZE_OF_RAM> \ --memory-total-size=<SIZE_OF_DISK> run

An example using 100MB of RAM and 50GB on disk:

sysbench --test=memory --memory-block-size=100M \ --memory-total-size=50G run

5.3 Performance benefit

The graphs below show that there is a significant increase in performance when using 2MB hugepages instead of the default 4K memory pages, for this specific benchmark.

data graphic
Figure 6: Hugepage time comparison

data graphic
Figure 7: Hugepage operations per second comparison

6 NUMA Awareness

6.1 Description

NUMA, or non-uniform memory access, describes a system with more than one system bus. CPU resources and memory resources are grouped together into a NUMA node. Communication between a CPU and memory within a NUMA node is much faster than in an ordinary system layout.

To show the benefits of using NUMA awareness within VMs, sysbench is used.

6.2 Configuration

First, create a flavor that has the NUMA awareness property.

On the Controller node, run:

openstack flavor create numa_aware_flavor --vcpus 4 --disk 20 -- ram \ 4096

openstack flavor set numa_aware_flavor --property \ hw:numa_mempolicy=strict --property hw:numa_cpus.0=0,1,2,3 --property \ hw:numa_mem.0=4096

Create two VMs, one which is NUMA-aware and one which is not NUMA-aware.

On the Controller node, run:

openstack server create --image ubu160410G --flavor numa_aware_flavor \ --nic net-id=e203cb1e-988f-4bb5-bbd1-54fb34783e02 --availability-zone \ nova::silpixa00395293 numa_aware_vm

openstack server create --image ubu160410G --flavor default--nic \ net-id=e203cb1e-988f-4bb5-bbd1-54fb34783e02 --availability-zone \ nova::silpixa00395293 default_vm

The threads mode and the memory mode of sysbench are utilized in order to benchmark these VMs.

To install sysbench, log into the VMs created above, and run the following command:

sudo apt install sysbench

From the VMs run the following commands.

The command to benchmark threads is:

sysbench --test=threads --num_thread=256 --thread-wields=10000 \ --thread-locks=128 run

The command to benchmark memory is:

sysbench --test=memory --memory-block-size=1K --memory-total-size=50G \ run

6.3 Performance benefit

The graph below shows that there is an increase in both thread and memory using the benchmark described above, when the NUMA awareness property is set.

Graphic data
Figure 8: NUMA awareness benchmarks

7 I/O Based NUMA Scheduling

The NUMA awareness feature described in section 5 details how to request a guest NUMA topology that matches the host NUMA topology. This ensures that all memory accesses are local to the NUMA node, and thus not consuming the very limited cross-node memory bandwidth, which adds latency to memory accesses.

However, this configuration does not take into consideration the locality of the I/O device providing data to the guest processing cores. For example, if guest vCPU cores are assigned to a particular NUMA node, but the NIC transferring the data is local to another NUMA node; this will result in reduced application performance.

data graphic
Figure 9 Guest NUMA placement considerations

The above diagram highlights two guest placement configurations. With the good placement configuration the guest physical CPU (pCPU) cores, memory allocation, and PCI device are all associated with the same NUMA node.

An optimal configuration would be where the guests assigned PCI device, RAM allocation, and assigned pCPU are associated with the same NUMA node. This will ensure that there is no cross-NUMA node memory traffic.

The configuration for this feature is similar to the configuration for PCI passthrough, described in sections 3.1 and 3.2

NOTE: In section 3 a single NUMA node is requested for the guest, and its vCPU is bound to host NUMA node 0:

openstack flavor set <FLAVOR> --property  hw:numa_nodes=1
openstack flavor set <FLAVOR> --property  hw:numa_cpus.0=0

If the platform has only one PCI device and it is associated with NUMA node 1, the guest will fail to boot.

7.1 Benefit

The advantage of this feature is that the guest PCI device and pCPU’s cores are all associated with the same NUMA node, avoiding cross-node memory traffic. This can deliver a significant improvement in network throughput, especially for smaller packets.

To demonstrate the benefit of this feature, the network throughput of guest A, that uses a PCI NIC that is associated with the same NUMA node, and the network throughput of guest B, that uses a PCI NIC that is associated with a remote NUMA node, is compared.

data graphic
Figure 10: NUMA awareness throughput comparison

8 Configure Open vSwitch

8.1 Description

When deploying OpenStack, vanilla Open vSwitch (OVS) is the default virtual switch used by OpenStack.

OVS comes as standard in most, if not all, OpenStack deployment tools such as Mirantis Fuel* and OpenStack Devstack.

8.1.1 Configure the controller node

Devstack deploys OpenStack based on a local.conf file. The details required by the local.conf file will change, based on your system, but an example for the Controller node is shown below:

OVS_LOG_DIR=/opt/stack/logs
OVS_BRIDGE_MAPPINGS="default:<bridge-name>"
PUBLIC_BRIDGE=br-ex

8.1.2 Configure the compute node

The parameters required for the Compute node are almost identical. Simply remove the public_bridge parameter:

OVS_LOG_DIR=/opt/stack/logs
OVS_BRIDGE_MAPPINGS="default:<bridge-name>"

To test vanilla OVS create a VM on the Compute Host, and use a traffic generator to send traffic to the VM. Have it sent back out through the host to the generator, and note the throughput.

The VM requires two networks to be connected to it in order for traffic to be sent up and then come back down. By default, OpenStack creates a single network which is usable by VMs on the host; that is, private.

Create a second network and subnet for the second NIC, and attach the subnet to the preexisting router:

openstack network create private2 --availability-zone nova
openstack subnet create subnet2 --network private2 --subnet-range \ 11.0.0.0/24
openstack router add subnet router1 subnet2

When that is done create the VM:

openstack server create --image ubu160410G --flavor m1.small --nic \ net-id=<private_net_id> --nic net-id=<private2_net_id> \ --security-group default --availability-zone nova::<compute_hostname> \ vm_name

Now, configure the system to forward packets from the packet generator through the VMs and back to Ixia*.

The setup for this is explained in detail in the section below, called Configure packet forwarding test with two virtual networks.

8.2 OVS-DPDK

8.2.1 Description

OVS-DPDK will be used to see how much of an increase in performance is received over vanilla OVS. To utilize OVS-DPDK you will need to set it up. In this case, OpenStack Devstack is used, and changing from vanilla OVS to OVS-DPDK requires some parameters to be changed within the local.conf file, and it requires you to restack the node.

For this test, send traffic from the Ixia traffic generator through the VM hosted on the Compute node and back to Ixia. In this test case, OVS-DPDK only needs to be set up on the Compute node.

8.2.2 Configure the compute node

Within the local.conf file add the specific parameters as shown below:

OVS_DPDK_MODE=compute
OVS_NUM_HUGEPAGES=<num-hugepages>
OVS_DATAPATH_TYPE=netdev
#Create the OVS Openstack management bridge and give it a name
OVS_BRIDGE_MAPPINGS="default:<bridge>"
#Select the interfaces you wish to be handled by DPDK
OVS_DPDK_PORT_MAPPINGS=”<interface>:<bridge>”,”<interface2>:<bridge>”

Now restack the Compute node. Once that is complete, the setup can be benchmarked.

8.2.3 Configure the controller node

You need two networks connected to the VM in order for traffic to be sent up and then come back down. By default, OpenStack creates a single network that is usable by VMs on the host; that is, private. A second network and subnet must be created for the second NIC, and attach the subnet to the preexisting router.

On the Controller node, run:

openstack network create private2 --availability-zone nova
openstack subnet create subnet2 --network private2 --subnet-range \ 11.0.0.0/24
openstack router add subnet router1 subnet2

Once that is done, create the VM. To utilize DPDK the VM must use hugepages. Details on how to set up your Compute node to use hugepages are given in the “Hugepage Support” section.

On the Controller node run:

openstack server create --image ubu160410G --flavor hugepage_flavor \ --nic net-id=<private_net_id> --nic net-id=<private2_net_id> \ --security-group default --availability-zone nova::<compute_hostname> \ vm_name

Now configure the system to forward packets from Ixia through the VMs and back to Ixia. The setup for this is explained in detail in the section below, called Configure packet forwarding test with two virtual networks.

Once that is complete run traffic through the Host.

8.3 Performance Benefits

The graph below highlights the performance gain when using a DPDK accelerated OVS.

data graphic
Figure 11: Virtual switch throughput comparison

9 CPU Pinning

9.1 Description

CPU pinning allows a VM to be pinned to specific CPUs without worrying about being moved around by the kernel scheduler. This increases the performance of the VM while the host is under heavy load. Its processes will not be moved from CPU to CPU, and instead they will be run within the pinned CPUs.

9.2 Configuration

There are two ways to use this feature in Newton, either by editing the properties of a flavor, or editing the properties of an image file. Both are shown below.

openstack flavor set <FLAVOR_NAME> --property hw:cpu_policy=dedicated
openstack image set <IMAGE_ID> --property hw_cpu_policy=dedicated

For the following test the Ixia traffic generator is connected to the Compute Host. Two VMs with two vNICs are needed; one VM with core pinning enabled and one with it disabled. Two separate flavors were created with the only difference being the cpu_policy.

On the Controller node run:

openstack flavor create un_pinned --ram 4096 --disk 20 --vcpus 4
openstack flavor create pinned --ram 4096 --disk 20 --vcpus 4

There is no need to change the policy for the unpinned flavor as the default cpu_policy is ‘shared’. For the pinned flavor set the cpu_policy to ‘dedicated’.

On the Controller node run:

openstack flavor set pinned --property hw:cpu_policy=dedicated

Create a network and subnet for the second NIC and attach the subnet to the preexisting router.

On the Controller node, run:

openstack network create private2 --availability-zone nova
openstack subnet create subnet2 --network private2 --subnet-range \ 11.0.0.0/24
openstack router add subnet router1 subnet2

Once this is complete create two VMs; one with core pinning enabled and one without.

On the Controller node, run:

openstack server create --image ubu160410G --flavor pinned --nic \ net-id=<private_net_id> --nic net-id=<private2_net_id> \ --security-group default --availability-zone nova::<compute_hostname> pinnedvm

openstack server create --image ubu160410G --flavor un_pinned --nic \ net-id=<private_net_id> --nic net-id=<private2_net_id> \ --security-group default --availability-zone nova::<compute_hostname> defaultvm

Now, configure the system to forward packets from Ixia through the VMs and back to Ixia. The setup for this is explained in detail in the section below, called Configure packet forwarding test with two virtual networks. Send traffic through both VMs while the host is idle and also while it is under stress, and graph the results. Use the Linux* ‘stress’ command. To do this, install stress on the Compute Host.

On Ubuntu simply run:

sudo apt-get install stress

The test run command in this benchmark is shown here:

stress --cpu 56 --io 4 --vm 2 --vm-bytes 128M --timeout 60s&

9.3 Performance benefit

The graph below highlights the performance gain when using the CPU pinning feature.

data graphic
Figure 12: CPU pinning throughput comparison

10 CPU Thread Policies

10.1 Description

CPU thread policies work with CPU pinning to ensure that the performance of your VM is maximized. CPU thread policy isolate allows entire physical cores to be allocated for use by a VM. While CPU pinning alone may allow Intel® Hyper-Threading Technology (Intel® HT Technology) siblings to be used by different processes, thread policy isolate ensures that this cannot happen. It also ensures that a physical core does not have more than one process trying to access it at one time. Similar to how CPU pinning was benchmarked, start by creating a new OpenStack flavor.

10.2 Configuration

On the Controller node, run:

openstack flavor create pinned_thread_policy --ram 4096 --disk 20 \ --vcpus 4

Thread policies were created to work with CPU pinning, so add both CPU pinning and thread policies to this flavor.

On the Controller node, run:

openstack flavor set pinned_with_thread --property \ hw:cpu_policy=dedicated --property hw:cpu_thread_policy=isolate

As is the case in the Pinning benchmark above, a second private network is needed to test this feature.

On the Controller node, run:

openstack network create private2 --availability-zone nova
openstack subnet create subnet2 --network private2 --subnet-range \ 11.0.0.0/24
openstack router add subnet router1 subnet2

Now create the VM. This VM is benchmarked versus the two VMs created in the previous section.

On the Controller node, run:

openstack server create --image ubu160410G --flavor pinned_with_thread \ --nic net-id=<private_net_id> --nic net-id=<private2_net_id> \ --security-group default --availability-zone nova::<compute_hostname> \ pinned_thread_vm

Ensure that the system is configured to forward traffic from Ixia through the VM and back to Ixia; read the section Configure packet forwarding test with two virtual networks. Send traffic through the VM while the host is idle and while it is under stress. Use the Linux ‘stress’ command. To do this, install stress on the Compute Host:

sudo apt-get install stress

On the Compute Host run the following command:

stress --cpu 56 --io 4 --vm 2 --vm-bytes 128M --timeout 60s&

10.3 Performance benefit

The graph below shows that a pinned VM actually performs slightly better than the other VMs while the system is unstressed. However, when the system is stressed there is a large increase in performance for the thread isolated VM over the pinned VM.

data graphic
Figure 13: CPU thread policy through comparison

11 Appendix

This section details some learning we came across while working on this paper.

11.1 Configure packet forwarding test with two virtual networks

This section details the setup for testing throughput in a VM. Here we use standard OVS and L2 forwarding in the VM.

data graphic
Figure 14: Packet forwarding test topology

11.1.2 Host configuration

Standard OVS deployed by OpenStack uses two bridges: a physical bridge (br-ens787f1) to plug the physical NICs into, and an integration bridge (br-int) that the VM VNICs get plugged into.

Plug in physical NICs:

sudo ovs-vsctl add-port br-ens787f1 ens803f1
sudo ovs-vsctl add-port br-ens787f1 ens803f0

Modify the rules on the integration bridge to allow traffic to and from the VM.

First, find out the port numbering in OVS ports on the integration bridge:

sudo ovs-ofctl show br-int
1(int-br-ens787f1): addr:12:36:84:3b:d3:7e
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max
 4(qvobf529352-2e): addr:b6:34:b5:bf:73:40
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 5(qvo70aa7875-b0): addr:92:96:06:8b:fe:b9
     config:     0
     state:      0
     current:    10GB-FD COPPER
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br-int): addr:5a:c9:6e:f8:3a:40
     config:     0
     state:      0
     speed: 0 Mbps now, 0 Mbps max

There are, however, issues with the default setup. If you attempt to have heavy traffic passed up to the VM and back down to the host through the same connection, OVS may cause an error to occur, which may cause your system to crash. To overcome this you will need to add a second connection from the integration bridge to the physical bridge.

Traffic going to the VM:

sudo ovs-ofctl add-flow br-int priority=10,in_port=1,action=output=4

Traffic coming from the VM:

sudo ovs-ofctl add-flow br-int priority=10,in_port=5,action=output=1

11.1.3 VM configuration

First, make sure there are two NICs up and running. This can be done manually or persistently by editing this file: /etc/network/interfaces

auto ens3
iface ens3 inet dhcp

auto ens4
aface ens4 inet dhcp

Then restart the network:

/etc/init.d/networking restart

Following this step there should be two running NICs in the VM.

By default, a system's routing table has just one default gateway; this will be whichever NIC came up first. To access both VM networks from the host, remove the default gateway. It is possible to add a second routing table to do this, but this is the easiest and quickest way. A downside of this approach is that you will not be able to communicate with the VM from the host, so you can use Horizon* for the remaining steps.

Now, forward the traffic coming in on one NIC to the other NIC. L2 bridging is used for this:

ifconfig ens3 0.0.0.0
ifconfig ens4 0.0.0.0

brctl addbr br0
brctl stp br0 on
brctl addif br0 ens3
brctl addif br0 ens4
brctl show

ifconfig ens3 up
ifconfig ens4 up
ifconfig br0 up

The two VM NICs should now be added to br0.

11.1.4 Ixia configuration

There are two ports on the traffic generator. Let’s call them C10P3 and C10P4.

On C10P3 configure the source and destination MAC and IP

SRC: 00:00:00:00:00:11, DST: 00:00:00:00:00:10

SRC: 11.0.0.100, DST 10.0.0.100

On C10P4 configure the source and destination MAC and IP

SRC: 00:00:00:00:00:10, DST: 00:00:00:00:00:11

SRC: 10.0.0.100, DST 11.0.0.100

As a VLAN network is being used here, VLAN tags must be configured.

Set them to the tags Openstack has given, in this case it's 1208.

Once these steps are complete you can start sending packets to the host, and you can verify that VM traffic is hitting the rules on the integration bridge by running the command:

watch -d sudo ovs-ofctl dump-flows br-int

You should see the packets received and packets sent increase on their respective flows.

11.2 AppArmor* issue

AppArmor* has many security features which may require additional configuration. One such issue is that if you attempt to allocate HugePages to a VM, AppArmor will cause Libvirtd* to give a permission denied message. To get around this, edit the qemu.conf file and change the security driver field, as follows:

vi /etc/libvirt/qemu.conf

security_driver = "none"

11.3 Share host ssh public keys with the guest for direct access

openstack keypair create --public-key ~/.ssh/id_rsa.pub mykey
openstack keypair list

11.4 Add rules to default security group for icmp and ssh access to guests

openstack security group rule create --protocol icmp --ingress default
openstack security group rule create --protocol tcp --dst-port 22 \ --ingress default
openstack security group list
openstack security group show default

11.5 Boot a VM

openstack image list
openstack flavor list
openstack keypair  list
openstack server create --image ubuntu1604 --flavor R4D6C4  --security-group \ default --key-name mykey vm1
openstack server list

11.6 Resize an image filesystem

sudo apt install libguestfs-tools

View your image partitions

sudo virt-filesystems --long -h --all -a ubuntu1604-5G.qcow2
Name       Type        VFS      Label  MBR  Size  Parent
/dev/sda1  filesystem  ext4     -      -    3.0G  -
/dev/sda2  filesystem  unknown  -      -    1.0K  -
/dev/sda5  filesystem  swap     -      -    2.0G  -
/dev/sda1  partition   -        -      83   3.0G  /dev/sda
/dev/sda2  partition   -        -      05   1.0K  /dev/sda
/dev/sda5  partition   -        -      82   2.0G  /dev/sda
/dev/sda   device      -        -      -    5.0G

Here’s how to expand /dev/sda1:

Create a 10G image template:

sudo truncate -r ubuntu1604-5G.qcow2 ubuntu1604-10G.qcow2

Extend the 5G image by 5G:

sudo truncate -s +5G ubuntu1604-10G.qcow2

Resize 5G image to 10G image template:

sudo virt-resize --expand /dev/sda1 /home/tester/ubuntu1604-5G.qcow2 \ /home/tester/ubuntu1604-10G.qcow2

11.7 Expand the filesystem of a running Ubuntu* image

11.7.1 Delete existing partitions

sudo fdisk /dev/sda

Command (m for help): p

Disk /dev/sda: 268.4 GB, 268435456000 bytes
255 heads, 63 sectors/track, 32635 cylinders, total 524288000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000e49fa

   Device Boot  	Start     	End  	Blocks   Id  System
/dev/sda1   *    	2048   192940031	96468992   83  Linux
/dev/sda2   	192942078   209713151 	8385537	5  Extended
/dev/sda5   	192942080   209713151 	8385536   82  Linux swap / Solaris

Command (m for help): d
Partition number (1-5): 1

Command (m for help): d
Partition number (1-5): 2

11.7.2 Create new partitions

Command (m for help): n
Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): p
Partition number (1-4, default 1):
Using default value 1
First sector (2048-524287999, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-524287999, default 524287999): 507516925

Command (m for help): n
Partition type:
   p   primary (1 primary, 0 extended, 3 free)
   e   extended
Select (default p): e
Partition number (1-4, default 2): 2
First sector (507516926-524287999, default 507516926):
Using default value 507516926
Last sector, +sectors or +size{K,M,G} (507516926-524287999, default 524287999):
Using default value 524287999

Command (m for help): n
Partition type:
   p   primary (1 primary, 1 extended, 2 free)
   l   logical (numbered from 5)
Select (default p): l
Adding logical partition 5
First sector (507518974-524287999, default 507518974):
Using default value 507518974
Last sector, +sectors or +size{K,M,G} (507518974-524287999, default 524287999):
Using default value 524287999

11.7.3 Change logical partition to SWAP

Command (m for help): t
Partition number (1-5): 5

Hex code (type L to list codes): 82
Changed system type of partition 5 to 82 (Linux swap / Solaris)

11.7.4 View new partitions

Command (m for help): p

Disk /dev/sda: 268.4 GB, 268435456000 bytes
255 heads, 63 sectors/track, 32635 cylinders, total 524288000 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000e49fa

   Device Boot  	Start     	End  	Blocks   Id  System
/dev/sda1        	2048   507516925   253757439   83  Linux
/dev/sda2   	507516926   524287999 	8385537	5  Extended
/dev/sda5   	507518974   524287999 	8384513   82  Linux swap / Solaris

11.7.5 Write changes

Command (m for help): w
The partition table has been altered!

FYI: Ignore any errors or warnings at this point and reboot the system:

sudo reboot

11.7.6 Increase the filesystem size

sudo resize2fs /dev/sda1

11.7.7 Activate SWAP

sudo mkswap /dev/sda5
sudo swapon --all --verbose
swapon on /dev/sda5

11.8 Patch ports

Patch ports can be used to create links between OVS bridges. They are useful when you are running the benchmarks that require traffic to be sent up to a VM and back out to a traffic generator. OpenStack only creates one link between the bridges, and having traffic going up and down the same link can cause issues.

To create a patch port, ‘patch1’, on the bridge ‘br-eno2’, which has a peer called ‘patch2’, do the following:

sudo ovs-vsctl add-port br-eno2 patch1 -- set Interface patch1 \ type=patch options:peer=patch2

To create a patch port, ‘patch2’, on the bridge ‘br-int’, which has a peer called ‘patch1’, do the following:

sudo ovs-vsctl add-port br-int patch2 -- set Interface patch2 \ type=patch options:peer=patch1

12 References

http://docs.openstack.org/juno/config-reference/content/kvm.html

http://docs.openstack.org/mitaka/networking-guide/config-sriov.html

https://networkbuilders.intel.com/docs/OpenStack_EPA.pdf

http://www.slideshare.net/oraccha/toward-a-practical-hpc-cloud-performance-tuning-of-a-virtualized-hpc-cluster


Overview of Intel® Computer Vision SDK and How it Applies to IoT

$
0
0

What is Intel® Computer Vision SDK?

The Intel® Computer Vision SDK is an Intel-optimized and accelerated computer vision software development kit based on the OpenVX* standard. The SDK integrates pre-built OpenCV with deep learning support using an included Deep Learning (DL) Deployment toolkit.

About OpenVX* and the Khronos Group*

OpenVX* is an open, royalty-free standard for cross platform acceleration of computer vision applications. The Khronos Group*, an industry consortium, defined OpenVX.

The Khronos Group is a not for profit, member-funded consortium dedicated to the creation of royalty-free open standards for graphics, parallel computing, vision processing. Intel joined the Khronos Group as a Promoter Member in March 2006.

OpenVX* Benefits

The OpenVX* API for computer vision standardizes the application interface for computer vision applications. This enables performance and power-optimized computer vision processing and allows the application layer to transparently use vendor specific hardware optimization and acceleration, when available.

OpenVX* also specifies an API independent standard file format for exchanging deep learning data between training systems and inference engines, called the Neural Network Exchange Format (NNEF*). 

Using an extension of OpenVX*, developers can represent Convolutional Neural Network topologies as OpenVX* graphs. This allows developers to mix CNN with traditional vision functions.

Intel® CV SDK Contents

  • Intel-optimized implementation of the OpenVX* 1.1 API with custom extensions and kernels.
  • Pre-built binaries of OpenCV with Intel® VTune™ Amplifier hooks for profiling.
  • Vision Algorithm Designer (VAD), IDE tool
  • Deep Learning Model Optimizer tool.
  • Deep Learning Inference Engine.
  • Sample applications.

customer sw 

Hardware and Software Requirements

Developers can program CV SDK using C/C++ on an Ubuntu* 64-bit development platform using Cmake to manage their builds and using GCC compiler.

The recommended development platform hardware is 6th Generation Intel® Core™ processor or better with integrated Iris ® Pro Graphics or HD Graphics.

Target platforms include: next-generation Intel Atom® processors (formerly known as Apollo Lake), Intel® Core™ processors and Intel® Xeon® processors. The target processors have integrated Iris Pro Graphics or HD Graphics to use OpenCL GPU kernels.

Intel® CV SDK Development Benefits

Intel® Hardware Optimization and Acceleration

Intel® CV SDK which is Intel's OpenVX* implementation, offers CPU kernels which are multi-threaded (with Intel® Threading Building Blocks) and vectorized (with Intel® Integrated Performance Primitives).

This optimized Intel® implementation of OpenCL™ supports Intel® GPUs on integrated Iris Pro or HD Graphics platforms.

Using Intel® CV SDK, developers can access early support for the new dedicated IPU (Image Processing Units) on Next-Generation Intel Atom® processors (formerly Apollo Lake).

These new processors feature an integrated, four- vector image-processing unit capable of supporting advanced vision functions and up to 4 concurrent HD IP cameras.

Custom Intel® Extensions

Intel® CV SDK extends the original OpenVX standard with specific APIs and many Kernel extensions that allow developers to add performance efficient (for example, tiled) versions of their own algorithms to the processing pipelines.

Heterogenous Computing Support

Intel® CV SDK supports both task and data parallelism to maximize the use of all available compute resources including CPU, GPU, and the new dedicated IPU (Image Processing Units).

Profiling Support

Intel® CV SDK has a pre-built OpenCV implementation. This OpenCV implementation integrates hooks for Instrumentation and Tracing technology (ITT) which allows profiling vision applications using Intel® VTune™ Amplifier.

Intel® CV SDK and IoT

One of the most important senses used by humans is our sight and vision. As much as 80% of our interaction with our environment is based on vision.

Until now, IoT relied on multiple sensors to perform basic telemetry and automation tasks because computer vision was expensive, complex, and inaccessible to most developers.

However, with the advent of cheap, HD cameras, processors with built-in CV accelerators and robust computer vision software stacks, there is a rising trend in the use of camera-based computer vision as an IoT sensor for multiple verticals.

Integration with machine learning and deep learning systems opens new application use cases for the use of computer vision in IoT and brings the power of embedded CNN and DNN to the edge.

Related Software:

Intel® VTune™ Amplifier – Advanced Intel toolkit for profiling, visualizing, and tuning multi-processor, multi-threaded or vectorized Intel® platforms which support Instrumentation and Tracing technology (ITT).

Intel® Vision Algorithm Designer – An IDE on top of OpenVX for the development of OpenVX algorithms, workloads, and capabilities in an intuitive and visual manner.

Intel® Deep Learning (DL) Deployment toolkit – A cross-platform DL model optimizer which helps integrate DL inference with application logic.

Intel® Deep Learning Inference Engine - supports inference operations on several popular image classification networks and the deployment of deep learning solutions by delivering a unified API to integrate the inference with application logic.

Intel® SDK for OpenCL™ applications - Accelerated and optimized application performance with Intel® Graphics Technology compute offload and high-performance media pipelines.

Getting Started:

Quick Start Guide for Intel® Computer Vision SDK Beta

Intel's Deep Learning Deployment Toolkit Installation Guide

Intel® Enhanced Privacy ID (EPID) Security Technology

$
0
0

Introduction

With the increasing number of connected devices, the importance of security and user privacy has never been more relevant.  Protecting information content is critical to prevent exposure of trade secrets for businesses, identity theft for individuals, and countless other harmful scenarios that cost both money and time to repair.  Part of protecting data and privacy includes ensuring that the devices touching the data are authentic, have not been hijacked, or even replicated into a non-genuine piece of hardware. 

In this article we will discuss the Intel® Enhanced Privacy Identification (EPID) security scheme, which helps to specifically address two device level security issues; anonymity and membership revocation. Billions of existing devices, including most Intel® platforms manufactured since 2008, create signatures that need Intel® EPID verification. Intel is providing the Intel® EPID SDK open source and encouraging device manufacturers to adopt it as an industry standard for device ID in IoT.

Security Review – Public Key Encryption and Infrastructure

When exchanging data between two people or systems, it is important to ensure that it arrives securely, and is not forged.  The recipient should have a high confidence that the sender is who they say they are.  One of the most widely used methods of ensuring this trusted data transport is by employing a DigitalSignature.One method of creating a digital signature is using the Public Key Encryption (PKE) security scheme.  Using a mathematical hashing algorithm, two binary keys are generated which work together to encrypt and decrypt the data. Data that is encrypted (or signed in this use case) using the private key can only be decrypted (verified) using the matching public key.  The private key is never shared with anyone, and the public key is available to everyone.  This method guarantees that any data decrypted using a public key was indeed encrypted using the matching private key.  For the most part, using Public Key Encryption for device authenticity works well, however it does have a few limitations.

Problem 1: Certifying Keys

The first limitation involves the validity of the sender’s key.  In order to verify a signature, the public key of the sender is required, however there is no way to guarantee it belongs to the sender, or has not been stolen or tampered with.  An additional step can be taken to ensure the validity of the public key which involves certification from a third party called an issuer.  Using Public Key Infrastructure (PKI), the level of security can be raised by introducing a new element called a digital certificate, which is signed by the issuer’s private key.  The certificate contains the public key of the member, the member’s name, and other optional details.  Using this method guarantees that the public key being used is the actual key issued, and hence is the actual sender of the data.  Think of an issuer as a notary who guarantees that this signature is correct because they witnessed the person writing it.   A digital certificate issued from a certified authority solves the problem of certifying that a public key is authentic.

Problem 2: Shielding Unique Identity

A second limitation with PKI is the inability to remain anonymous while still being granted access.  Because one public certificate contains the key owner’s name and information, the ownership of the secured information is inherently known, and if the same device is verified multiple times, its activity could be tracked.  Usage of PKI for signed and encrypted emails is useful in this scenario where it is desired for the users to be identified.  The recipient installs the public certificate of the sender, and when opening the message has a level of trust that the sender signed these emails using a protected matching private key. 

As devices increasingly play roles in requesting authentication to systems, there is a greater need for both devices and users to be anonymous.  While a valid attestation is required, it is not a requirement that the device be individually identified or that the device provide any additional details other than the minimum amount of information necessary to prove that they are a genuine member of a trusted device family.  Taking this approach allows devices to access a system based on their approved level of access and not any personal information like a MACID or who they are. In other words, in the Intel® EPID scheme, if a hundred authentic signatures are verified, the verifier would not be able to determine whether one hundred devices were authenticated, or if the same device was authenticated one hundred times.

Problem 3: Revoking Access

Yet another limitation with PKE is in the fact that there exists no easy mechanism for revoking a private key that has been compromised.  If anyone other than the user gets access to the private key, they can masquerade as that user resulting in a loss of trust for the device.  These private keys are often stored in a separate chip called a Trusted Platform Module (TPM) which is also encrypted.  While this hardware trusted module approach is much more secure, the existence of the private key on the device still creates the possibility that it can be stolen.  Fixing the problem of a stolen key would involve issuance of a new certificate and manual intervention to flash a new private key onto the device.  Adding the ability to easily revoke a compromised private key would allow a device to be flagged and disabled automatically, and prevent any further identify theft.

Roles in a Public Key Infrastructure

CAA Certified Authority is the entity that is issuing security certificates.
RAThe Registration Authority accepts requests for new certificates, ensures the authenticity of the requestor and completes the registration process to the CA for the requestor.
VAA Validation Authority is a trusted third party that can validate a certificate on behalf of a Certificate Authority.
Member

The role of member can be assumed by an end user or a device. The member is the role that is requesting attestation of itself during secure exchanges.


Figure 1 - PKI Roles and Process Flow

Authentication vs Identification

Gaining access to a system should not always require user identification.  The intent behind requesting access to a system is to obtain access; providing only a minimal, certifiable proof that access has been granted.  Each user might require a certain level of anonymity based on specific use-cases.  Take, as an example, a medical wristband device that monitors sleep habits of someone that is experiencing insomnia.  For the individual, it is important to ensure that the data is provided to the doctors for analysis without allowing anyone else to potentially identify them or their private medical data.

For users accessing services, the authentication process is owned by the access provider, which unfortunately often ties access rights directly to an account identifier, which is usually then linked directly to additional account details that the user may want to hold private.  Unfortunately, most systems today require a user or device to actually identify themselves in a way that can in fact be traced back to the original user with every transaction.  An example in software would be a username.  For a device, it might be a MACID or even a public key provided that was stored on secure storage.  To prevent this from occurring, the ability must exist by which a user can effectively and rightfully use a system without being required to provide any information that can be immediately linked to themselves.   One example would be a toll both.  A user should be able to pass through the booth because they were previously issued a valid RFID tag, however no personal information should be required, and the user is not directly known in any transaction or tracking on that device.  If the requirement is to trace whom is travelling through the toll booth, then that right is reserved by the access provider, however there are instances when authentication should be separated from identification.

Direct Anonymous Attestation is a security scheme proposed in 2004 that permits a device in a system to attest membership of a group while preserving the identity of the individual.  Drafted by Ernie Brickell (Intel Corp), Jan Camenisch (IBM Research®), and Liqun Chen (HP Laboratories®), DAA is now approved by the Trusted Computing Group (TCG) as the recommended method for attestation of a remote device, and is outlined in ISO/IEM 20008.

What is EPID?

Enhanced Privacy Identification (EPID) is an implementation ISO 20008 from Intel that addresses two of the problems with PKI Security Scheme:  anonymity and membership revocation.  Intel includes EPID keys in many of its processors, starting with chipset series 5 in 2008 which includes all Intel® Core™ processor family products.  In 2016, Intel as a certified EPID Key Generation Facility, announced that it has distributed over 4.5 billion EPID keys since 2008. 

The first improvement over PKI provides a user or device “Direct Anonymous Attestation” which provides the ability to authenticate a device for a given level of access while allowing the device to remain anonymous.  This is accomplished by introducing the concept of a Group level membership authentication scheme.  Instead of a 1:1 public to private key assignment for an individual user, EPID allows a group of membership private keys to be associated together, and linked to one public group key.  This public EPID group public key can be used to verify the signature produced by any EPID member private key in the group.  Most importantly, no one, including the issuer, has any way to know the identity of the user.  Only the member device has access to the private key, and will validate only using a properly provisioned EPID signature.

The second security improvement that EPID provides is the ability to revoke an individual device by detecting a compromised signature or key.  If the private key used by a device has been compromised or stolen, allowing the EPID ecosystem to recognize this allows the device to be revoked as well as prevent any future forgery.  During a security exchange, the EPID protocol requires that members perform mathematical proofs to indicate that they could not have created any of the signatures that have been flagged on a signature revocation list.  This built in revocation feature allows devices or even entire groups of devices to be instantly flagged for revocation, instantly being denied service. It allows anonymous devices to be revoked on the basis of a signature alone, which allows an issuer to ban a device from a group without ever knowing which device was banned.

Intel® EPID Roles

There are three roles in the EPID security ecosystem.  Firstly, the Issuer is the authority group that assigns or issues EPID Group IDs and Keys to individual platforms; similar devices that should be grouped together from an access level perspective.  The issuer manages group membership and maintains current versions of all revocation lists. Using a newly generated private key for the group, the issuer generates one group public key, and as many EPID member private keys as requested, all of which are paired with the one group public key.  The Member role is an end device, and represents one individual member in a group of many members, all sharing the same level of access.  Finally, the Verifier role serves as the gatekeeper:  checking and verifying EPID signatures generated by platforms ultimately ensuring they belong to the correct group.  Using the EPID group public key, the verifier is able to validate the EPID signature of any member in the group with no knowledge of membership identity.


Figure 2 – EPID Roles

IssuerCreates, stores, and distributes issuer signed public certificates for groups. Creates, distributes, and then destroys private keys. Private keys are not retained by an issuer, and are held private by member devices in Trusted Storage such as TPM 1.2 compliant device. Creates and maintains revocation lists.
VerifierChallenges member verification requests using EPID Group Public Key and revocation lists. Identify any member or group revocations.
MemberAn end device for a particular platform. Protect private EPID keys into protected TPM 1.2 complaint storage. Sign messages when challenged.

Now that the roles in EPID have been discussed, let’s discuss the different security keys used in the EPID security scheme. First, for a given group of devices or platforms, the issuer generates the group public key and group private key simultaneously. The group private key has one purpose: the issuer uses it as the basis to create new member private keys, and for that reason the issuer keeps the group private key secret from all other parties. The EPID group public key is maintained by the issuer and verifier, and the private member keys are distributed to the device platforms before the issuer destroys its local versions.


Figure 3 - Using a unique key allocated for a group, an issuer will create one EPID Group Public key and many member EPID private keys as requested.

Security Keys used in EPID

OwnerPub/PriDescriptionUsage
IssuerPRICA Root Issuing authority private ECC keyUsed to sign EPID Group public key and parameters, ensures trust all the way to member.
Issuing CAPUBIssuing authority public ECC keyProvided to platform members to enable trust with a verifier and issuer.
IssuerPRIGroup Private Key. One per group.Created by issuer for a group. Used to generate private member keys.
GroupPUBEPID Group Public Key Generated by issuerProvided to platform devices during provisioning upon request. Used by verifiers to validate EPID member signatures.
MemberPRIEPID member private key. Private unique key for each device, can be fused, must be secured.Generated by issuer using Group Private Key. Stored securely or embedded/fused into Silicon Golden private key ready for provisioning into final EPID key. Used to create valid EPID signatures that can be decrypted using a paired EPID Group public key.

The Intel® EPID scheme works with three types of keys: the group public key, the issuing private key, and the member private key. A group public key corresponds to the unique member private keys that are part of the group. Member private keys are generated from the issuing private key, which always remains secret and known only to the issuer.

To ensure that material generated by the issuer is authentic, another level of security is added: the issuing CA certificate. The CA (Certificate Authority) public key contains the ECDSA public key of the issuing CA. The verifier uses this key to authenticate that information provided by the issuer is genuine.

Intel® EPID Signature

An Intel® EPID signature is created using the following parameters:

  • Member private key
  • Group public key
  • Message to be signed
  • Signature revocation proof list (to prove that it did not create any signatures that were flagged for revocation in the past)

An Intel® EPID signature is verified using the following parameters:

  • Member’s signature
  • CA certificate (to certify authenticity of issuer material before it is used)
  • Group public key
  • Group revocation list
  • Member private key revocation list
  • Signature revocation list

Intel® EPID Process Flows

Embedding

By including the Intel® EPID key into the manufacturing process for a device, a part can be identified as genuine after deployment into the field without any human intervention.  This process saves time and improves security by not distributing any private keys or requiring any interaction with the end user.  Sequence 1 shows a vendor of a hardware device initiating a request with an Intel® EPID Issuer and ultimately deploying the generated Intel® EPID member keys with the device.  The process starts by the vendor requesting to join the ecosystem managed by this Issuer.  It can also be said that this member is choosing this Issuer as the Key Generation Facility.  When a new member requests to join, an issuer first generates a set of platform keys which are held private and used to generate one group public key and one or more member private keys.  The member private keys are deployed with each device securely, and are not known by anyone else including the issuer who does not retain any of the private keys.  The Intel® EPID group public key is stored with the issuer and distributed to verifiers upon request.


Sequence 1 – Intel® EPID key request and distribution process

For products supporting Intel® EPID, Intel fuses a 512 bit number directly into a submodule of the processor called the Management Engine.  This Intel® EPID private member key is encoded with an Intel® EPID Group ID that uniquely identifies the device as part of the group.  As an Issuer and Verifier, Intel maintains public certificates for each of devices encoded with Intel® EPID keys.  The private member keys require the same level of protection as a standard PKI private key.  Access to the private key can only be achieved using an application that is signed by an RSA Security Certificate whose root of trust is Intel.  Other silicon manufactures can follow a similar process, allowing only trusted applications of their corporations to access the private key on their products.

Provisioning

After deployment into the field, a device is not ready to use Intel® EPID out of the box.  Before it can be brought to life, it must follow a process called provisioning, which allows it to attest its authenticity using a valid Intel® EPID signature for all future transactions.  Sequence 2 shows a possible provisioning process for first boot of an IOT device that uses Intel® EPID.  Once granted access to the Internet, a device can call home to state it is online and also check for software updates. 

Before granting access however, the provider answering the call must ensure that the device is authentic.  In a typical onboarding scenario, a verifier will be sent to a member device requesting a provisioning status.  If the device is not already provisioned, meaning it has previously been authenticated, it can complete provisioning by requesting a public Intel® EPID Group Key from the verifier.  The member device then stores both the private and public Intel® EPID keys into secure storage, and is able to successfully sign Intel® EPID signatures as well as reply to provisioning status challenges.


SEQUENCE 2 – Intel® EPID Provisioning Flow

Revocation

Because the Intel® EPID security scheme allows for anonymous, group membership attestation, it must also provide the ability to reject or decommission members or groups at any time.  Intel® EPID supports revocation at the membership level through identification of an Intel® EPID member signature, or if known, the private member key. 

In addition, Intel® EPID supports revocation of an entire Group, which revokes access for all devices in that group.  One typical use case, as shown in SEQUENCE 3, Member revocation can be initiated by a verifier or an issuer, however only the issuer can revoke an Intel® EPID member or group.  Once a group is revoked, verifiers will no longer reference any signature or key based revocations for the group, meaning it will be ignored.

The Intel® EPID protocol that is exchanged between member-verifier-issuer contains revocation lists, which can grow in size over time for a platform group that has many compromised members.  An increase in revocations comes at a linear performance decrease, meaning it will take longer to validate everyone in the chain over time.  One solution an issuer can pursue when this occurs is to create a new Group and move the uncompromised members into that new group.  The old group can then be revoked.


SEQUENCE 3 – Verifier submits request to Issuer to revoke a member or group

Summary of Revocation Lists
PRIV-RL – Private Member Key is known
SIG-RL – Platform Member Key is not recovered, however signature is known
GROUP-RL – Entire Group should be revoked

While members normally exchange signature exchanges with verifiers, communication also occurs directly with the issuer.  The join protocol between a member device and issuer supports the possibility to transport a valid Intel® EPID private key to the device. This can be used for replacement of a compromised key or remote provisioning when the key is not available by the member.  A secure, trusted transport mechanism of the key is assumed and outside the scope of the protocol.

Intel® EPID Use Cases

A perfect example usage of Intel® EPID is to prove that a hardware device is genuine.  After deployment from a manufacturer, it is important for a device to have the ability to truthfully identify itself during software updates or requesting access to a system.  Once authorized, the device is then said to be genuine and a valid member of a group while still remaining anonymous. 

Another example is related to digital streaming content.  Digital Rights Management (DRM) currently uses Intel® EPID to ensure that a remote hardware device is secure prior to streaming data to the device.  This process ensures that the hardware player streaming the content is authentic.  Intel® Insider™ technology, which focuses on ensuring digital movie content delivered from service providers, only works on clients that also support Intel® Insider™.  This gives content providers a level of trust that their content cannot be copied simply by viewing on the device.  There is no disruption to current services, and the only impact would be to those trying to pirate digital content that has been protected using Intel® Insider™.

Intel® Insider™
http://blogs.intel.com/technology/2011/01/intel_insider_-_what_is_it_no/

Intel® Identity Protection Technology  with One Time Password (OTP) also uses Intel® EPID keys to implement a two factor authentication method that enhances security beyond a simple username/password.

One time password
https://www.intel.com/content/www/us/en/architecture-and-technology/identity-protection/one-time-password.html

SGX – Software Guard Extensions on Intel® products allow applications to run in a trusted, protected area of memory allocated as an ‘enclave,’ preventing any outside access to the application memory or execution space.

SGX
https://software.intel.com/en-us/sgx

Silicon providers such as Microchip* and Cypress Semiconductor* are now implementing Intel® EPID into their products as well.

Microchip announces plans for implementing Intel® EPID
http://download.intel.com/newsroom/kits/idf/2015_fall/pdfs/Intel_EPID_Fact_Sheet.pdf

Intel Products offering Intel® EPID

Beginning with the release of Series 5 Chipsets, EPID keys have been fused and deployed in all products included in all series five and newer chipsets.  For more information on which products are supported, visit the ARK at http://ark.intel.com/#@ConsumerChipsets

Intel® EPID SDK – Member and Verifier APIs

The Intel® EPID SDK is an open source library that provides support for both member and verifier Intel® EPID tasks.  It does not include any Issuer APIs, which means it is not meant to create EPID keys.  The SDK comes with documentation and examples for signing and verifying messages using included sample Issuer material, which in a real system would be generated by the issuer (Public group Intel® EPID key, Private member Intel® EPID key, and additional information such as the revocation lists.)  Verifier APIs do exist that allow populating a special kind of signature revocation list known as the verifier blacklist, however that list can only be populated if members opt-in to allowing themselves to be tracked, and only the issuer can perform the creation of revocation lists that apply to the entire group.

First steps with Intel® EPID

To get started, download the latest Intel® EPID SDK, and begin by reading the documentation included in the doc subfolder with each distribution.  https://01.org/epid-sdk/downloads

After building the SDK, navigate to the _install\epid-sdk\example folder and try out the included examples for signing and verifying signatures.  The folder contents are shown below which include the sample private key, Issuer certificates, and revocation lists required to complete verifications.  The files are well named, making it very easy to know their contents.


Figure 4 – Directory listing of the Intel® EPID 4.0.0 SDK

Intel® EPID Member Example

Create a digital signature using the sample Intel® EPID member private key, groupID, and a sample text string of any content.

signmsg.exe --msg=”TEST TEXT BLOB”

The signmsg command will output a signature file (./sig.dat) whose contents can only be verified using a matching Intel® EPID public key, and the message to be signed.  Regardless of what initiates or triggers the verification process, the verifier and member have to use the same message parameter for verification to succeed.

Intel® EPID Verifier Example

Creation and validation of signatures depends that both ends (Member and Verifier) use the same message, hashing algorithm, basename, and signature revocation lists.  A change to any of these will result in a signature verification failure.  During a validation flow, the verifier may send a text message for the member to sign.

Verify a digital signature using the SDK with the same message.

verifysig --msg=”TEST TEXT BLOB”


Figure 5 – Console sign and verify success

If not specified, the SDK will use default values for the hashing algorithm.

If a different message or hashing algorithm are used, the verification will fail.


Figure 6 – Console sign and verify failure

The executables included with the Intel® EPID SDK examples are intended only for quick validation or integration tests of signatures, and to demonstrate basic member and verifier capability.  A developer wanting to implement member or verifier functions would start by taking a look at the included documentation, which includes both an API reference and sample walkthroughs for signing and verifying in Intel® EPID.


Figure 7 – Intel® EPID SDK Documentation

The Intel® EPID SDK is constantly improving with each release, aligning to the newest Intel® EPID standards and providing optimizations for hashing algorithms using Intel® Performance Primitives.

How to implement Intel® EPID

OEM and ODMs can take advantage of the fact that Intel® EPID keys are available on all Intel® products that include series 5+ firmware.  The Intel® EPID SDK can be used to create the platform code that will run on the device, however it can only be executed on a platform device in a secured, trusted environment that is signed by Intel.  Only a signed application running in the ME secure firmware can access the Intel® EPID key for the purpose of provisioning.  An OEM/ODM can work with an Intel representative for guidance on how to enable Intel® EPID on an existing Intel® product that supports Intel® EPID.

Other silicon manufacturers are following suit and adopting Intel® EPID technology.  Both Cypress Semiconductor and Microchip are starting to ship products with embedded Intel® EPID member keys as well.  What this means is that employment of an Intel® EPID ecosystem can be accomplished regardless of Intel® Silicon – adhering to the rules of the Intel® EPID Security scheme is what permits a device to take advantage of the Intel® EPID features.

Visit the Intel Intel® EPID SDK deployment site for more documentation and API walkthroughs for signing and verifying messages https://01.org/epid-sdk/ .

If you are interested in implementing Intel® EPID into your products, or to join our Zero Touch Onboarding POC, start by emailing iotonboarding@intel.com

If you would like to use Intel’s Key Generation Facility to act as an Intel® EPID issuer for creation of Intel® EPID keys, please start by contacting iotonboarding@intel.com.

Quick Facts

  • Intel has issued over 4 billion Intel® EPID keys since the release of the Series 5 chipset in 2008
  • Devices in an Intel® EPID Ecosystem are allowed to authenticate anonymously using only a Group ID
  • Intel® EPID is Intel’s implementation of Direct Anonymous Attestation
  • Intel® EPID supports revoking devices based on Private Key, Intel® EPID Signature, or an entire Group
  • Silicon providers can create their own Intel® EPID ecosystem
  • OEM/ODMs can use Intel® EPID compliant silicon devices to provide quick and secure provisioning
  • Intel® products include an embedded true random number generator – providing quicker, more secure seed values for hashing algorithms. (The SDK requires a secure random number generator to be used in any implementation of Intel® EPID.)

Summary

In this article, we discussed an Intel ® security scheme called Intel® EPID that allows devices to attest membership of a group without being individually identified.  Intel® Enhanced Privacy Identification technology 2.0 enhances direct anonymous attestation by providing a member revocation ability based on member or group signatures.  Choosing Intel products allows OEM/ODMs and ISVs to take advantage of built-in security keys provided by Intel already available in numerous product families.  Silicon providers can also take advantage of the Intel® EPID technology by embedding private keys directly into their hardware, and joining their own Intel® EPID ecosystem.  With a predicted 50 to 100 billion connected IOT devices by 2020, security and device authenticity should be imperative for both manufacturers and end users.

A very special thanks to the members of the Intel® EPID SDK team for taking time to answer questions on Intel® EPID and the Intel® EPID SDK.

Terminology

AES-NIAES - New Instructions is a hardware embedded feature available in most newer Intel® products.
AIKAttestation Identity Key
AMTActive Management Technology - Support out of band remote access.
AnonymityA property that allows a device to avoid being uniquely identified or tracked.
AttestationA process by which a user or device guarantees they are who they say they are.
CertificateAn electronic document issued by a third-party trusted authority (issuer) that verifies the validity of a Public Key. The contents include a subject and a verifiable signature from the Issuer, which adds an additional layer of trust around the contents.
DAADirect Anonymous Attestation
DERCertificate File format - Distinguished Encoding Resource
ECCElliptic Curve Cryptography
EPIDEnhanced Privacy Identification
EPID keyA private key held by an individual and not shared with anyone. Is used to create a valid Intel® EPID signature that can be verified using a matching Intel® EPID public group key
iKGFIntel® Key Generation Facility
Intel SCSSetup and Configuration Software - Used to access AMT capabilities
ISMIntel® Standard Manageability
ISO 2008-2:2013ISO standard for Anonymous digital signature security mechanisms https://www.iso.org/obp/ui/#iso:std:iso-iec:20008:-2:ed-1:v1:en
MEIntel® Management Engine, sometimes also called Security and Management Engine
ODMOriginal Device Manufacturer
OEMOriginal Equipment Manufacturer
PEMCertificate File format - Privacy Enhanced Mail
PKEPublic Key Encryption
PKIPublic Key Infrastructure
PlatformA platform is considered a piece of hardware or device.
Private KeyA key that is owned by an individual or device and is held private and never shared with anyone. It is most commonly used to encrypt a message into cipher-text that can only be opened using a matching Public key.
Public KeyA key provided to the public that will only decrypt a document encrypted using a matching private key
SBTSmall Business Technology
Secure KeyA text string that matches the output of a defined algorithm and allows plain text to be transformed into cipher-text or vice-versa.
SIGMASIGn and Message Authentication - A protocol from Intel for platform to verifier two way authentication.
X.509IEEE standard for certificate format and content

About the Author

Matt Chandler is a senior software and applications engineer with Intel since 2004.  He is currently working on scale enabling projects for Internet of Things including ISV support for smart buildings, device security, and retail digital signage vertical segments.

References

Intel® EPID White Paper

https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/intel-epid-white-paper.pdf

NISC-PEC, December 2011

http://csrc.nist.gov/groups/ST/PEC2011/presentations2011/brickell.pdf

Wikipedia References on Security

https://en.wikipedia.org/wiki/Direct_Anonymous_Attestation
https://en.wikipedia.org/wiki/Public-key_cryptography
https://en.wikipedia.org/wiki/Public_key_infrastructure
https://en.wikipedia.org/wiki/Public_key_certificate

ACM conference 2004, “Direct Anonymous Attestation”

https://eprint.iacr.org/2004/205.pdf

Platform Embedded Security Technology Revealed

http://www.apress.com/us/book/9781430265719

Wikipedia Image license for PKI process

https://en.wikipedia.org/wiki/Public_key_infrastructure#/media/File:Public-Key-Infrastructure.svg
https://creativecommons.org/licenses/by-sa/3.0/

Face Beautification API for Intel® Graphics Technology

$
0
0

Download sample code [16MB]

Abstract

This paper highlights the C++ API for enabling applications to support Face Beautification, which is one of the features supported by Intel® Graphics Technology. It outlines of the list of available effects in the Face Beautification Library Version 1.0, and describes C++ API definitions and methods included in the library.

Face Beautification in Intel® Graphics Technology

The Face Beautification feature supported by Intel Graphics Technology provides the capability to automatically adjust facial landmarks. Using the current implementation of the APIs, you can create an automatic framework and develop face enhancement tools for a better user experience. Because there is a lot of information that can be extracted from an image, the API helps to implement automatic Face Beautification.

There are two methods for enabling Face Beautification in an application. The first version of Face Beautification support was available through Device Driver Interface (DD)I implementations; applications could be enabled by a call into the private DDI. Now, we have a second and simpler version, using a C++ API that can assist application development. Developers can access the C++ API via the Face Beautification static library. The Face Beautification feature set is detailed in the table below.

FB FeaturesCategory
Face brighteningGlobal
Face whiteningGlobal
Skin FoundationSkin map based
Skin SmoothingSkin map based
Skin blushSkin map + Landmark
Eye circles removalSkin map + Landmark
Eye bags removalSkin map + Landmark
Eye wrinkles removalSkin map + Landmark
Red lipsLandmark based
Big eyesLandmark based
Cute noseLandmark based
Slim faceLandmark based
Happy faceLandmark based

API Definitions

There are five APIs for Face Beautification. The current infrastructure supports one Face Beautification feature i.e. FBRedLip. There are three structures that store the input and output properties, FDFB mode feature, feature strength, and other parameters. The constructor initializes the class data members based on information provided by the structure. The first API initializes the device using fDeviceInitialization() followed by fConfiguration() for setting the device properties based on structures passed to the constructor. There is a separate API provided for Face Detection and Face Beautification (FDFB) mode - FDFBMode_Initialization(). After the device initialization and device configuration, the pipeline is executed for each frame using the ExecutionPipeline() API. The destructor is called automatically to release the memory object.

FB_API(FDFB_IN_OUT_PARAM init_file_var, FDFB_REDLIP_PARAM FBRedLip, FDFB_MODE_PARAMS FDFB_Mode_Val);
void fDeviceInitialization();
void fConfiguration();
void FDFBMode_Initialization();
void ExecutionPipeline(char* tempBuffer);
int convertFileToFaceList(std::fstream& file, std::vector<VPE_FACE_RECT>& list);
~FB_API();

The details of the structure used by the API’s is provided below:

typedef DXGI_FORMAT FDFB_FORMAT;
typedef struct FDFB_IN_OUT_PARAM
{
 FDFB_FORMAT inputFormat;
 FDFB_FORMAT outputFormat;
 UINT inputWidth;
 UINT inputHeight;
 UINT outputWidth;
 UINT outputHeight;
} FDFB_IN_OUT_PARAM;

typedef struct FDFB_REDLIP_PARAM
{
 UINT FBRedLipStrengthEnable;
 UINT FBRedLipStrength;
} FDFB_REDLIP_PARAM;

typedef struct FDFB_MODE_PARAMS
{
 GUID * pVprepOperationGUID;
 VPE_FDFB_MODE_ENUM FDFBMode;
 VPE_FD_FACE_SELECTION_CRITERIA_ENUM faceSelectionCriteria;
 std::vector<vpe_face_rect> list;
} FDFB_MODE_PARAMS;</vpe_face_rect>

Usage and Program Flow

  1. The Face Beautification header file and static library (.lib) are provided. Create the project and include the header file in the additional include directory, add .lib to additional library directories, and add the name of the static library to the additional dependencies on the Input tab.
  2. In the application, provide the input file, the output file, and the face file.
  3. Provide or read properties into variables like input width, input height, output width, output height, input format, and output file format. If the application wants to enable FDFB mode, provide which feature is enabled and its strength. Set the face selection criteria. If face selection criteria and FB feature strength is not provided, then the driver uses the default values. Use convertFileToFaceList() to convert the face file to list format.
  4. Create a class object and pack the information into structures. Call the class constructor to initialize the value of class data members.
  5. Call the device initialization function fDeviceInitialization().
  6. Call the device configuration function fConfiguration().
  7. Call the FDFB mode initialization function FDFBMode_Initialization().
  8. Execute the pipeline by calling ExecutionPipeline(char* tempbuffer). This function loops for all the frames. A new buffer is passed for every new frame; update the buffer accordingly.
  9. Write the output to the output file. A class destructor is called automatically to release the memory objects.

Future Work

The current API implementation supports one FDFB feature, FBRedLip. The upcoming versions will include a larger set of Face Beautification features. If requested, face detection support will be included as well. The DDI implementation support takes virtual camera input; this feature can be extended to the C++ API as well.

About the Author

Sonal Sharma is a software application engineer working at Intel in California. Her work responsibility includes performance profiling and analysis, and CPU/GPU code optimization for media applications.

Path of Exile’s storied road to success

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Path of Exile’s storied road to success Get more game dev news and related topics from Intel on VentureBeat.

Screenshot of game Path of Exile

It’s probably fair to say that if you’re a fledgling indie development studio casting around for signs of inspiration, good choices, and role models you’d be hard-pressed to find a better example than Grinding Gear Games, creators of action-RPG, Path of Exile. The studio, founded in late 2006 in Auckland, New Zealand, took its sweet time to release the game, but the patience and constant work paid off as it’s a certifiable hit commercially and among its legion of dedicated fans.

From a starting team of three the studio has ballooned to 100 developers all focused on the single project in front of them. They are still generating new content, and with it, encouraging new players to the fold while enticing lapsed players with smartly considered methods of keeping the experience fresh.

It all started with a love of Diablo* II

“We played a lot of online action RPGs like Titan Quest and Dungeon Siege, but especially Diablo* II,” says Producer and Lead Designer, Chris Wilson, “and as 2006 approached we felt it was strange that no studios were making games like this, specifically online, with good item economies.” Wilson and his early team knew that other studios were making games following some of the action-RPG tropes of Diablo II but whatever their intentions, those games ended up being single-player games.

Screenshot of game Path of Exile
Above: Path of Exile springs epic visual moments into the action-RPG gameplay.

“We felt that Diablo II players were looking for something newer and we thought, somewhat naively, why don’t we make that game?” says Wilson. There was method in the apparent madness since none of these friends had ever put together a game studio before. “There was a hole in the market, tens of millions of players were looking for something to play, and we felt like we can do that,” adds Wilson. That concept of identifying a hole in the market that evidently had a fan base but wasn’t being satisfactorily served is a core concept that would prove vital in allowing Path of Exile to be a success. That, and a little talent, of course.

Getting started

The early days wouldn’t be easy though. “We pooled our life savings, set up in a garage, and three of us started to make Path of Exile. It’s a survivor story, really, as we had to learn how to make games, scale a studio up to 100 people, but it was successful and it all came from a desire to fill a hole in the market,” says Wilson. “We knew it was the right product to make, we just didn’t know if we were capable of making it,” he adds.

From a design perspective, the team established pillars that would have to be adhered to for this game to be a success. “We knew to be successful it had to be an online game and that items had to be stored on the servers. These games thrive on the fact that they have items that are incredibly hard to obtain online, and players are willing to spend a long time to attain them,” says Wilson.

Screenshot of game Path of Exile
Above: Building a game with an absorbing item economy was key to Path of Exile’s design and success.

The next pillar dealt with requiring random levels and items to help retain players for the long haul. “It’s important for replayability that the levels are procedurally generated so that when you play through, it feels different,” says Wilson. In addition, staples of the genre like visceral combat that was also responsive and, as Wilson described, “punchy,” were the kind of standards that the Diablo crowd would both recognize and feel was core to their enjoyment.

These pieces would lead to an important goal. “We want people to play the game for ten years. And we already have players who are entering their sixth year, so that’s working well,” says Wilson.

As developers everywhere know, building technology while you’re building a game is far from easy. For the Grinding Gear team, there were no shortcuts. “At the time, there were no off-the-shelf online game back ends that you could just purchase. We needed the game to support tens of thousands of players online simultaneously. So, we looked at how other games architected and came up with a hybrid that we would build,” says Wilson.

As a result, every system was custom built for the requirements of this game. Wilson also revealed that it’s only this year that the team has investigated middleware options to help with new features like adding video into the game.

Screenshot of game Path of Exile
Above: Maps built using the procedural generation system ensures a different experience every time.

Slow growth

Given all this work, it shouldn’t be too much of a surprise that it took until 2013 before the game was officially in full release. Though the alpha period had generated an engaged community, building beyond that was a struggle with a traditional press tour resulting in positive plaudits and feedback, but only 2,000 additional people hitting the forums. “Now 2,000 people is nice, but nowadays you get that by just tweeting something,” says Wilson.

This slow-burn was frustrating, but the belief in the core product remained resolute. “Eventually the inevitability worked and it passed a quality threshold where people were willing to tell their friends,” says Wilson. It is also a commitment to let the game speak for itself that has stopped Grinding Gear from embarking on refer-a-friend programs and similar marketing techniques to attract players. Rather, they would prefer a more organic process whereby “a friend disappears and you wonder where they are. You discover they’ve disappeared because they’ve been playing this awesome game for six hours a day!” adds Wilson.

The randomized system — such a core part of the replayability — has also played well with the community and with YouTubers. “We call them the reddit moments,” says Wilson, “when the game does something interesting enough, the player will say ‘hey, that was cool, I need to post it on reddit.’”

Those moments might be action situations, but the randomized naming of items and monsters can generate its own comedy and, even, naughtiness. “There was a monster where the game generated the name Stink Stink, and so that of course has become a community meme.

“We quickly learned that while it sounded cool to have the word ‘black’ as a prefix so you could have cool names like Black Bane, it was way too quickly able to generate offensive stuff, so that had to be removed early in the beta!” adds Wilson.

Screenshot of game Path of Exile
Above: The inspiration from Diablo* II is quite apparent in the game style and layout.

The emergence of the streamer and YouTube* community during Path of Exile’s development and release has certainly aided gamer awareness, but hasn’t affected any core design or feature set ideas. That said, Wilson suggested that the team have considered a few ideas to address the people who are helping generate awareness and extending the game’s reach.

“We have considered a game mode for streamers (or any user) where two streamers enter and they’re competing with each other in some way — probably not directly — like who gets through the maze first. Then their viewers have some mechanism for donating or voting that makes it harder for the other streamer. So, two rivals can have fun in friendly competition,” explains Wilson.

The business of free-to-play

Now approaching a full four years of full release, Path of Exile continues its upwards growth curve, buoyed by new content, daily news posts, and unveiling a new server every three months that keeps drawing players back. Wilson accepts that the game is now profitable, powered by its 100-strong development team. That scenario wasn’t always quite so apparent when this new studio started out with its game design dream. “It pretty much is a passion project. It started with ‘players want this game’ and only turned into the business of ‘okay, how is this game going to pay for itself’ a bit later,” says Wilson.

“We had seen games like Maple Quest be very successful in Korea, but nobody had really done a free-to-play model in the West,” says Wilson, “and our big revolution was to see if we could be the first free-to-play in the West. We weren’t because the game took forever to make.”

Wilson and his team did what he describes as “rough back-of-envelope math” to figure out the business model in those early days. “From surveying other games, we figured for every average concurrent user, you make about 50 cents a day. So, with 1000 people on average logged in, you’re making $500 a day. That obviously only pays for a couple of staff members. We looked into the logistics of what it’s going to take to run this online game with a skeleton crew, not making much content, and decided we needed 10,000 players logged in on average in order to pay for the game, so that was our goal — the 10,000 concurrent players mark,” reveals Wilson.

It turned out that Wilson and his team had the math wrong. Quite wrong! “You actually make a lot more money than that and you also require a lot more people. We have a 100-person team and we thought you could run an online game with six!” says Wilson. Fortunately for Grinding Gear, fans enjoying their game experience are willing to pay for the entertainment and, coupled with blowing the 10,000-concurrent number out of the water, the game is able to support that 100-person crew.

Screenshot of game Path of Exile
Above: The anticipated economics turned out to be quite different than expected once the game began to scale its concurrent players.

Coming next

Wilson is clear that Path of Exile remains the total focus of Grinding Gear Games for the foreseeable future, with no plans to diversify into other games. “We have a lot of stories still to tell with Path of Exile,” he says, adding “We hear game ideas all the time and someone will say ‘we can make a game to beat Dota 2’ and I shake my head and go back into my office!”

Now if the studio were looking for opportunities — and they’re not — it would follow that same philosophy of finding an underrepresented genre with an established fan base. “It would be something like the old-school point-and-click adventures or the Command & Conquer RTS. Those are some of the areas that we would look at,” says Wilson, but also made it abundantly clear that no, the studio is not announcing work on any such projects.

“We’re not making a VR game, we’re not making a survival game like DayZ, we avoided making a Minecraft* game. We’ve avoided jumping on any bandwagon, but would look where areas are being underserved,” Wilson added.

Remaining focused on Path of Exile continues to pay dividends, as does believing that despite reports to the contrary, the PC continues to maintain its viability as a major gaming platform. There are lessons here for every development studio.

Introduction to the DPDK Sample Applications

$
0
0

This article describes the Data Plane Development Kit* (DPDK*) sample applications.

The DPDK sample applications are small standalone applications which demonstrate various features of DPDK. They can be considered a cookbook of DPDK features. A user interested in getting started with DPDK can take the applications, try out the features, and then extend them to fit their needs.

The DPDK Sample Applications

Table 1 shows a list of some of the sample applications that are available in the examples directory of DPDK:

BondingNetmap Compatibility
Command LinePacket Ordering
DistributorPerformance Thread
EthtoolPrecision Time Protocol (PTP) Client
Exception PathQuality of Service (QoS) Metering
Hello WorldQoS Scheduler
Internet Protocol (IP) FragmentationQuota and Watermark
IP PipelineRX/TX Callbacks
IP ReassemblyServer Node EFD
IPsec Security GatewayBasic Forwarding/Skeleton App
IPv4 MulticastTunnel End Point (TEP) Termination
Kernel NIC InterfaceTimer
Network Layer 2 Forwarding + variantsVhost
Network Layer 3 Forwarding + variantsVhost Xen
Link Status InterruptVMDQ Forwarding
Load BalancerVMDQ and DCB Forwarding
Multi-processVM Power Management

Table 1. Some of the DPDK sample applications.

These examples range from simple to reasonably complex but most are designed to demonstrate one particular feature of DPDK. Some of the more interesting examples are highlighted below.

  • Hello World: As with most introductions to a programming framework a good place to start is with the Hello World application. The Hello World example sets up the DPDK Environment Abstraction Layer (EAL), and prints a simple "Hello World" message to each of the DPDK-enabled cores. This application doesn’t do any packet formatting but it is a good way to test whether the DPDK environment is compiled and set up properly.
  • Basic Forwarding/Skeleton application: The basic forwarding/skeleton contains the minimum amount of code required to enable basic packet forwarding with DPDK. This will allow the user to test and see if their network interfaces are working with DPDK.
  • Network Layer 2 Forwarding: The Network Layer 2 forwarding, or L2fwd application, does forwarding based on Ethernet MAC addresses like a simple switch.
  • Network Layer 3 Forwarding: The Network Layer 3 forwarding, or L3fwd application, does forwarding based on Internet protocols, IPv4, or IPv6 like a simple router.
  • Packet Distributor: The packet distributor demonstrates how to distribute packets arriving on an Rx port to different cores for processing and transmission.
  • Multi process application: The multi process application shows how two DPDK processes can work together using queues and memory pools to share information.
  • RX/TX Callbacks application: The RX/TX Callbacks sample application is a packet forwarding application that demonstrates the use of user-defined callbacks on received and transmitted packets. The application calculates the latency of the packet between RX (packet arrival) and TX (packet transmission) by adding callbacks to the RX and TX packet processing functions.
  • IPSec Security Gateway: The IPSec security gateway application is a minimal example of something closer to a real-world usage example. This is also a good example of an application using the DPDK Cryptodev* framework.
  • Precision Time Protocol (PTP) client: The PTP client is another minimal implementation of a real-world application. In this case the application is a PTP client that communicates with a PTP master clock to synchronize time on a network interface card (NIC) using the IEEE1588 protocol.
  • Quality of Service (QoS) Scheduler: The QoS Scheduler application demonstrates the use of DPDK to provide QoS scheduling.

There are many more examples which are documented online at dpdk.org. Each of the documented sample applications show how to compile, configure, and run the application, as well as explaining the main code behind the functionality.

In the next section, we will look at the Network Layer 3 forwarding (L3fwd) sample application in more detail.

The Network Layer 3 Forwarding Sample Application

The Network Layer 3 forwarding, or L3fwd application, demonstrates packet forwarding based on Internet protocol, IPv4, or IPv6 like a simple router. The L3fwd application has two modes of operation, longest prefix match (LPM) and exact match (EM), which demonstrate the use of the DPDK LPM and Hash libraries.

Figure 1 shows a block diagram of the L3fwd application set up to forward packets from a traffic generator using two ports.


Figure 1. The L3fwd application set up to forward packets from a traffic generator.

Longest prefix match (LPM) is a table search method, typically used to find the best route match in IP forwarding applications. The L3fwd application statically configures a set of rules and loads them into an LPM object at initialization time. By default, L3fwd has a statically defined destination LPM table with eight routes, as shown in Table 2.


Table 2. Default LPM routes in L3fwd.

L3fwd uses the IPv4 destination address of the packet to identify its next hop; i.e., the output port ID from the LPM table. It can also route based on IPv6 addresses (from DPDK 17.05).

Exact match (EM) is hash-based table search method to find the best route match in IP forwarding applications. In EM lookup, the search key is represented by a five-tuple value of Source IP address, Destination IP address, Source Port, Destination Port and Protocol. The set of flows used by the application is statically configured and loaded into the hash object at initialization time. By default, L3fwd has a statically defined destination EM table with four routes, as shown in Table 3.


Table 3. Default EM routes in L3fwd.

The next hop, i.e., the output interface for the packet is identified from the EM table entry. EM-based forwarding supports IPv4 and IPv6.

Building the Application

The L3fwd application can be built as shown below. The environment variables used are described in the DPDK Getting Started guides.

$ export RTE_SDK=/path/to/rte_sdk
$ export RTE_TARGET=x86_64-native-linuxapp-gcc
$ cd $RTE_SDK/examples/l3fwd
$ make clean
$ make

Running the Application

The command line for the L3fwd application has the following options:

$./build/l3fwd [EAL options] --
                -p PORTMASK [-P] [-E] [-L]
                --config(port,queue,lcore)[,(port,queue,lcore)]
                [--eth-dest=X,MM:MM:MM:MM:MM:MM]
                [--enable-jumbo [--max-pkt-len PKTLEN]]
                [--no-numa]
                [--hash-entry-num 0x0n]
                [--ipv6]
                [--parse-ptype]

This comprises the EAL parameters, which are common to all DPDK applications, and application-specific parameters.

The L3fwd app uses the LPM as the default lookup method. The lookup method can be changed with a command-line option at runtime:

-E: selects the Exact Match lookup method.

-L: selects the LPM lookup method.

Here are some examples of running the L3fwd application.

#LPM
$./build/l3fwd -l 1,2 -n 4 -- -p 0x3 -L --config="(0,0,1),(1,0,2)"
#EM
$./build/l3fwd -l 1,2 -n 4 -- -p 0x3 -E --config="(0,0,1),(1,0,2)" \
               --parse-ptype

Conclusion

This article describes an overview of a subset of the DPDK sample applications and then goes into more detail on the L3fwd sample application.

About the Author

Bernard Iremonger is a network software engineer with Intel Corporation. His work is primarily focused on the development of the data plane libraries for DPDK. His contributions include conversion of the DPDK documentation to use the Sphinx* documentation tool, enhancements to the poll mode drivers (PMDs) to support port hot plug and live migration of virtual machines (VMs) with single root IO virtualization (SRIOV) virtual functions (VFs), API extensions to the ixgbe, and i40e PMDs for control of the VFs from the physical function (PF).

Unattended Baggage Detection Using Deep Neural Networks in Intel® Architecture

$
0
0

In a world becoming ever more attuned to potential security threats, the need to deploy sophisticated surveillance systems is increasing. An intellectual system that functions as an intuitive “robotic eye” for accurate, real-time detection of unattended baggage has become a critical need for security personnel at airports, stations, malls, and in other public areas. This article discusses inferencing a Microsoft Common Objects in Context (MS-COCO) detection model for detecting unattended baggage in a train station.

1. Evolution of Object Detection Algorithms

Image classification involves predicting the label of an image among predefined labels. This assumes that there is a single object of interest in the image and it covers a significant portion of the image. Detection is about not only finding the class of the object but also localizing the extent of the object in the image. The object can be lying anywhere in the image and can be any size (scale). So object classification is not helpful when there are multiple objects in an image, the objects are small, and the exact location and image are desired.

Traditional methods of detection involved using a block-wise orientation histogram (SIFT or HOG) feature that could not achieve high accuracy in standard data sets such as PASCAL VOC. These methods encode low-level characteristics of the objects and therefore cannot effectively distinguish among the different labels. Methods based on deep learning (convolutional networks) have become the state-of-the-art in object detection in images. Various network topologies have evolved over time, as shown in Figure 1.

Figure 1

Figure 1: Evolution of detection algorithms [1].

2. Installation

2.1 Building and Installing Caffe* Optimized for Intel® Architecture

Caffe can be installed and used with several combinations of development tools and libraries on a variety of platforms. Here we describe the steps to build and install Caffe* optimized for Intel® architecture with the Intel® Math Kernel Library 2017 on Ubuntu*-based systems. Please refer to the git* clone https://github.com/intel/caffe.

1. Clone the Caffe optimized for Intel architecture and pull down all the dependencies.

Navigate to the local caffe directory and copy the makefile.config.example and rename it to makefile.config.

2. Make sure the following lines are uncommented in the makefile.config.

makefile.config

# CPU-only switch (uncomment to build without GPU support)

CPU_ONLY := 1

3. Install OpenCV*.

For computer vision and image augmentation, install the OpenCV 3.2 version.

sudo apt-get install python-opencv

Remember to enable OPENCV_VERSION := 3 in Makefile.config before running make when using OpenCV 3 or higher.

4. Build the local Caffe. 

Navigate to the local caffe directory.

NUM_THREADS=41
make -j $NUM_THREADS

5. Install and load the Python*modules.

make pycaffe
pip install pandas
pip install scipy
    import sys
    CAFFE_ROOT = 'path/to/caffe'
    sys.path.append(CAFFE_ROOT)
    import caffe
    caffe.set_mode_cpu()

3. Solution Architecture and Design

Our solution aims at identifying unattended baggage in public areas like railway stations, airports and so on and then triggering an alarm. Detections are done in surveillance videos using the business rules defined in section 3.3. Network Topology

Of the different detection techniques mentioned in Figure 1, we decided to choose the Single Shot multibox Detector (SSD) optimized for Intel architecture [2].Researchers say that it has promising performance even in embedded systems and high-end devices and hence is likely to be used for real-time detections.

Figure 2

Figure 2. Input Image and Feature Maps

SSD only needs an input image and ground truth (GT) boxes for each object during training. In a convolutional fashion, a small set (four, in our example) of default boxes of different aspect ratios at each location in several feature maps with different scales [8 × 8 and 4 × 4 in (b) and (c)] is evaluated (see Figure 2). The SSD leverages the Faster RCNN [3] Region Proposal Network (RPN) [4], using it to directly classify object inside each prior box instead of just scoring the object confidence.

For each default box, the network predicts both the shape offsets and the confidences for all object categories [(c1, c2, ..., cp)]. At training time, the default boxes are first matched to the ground truth boxes. For example, two default boxes—one with a cat and one with a dog—are matched, which are treated as positives. The boxes other than default boxes are treated as negatives. The model loss is a weighted sum between localization loss (example: Smooth L1) and confidence loss (example: Softmax).

Since our use case involves baggage detection, either the SSD network needs to be trained with different kinds of baggage or we can use a pretrained model like SSD300 trained on an MS-COCO data set. We decided to use the pretrained model, which is available for download at https://github.com/weiliu89/caffe/tree/ssd#models

3.1 Design and Scope 

The scope of this use case limited to the detection of baggage that stays un-attended for a period of time. Identifying the exact owner and tracking the baggage is beyond the scope of this use case.

Because of the large number of boxes generated during the model inference, it is essential to perform non-maximum suppression (NMS) efficiently during inference. By using a confidence threshold of 0.01, most boxes can be filtered out. The NMS can then be applied with a Jaccard overlap of 0.45 per class, keeping the top 400 detections per image. Figure 3 shows the flow diagram for running detection on a surveillance video.

Figure 3

Figure 3. Detection flow diagram.     

The surveillance video is broken down into frames using OpenCV with a configurable frames per second. As the frames are generated, they are  passed to the detection model, which localizes the different objects in the form of four coordinates (xmin, xmax, ymin, and ymax) and provides a classification score to the different possible objects. By applying the NMS threshold and setting confidence thresholds, the number of predictions can be reduced and  kept to the prediction that is the most optimal. OpenCV is used to draw a rectangular box with various colors around the detected baggage and the person.

3.2 Defining the Business Rules 

Abandoned luggage in our context is defined as items of luggage that have been abandoned by their owner. Each item of luggage has one owner, and each owns at most one item of luggage. Luggage is defined as including all types of baggage that can be carried by hand. Examples: trunks, bags, rucksacks, backpacks, parcels, and suitcases.

The following rules apply to attended and unattended luggage:

  • A luggage is owned and attended to by a person who enters the scene with the luggage until the point at which the luggage is not in physical contact with the person.
  • At this point the luggage is attended to by the owner ONLY when they are within a distance of 20 inches (spatial rule). All distances are measured between Euclidean distances.
  • A luggage item is unattended when the owner is farther than b meters (where b ≥ a  from the luggage. In this case the system applies the spacio-temporal rule to detect whether this item of luggage has been abandoned (triggering an alarm event).
  • The spacio-temporal rule to determine abandonment: an item of luggage that has been left unattended by its owner for a period of t consecutive seconds during which time the owner has not re-attended to the luggage nor has the luggage been attended to by a second party (instigated by physical contact, in which case a theft/tampering event may be raised). The image below (Figure 7) shows an item of luggage left unattended for t (=10) seconds, at which point the alarm event is triggered. Here we relate the time t with the number of frames f per second. If we have n frames per second in our input video, t seconds would be defined as (t*f) frames. In short, a bag that has been unattended in (t*f) consecutive frames triggers the alarm.

3.3 Inferencing the MS-COCO Model

Implementation or inferencing is done using Python 2.7.6 and OpenCV 3.2. The following steps are performed (code snippets are included for reference):

  1. Read the input video as follows:

    CAFFE_ROOT = '/home/979648/SSD/caffe/
    # -> Reading the video file and storing in a directory
    TEST_VIDEO = cv2.VideoCapture(os.getcwd()+
    ‘InputVideo/SurveillanceVideo.avi')
    MODEL_DEF = 'deploy.prototxt'
  2. Load the network architecture.

    net = caffe.Net(MODEL_DEF, MODEL_WEIGHTS,caffe.TEST)
  3. Read the video by frame and inference each frame against the model to obtain a detection and classification score.

    success, image = TEST_VIDEO.read()
    if (success):
        refObj = None
        imageToNet = cv2.resize(image, (300, 300))
        image_convert = np.swapaxes(np.swapaxes(imageToNet, 1, 2), 0, 1)
        net.blobs['data'].data[...] = image_convert
        # Forward pass.
        detections = net.forward()['detection_out']
    
        # Parse the outputs.
        det_label = detections[0, 0, :, 1]
        det_conf = detections[0, 0, :, 2]
        det_xmin = detections[0, 0, :, 3]
        det_ymin = detections[0, 0, :, 4]
        det_xmax = detections[0, 0, :, 5]
        det_ymax = detections[0, 0, :, 6]
    
        # Get detections with confidence higher than 0.6.
        top_indices = [i for i, conf in enumerate(det_conf) if conf >=    CONFIDENCE]
    
        top_conf = det_conf[top_indices]
    
        top_label_indices = det_label[top_indices].tolist()
        top_labels = get_labelname(labelmap, top_label_indices)
        top_xmin = det_xmin[top_indices]
        top_ymin = det_ymin[top_indices]
        top_xmax = det_xmax[top_indices]
        top_ymax = det_ymax[top_indices]
    
        colors = plt.cm.hsv(np.linspace(0, 1, 21)).tolist()
    
        currentAxis = plt.gca()
        # print('Detected Size : ', top_conf.shape[0])
    
        detectionDF = pd.DataFrame()
        if (top_conf.shape[0] != 0):
            for i in xrange(top_conf.shape[0]):
                xmin = int(round(top_xmin[i] * image.shape[1]))
                ymin = int(round(top_ymin[i] * image.shape[0]))
                xmax = int(round(top_xmax[i] * image.shape[1]))
                ymax = int(round(top_ymax[i] * image.shape[0]))
                score = top_conf[i]
                label = int(top_label_indices[i])
                label_name = top_labels[i]
                display_txt = '%s: %.2f' % (label_name, score)
                detectionDF = detectionDF.append(
                        {'label_name': label_name, 'score': score, 'xmin': xmin, 'ymin': ymin, 'xmax': xmax, 'ymax': ymax},
                        ignore_index=True)
    
    		detectionDF = detectionDF.sort('score', ascending=False)
  4. For calculating the distance between objects in an image, a reference object has to be used. A reference object has two main properties:

       a) The dimensions of the object in some measurable unit, such as inches or millimeters. In this case we consider the dimesions to be in inches.

       b) We can easily find and identify the reference object in our image.

    Also, an approximate width of the reference object has to be assumed. In this case we assume the width of the suitcase (args[‘width’]) to be 27 inches.

    $ pip install imutils
    if refObj is None:
    	  # unpack the ordered bounding box, then compute the
    	  # midpoint between the top-left and top-right points,
    	  # followed by the midpoint between the top-right and
    	  # bottom-right
    	  (tl, tr, br, bl) = box
    	  (tlblX, tlblY) = midpoint(tl, bl)
    	  (trbrX, trbrY) = midpoint(tr, br)
    
    	  # compute the Euclidean distance between the midpoints,
    	  # then construct the reference object
    	  D = dist.euclidean((tlblX, tlblY), (trbrX, trbrY))
    	  refObj = (box, (cX, cY), D / args["width"])
    	  continue
  5. Once the reference object is obtained, the distance between the reference object and the other objects in the image is calculated. The business rules are applied, and then the appropriate alarm will be triggered. In this case, a red box will be highlighted on the object.

    if refObj != None:
    D = dist.euclidean((objBotX, objBotY), (int(tlblX), int(tlblY))) /  refObj[2]
    (mX, mY) = midpoint((objBotX, objBotY), (tlblX, tlblY))
    
    //apply spacio temporal rule
         // Highlight with Green/Yellow /Red

     

  6. Save the processed images, and then append them to the output video.

4. Experimental Results

The following detection (see Figures 4—7) was obtained when the inference use case was run on a sample YouTube* video available at https://www.youtube.com/watch?v=fpTG4ELZ3bE

Figure 1Figure 2
Figure 4: Person enters the scene with the baggage, which is currently safe (highlighted with green).
Figure 5Figure 6
Figure 5: The owner is moving away from the baggage.Figure 6: The system raises a warning signal.
Figure 7
Figure 7: Owner is almost out of the frame and the system raises a video alarm (blinking in red).

5. Conclusion and Future Work

We observed that the system can detect baggage accurately in medium- to high-quality images. The system is also capable of detecting more than one baggage in the case of multiple owners. However the system failed to detect the baggage in a low-quality video. The distance calculation does not include focal length, angle of the camera, and the plane, and hence the current calculation logic has its own limitations. The current system is also not capable of tracking the baggage.

The model was inferenced using the Intel® Xeon® processor E5-2699 v4 @ 2.20 GHz with 22 cores and 64 GB free memory. Future work will include enhancement to the current use case by identifying the owner of the baggage and also tracking the baggage. Videos with different angles and focal lengths will also be inferenced to judge the effectiveness of the system. The next phases of our work will also consider efforts to parallelize the inference model.

6. References and Links

Power System Infrastructure Monitoring Using Deep Learning on Intel® Architecture

$
0
0

List of Abbreviations

AbbreviationsExpanded Form
DLdeep learning
LSDline segment detector
UAVunmanned aerial vehicle
GPUgraphics processing unit

Abstract

The work in this paper evaluates the performance of Intel® Xeon® processor powered machines for running deep learning on the GoogleNet* topology (Inception* v3). The functional problem tackled is the identification of power system components such as pylons, conductors, and insulators from the real-world video footage captured by unmanned aerial vehicles (UAVs) or commercially available drones. By conducting multiple experiments we tried to derive the optimal batch size, iteration count, and learning rate for the model to converge.

Introduction

Recent advances in computer-aided visual object recognition, namely the application of deep learning, has made it possible to solve a wide array of real-world problems which previously were impossible. In this work, we present a novel method for detecting the components of power system infrastructure such as pylons, conductor cables, and insulators.

The original implementation of this algorithm took advantage of the power of the NVIDIA* graphics processing unit (GPU) during training and detection. The current work primarily focuses on implementing the algorithm on TensorFlow* CPU mode and executing it over Intel® Xeon® processors.

During execution, we will record performance metrics across the different CPU configurations.

Environment Setup

Hardware Setup

Table 1. Intel® Xeon® processor configuration.

Intel Xeon processor

Model Name: Intel® Xeon® processor E5-2699 v4 @ 2.20GHz

Core(s) Per Socket: 22                       RAM (free): 123 GB

OS: Ubuntu* 16.1

Software Setup

  1. Python* Setup

    The experiment is tested on Python* version 2.7.x. Verify the version as follows:


    Figure 1. Verify Python* version.

  2. TensorFlow* Setup
    1. Install TensorFlow using pip: “$ pip install tensorflow”. By default, this would install the latest wheel for your CPU architecture. Our experiment is built and tested on TensorFlow3 version 1.0.x.
    2. Verify the installation as shown in Figure 2:


    Figure 2. Verify TensorFlow setup.

  3. Inception* Model

    The experiments detailed in the subsequent sections employ the transfer learning technique to speed up the entire process. For this purpose, we used a pretrained GoogleNet* model, namely Inception* v3. The details of the transfer learning process are explained in the subsequent sections.

    Download the Inception v3 model from the following link: http://download.tensorflow.org/models/image/imagenet/inception-2015-12-05.tgz

  4. TensorBoard*

    We use TensorBoard* in our experiments to visualize the progress and the results of individual experiment runs.

    TensorBoard is installed along with TensorFlow. After installing TensorFlow, enter the following command from the bash script to ensure that TensorBoard is available:

    “ $ tensorboard --help ”

Solution Design

The entire solution is divided into three stages. They are:

  • Data Preprocessing
  • Model Training
  • Inference


Figure 3. High-level solution design.

Data Preprocessing

The images used for training the model are collected through aerial drone missions carried out in the field. The images collected vary in resolution, aspect, and orientation, with respect to the object of interest.

The entire preprocessing pipeline is built using OpenCV* 2 (Python implementation). The high-level objective of preprocessing is to convert the raw, high-resolution drone images into a labeled set of image patches of size 32 x 32, which is used for training the deep learning model.

The various processes involved in the preprocessing pipeline are as follows:

  • Image annotation
  • Generating binary masks
  • Creating labeled image patches

The individual processes involved in the pipeline are detailed in the following steps:

Step 1: Image annotation.

Those experienced in the art of building and training convolutional neural network (CNN) architectures will quickly relate to the image annotation task. It involves manually labeling the objects within your training image set. In our experiment, we relied on the Python tool, LabelImg*4, for annotation. The tool outputs the object coordinates in XML format for further processing.


Figure 4. Image without annotation.


Figure 5. Image with annotation overlay.

The preceding images depict a typical annotation activity carried out on the raw images.

Step 2: Generating binary masks.

Binary masks refer to the mode of image representation where we depict either the presence or absence of an object. Hence, for every raw image, we generate individual binary masks corresponding to each of the labels available. The binary masks so created are used in the steps that follow for actually labeling the image patches. This idea is depicted in the following images. In the current implementation, the mask generation process is developed using Python OpenCV.


Figure 6. Generating binary masks from the raw image.

Step 3: Creating labeled image patches.

Once the binary mask is generated, we run a 32 x 32 filter over the raw image and compare the activations (white pixel count) obtained in the various masks for the corresponding filter position.


Figure 7. Creating labeled image patches.

If the activation in a particular mask is found to be above the defined threshold of 5 percent of patch area (0.05*32*32), we label the patch to match the mask’s label. The output of this activity is a set of 32 x 32 image patches partitioned into multiple directories based on their labels. The forthcoming model training phase of the experiment directly accesses this partitioned directory structure for label-specific training images.


Figure 8. Preprocessing output directory structure.

Please note that in the above-described patch generation process, the total number of patches generated varies, depending on other variables such as size of the filter (32 x 32 in this case), resolution of input images, and the activation threshold, while comparing with binary masks.

Network Topology and Model Training

Inception v3 Model


Figure 9. Inception V3 topology.

Inception V3 is a revolutionary deep learning architecture, which achieved state of the art performance in ILSVRC14 (ImageNet* Large Scale Visual Recognition Challenge 2014).

The most striking advantage of Inception over the other topologies is the depth of feature learning achieved, keeping the memory and CPU cost nearly at a par with other topologies. The architecture tries to improve on performance by reducing the effective sparsity of the data structures by converting them into dense matrices through clustering. This sparse-to-dense conversion is achieved architecturally by designing telescopic convolutions (1 x 1 to 3 x 3 to 5 x 5). This is commonly referred to as the network-in-network.

Transfer Learning on Inception

In our experiments we applied transfer learning on a pretrained Inception model (trained on ImageNet data). The transfer learning approach initializes the last fully connected layer with random weights (or zeroes), and when the system is trained for the new data (in our case, the power system infrastructure images), these weights are readjusted. The base concept of transfer learning is that the initial many layers in the topology will have learned some of the base features such as edges and curves, and this learning can be reused for the new problem with the new data. However, the final, fully connected layers would be fine-tuned for the very specific labels that it is trained for. Hence, this needs to be retrained on the new data.

This is achieved through the Python API, as follows:

  1. Add new hidden layer, Rectified Linear Unit (ReLU):
    hidden_units_layer_1 = 1024
      layer_weights_fc1 = tf.Variable(
          tf.truncated_normal([BOTTLENECK_TENSOR_SIZE, hidden_units_layer_1], stddev=0.001),
          name='fc1_weights')
      layer_biases_fc1 = tf.Variable(tf.zeros([hidden_units_layer_1]), name='fc1_biases')
      hidden_layer_1 = tf.nn.relu(tf.matmul(bottleneck_input, layer_weights_fc1,name='fc1_matmul') + layer_biases_fc1)
     
  2. Add new softmax function: 
    layer_weights_fc2 = tf.Variable(
          tf.truncated_normal([hidden_units_layer_1, class_count], stddev=0.001),
          name='final_weights')
      layer_biases_fc2 = tf.Variable(tf.zeros([class_count]), name='final_biases')
    
      logits = tf.matmul(hidden_layer_1, layer_weights_fc2,
                         name='final_matmul') + layer_biases_fc2
      final_tensor = tf.nn.softmax(logits, name=final_tensor_name)

Testing and Inference

Testing is done on a 90:10 split on the entire image set. The test images go through the same patch generation process that was invoked during the training phase. The resultant patches are sent for detection on the trained model.


Figure 10. Result of model inference overlaid on raw image.

After detection, the patches are passed through a line segment detector (LSD) for the final localization.


Figure 11. Result of running LSD.

Results

The different iterations of the experiments involve varying batch sizes and iteration counts.

During the experiments, in order to reduce the time consumed during preprocessing, we modified the preprocessing logic. Therefore, the metrics for different variants of the preprocessing logic were also captured.

We also observed that in the inception model, bottleneck tensors are cached during the initial run, so the training time during the subsequent runs would be much less. The final training result for the Intel Xeon processor is as follows:

Table 2. Experiment results.

Note: Inference time is inclusive of the preprocessing (patch) operation along with the time for the actual detection on the trained model.

Conclusion and Future Work

The functional use case tackled in this paper involved the detection and localization of power system components. The use case could be further expanded to identifying powersystem components that are damaged.

The training and inference time observed could be further improved by using an Intel optimized version of TensorFlow5.

References and Links

The references and links used to create this paper are as follows:

1. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna, Rethinking the Inception Architecture for Computer Vision (2015).

2. Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016).

3. Tensorflow Repository, https://github.com/tensorflow/tensorflow.

4. LabelImg – Python tool for image annotation, https://github.com/tzutalin/labelImg.

5. Optimized Tensorflow – TensorFlow optimizations on Modern Intel Hardware, https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture


Python mpi4py on Intel® True Scale and Omni-Path Clusters

$
0
0

Python users of the mpi4py package, leveraging capabilities for distributed computing on supercomputers with the Intel® True Scale or Intel® Omni-Path interconnects might run into issues with the default configuration of mpi4py.

The mpi4py package is using matching probes (MPI_Mpobe) for the receiving function recv() instead of regular MPI_Recv operations per default. These matching probes from the MPI 3.0 standard however are not supported for all fabrics, which may lead to a hang in the receiving function.

Therefore, users are recommended to leverage the OFI fabric instead of TMI for Omni-Path systems. For Intel® MPI, the configuration could look like the following environment variable setting.:

I_MPI_FABRICS=ofi

Users utilizing True Scale or Omni-Path systems via the TMI fabric, might alternatively switch off the usage of matching probe operations withing the mpi4py recv() function.

This can be established via

mpi4py.rc.recv_mprobe = False

right after importing the mpi4py package.

 

 

Twisted Pixel brings Hollywood A-list voices to VR

$
0
0

The original article is published by Intel Game Dev on VentureBeat*: Twisted Pixel brings Hollywood A-list voices to VR Get more game dev news and related topics from Intel on VentureBeat.

Promo image for game Wilson&#039;s heart

Launching a game into the fledgling VR world might for some sound like a risk: become a pioneering early adopter or languish as an also-ran in a field of too few units? For Twisted Pixel Games and Chief Creative Officer Josh Bear it wasn’t an early-technology play as might be suggested by a resume that includes Gunslinger, an early support showcase for the fateful Kinect on Xbox 360.

“It wasn’t the intention to focus the studio around gaming technology,” says Bear, “but we got to see Project Natal — as Kinect was initially known — and Microsoft was incredibly excited about it, so we got to make Gunslinger.”

The studio, with several games under its belt over its ten-year tenure, had always wanted to make Wilson’s Heart, which is out now in VR. “We had an early prototype that we had called ‘The Hands Game’ working with a gamepad with a first-person narrative. It used your hands to pick up objects like guns, like a first-person shooter…but then when the Oculus guys showed us that technology, with its Touch controller, we were all like ‘oh man, this just makes sense for this game,’” explains Bear.

Changing the core technology was not a straightforward process, however. What had begun life as a PC game with first-person sensibilities now had to adapt its functions to fit a whole new ballgame. “We basically started over when we had VR,” says Bear, “we even had to reassess the black and white graphics aesthetic: Would it be weird in VR? Will it work or feel unreal to gamers?”

“We had to do a lot of work with the controllers to make sure the hands felt good. They are so much a part of the game that we had to make sure they felt, acted, and looked cool,” he adds.

 game Wilson's Heart
Above: The gameplay blends psychological horror with puzzles, and not many “gotcha” horror jump moments.

Questions and challenges will continue to be answered with work, ideas, and understanding the new paradigms as VR evolves. Fundamentals of the game experience are affected: how long is too long, how short is too short? Nobody will currently build a 100-hour VR game (yet) but length of the total experience is one of the challenges that Bear’s team had to grapple with.

“Some people took six hours, some eight, and we saw ten or 12. But we have to be cognizant of that because you’re going to want to get out of the helmet at some point no matter how much you like this stuff,” says Bear.

Do you feel dizzy?

Of course, one of VR’s major challenges impacting experiences at this stage of the technology’s development is motion sickness. One design change that was required in switching from PC to VR was handling locomotion. “We didn’t want players getting sick, but we wanted to keep people in it as much as possible. So, when you warp in the game and the screen goes black you can still hear Wilson breathing or his footsteps or sound effects to push you along,” says Bear, “so we had to think about everything very differently in VR.”

An early version of the game allowing players to walk around anywhere and go where they want — and, in fact, one of the most requested features emerging from players experiencing the game now — caused problems for people.

“Hopefully we — or someone — will figure out the way to make that happen, but it’s the reason we went with the teleportation system.” The result? “We’ve had no issues with motion sickness at all,” says Bear.

Bear adds, “Oculus has been great about taking this as a top concern. But it’s hard because some people love it so much they just want to walk around, but they could pull the wires out and bust their system. So yeah, it’s a big challenge.”

The game experience itself is set in the 40s post-WWII and is one of psychological horror borne of Bear’s deep passion for the old Universal monster movies. “My favorite was The Wolfman, and Abbott and Costello Meet Frankenstein, Boris Karloff, all that…and throw in some Twilight Zone. We wanted to see if we could do our own homage to those movies.”

Those movies all had defining actors making their roles their own, and in themselves becoming an iconic part of the lore. Wilson’s Heart stars a top-tier, all-star cast of voice actors that are a really standout addition to the game.

 Game Wilson heart
Above: The cast from Peter Weller to Michael B. Jordan to Rosario Dawson and Alfred Molina to Kurtwood Smith is outstanding.

Hollywood heavyweights

“I really wanted Peter Weller for Wilson…really love Robocop— who doesn’t — but he was great in Naked Lunch, Buckaroo Banzai, and others. So we flew to New Mexico, showed him a brief demo of the game and he was super-gracious, loved the concept, and the art, and just said he would do it,” said Bear.

For the other roles, he looked at the requirements of the character and identified ideal actors. “We got all our first picks,” he says. That includes Michael B. Jordan, Alfred Molina, Rosario Dawson, and Paul Reubens.

A kick for any Robocop fan is that in addition to headliner Peter Weller, Kurtwood Smith also provides voice talent. “Although we never had them in the same room, to have those two was amazing,” says big Robocop fan, Bear, “and Kurtwood even threw in a couple of Robocop lines all on his own, and I tried to keep calm about it in the VO room! And we did use one in the game.”

How is it working with top-tier talent for a videogame? “They were right on point,” says Bear, “they would improve the lines we’d written on-the-fly…when you have that kind of talent, it makes things a lot easier.”

But come on…what were they actually like?

“Some of the nicest people I’ve ever met!” says Bear.

Garnering positive reception from press and community alike, Bear understands the nature of the current VR beast. “To be fair, you need the rig, you need a powerful PC to run it, and you need the controllers. So, I do think that the game will have a long tail as there is more adoption,” he says.

For the 30-plus developers at Twisted Pixel, it’s not all about VR, however. “As much as we love VR, we love PC and console stuff. It’s more about the concept and what platform fits it best, rather than just trying to cram something onto VR.”

Whatever the next step, Bear gets the nature of the business “We have the 3- to 5-year outlook, of course,” he offers, “but as you know, that often goes to shit!”

Getting Started in Linux with Intel® SDK for OpenCL™ Applications

$
0
0

This article is a step by step guide to quickly get started developing using Intel®  SDK for OpenCL™ Applications with the Linux SRB5 driver package.

  1. Install the driver
  2. Install the SDK
  3. Set up Eclipse

For SRB4.1 instructions, please see https://software.intel.com/en-us/articles/sdk-for-opencl-gsg-srb41.

Step 1: Install the driver

 

This script covers the steps needed to install the SRB5 driver package in Ubuntu 14.04, Ubuntu 16.04, CentOS 7.2, and CentOS 7.3.

 

To use

$ mv install_OCL_driver.sh_.txt install_OCL_driver.sh
$ sudo su
$ ./install_OCL_driver.sh

This script automates downloading the driver package, installing prerequisites and user-mode components, patching the 4.7 kernel, and building it. 

You can check your progress with the System Analyzer Utility.  If successful, you should see smoke test results looking like this at the bottom of the the system analyzer output:

--------------------------
Component Smoke Tests:
--------------------------
 [ OK ] OpenCL check:platform:Intel(R) OpenCL GPU OK CPU OK

 

Experimental installation without kernel patch or rebuild:

If using Ubuntu 16.04 with the default 4.8 kernel you may be able to skip the kernel patch and rebuild steps.  This configuration works fairly well but several features (i.e. OpenCL 2.x device-side enqueue and shared virtual memory, VTune GPU support) require patches.  Install without patches has been "smoke test" validated to check that it is viable to suggest for experimental use only, but it is not fully supported or certified.   

 

Step 2: Install the SDK

This script will set up all prerequisites for successful SDK install for Ubuntu. 

$ mv install_SDK_prereq_ubuntu.sh_.txt install_SDK_prereq_ubuntu.sh
$ sudo su
$ ./install_SDK_prereq_ubuntu.sh

After this, run the SDK installer.

Here is a kernel to test the SDK install:

__kernel void simpleAdd(
                       __global int *pA,
                       __global int *pB,
                       __global int *pC)
{
    const int id = get_global_id(0);
    pC[id] = pA[id] + pB[id];
}                               

Check that the command line compiler ioc64 is installed with

$ ioc64 -input=simpleAdd.cl -asm

(expected output)
No command specified, using 'build' as default
OpenCL Intel(R) Graphics device was found!
Device name: Intel(R) HD Graphics
Device version: OpenCL 2.0
Device vendor: Intel(R) Corporation
Device profile: FULL_PROFILE
fcl build 1 succeeded.
bcl build succeeded.

simpleAdd info:
	Maximum work-group size: 256
	Compiler work-group size: (0, 0, 0)
	Local memory size: 0
	Preferred multiple of work-group size: 32
	Minimum amount of private memory: 0

Build succeeded!

 

Step 3: Set up Eclipse

Intel SDK for OpenCL applications works with Eclipse Mars and Neon.

After installing, copy the CodeBuilder*.jar file from the SDK eclipse-plug-in folder to the Eclipse dropins folder.

$ cd eclipse/dropins
$ find /opt/intel -name 'CodeBuilder*.jar' -exec cp {} . \;

Start Eclipse.  Code-Builder options should be available in the main menu.

Introduction to VR: Creating a First-Person Player Game for the Oculus Rift*

$
0
0

View PDF [12,544KB]

Introduction

This article introduces virtual reality (VR) concepts and discusses how to integrate a Unity* application with the Oculus Rift*, add an Oculus first-person player character to the game, and teleport the player to the scene. This article is aimed at an existing Unity developer who wants to integrate Oculus Rift into the Unity scene. The assumption is that the reader already has the setup to create a VR game for Oculus: an Oculus-ready PC and the Oculus Rift and touch controllers.

Development tools

  • Unity 5.5 or greater
  • Oculus Rift and touch controllers

Creating a Terrain in Unity

Multiple online resources are available that explain how to create a basic terrain in Unity. I followed the Unity manual. Adding lots of trees and grass details to the scene may have a performance impact, causing the frames per second (FPS) to decrease significantly. Make sure you have the optimal number of trees, if required, and set the min height/max height and min width/max width of the grass as low as possible to lessen the impact on the FPS. In order to improve the VR experience in your game, a minimum of 90 FPS is recommended.

Setting up the Oculus Rift

This section explains how to set up the Oculus Rift, place the Oculus first-person character in the scene, and teleport the player from one scene to another.

To set up Oculus, follow the downloadable instructions from the Oculus website.

Once you complete the setup, make sure that Oculus is integrated with your machine, and then do the following:

  1. Download the Oculus utilities for Unity 5.
  2. Import the Unity package into your Unity project.
  3. Remove the Main Camera object from your scene. It’s unnecessary because the Oculus OVRPlayerController prefab already comes with a custom VR camera.
  4. Navigate to the Assets/OVR/Prefabs folder.
  5. Drag and drop the OVRPlayerController prefab into your scene. You can work with the OVRCameraRig prefab. For a description of these prefabs and an explanation of their differences, go to this link. The sample shown below was implemented using OVRPlayerController.

Adjust the headset for the best fit and so you have clear visibility of the scene. Adjust the settings as necessary and according to the instructions provided while setting up the Oculus Rift. You can click the Stats button to observe the scene’s FPS. If the FPS is less than the recommended 90 FPS, decrease the details in your Unity scene or troubleshoot to find out what parts of the scene are consuming more CPU/GPU and why that is impacting the FPS.

Now let’s look at how we can interact with the objects in your scene with the Oculus touch controllers. Let’s add a shotgun model to the scene so that the player can attack the enemies. You can either create your own model or download it from the Unity Asset Store. I downloaded the model from the Unity store.

  1. Make this model the child of the RightHandAnchor of the OVRPlayerController as shown below.
  2. Adjust the size and orientation of the model so that it fits the scene and your requirements.

Now once you move the right touch controller, you are directly interacting with the shotgun in the scene.

Adding the Code to Work with the Oculus Touch Controller

In the code snippet shown below, we check the OVRInput and based on the button pressed in the Oculus touch controller, we are doing one of three things:

  • If the Primary Index Trigger button (that is, the right controller’s trigger button) is pressed, we call the RayCastShoot function with the Teleport option set to false. This condition lets the player object fire at the enemies and any other targets that we set up in the scene. We are also making sure that we can only fire once within the specified time variable by checking the condition Time.Time > nextfire.
  • If the A button on the controller is pressed, we call the RayCastShoot function, setting the Teleport option to true. This option allows the player to teleport to different points in the terrain. The teleported points can be either predefined points set in the scene, or the player can be teleported directly to the hit point. It is up to the developer to decide, based on the requirements of the game, where in the scene to teleport the player.
  • If the B button of the controller is pressed at any time in the game, the position of the player is reset to its original position.
void Update () {

        if (OVRInput.Get(OVRInput.Button.PrimaryIndexTrigger) && (Time. Time > nextfire))

        {
            //If the Primary Index trigger is pressed on the touch controller we fire at the targets
            nextfire = Time. Time + fireRate;
            audioSource.Play();

            // Teleporting is set to false here
            RayCastShoot(false);        }
        else if (OVRInput.Get(OVRInput.RawButton.A) && (Time.time > nextfire))
        {

            // Teleporting is set to true when Button A is pressed on the controller
            nextfire = Time.time + fireRate;

            RayCastShoot(true);

        }

        else if (OVRInput.Get(OVRInput.RawButton.B) && (Time.time > nextfire))
        {
            // If Button B is pressed on the controller player is reset to his original position

            nextfire = Time.time + fireRate;
            player.transform.position = resetPosition;

        }
    }

In the sample below, I added zombies, downloadable from the Unity Asset Store, as the enemies and also added some targets, like rocks and grenades, to add more particle effects, like explosions, rock impacts, and so on, to the scene. I also created a simple animation for the zombie following this tutorial.

Now let’s look at the RayCastshoot function. Physics.Raycast casts a ray from the gunTransform position in a forward direction against the colliders in the scene. The range is specified in the weaponRange variable. If the ray hits something, it is stored in the hit variable.

RaycastHit hit;

        if (Physics.Raycast(gunTransform.position, gunTransform.forward, out hit, weaponRange))

The function RayCastshoot takes a Boolean value. If the value is true, the function teleports the player; if the value is false, it checks for any objects in the scene, like zombies, rocks, grenades, and so on, that it collides with and destroys the enemies and the targets.

The first thing we do is for the zombie object. We add the Physics rigid body component and set its kinematic value to true. We also add a small script, which we named Enemy.cs, and attach it to the enemy object. The script shown below takes a function and checks the life of the enemy. Each call to the enemyhit function (that is, whenever we fire at the enemy) reduces the enemy’s life by one. After the enemy is shot five times, it is destroyed.

In the RayCastshoot function we call this function to get the handle to the Zombie object and determine if we are actually firing at the zombie.

Enemy enemy = hit.collider.GetComponentInParent<Enemy>();​

If the enemy object is not null, we call the enemyhit function to reduce its life by one. We also instantiate the blood effect prefab, as shown below, each time the zombie is hit. We check the full life of the enemy, and if it less than zero we destroy the zombie object.

//Enemy.cs
public class Enemy : MonoBehaviour {

    //public GameObject explosionPrefab;
    public int fullLife = 5;


    public void enemyhit(int life)
    {
        //subtract life  when Damage function is called
        fullLife -= life;

        //Check if full life has fallen below zero
        if (fullLife <= 0)
        {
            //Destroy the enemy if the full life is less than or equal to zero
            Destroy(gameObject);

        }
    }

}

// if the hit object is the enemy
//Raycastexample.cs from where we are calling the enemyhit function
if (enemy != null)

            {
                enemy.enemyhit(1);

                //Checks the health of the enemy and resets to  max again
                //Instantiates the blood effect prefab for each hit

                var bloodEffect = Instantiate(bloodPrefab);
                bloodEffect.transform.position = hit.point;

                if (enemy.fullLife <= 0)
                {

                    enemy.fullLife = 5;
                }
            }

If the object we hit is anything other than the zombie, we can access the object by adding a tag to the different objects in the scene. For example, we added a tag called “Mud” for the ground, a tag called “Rock” for rocks, and so on. As shown in the code sample below, we can compare the tags to objects that we hit, then instantiate the respective prefab effects for those objects.

//If the hit targets are the targets other than the enemy like the mud, Rocks , Grenades on the terrain
else
            {
                var impactEffect = Instantiate(impactEffectPrefab);
                impactEffect.transform.position = hit.point;
                Destroy(impactEffect, 4);

                // If the Target is the ground
                if ((hit.collider.gameObject.CompareTag("Mud")))
                {

                    var mudeffect = Instantiate(mudPrefab);
                    mudeffect.transform.position = hit.point;


                }

                // If the Target is  Rocks
                else if ((hit.collider.gameObject.CompareTag("Rock")))
                {

                    var rockeffect = Instantiate(rockPrefab);
                    rockeffect.transform.position = hit.point;
                }

                // If the Target is the Grenades

                else if ((hit.collider.gameObject.CompareTag("Grenade")))
                {

                    var grenadeEffect = Instantiate(explosionPrefab);
                    grenadeEffect.transform.position = hit.point;
                    Destroy(grenadeEffect, 4);

                }
            }

        }

Teleporting

Teleporting is an important aspect in VR games that is recommended so that the user can avoid nausea when moving around the scene. The example shown below implements a simple teleporting mechanism in Unity. In the code we can either teleport the player to the “hit” point or we can create multiple points in the terrain where the player can be teleported.

  1. Create an empty game object and name it “Teleport.”

  2. Create a tag called “Teleport.”

  3. Assign the Teleport tag to the teleport object as shown below.

  4. Press CTRL+D and duplicate these points to create more teleport points in the scene. Adjust the positions of the points so that they span the terrain. I set the y position the same as my OVR player prefab so that the y value in the points are the same as my camera position.

As per the code below, if the teleport is set to true, we get the array of all the points in the teleportPoints variable, and we randomly pick one of these points for the player to teleport.

var newPosition = teleportPoints[Random.Range(0, teleportPoints.Length)];

Finally, we set the player’s transform position to the new position.

player.transform.position = newPosition.transform.position;
if (teleport)
            {
                //If the player needs to be teleported to the hit point
                // Vector3 newposition = hit.point;
                //player.transform.position = new Vector3(newposition.x, player.transform.position.y, newposition.z);


                //If the player needs to be teleported to the teleport points that are created in the Unity scene. Below code teleports the player
                // to one of the points randomly

                var teleportPoints = GameObject.FindGameObjectsWithTag("Teleport");
                var newPosition = teleportPoints[Random.Range(0, teleportPoints.Length)];


                player.transform.position = newPosition.transform.position;

                return;
            }

Building Settings and Deploying the VR Application

After you are finished with the game, deploy your VR application for PCs.

  1. Go to File > Build Settings, and then for Target Platform, select Windows.

  2. Go to Edit > Project Settings > Player, and then click the Inspector tab.
  3. Click Other Settings, and then select the Virtual Reality Supported check box.

  4. Compile and then build to get the final VR application.

Conclusion

Creating a VR game is a lot of fun, but it also requires preciseness. If you are an existing Unity developer and have a game that is not specific to VR, you can also integrate it with Oculus Rift and port it as a VR game. At the end of this article, we list a number of references that focus on best practices in VR.

Below is the complete script for the sample scene discussed in this article.

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class Raycastexample : MonoBehaviour {


    //Audio clip to play
    public AudioClip clip;
    public AudioSource audioSource;

    //rate of firing at the targets
    public float fireRate = .25f;
    // Range to which Raycast will detect the collision
    public float weaponRange = 300f;

    //Prefab for Impacts at the target
    public GameObject impactEffectPrefab;
    //Prefab for Impacts for grenade explosions
    public GameObject explosionPrefab;

    //Prefab at gun transform position
    public GameObject GunfirePrefab;

    //Prefab if the target is the terrain
    public GameObject mudPrefab;
    // Prefab when hits the Zombie
    public GameObject bloodPrefab;

    // prefabs when hits the rocks
    public GameObject rockPrefab;

    // Player transform that is used in teleporting
    public Transform player;
    private float nextfire;

    //transform at the Gun end to show some muzzle effects when firing
    public Transform gunTransform;
    // Position to reset the player to its original position when "B" is pressed on the touch controller
    private Vector3 resetPosition;

    // Use this for initialization
    void Start () {

        // Play the Audio clip while firing
        audioSource = GetComponent();
        audioSource.clip = clip;
        // Reset position after teleporting to set the position to his original position
        resetPosition = transform.position;

    }

	// Update is called once per frame
	void Update () {
        //If the Primary Index trigger is pressed on the touch controller we fire at the targets
        if (OVRInput.Get(OVRInput.Button.PrimaryIndexTrigger) && (Time.time > nextfire))
        {
            nextfire = Time.time + fireRate;
            audioSource.Play();
            // Teleporting is set to false here
            RayCastShoot(false);
        }
        else if (OVRInput.Get(OVRInput.RawButton.A) && (Time.time > nextfire))
        {

            // Teleporting is set to true when Button A is pressed on the controller
            nextfire = Time.time + fireRate;

            RayCastShoot(true);

        }

        else if (OVRInput.Get(OVRInput.RawButton.B) && (Time.time > nextfire))
        {
            // If Button B is pressed on the controller player is reset to his original position

            nextfire = Time.time + fireRate;
            player.transform.position = resetPosition;

        }

    }

    private void RayCastShoot(bool teleport)
    {
        RaycastHit hit;
        //Casts a ray against the targets in the scene and returns the "hit" object.
        if (Physics.Raycast(gunTransform.position, gunTransform.forward, out hit, weaponRange))
        {

            if (teleport)
            {
                //If the player needs to be teleported to the hit point
                // Vector3 newposition = hit.point;
                //player.transform.position = new Vector3(newposition.x, player.transform.position.y, newposition.z);


                //If the player needs to be teleported to the teleport points that are created in the Unity scene. Below code teleports the player
                // to one of the points randomly

                var teleportPoints = GameObject.FindGameObjectsWithTag("Teleport");
                var newPosition = teleportPoints[Random.Range(0, teleportPoints.Length)];


                player.transform.position = newPosition.transform.position;



                return;
            }

            //Attach the Enemy script as component to the enemy

            Enemy enemy = hit.collider.GetComponentInParent();

            // Muzzle effects of the Gun and its tranfrom poisiton is the Gun

            var GunEffect = Instantiate(GunfirePrefab);
            GunfirePrefab.transform.position = gunTransform.position;



            // if the hit object is the enemy

            if (enemy != null)

            {
                enemy.enemyhit(1);

                //Checks the health of the enemy and resets to  max again
                //Instantiates the blood effect prefab for each hit

                var bloodEffect = Instantiate(bloodPrefab);
                bloodEffect.transform.position = hit.point;

                if (enemy.fullLife <= 0)
                {

                    enemy.fullLife = 5;
                }
            }

            //If the hit targets are the targets other than the enemy like the mud, Rocks , Grenades on the terrain

            else
            {
                var impactEffect = Instantiate(impactEffectPrefab);
                impactEffect.transform.position = hit.point;
                Destroy(impactEffect, 4);

                // If the Target is the groud
                if ((hit.collider.gameObject.CompareTag("Mud")))
                {
                    Debug.Log(hit.collider.name + ", " + hit.collider.tag);
                    var mudeffect = Instantiate(mudPrefab);
                    mudeffect.transform.position = hit.point;


                }

                // If the Target is  Rocks
                else if ((hit.collider.gameObject.CompareTag("Rock")))
                {

                    var rockeffect = Instantiate(rockPrefab);
                    rockeffect.transform.position = hit.point;
                }

                // If the Target is the Grenades

                else if ((hit.collider.gameObject.CompareTag("Grenade")))
                {

                    var grenadeEffect = Instantiate(explosionPrefab);
                    grenadeEffect.transform.position = hit.point;
                    Destroy(grenadeEffect, 4);

                }
            }

        }
    }
}
//Enemy.cs
using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class Enemy : MonoBehaviour {

    //public GameObject explosionPrefab;
    public int fullLife = 5;

    // Use this for initialization
    void Start () {

	}


    public void enemyhit(int life)
    {
        //subtract life  when Damage function is called
        fullLife -= life;

        //Check if full life has fallen below zero
        if (fullLife <= 0)
        {
            //Destroy the enemy if the full life is less than or equal to zero
            Destroy(gameObject);

        }
    }


}

References for VR on the Intel® Developer Zone

Virtual Reality User Experience Tips from VRMonkey: https://software.intel.com/en-us/articles/virtual-reality-user-experience-tips-from-vrmonkey

Presence, Reality, and the Art of Astonishment in Arizona Sunshine: https://software.intel.com/en-us/blogs/2016/12/01/presence-reality-and-the-art-of-astonishment-in-arizona-sunshine

Combating VR Sickness with User Experience Design: https://software.intel.com/en-us/articles/combating-vr-sickness-with-user-experience-design

Interview with Well Told Entertainment about their Virtual Reality Escape Room Game: https://software.intel.com/en-us/blogs/2016/11/30/interview-with-well-told-entertainment-about-their-virtual-reality-escape-room-game

What is the Next Leap in VR Experiences?: https://software.intel.com/en-us/videos/what-is-the-next-leap-in-vr-experiences

VR Optimization Tips from Underminer Studios: https://software.intel.com/en-us/articles/vr-optimization-tips-from-underminer-studios

VR Optimizations with Intel® Graphics Performance Analyzers: https://software.intel.com/en-us/videos/vr-optimizations-with-intel-graphics-performance-analyzers

Creating Immersive Virtual Worlds Within Reach of Current-Generation CPUs: https://software.intel.com/en-us/articles/creating-immersive-virtual-worlds-within-reach-of-current-generation-cpus

About the Author

Praveen Kundurthy works in the Intel® Software and Services Group. His main focus is on mobile technologies, Microsoft Windows*, virtual reality, and game development.

New Issue of The Parallel Universe is Here: Tuning Autonomous Driving Using Intel® System Studio

$
0
0

Everything old is new again, and that’s just fine with us.

We hope you’ll agree after you read the latest issue of The Parallel Universe, Intel’s quarterly magazine for developers. In this issue, we take a fresh look at what’s come before (OpenMP, MySQL*, Intel® C++ Compiler, vectorization) while looking ahead (autonomous driving applications, edge-to-cloud data compression, the latest programming languages).

Download it and see what’s possible with leading-edge HPC products and practices.

  • Conducting hotspot analysis for autonomous driving applications
  • What’s so unique about programming language Julia* and why its use doubles every year
  • How vectorization saves the day for an open source application used for large-scale, 3-D simulations.
  • Plus lots more

Read the new issue

Browse past issues

Subscribe

Viewing all 1201 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>