GDC 2018 Tech Sessions
Using the Model Optimizer to Convert MXNet* Models
Introduction
NOTE: The OpenVINO™ toolkit was formerly known as the Intel® Computer Vision SDK.
The Model Optimizer is a cross-platform command-line tool that facilitates the transition between the training and deployment environment, performs static model analysis, and adjusts deep learning models for optimal execution on end-point target devices.
The Model Optimizer process assumes you have a network model trained using a supported frameworks. The scheme below illustrates the typical workflow for deploying a trained deep learning model:
A summary of the steps for optimizing and deploying a model that was trained with the MXNet* framework:
- Configure the Model Optimizer for MXNet* (MXNet was used to train your model).
- Convert a MXNet model to produce an optimized Intermediate Representation (IR) of the model based on the trained network topology, weights, and biases values.
- Test the model in the Intermediate Representation format using the Inference Engine in the target environment via provided Inference Engine validation application or sample applications.
- Integrate the Inference Engine in your application to deploy the model in the target environment.
Model Optimizer Workflow
The Model Optimizer process assumes you have a network model that was trained with one of the supported frameworks. The workflow is:
- Configure Model Optimizer for the TensorFlow* framework running the configuration bash script (Linux*) or batch file (Windows*) from the
<INSTALL_DIR>/deployment_tools/model_optimizer/install_prerequisites
folder:install_prerequisites_mxnet.sh
install_prerequisites_mxnet.bat
For more details on configuring the Model Optimizer, see Configure the Model Optimizer.
- Provide as input a trained model that contains the certain topology, described in the
.json
file, and the adjusted weights and biases, described in.params
. - Convert the MXNet* model to optimized Intermediate Representation.
The Model Optimizer produces as output an Intermediate Representation (IR) of the network can be read, loaded, and inferred with the Inference Engine. The Inference Engine API offers a unified API across a number of supported Intel® platforms. The Intermediate Representation is a pair of files that describe the whole model:
.xml
: Describes the network topology.bin
: Contains the weights and biases binary data
Supported Topologies
The table below shows the supported models, with the links to the symbol and parameters.
Model Name | Model |
---|---|
VGC-16 | Symbol, Params |
VGC-19 | Symbol, Params |
ResNet-152 v1 | Symbol, Params |
SqueezeNet_v1.1 | Symbol, Params |
Inception BN | Symbol, Params |
CaffeNet | Symbol, Params |
DenseNet-121 | Repo |
DenseNet-161 | Repo |
DenseNet-169 | Symbol, Params |
DenseNet-201 | Repo |
MobileNet | Repo, Params |
SSD-ResNet-50 | Repo, Params |
SSD-VGG-16-300 | Symbol + Params |
SSD-Inception v3 | Symbol + Params |
Fast MRF CNN (Neural Style Transfer) | Repo |
FCN8 (Semantic Segmentation) | Repo |
Converting an MXNet* Model
To convert an MXNet model:
- Go to the
<INSTALL_DIR>/deployment_tools/model_optimizer
directory. - To convert an MXNet* model contained in a
model-file-symbol.json
andmodel-file-0000.params
, run the Model Optimizer launch scriptmo.py
, specifying a path to the input model file:python3 mo_mxnet.py --input_model model-file-0000.params
Two groups of parameters are available to convert your model:
- Framework-agnostic parameters: These parameters are used to convert any model trained in any supported framework.
- MXNet-specific parameters: Parameters used to convert only MXNet* models
Using Framework-Agnostic Conversion Parameters
The following list provides the framework agnostic parameters. The MXNet-specific list is further down in this document.
Optional arguments: -h, --help Shows this help message and exit --framework {tf,caffe,mxnet} Name of the framework used to train the input model. Framework-agnostic parameters: --input_model INPUT_MODEL, -w INPUT_MODEL, -m INPUT_MODEL Tensorflow*: a file with a pre-trained model (binary or text .pb file after freezing). Caffe*: a model proto file with model weights --model_name MODEL_NAME, -n MODEL_NAME Model_name parameter passed to the final create_ir transform. This parameter is used to name a network in a generated IR and output .xml/.bin files. --output_dir OUTPUT_DIR, -o OUTPUT_DIR Directory that stores the generated IR. By default, it is the directory from where the Model Optimizer is launched. --input_shape INPUT_SHAPE Input shape that should be fed to an input node of the model. Shape is defined in the '[N,C,H,W]' or '[N,H,W,C]' format, where the order of dimensions depends on the framework input layout of the model. For example, [N,C,H,W] is used for Caffe* models and [N,H,W,C] for TensorFlow* models. Model Optimizer performs necessary transforms to convert the shape to the layout acceptable by Inference Engine. Two types of brackets are allowed to enclose the dimensions: [...] or (...). The shape should not contain undefined dimensions (? or -1) and should fit the dimensions defined in the input operation of the graph. --scale SCALE, -s SCALE All input values coming from original network inputs will be divided by this value. When a list of inputs is overridden by the --input parameter, this scale is not applied for any input that does not match with the original input of the model. --reverse_input_channels Switches the input channels order from RGB to BGR. Applied to original inputs of the model when and only when a number of channels equals 3 --log_level {CRITICAL,ERROR,WARN,WARNING,INFO,DEBUG,NOTSET} Logger level --input INPUT The name of the input operation of the given model. Usually this is a name of the input placeholder of the model. --output OUTPUT The name of the output operation of the model. For TensorFlow*, do not add :0 to this name. --mean_values MEAN_VALUES, -ms MEAN_VALUES Mean values to be used for the input image per channel. Shape is defined in the '(R,G,B)' or '[R,G,B]' format. The shape should not contain undefined dimensions (? or -1). The order of the values is the following: (value for a RED channel, value for a GREEN channel, value for a BLUE channel) --scale_values SCALE_VALUES Scale values to be used for the input image per channel. Shape is defined in the '(R,G,B)' or '[R,G,B]' format. The shape should not contain undefined dimensions (? or -1). The order of the values is the following: (value for a RED channel, value for a GREEN channel, value for a BLUE channel) --data_type {FP16,FP32,half,float} Data type for input tensor values. --disable_fusing Turns off fusing of linear operations to Convolution --disable_gfusing Turns off fusing of grouped convolutions --extensions EXTENSIONS Directory or list of directories with extensions. To disable all extensions including those that are placed at the default location, pass an empty string. --batch BATCH, -b BATCH Input batch size --version Version of Model Optimizer
Note: The Model Optimizer does not revert input channels from RGB to BGR by default, as it did in the 2017 R3 Beta release. Instead, manually specify the command-line parameter to perform the reversion: --reverse_input_channels
Command-Line Interface (CLI) Examples Using Framework-Agnostic Parameters
- Launching the Model Optimizer for
model.params
with debug log level: Use this to better understand what is happening internally when a model is converted:python3 mo_mxnet.py --input_model model.params --log_level DEBUG
- Launching the Model Optimizer for
model.params
with the output Intermediate Representation calledresult.xml
andresult.bin
that are placed in the specified../../models/
:python3 mo_mxnet.py --input_model model.params --model_name result --output_dir ../../models/
- Launching the Model Optimizer for
model.params
and providing scale values for a single input:python3 mo_mxnet.py --input_model model.params --scale_values [59,59,59]
- Launching the Model Optimizer for
model.caffemodel
with two inputs with two sets of scale values for each input. A number of sets of scale/mean values should be exactly the same as the number of inputs of the given model:python3 mo_mxnet.py --input_model model.params --input data,rois --scale_values [59,59,59],[5,5,5]
- Launching the Model Optimizer for
model.params
with specified input layer (data), changing the shape of the input layer to[1,3,224,224]
, and specifying the name of the output layer:python3 mo_mxet.py --input_model model.params --input data --input_shape [1,3,224,224] --output pool5
- Launching the Model Optimizer for
model.params
with disabled fusing for linear operations with convolution, set by the--disable_fusing
flag, and grouped convolutions, set by the--disable_gfusing
flag:python3 mo_mxnet.py --input_model model.params --disable_fusing --disable_gfusing
- Launching the Model Optimizer for
model.caffemodel
, reversing the channels order between RGB and BGR, specifying mean values for the input and the precision of the Intermediate Representation to beFP16
:python3 mo_mxnet.py --input_model model.params --reverse_input_channels --mean_values [255,255,255] --data_type FP16
- Launching the Model Optimizer for
model.params
with extensions from specified directories. In particular, from/home/
and from/home/some/other/path
.
In addition, the following command shows how to pass the mean file to the Intermediate Representation. The mean file must be in abinaryproto
format:python3 mo_mxnet.py --input_model model.params --extensions /home/,/some/other/path/ --mean_file mean_file.binaryproto
Using MXNet*-Specific Conversion Parameters
The following list provides the MXNet*-specific parameters.
Mxnet-specific parameters: --nd_prefix_name ND_PREFIX_NAME Prefix name for args.nd and argx.nd files. --pretrained_model_name PRETRAINED_MODEL_NAME Name of pretrained model without extension and epoch number which will be merged with args.nd and argx.nd files.
Custom Layer Definition
Internally, when you run the Model Optimizer, it loads the model, goes through the topology, and tries to find each layer type in a list of known layers. Custom layers are layers that are not included in the list of known layers. If your topology contains any layers that are not in this list of known layers, the Model Optimizer classifies them as custom.
Supported Layers and the Mapping to Intermediate Representation Layers
Number | Layer Name in MXNet | Layer Name in the Intermediate Representation |
---|---|---|
1 | BatchNorm | BatchNormalization |
2 | Crop | Crop |
3 | ScaleShift | ScaleShift |
4 | Pooling | Pooling |
5 | SoftmaxOutput | SoftMax |
6 | SoftmaxActivation | SoftMax |
7 | null | Ignored, does not appear in IR |
8 | Convolution | Convolution |
9 | Deconvolution | Deconvolution |
10 | Activation | ReLU |
11 | ReLU | ReLU |
12 | LeakyReLU | ReLU (negative_slope = 0.25) |
13 | Concat | Concat |
14 | elemwise_add | Eltwise(operation = sum) |
15 | _Plus | Eltwise(operation = sum) |
16 | Flatten | Flatten |
17 | Reshape | Reshape |
18 | FullyConnected | FullyConnected |
19 | UpSampling | Resample |
20 | transpose | Permute |
21 | LRN | Norm |
22 | L2Normalization | Normalize |
23 | Dropout | Ignored, does not appear in IR |
24 | _copy | Ignored, does not appear in IR |
25 | _contrib_MultiBoxPrior | PriorBox |
26 | _contrib_MultiBoxDetection | DetectionOutput |
27 | broadcast_mul | ScaleShift |
The current version of the Model Optimizer for MXNet does not support models that contain custom layers. The general recommendation is to cut the model to remove the custom layer, and then reconvert the cut model, without the custom layer. If the custom layer is the last layer in the topology, then the processing logic can be made on the level of the Inference Engine sample that you will use when inferring the model.
Frequently Asked Questions (FAQ)
The Model Optimizer provides explanatory messages if it is unable to run to completion due to issues like typographical errors, incorrectly used options, or other issues. The message describes the potential cause of the problem and gives a link to the Model Optimizer FAQ. The FAQ has instructions on how to resolve most issues. The FAQ also includes links to relevant sections in the Model Optimizer Developer Guide to help you understand what went wrong. The FAQ is here: https://software.intel.com/en-us/articles/OpenVINO-ModelOptimizer#FAQ.
Summary
In this document, you learned:
- Basic information about how the Model Optimizer works with MXNet* models
- Which MXNet* models are supported
- How to convert a trained MXNet* model using the Model Optimizer with both framework-agnostic and MXNet-specific command-line options
Legal Information
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.
No computer system can be absolutely secure.
Intel, Arria, Core, Movidia, Pentium, Xeon, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used with permission by Khronos
*Other names and brands may be claimed as the property of others.
Copyright © 2018, Intel Corporation. All rights reserved.
IoT Reference Implementation: How to Build a Store Traffic Monitor
An application capable of detecting objects on any number of screens.
What it Does
This application is one of a series of IoT reference implementations aimed at instructing users on how to develop a working solution for a particular problem. It demonstrates how to create a smart video IoT solution using Intel® hardware and software tools. This reference implementation monitors people activity inside and outside a facility as well as counts product inventory.
How it Works
The counter uses the Inference Engine included in OpenVINO™. A trained neural network detects objects within a designated area by displaying a green bounding box over them. This reference implementation identifies multiple intruding objects entering the frame and identifies their class, count, and time entered.
Requirements
Hardware
- 6th Generation Intel® Core™ processor with Intel® Iris® Pro graphics and Intel® HD Graphics.
Software
- Ubuntu* 16.04 LTS Note: You must be running kernel version 4.7+ to use this software. We recommend using a 4.14+ kernel to use this software. Run the following command to determine your kernel version:
uname -a
- OpenCL™ Runtime Package
- OpenVINO™
IoT Reference Implementation: How to Build an Intruder Detector Solution
An application capable of detecting any number of objects from a video input.
What it Does
This application is one of a series of IoT reference implementations aimed at instructing users on how to develop a working solution for a particular problem. It demonstrates how to create a smart video IoT solution using Intel® hardware and software tools. This solution detects any number of objects in a designated area providing number of objects in the frame and total count.
How it Works
The counter uses the Inference Engine included in OpenVINO™. A trained neural network detects objects within a designated area by displaying a green bounding box over them, and registers them in a logging system.
Requirements
Hardware
- 6th Generation Intel® Core™ processor with Intel® Iris® Pro graphics and Intel® HD Graphics
Software
- Ubuntu* 16.04 LTS Note: You must be running kernel version 4.7+ to use this software. We recommend using a 4.14+ kernel to use this software. Run the following command to determine your kernel version:
uname -a
- OpenCL™ Runtime Package
- OpenVINO™
IoT Reference Implementation: How to Build a Face Access Control Solution
Introduction
The Face Access Control application is one of a series of IoT reference implementations aimed at instructing users on how to develop a working solution for a particular problem. The solution uses facial recognition as the basis of a control system for granting physical access. The application detects and registers the image of a person’s face into a database, recognizes known users entering a designated area and grants access if a person’s face matches an image in the database.
From this reference implementation, developers will learn to build and run an application that:
- Detects and registers the image of a person’s face into a database
- Recognizes known users entering a designated area
- Grants access if a person’s face matches an image in the database
How it Works
The Face Access Control system consists of two main subsystems:
cvservice
- cvservice is a C++ application that uses OpenVINO™. It connects to a USB camera (for detecting faces) and then performs facial recognition based on a training data file of authorized users to determine if a detected person is a known user or previously unknown. Messages are published to a MQTT* broker when users are recognized and the processed output frames are written to stdout in raw format (to be piped to ffmpeg for compression and streaming). Here, Intel's Photography Vision Library is used for facial detection and recognition.
webservice
- webservice uses the MQTT broker to interact with cvservice. It's an application based on Node.js* for providing visual feedback at the user access station. Users are greeted when recognized as authorized users or given the option to register as a new user. It displays a high-quality, low-latency motion jpeg stream along with the user interface and data analytics.
In the UI, there are three tabs:
- live streaming video
- user registration
- analytics of access history.
This is what the live streaming video tab looks like:
This is what the user registration tab looks like:
This is an example of the analytics tab:
Hardware requirements
- 5th Generation Intel® Core™ processor or newer or Intel® Xeon® v4, or Intel® Xeon® v5 Processors with Intel® Graphics Technology (if enabled by OEM in BIOS and motherboard) [tested on NUC6i7KYK]
- USB Webcam [tested with Logitech* C922x Pro Stream]
Software requirements
- Ubuntu* 16.04
- OpenVINO™ toolkit
This article continues here on GitHub.
IoT Reference Implementation: People Counter
What it Does
This people counter application is one of a series of IoT reference implementations aimed at instructing users on how to develop a working solution for a particular problem. It demonstrates how to create a smart video IoT solution using Intel® hardware and software tools. This people counter solution detects people in a designated area providing number of people in the frame, average duration of people in frame, and total count.
How it Works
The counter uses the Inference Engine included in the OpenVINO™ and the Intel® Deep Learning Deployment Toolkit. A trained neural network detects people within a designated area by displaying a green bounding box over them. It counts the number of people in the current frame, the duration that a person is in the frame (time elapsed between entering and exiting a frame), and the total number of people seen, and then sends the data to a local web server using the Paho* MQTT C client libraries.
Requirements
Hardware
- 6th Generation Intel® Core™ processor with Intel® Iris® Pro graphics and Intel® HD Graphics.
Software
- Ubuntu* 16.04 LTS Note: You must be running kernel version 4.7+ to use this software. We recommend using a 4.14+ kernel to use this software. Run the following command to determine your kernel version:
uname -a
- OpenCL™ Runtime Package
- OpenVINO™
Visit Project on GitHub
How to Install and Use Intel® VTune™ Amplifier Platform Profiler
Try Platform Profiler Today
You are invited to try a free technical preview release. Just follow these simple steps (if you are already registered, skip to step 2):
- Register for the Intel® Parallel Studio XE Beta
- Download and install (the Platform Profiler is a separate download from the Intel Parallel Studio XE Beta)
- Check out the getting started guide, then give Platform Profiler a test drive
- Fill out the online survey
Introduction
Intel® VTune™ Amplifier - Platform Profiler, currently available as a technology preview, is a tool that helps users to identify how well an application uses the underlying architecture and how users can optimize hardware configuration of their system. It displays high-level system configuration such as processor, memory, storage layout, PCIe* and network interfaces (see Figure 1), as well as performance metrics observed on the system such as CPU and memory utilization, CPU frequency, cycles per instruction (CPI), memory and disk input/output (I/O) throughput, power consumption, cache miss rate per instruction, and so on. Performance metrics collected by the tool can be used for deeper analysis and optimization.
There are two primary audiences for Platform Profiler:
- Software developers - Using performance metrics provided by the tool, developers can analyze the behavior of their workload across various platform components such as CPU, memory, disk, and network devices.
- Infrastructure architects - You can monitor your hardware by analyzing long collection runs and finding times when the system exhibits low performance. Moreover, you can optimize the hardware configuration of your system based on the tool findings. For example, if after running a mix of workloads, Platform Profiler shows high processor utilization, memory use, or that I/O is limiting the application performance, you can bring in more cores, more memory, or use more or faster I/O devices.
Figure 1. High-level system configuration view from Platform Profiler
The main difference between Platform Profiler and other VTune Amplifier analyses is that Platform Profiler helps profile a platform for longer periods of time incurring very little performance overhead and generating a small amount of data. The current version of Platform Profiler can run up to 13 hours and generate less data than VTune Amplifier would do in 13 hours. So, one can simply start Platform Profiler running, and keep it running for 13 hours, meanwhile utilizing the system as he or she prefers, then stopping Platform Profiler profiling. Platform Profiler will collect all the profiled data and display the system utilization diagrams. VTune Amplifier, on the other hand, cannot run for such a long period of time, since it generates profiled data of gigabytes of magnitude in a matter of minutes. Thus, VTune Amplifier is more appropriate for fine tuning or for analyzing an application rather than a system. How well is my application using the machine or perhaps How well is the machine being used are key questions that Platform Profiler can answer. But, How do I fix my application to use the machine better is the question that VTune Amplifier answers.
Platform Profiler is composed of two main components: a data collector and a server.
- Data Collector - is a standalone package that needs to be installed on profiled system. It collects system-level hardware and operating system performance counters.
- Platform Profiler Server - post-processes the collected data into a time-series database, correlates with system topology information, and displays topology diagrams and performance graphs using a web-based interface.
Tool Installation
In order to use Platform Profiler, one needs to install both the server and the data collector components first. Below are the steps on how to do the installation.
Installing the Server Component
- Copy the server package to the system on which you want to install the server.
- Extract the archive to a writeable directory.
- Run the setup script and follow the prompts. On Windows*, run the script using the Administrator Command Prompt. On Linux*, use an account with root (“sudo”) privileges.
Linux example: ./setup.sh
Windows example: setup.bat
By default, the server is installed in the following location:
- On Linux: /opt/intel/vpp
- On Windows: C:\Program Files(x86)\IntelSWTools\VTune Amplifier Platform Profiler
Installing the Data Collector Component
- Copy the collector package to the target system on which you want to collect platform performance data.
- Extract the archive to a writeable directory on the target system.
- Run the setup script and follow the prompts. On Windows, run the script using the Administrator Command Prompt. On Linux, use an account with root (“sudo”) privileges.
Linux example: ./setup
Windows example: setup.cmd
By default, the collectors are installed in the following location:
- On Linux: /opt/intel/vpp-collector
- On Windows: C:\Intel\vpp-collector
Tool Usage
Starting and Stopping the Server Component
On Linux:
- Run the following commands to start the server manually after initial installation or a system reboot:
- source ./vpp-server-vars.sh
- vpp-server-start
- Run the following commands to stop the server:
- source ./vpp-server-vars.sh
- vpp-server-stop
On Windows:
- Run the following commands to start the server manually after initial installation or a system reboot:
- vpp-server-vars.cmd
- vpp-server-start
- Run the following command to stop the server:
- vpp-server-vars.cmd
- vpp-server-stop
Collecting System Data
Collecting data using Platform Profiler is pretty straight forward. Below are the steps one needs to take to collect data using the tool:
- Setup the environment:
- On Linux: source /opt/intel/vpp-collector/vpp-collect-vars.sh
- On Windows: C:\Intel\vpp-collector\vpp-collect-vars.cmd
- Start the data collection: vpp-collect-start [-c “workload description – free text comment”].
- Optionally, you can also add timeline markers to distinguish the time periods between collections: vpp-collect-mark [“an optional label/text/comment”].
- Stop the data collection: vpp-collect-stop. After the collection is stopped, the compressed result file is stored in the current directory.
Note: Inserting timeline markers is useful when you leave Platform Profiler running for a long period of time. For example, you run the Platform Profiler collection for 13 hours straight. During these 13 hours you run various stress tests and would like to find out how each test affects the system. In order to distinguish the time between these tests, you may want to use the timeline markers.
View Results
- From the machine on which the server is installed, point your browser (Google Chrome* recommended) to the server home page: http://localhost:6543.
- Click “View Results”.
- Click the Upload button and select the result file to upload.
- Select the result from the list to open the viewer.
- Navigate through the result to identify areas for optimization.
Tool Demonstration
In the rest of the article, I demonstrate how to navigate and analyze the result data it collects. I use a movie recommendation system application as an example in this article. The movie recommendation code is obtained from the Spark* Training GitHub* website. The underlying platform is a two-socket Haswell server (Intel® Xeon® CPU E5-2699 v3) with Intel® Hyper-Threading Technology enabled, 72 logical cores, 64 GB of memory, running an Ubuntu* 14.04 operating system.
The code is run in Spark on a single node as follows:
spark-submit --driver-memory 2g --class MovieLensALS --master local[4] movielens-als_2.10-0.1.jar movies movies/test.dat
With the command line above, Spark runs in local mode with four threads specified with the --master local[4] option. In local mode there is only one driver, which acts as an executor, and the executor spawns the threads to execute tasks. There are two arguments that can be changed before launching the application which are driver memory (--driver-memory 2g) and number of threads to run with (local[4]). My goal is to see how much I stress my system by changing these arguments, and to find out if I can identify any interesting pattern happening during the execution using Platform Profiler's profiled data.
Here are the four test cases that were run and their corresponding run times:
spark-submit --driver-memory 2g --class MovieLensALS --master local[4] movielens-als_2.10-0.1.jar movies movies/test.dat (16 minutes 11 seconds)
spark-submit --driver-memory 2g --class MovieLensALS --master local[36] movielens-als_2.10-0.1.jar movies movies/test.dat (11 minutes 35 seconds)
spark-submit --driver-memory 8g --class MovieLensALS --master local[36] movielens-als_2.10-0.1.jar movies movies/test.dat (7 minutes 40 seconds)
spark-submit --driver-memory 16g --class MovieLensALS --master local[36] movielens-als_2.10-0.1.jar movies movies/test.dat (8 minutes 14 seconds)
Figures 2 and 3 show observed CPU metrics during the first and second tests, respectively. Figure 2 shows that the CPU is underutilized and the user can add more work, if the rest of the system is similarly underutilized. The CPU frequency slows down often, supporting the thesis that the CPU will not be the limiter of performance. Figure 3 shows that the test utilizes the CPU more due to an increase in number of threads, but there is still significant headroom. Moreover, it is interesting to see that by increasing the number of threads we also decreased the CPI rate, as shown in the CPI chart of Figure 3.
Figure 2. Overview of CPU usage in Test 1.
Figure 3. Overview of CPU usage in Test 2.
Figure 4. Memory read/write throughput on Socket 0 for Test 1.
The increase in number of threads also increased the number of memory accesses, when you compare Figures 4 and 5. This is an expected behavior and is verified by the data collected by Platform Profiler.
Figure 5. Memory read/write throughput on Socket 0 for Test 2.
Figures 6 and 7 show L1 and L2 miss rates per instruction for Tests 1 and 2, respectively. Increasing the number of threads in Test 2 drastically decreased L1 and L2 miss rates, as depicted in Figure 7. We found out that the application incurs less CPI rate and less L1 and L2 miss rate when you run the code with more threads, which means that once data is loaded from the memory to caches a fairly good amount of data reuse happens, which benefits the overall performance.
Figure 6. L1 and L2 miss rate per instruction for Test 1.
Figure 7. L1 and L2 miss rate per instruction for Test 2.
Figure 8 shows the memory usage chart for Test 3. Similar memory usage patterns are observed for all other tests as well; that is, used memory is between 15-25 percent, whereas cached memory is between 45-60 percent. Spark caches its intermediate results in memory for later processing, hence we see high utilization of cached memory.
Figure 8. Memory utilization overview for Test 3.
Finally, Figures 9-12 show the disk utilization overview for all four test runs. As the amount of work has increased across the four test runs, the data shows that a faster disk improves the performance of the tests. The number of bytes transacted is not a lot, but the I/O operations (iops) are spending significant time waiting for completion. This can be seen by the Queue Depth chart. If the user is unable to change the disk then adding more threads would help to tolerate the disk access latency.
Figure 9. Disk utilization overview for Test 1.
Figure 10. Disk utilization overview for Test 2.
Figure 11. Disk utilization overview for Test 3.
Figure 12. Disk utilization overview for Test 4.
Summary
Using Platform Profiler, I was able to understand the execution behavior of the movie recommendation workload and observe how certain performance metrics change across a different number of threads and driver memory settings. Moreover, I was surprised to find out that a lot of disk write operations happen during the execution of the workload, since Spark applications are designed to run in memory. In order to investigate the code further, I will proceed with running VTune Amplifier's Disk I/O analysis to find out the details behind the disk I/O performance.
Downpour Interactive* Is Bringing Military Sims to VR with Onward*
The original article is published by Intel Game Dev on VentureBeat*: Downpour Interactive is bringing military sims to VR with Onward. Get more game dev news and related topics from Intel on VentureBeat.
While there's no dearth of VR (virtual reality) shooters out there, military simulation games enthusiasts searching for a VR version of Arma* or Rainbow Six*, where tactics trump flashy firefights, don't have many options. Downpour Interactive* founder Dante Buckley was one such fan, and when he decided to make his first game, Onward*, after a stint in hardware development, he knew it had to be a military simulation.
Onward pits squads of MARSOC (Marine Corps Forces Special Operations Command) marines against their opponents, Volg, in tactical battles across cities and deserts. Like Counter-Strike* or Socom*, players are limited to one life, with no respawns. While players can heal each other, when they're dead, that's it. "It's a style I really loved growing up," Buckley says. "And I wanted to bring that to VR and make sure it doesn't get forgotten in this next generation of games."
Onward goes several steps further, however. There's no screen clutter, for one. The decision to eschew a HUD (head-up display) was an easy one for Buckley to make, and one that made sense for the game as both a VR title and a military simulation.
"You have to look to see who's on your team. There's no identification, no things hovering over their heads, so you have to get used to their voices and play-styles. You have to use the radio to communicate with your team and keep aware of what's going on. It just forces people to play differently than they're used to. So not having a HUD was a design choice made early on."
Playing Dead
Instead of using a mini-map or radar to identify enemies, players have to listen for hints that the enemy might be right around the corner or hiding just behind a wall. When an ally is shot, you can't just look at a health bar, so you need to check their pulse to see whether they're just knocked down — and thus can be healed and brought back into the fight — or dead.
"We wanted to encourage players to interact with each other like they would in the real world," Buckley explains, "rather than just checking their HUD."
Buckley wants the mixture of immersion and a lack of hand-holding to inspire more player creativity. He cites things like players laying on the ground and playing dead. Enemies ignore them and, when their backs are turned, the not-at-all-dead players get up and take them out. You've got to be a little paranoid when you see a body, then.
When Buckley started working on Onward, he was doing it all himself, but last year he was able to hire a team of four. The extra hands mean that a lot of the existing game is being improved upon, from the maps to the guns.
"We're still working on updating some things. We had to use asset stores to create these maps. When I first started, it was just me — I had to balance out programming and map design and sound design, so using asset stores like Unity* and other places was really helpful in fast-tracking the process and getting a vision set early on. But now that I have a team, we're going back and updating things and making them higher quality."
The addition of a weapon artist has been a particular boon, says Buckley. "When you could first play the game in August 2016, the weapons looked like crap. Now an awesome weapon artist has joined us and he's going back through the game updating each weapon. Things are looking really good now."
Beyond making them look good, Buckley also wants to ensure that each weapon handles realistically. He's used feedback from alpha testers, marines, and his own experiences from gun ranges to create the firearms.
Tools of the Trade
"There are different ergonomics for each weapon," he explains. And they've got different grips and reload styles. "That's something to consider that most non-VR games don't. With other games you just click ‘R' to reload. You never really need to think about how you hold it or reload it. That's something cool that we've translated from the real-world."
It's an extra layer of interaction that you can even see in the way that you use something like a pair of night vision goggles. Instead of just hitting a button, you have to actually grab the night vision goggles that are on your head and pull them down over your face. And when you're done with them, you need to lift the goggles up.
All of these tools and weapons are selected in the lobby, before a match. Players can pick their primary weapon — pistol, gear like grenades or syringes, and finally weapon attachments such as scopes, suppressors, and various foregrips. Everyone has access to the same equipment. There's a limit to how much you can add to your loadout — everyone gets 8 points to distribute as they see fit — but that's the only restriction.
"That's something I wanted to make sure was in the game," says Buckley. "I know a lot of games have progression systems, which are fun, but it can kill the skill-based aspect of some games, where the more you play, the more stuff you get, versus getting better."
The problem right now is that the haptics can't match Buckley's ambitions. Since the controllers aren't in the shape of weapons, and they only vibrate, it's hard to forget that you're just holding some weird-looking controllers. But there are people and companies creating peripherals that the controllers can be slotted into, calling to mind Light gun shooter.
Onward has been on Steam* Early Access since August 2016, and Buckley considers it an overwhelmingly positive experience. "I can't say many negative things about it. It's been awesome. I'm super thankful because it gives me and my team an opportunity to show the world that we can make something big, and improve on it. Under any other circumstances, if there wasn't an avenue like this to make games and get them out into the world, a game like Onward might not have existed."
The early release has also allowed Buckley to watch a community grow around the game, which he thinks is helped by VR. "What makes us feel more together is that we can jump into VR and see each other, and see expressive behavior. You can see body language in-game; it's more than you get with a mic and a keyboard. Since people can emote how they're feeling, it's a little bit more real."
The end of this phase is almost in sight, however. Onward will launch by the end of the year, and Buckley promises some big announcements before then. In the meantime, you can play the early access version on Steam*.
CardLife*'s Liberating World is a Game Inside a Cardboard Creation Tool
The original article is published by Intel Game Dev on VentureBeat*: CardLife’s liberating world is a game inside a cardboard creation tool. Get more game dev news and related topics from Intel on VentureBeat.
With Robocraft*, its game of battling robots, Freejam* made its love for user-generated content known. With its latest game, the studio is going back to the ultimate source of user-generated content: the humble sheet of cardboard. Not so humble now, however, when it's used to craft an entire multiplayer survival game filled with mechs, dragons, and magic.
It started as a little prototype with a simple goal conceived by Freejam's CEO Mark Simmons: explore your creativity with cardboard. He and game director Rich Tyrer had chatted about it, thrown around more ideas, and they eventually tossed their creation out into the wild. "We had a basic cardboard world with a cardboard cabin and tools," recalls Tyrer. "We released it online just to see what would happen. The point was to see what people could do with their creativity with this cardboard and these two tools, a saw and a hacksaw."
Even with just a couple of tools, there were a lot of things that could be done with the cardboard — especially since the scale could be manipulated — and its potential was enough for a Freejam team to be set up a month later to properly develop it. "We went from there and molded it into this game. We started thinking about what genre would be good for the aesthetic, how could we let people make what they want? In our minds, the cardboard aesthetic was like being a child and building a castle out of a cardboard box. You fill the blanks. That's what it came down to."
The open-world multiplayer game itself, however, is not the end-point of the CardLife* concept. "The CardLife game is more of an example of what you can do in cardboard. We want to use that to create this cardboard platform where people can create things and do lots of modding. That's why it's UGC (user-generated content) focused and really customizable."
The Only Rule of CardLife
Freejam's goal is to create a platform where people can create whatever they want, both within the CardLife game and by creating mods simply by using Notepad* and Paint*, as long as those things are cardboard. Tyrer uses the example of Gears of War*. If a player wants to totally recreate Epic*'s shooter, they should just use Unreal*. If they want to make it out of cardboard though, then they can absolutely do that. That's the only creative limitation Tyrer wants to impose.
He stresses that, in terms of accessibility, he wants it to be easy enough for kids to use. "The 3D modeling for kids moniker is more about the barrier to entry and how difficult it is for someone to make a mod. We want it to be low, whether they want to make a driving version, a fighting game, or they just want to add more creatures." Everything in the game is essentially just a collection of 2D shapes that can be found as PNG files in the game folder. Making a new character model is as simple as drawing different shapes in Paint and putting them together.
CardLife's customization isn't limited to modding, however, and whenever players craft an item, a vehicle, or even a dragon, they're able to completely transform their cardboard creation. The system is named for and inspired by connecting the dots. It's a way for kids to create art by filling in the blanks, and that's very much the feel that CardLife is trying to capture by letting you alter the cardboard silhouette of a helicopter or a monster. It's what's ultimately happening when the world gets changed by players terraforming areas, too.
The less noticeable mechanisms and hidden pieces of card take care of themselves. "If you wanted to put big spikes on the back, you can drastically change this central piece of card on the back and make it their own," explains Tyrer. "But a small disc underneath the seat that nobody really sees, there's no point in someone drawing that. It just inherits the scale of all the other pieces that you've drawn. And you can see in the 3D preview how your pieces of card have had an effect."
Parts can be mirrored and duplicated as well, excising some of the busywork. If you're making a new character and give them a leg, you don't then have to make the second leg; the game will create a new leg for you. This, Tyrer hopes, will free people up to just focus on making cool things and exploring their creativity. He doesn't want the art and the gameplay to get in each other's way, so no matter how ridiculous your creation looks, it's still perfectly viable in combat.
Beautiful but Deadly
"We learned in Robocraft that art bots, as in bots that are aesthetically pleasing, generally don't perform well compared to very Robocraft-y bots, which is usually a block with loads of guns on top, because of the way the damage model works. So it's quite hard, for example, so make something that looks like SpongeBob SquarePants* be an effective robot. With CardLife, we wanted to make sure that people would always be able to creatively express themselves without having to worry about the gameplay implications."
PvE servers will be added in an update, but combat and PvP will remain an important part of CardLife. Tyrer envisions big player sieges, with large fortresses made out of cardboard being assaulted by Di Vinci-style helicopters, wizards, and soldiers carrying future-tech. Craftable items are split into technological eras, all the way up to the realms of science-fiction, but there's magic too, and craftable spells.
Tyrer sees cardboard as liberating, freeing him from genre and setting conventions. "If you're making an AAA sci-fi game, you can't put dragons in it. The parameters of what you can add to the game are set by the pre-existing notions of what sci-fi is. And you can see that in Robocraft. But with cardboard… if I'm a child playing with a cardboard box, nobody is going to pop up and tell me I can't have a dragon fight a mech. I can. That's the beauty of it."
Rather than standing around waiting for tech to get researched, discovering new recipes comes down to simply exploring and digging into the game. Sometimes literally. "It's more like real life," explains Tyrer. "As you find new materials and put them into your crafting forge, it will give you different options to make things. As you dig deeper into the ground and find rare ores, those ores will be associated with different recipes, and those recipes will then allow you to make stronger items."
To Infinity
Unlike its fellow creative sandbox Minecraft*, CardLife's world is not procedural. It is finite and designed in a specific way. If you're standing beneath a huge mountain in one server and go to another, you'll be able to find that very same mountain. It might look different, however, as these worlds are still dynamic and customizable. But they can also be replenished.
"If you want to keep a finite world running for an infinite amount of time, it needs to be replenished," says Tyrer. "Structures of lapsed players will decay, natural disasters will refill caves or reform mountains, and that's our way of making the world feel infinite. We can have a world that's shifting and changing but also one that people can get to know and love."
What's there right now, however, is just an early access placeholder. CardLife's moving along swiftly, though. A large update is due around the beginning of March, potentially earlier, that will introduce armor and weapon stats, building permissions so that you can choose who can hang out in or edit your structure, and item durability. And then another update is due out two months after that.
The studio isn't ready to announce a release date yet, and it's still busy watching its early access test bed and planning new features. More biomes are on the cards, as well as oceans that can be explored. "No avenue is closed," says Tyrer. And that includes platforms, so a console release isn't out of the question. There's even the cardboard connection between this and the Switch* via Nintendo*'s Labo construction kit, though Tyrer only laughs at the suggestion they'd be a good fit. It's not a 'no'.
"We just want to take all the cool things you can think of, put it in a pot, and mix it around."
CardLife is available at Freejam* now and is coming to Steam* soon.
The Path to Production IoT Development
Billions of devices are connecting to the Internet of Things (IoT) every year, and it isn’t happening magically. On the contrary, a veritable army of IoT developers is painstakingly envisioning and prototyping these Internet of Things “things”—one thing at a time. And, in many cases, they are guiding these IoT solutions from the drawing board to commercial production through an incredibly complex gauntlet.
Building a working prototype can be a challenging engineering feat, but it’s the decision-making all along the way that can be especially daunting. Remember that famous poem by Robert Frost, The Road Not Taken, in which the traveler comes to a fork in the road and dithers about which way to go? Just one choice? Big deal. Consider the possible divergent paths in IoT development:
- Developer kits– “out-of-the-box” sets of hardware and software resources to build prototypes and solutions.
- Frameworks– platforms that identify key components of an IoT solution, how they work together, and where essential capabilities need to reside.
- Sensors– solution components that detect or measure a physical property and record, indicate, or otherwise respond to it.
- Hardware– everything from the circuit boards, sensors, and actuators in an IoT solution to the mobile devices and servers that communicate with it.
- IDEs – Integrated Development Environments are software suites that consolidate the basic tools IoT developers need to write and test software.
- Cloud – encompasses all the shared resources—applications, storage and compute—that are accessible via the web to augment the performance, security and intelligence of IoT devices.
- Security– this can be multilayered from the “thing” to the network to the cloud
- Analytics– comprises all the logic on board the device, on the network and in the cloud that is required to turn an IoT solution’s raw data streams into decisions and action.
Given all the technologies to choose from that are currently on the market, the possible combinations are infinite. How well those combinations mesh is critical for an efficient development process. And efficiency matters now more than ever—especially in industries and areas of IoT development where the speed of innovation is off the charts.
One area of IoT development where the pace of innovation is quickly taking off is at the intersection of the Internet of Things and Artificial Intelligence (AI). It’s where developers are intent upon driving towards a connected and autonomous world. Whether connecting the unconnected, enabling smart devices, or eventually creating networked autonomous things on a large scale, developers are being influenced by ever-increasing compute performance and monitoring needs. They are responding by designing AI capabilities into highly integrated IoT solutions that are spanning the extended network environment: paving the way to solve highly complex challenges in areas of computer vision, edge computing, and automated manufacturing processes. The results are already making their presence known in enhanced retail customer experiences, beefed up security in cities, and better forecasting of events and human behaviors of all kinds.
For IoT developers, AI represents intriguing possibilities and much more: more tools, more knowledge, and more decision-making. Also, it adds complexity and can slow time to market. But the silver lining is that you don’t have to figure out AI or any aspect of IoT development all by yourself.
Intel stands ready to help with technology, expertise, and support—everything from scalable developer kits to specialized, industry-focused, use-case-driven kits and more. In fact, Intel offers more than 6,000 IoT solutions, from components and software to systems and services—all optimized to work together to sustain developers as their solutions evolve from prototype to production
We take a practical, prescriptive approach to IoT development—helping you obtain the precise tools and training you need to realize your IoT ambitions. And we can assist you in fulfilling the specialized requirements of your IoT solution design today, while anticipating future requirements so you don’t have to.
Take your first step on the path to straightforward IoT development at https://software.intel.com/iot.
Battery-Powered Deep Learning Inference Engine
Improving visual perception of edge devices
LiPo batteries (lithium polymer batteries) and embedded processors are a boon to the Internet of Things (IoT) market. They have enabled IoT device manufacturers to pack more features and functionalities into mobile edge devices, while still providing a long runtime on a single charge. The advancement in sensor technology, especially vision-based sensors, and software algorithms that process large amount of data generated by these sensors has spiked the need for better computational performance without compromising on battery life or real-time performance of these mobile edge devices.
The Intel® Movidius™ Visual Processing Unit (Intel® Movidius™ VPU) provides real-time visual computing capabilities to battery-powered consumer and industrial edge devices such as Google Clips, DJI® Spark drone, Motorola® 360 camera, HuaRay® industrial smart cameras, and many more. In this article, we won’t replicate any of these products, but we will build a simple handheld device that uses deep neural networks (DNN) to recognize objects in real-time.
![]() |
---|
The project in action |
Practical learning!
You will build…
A battery-powered DIY handheld device, with a camera and a touch screen, that can recognize an object when pointed toward it.
You will learn…
- How to create a live image classifier using Raspberry Pi* (RPi) and the Intel® Movidius™ Neural Compute Stick (Intel® Movidius™ NCS)
You will need…
- An Intel Movidius Neural Compute Stick - Where to buy
- A Raspberry Pi 3 Model B running the latest Raspbian* OS
- A Raspberry Pi camera module
- A Raspberry Pi touch display
- A Raspberry Pi touch display case [Optional]
- Alternative option - Pimoroni® case on Adafruit
If you haven’t already done so, install the Intel Movidius NCSDK on your RPi either in full SDK or API-only mode. Refer to the Intel Movidius NCS Quick Start Guide for full SDK installation instructions, or Run NCS Apps on RPi for API-only.
Fast track…
If you would like to see the final output before diving into the detailed steps, download the code from our sample code repository and run it.
mkdir -p ~/workspace
cd ~/workspace
git clone https://github.com/movidius/ncappzoo
cd ncappzoo/apps/live-image-classifier
make run
The above commands must be run on a system that runs the full SDK, not just the API framework. Also make sure a UVC camera is connected to the system (a built-in webcam on a laptop will work).
You should see a live video stream with a square overlay. Place an object in front of the camera and align it to be inside the square. Here’s a screenshot of the program running on my system.
Let’s build the hardware
Here is a picture of how the hardware setup turned out:
Step 1: Display setup
Touch screen setup: follow the instructions on element14’s community page.
Rotate the display: Depending on the display case or stand, your display might appear inverted. If so, follow these instructions to rotate the display 180°.
sudo nano /boot/config.txt
# Add the below line to /boot/config.txt and hit Ctrl-x to save and exit.
lcd_rotate=2
sudo reboot
Skip step 2 if you are using a USB camera.
Step 2: Camera setup
Enable CSI camera module: follow instructions on the official Raspberry Pi documentation site.
Enable v4l2 driver: For reasons unknown, Raspbian does not load V4L2 drivers for CSI camera modules by default. The example script for this project uses OpenCV-Python, which in turn uses V4L2 to access cameras (via /dev/video0
), so we will have to load the V4L2 driver.
sudo nano /etc/modules
# Add the below line to /etc/modules and hit Ctrl-x to save and exit.
bcm2835-v4l2
sudo reboot
Let’s code
Being a big advocate of code reuse, so most of the Python* script for this project has been pulled from this previous article, ‘Build an image classifier in 5 steps’. The main difference is that we have moved each ‘step’ (sections of the script) into its own function.
The application is written in such a way that you can run any classifier neural network without having to make much change to the script. The following are a few user-configurable parameters:
GRAPH_PATH
: Location of the graph file, against which we want to run the inference- By default it is set to
~/workspace/ncappzoo/tensorflow/mobilenets/graph
- By default it is set to
CATEGORIES_PATH
: Location of the text file that lists out labels of each class- By default it is set to
~/workspace/ncappzoo/tensorflow/mobilenets/categories.txt
- By default it is set to
IMAGE_DIM
: Dimensions of the image as defined by the choosen neural network- ex. MobileNets and GoogLeNet use 224x224 pixels, AlexNet uses 227x227 pixels
IMAGE_STDDEV
: Standard deviation (scaling value) as defined by the choosen neural network- ex. GoogLeNet uses no scaling factor, MobileNet uses 127.5 (stddev = 1/127.5)
IMAGE_MEAN
: Mean subtraction is a common technique used in deep learning to center the data- For ILSVRC dataset, the mean is B = 102 Green = 117 Red = 123
Before using the NCSDK API framework, we have to import mvncapi module from mvnc library:
import mvnc.mvncapi as mvnc
If you have already gone through the image classifier blog, skip steps 1, 2, and 5.
Step 1: Open the enumerated device
Just like any other USB device, when you plug the NCS into your application processor’s (Ubuntu laptop/desktop) USB port, it enumerates itself as a USB device. We will call an API to look for the enumerated NCS device, and another to open the enumerated device.
# ---- Step 1: Open the enumerated device and get a handle to it -------------
def open_ncs_device():
# Look for enumerated NCS device(s); quit program if none found.
devices = mvnc.EnumerateDevices()
if len( devices ) == 0:
print( 'No devices found' )
quit()
# Get a handle to the first enumerated device and open it.
device = mvnc.Device( devices[0] )
device.OpenDevice()
return device
Step 2: Load a graph file onto the NCS
To keep this project simple, we will use a pre-compiled graph of a pre-trained GoogLeNet model, which was downloaded and compiled when you ran make
inside the ncappzoo
folder. We will learn how to compile a pre-trained network in another blog, but for now let’s figure out how to load the graph into the NCS.
# ---- Step 2: Load a graph file onto the NCS device -------------------------
def load_graph( device ):
# Read the graph file into a buffer.
with open( GRAPH_PATH, mode='rb' ) as f:
blob = f.read()
# Load the graph buffer into the NCS.
graph = device.AllocateGraph( blob )
return graph
Step 3: Pre-process frames from the camera
As explained in the image classifier article, a classifier neural network assumes there is only one object in the entire image. This is hard to control with a LIVE camera feed, unless you clear out your desk and stage a plain background. In order to deal with this problem, we will cheat a little bit. We will use OpenCV API to draw a virtual box on the screen and ask the user to manually align the object within this box; we will then crop the box and send the image to NCS for classification.
# ---- Step 3: Pre-process the images ----------------------------------------
def pre_process_image():
# Grab a frame from the camera.
ret, frame = cam.read()
height, width, channels = frame.shape
# Extract/crop frame and resize it.
x1 = int( width / 3 )
y1 = int( height / 4 )
x2 = int( width * 2 / 3 )
y2 = int( height * 3 / 4 )
cv2.rectangle( frame, ( x1, y1 ) , ( x2, y2 ), ( 0, 255, 0 ), 2 )
cv2.imshow( 'NCS real-time inference', frame )
cropped_frame = frame[ y1 : y2, x1 : x2 ]
cv2.imshow( 'Cropped frame', cropped_frame )
# Resize image [image size if defined by chosen network during training].
cropped_frame = cv2.resize( cropped_frame, IMAGE_DIM )
# Mean subtraction and scaling [a common technique used to center the data].
cropped_frame = cropped_frame.astype( numpy.float16 )
cropped_frame = ( cropped_frame - IMAGE_MEAN ) * IMAGE_STDDEV
return cropped_frame
Step 4: Offload an image/frame onto the NCS to perform inference
Thanks to the high-performance and low-power consumption of the Intel Movidius VPU, which is in the NCS, the only thing that Raspberry Pi has to do is pre-process the camera frames (step 3) and shoot it over to the NCS. The inference results are made available as an array of probability values for each class. We can use argmax()
to determine the index of the top prediction and pull the label corresponding to that index.
# ---- Step 4: Offload images, read and print inference results ----------------
def infer_image( graph, img ):
# Read all categories into a list.
categories = [line.rstrip('\n') for line in
open( CATEGORIES_PATH ) if line != 'classes\n']
# Load the image as a half-precision floating point array.
graph.LoadTensor( img , 'user object' )
# Get results from the NCS.
output, userobj = graph.GetResult()
# Find the index of highest confidence.
top_prediction = output.argmax()
# Print top prediction.
print( "Prediction: " + str(top_prediction)
+ "" + categories[top_prediction]
+ " with %3.1f%% confidence" % (100.0 * output[top_prediction] ) )
return
If you are interested to see the actual output from NCS, head over to
ncappzoo/apps/image-classifier.py
and make this modification:
# ---- Step 4: Read and print inference results from the NCS -------------------
# Get the results from NCS.
output, userobj = graph.GetResult()
# Print output.
print( output )
...
# Print top prediction.
for i in range( 0, 4 ):
print( "Prediction " + str( i ) + ": " + str( order[i] )
+ " with %3.1f%% confidence" % (100.0 * output[order[i]] ) )
...
When you run this modified script, it will print out the entire output array. Here’s what you will get when you run an inference against a network that has 37 classes, notice the size of the array is 37 and the top prediction (73.8%) is in the 30th index of the array (7.37792969e-01).
[ 0.00000000e+00 2.51293182e-04 0.00000000e+00 2.78234482e-04
0.00000000e+00 2.36272812e-04 1.89781189e-04 5.07831573e-04
6.40749931e-05 4.22477722e-04 0.00000000e+00 1.77288055e-03
2.31170654e-03 0.00000000e+00 8.55255127e-03 6.45518303e-05
2.56919861e-03 7.23266602e-03 0.00000000e+00 1.37573242e-01
7.32898712e-04 1.12414360e-04 1.29342079e-04 0.00000000e+00
0.00000000e+00 0.00000000e+00 6.94580078e-02 1.38878822e-04
7.23266602e-03 0.00000000e+00 7.37792969e-01 0.00000000e+00
7.14659691e-05 0.00000000e+00 2.22778320e-02 9.25064087e-05
0.00000000e+00]
Prediction 0: 30 with 73.8% confidence
Prediction 1: 19 with 13.8% confidence
Prediction 2: 26 with 6.9% confidence
Prediction 3: 34 with 2.2% confidence
Step 5: Unload the graph and close the device
In order to avoid memory leaks and/or segmentation faults, we should close any open files or resources and deallocate any used memory.
# ---- Step 5: Unload the graph and close the device -------------------------
def close_ncs_device( device, graph ):
cam.release()
cv2.destroyAllWindows()
graph.DeallocateGraph()
device.CloseDevice()
return
Congratulations! You just built a DNN-based live image classifier.
The following pictures are of this project in action
![]() |
---|
NCS and a wireless keyboard dongle plugged directly to RPI. |
![]() |
---|
RPi camera setup |
![]() |
---|
Classifying a bowl |
![]() |
---|
Classifying a computer mouse |
Further experiments
- Port this project onto a headless system like RPi Zero* running Raspbian Lite*.
- This example script uses MobileNets to classify images. Try flipping the camera around and modifying the script to classify your age and gender.
- Hint: Use graph files from ncappzoo/caffe/AgeNet and ncappzoo/caffe/GenderNet.
- Convert this example script to do object detection using ncappzoo/SSD_MobileNet or Tiny YOLO.
Further reading
- @wheatgrinder, an NCS community member, developed a system where live inferences are hosted on a local server, so you can stream it through a web browser.
- Depending on the number of peripherals connected to your system, you many notice throttling issues as mentioned by @wheatgrinder in his post. Here’s a good read on how he fixed the issue.
Using and Understanding the Intel® Movidius™ Neural Compute SDK
The Intel® Movidius™ Neural Compute Software Development Kit (NCSDK) comes with three tools that are designed to help users get up and running with their Intel® Movidius™ Neural Compute Stick: mvNCCheck, mvNCCompile, and mvNCProfile. In this article, we will aim to provide a better understanding of how the mvNCCheck tool works and how it fits into the overall workflow of the NCSDK.
Fast track: Let’s check a network using mvNCCheck!
You will learn…
- How to use the mvNCCheck tool
- How to interpret the output from mvNCCheck
You will need…
- An Intel Movidius Neural Compute Stick - Where to buy
- An x86_64 laptop/desktop running Ubuntu 16.04
If you haven’t already done so, install NCSDK on your development machine. Refer to the Intel Movidius NCS Quick Start Guide for installation instructions.
Checking a network
Step 1 - Open a terminal and navigate to ncsdk/examples/caffe/GoogLeNet
Step 2 - Let’s use mvNCCheck to validate the network on the Intel Movidius Neural Compute Stick
mvNCCheck deploy.prototxt -w bvlc_googlenet.caffemodel
Step 3 - You’re done! You should see similar output to the one below:
USB: Myriad Connection Closing.
USB: Myriad Connection Closed.
Result: (1000,)
1) 885 0.3015
2) 911 0.05157
3) 904 0.04227
4) 700 0.03424
5) 794 0.03265
Expected: (1000,)
1) 885 0.3015
2) 911 0.0518
3) 904 0.0417
4) 700 0.03415
5) 794 0.0325
------------------------------------------------------------
Obtained values
------------------------------------------------------------
Obtained Min Pixel Accuracy: 0.1923076924867928% (max allowed=2%), Pass
Obtained Average Pixel Accuracy: 0.004342026295489632% (max allowed=1%), Pass
Obtained Percentage of wrong values: 0.0% (max allowed=0%), Pass
Obtained Pixel-wise L2 error: 0.010001560141939479% (max allowed=1%), Pass
Obtained Global Sum Difference: 0.013091802597045898
------------------------------------------------------------
What does mvNCCheck do and why do we need it?
As part of the NCSDK, mvNCCheck serves three main purposes:
- Ensure accuracy when the data is converted from fp32 to fp16
- Quickly find out if a network is compatible with the Intel Movidius Neural Compute Stick
- Quickly debug the network layer by layer
Ensuring accurate results
To ensure accurate results, mvNCCheck’s compares inference results between the Intel Movidius Neural Compute Stick and the network’s native framework (Caffe* or TensorFlow*). Since the Intel Movidius Neural Compute Stick and NCSDK use 16-bit floating point data, it must convert the incoming 32-bit floating point data to 16-bit floats. The conversion from fp32 to fp16 can cause minor rounding issues to occur in the inference results, and this is where the mvNCCheck tool can come in handy. The mvNCCheck tool can check if your network is producing accurate results.
First the mvNCCheck tool reads in the network and converts the model to Intel Movidius Neural Compute Stick format. It then runs an inference through the network on the Intel Movidius NCS, and it also runs an inference with the network’s native framework (Caffe or TensorFlow).
Finally the mvNCCheck tool displays a brief report that compares inference results from the Intel Movidius Neural Compute Stick and from the native framework. These results can be used to confirm that a neural network is producing accurate results after the fp32 to fp16 conversion on the Intel Movidius Neural Compute Stick. Further details on the comparison results will be discussed below.
Determine network compatibility with Intel Movidius Neural Compute Stick
mvNCCheck can also be used as a tool to simply check if a network is compatible with the Intel Movidius Neural Compute Stick . There are a number of limitations that could cause a network to not be compatible with the Intel Movidius Neural Compute Stick including, but not limited to, memory constraints, layers not being supported, or unsupported neural network architectures. For more information on limitations, please visit the Intel Movidius Neural Compute Stick documentation website for TensorFlow and Caffe frameworks. Additionally you can view the latest NCSDK Release Notes for more information on errata and new release features for the NCSDK.
Debugging networks with mvNCCheck
If your network isn’t working as expected, mvNCCheck can be used to debug your network. This can be done by running mvNCCheck with the -in and -on options.
- The -in option allows you to specify a node as the input node
- The -on option allows you to specify a node as the output node
Using the -in and -on arguments with mvNCCheck, it is possible to pinpoint which layer the error/discrepencies could be originating from by comparing the Intel Movidius Neural Compute Stick results with the Caffe or TensorFlow in a layer-by-layer or a binary search analysis.
Debugging example:
Let’s assume your network architecture is as follows:
- Input - Data
- conv1 - Convolution Layer
- pooling1 - Pooling Layer
- conv2 - Convolution Layer
- pooling2 - Pooling Layer
- Softmax - Softmax
Let’s pretend you are getting nan (not a number) results when running mvNCCheck. You can use the -on option to check the output of the first Convolution layer “conv1” by running the following command mvNCCheck user_network -w user_weights -in input -on conv1
. With a large network, using a binary search method would help to reduce the time needed to find the layer where the issue is originating from.
Understanding the output of mvNCCheck
Let’s examine the output of mvNCCheck above.
- The results in the green box are the top five Intel Movidius Neural Compute Stick inference results
- The results in the red box are the top five framework results from either Caffe or TensorFlow
- The comparison output (shown in blue) shows various comparisons between the two inference results
To understand these results in more detail, we have to understand that the output from the Intel Movidius Neural Compute Stick and the Caffe or TensorFlow are each stored in a tensor (a more simplified definition of a tensor is an array of values). Each of the five comparison tests is a mathematical comparison between the two tensors.
Legend:
- ACTUAL – the tensor output by the Intel Movidius Neural Compute Stick
- EXPECTED– the tensor output by the framework (Caffe or TensorFlow)
- Abs – calculate the absolute value
- Max – Find the maximum value from a tensor(s)
- Sqrt – Find the square root of a value
- Sum – Find the sum of a value
Min Pixel Accuracy:
This value represents the largest difference between the two output tensors’ values.
Average Pixel Accuracy:
This is the average difference between the two tensors’ values.
Percentage of wrong values:
This value represents the percentage of Intel Movidius NCS tensor values that differ more than 2 percent from the framework tensor.
Why the 2% threshold? The 2 percent threshold comes from the expected impact of reducing the precision from fp32 to fp16.
Pixel-wise L2 error:
This value is a rough relative error of the entire output tensor.
Sum Difference:
The sum of all of the differences between the Intel Movidius NCS tensor and the framework tensor.
How did mvNCCheck run an inference without an input?
When making a forward pass through a neural network, it is common to supply a tensor or array of numerical values as input. If no input is specified, mvNCCheck uses an input tensor of random float values ranging from -1 to 1. It is also possible to specify an image input with mvNCCheck by using the “-i” argument followed by the path of the image file.
Examining a Failed Case
If you run mvNCCheck and your network fails, it can be one of the following reasons.
Input Scaling
Some neural networks expect the input values to be scaled. If the inputs are not scaled, this can result in the Intel Movidius Neural Compute Stick inference results differing from the framework inference results.
When using mvNCCheck, you can use the –S option to specify the divisor used to scale the input. Images are commonly stored with values from each color channel in the range of 0-255. If a neural network expects a value from 0.0 to 1.0 then using the –S 255 option will divide all input values by 255 and scale the inputs accordingly from 0.0 to 1.0.
The –M option can be used for subtracting the mean from the input. For example, if a neural network expects input values ranging from -1 to 1, you can use the –S 128 and –M 128 options together to scale the network from -1 to 1.
Unsupported layers
Not all neural network architectures and layers are supported by the Intel Movidius Neural Compute Stick. If you receive an error message saying “Stage Details Not Supported” after running mvNCCheck, there may be a chance that the network you have chosen requires operations or layers that are not yet supported by the NCSDK. For a list of all supported layers, please visit the Neural Compute Caffe Support and Neural Compute TensorFlow Support documentation sites.
Bugs
Another possible cause of incorrect results are bugs. Please report all bugs to the Intel Movidius Neural Compute Developer Forum.
More mvNCCheck options
For a complete listing of the mvNCCheck arguments, please visit the mvNCCheck documentation website.
Further reading
- Understand the entire development workflow for the Intel Movidius Neural Compute Stick.
- Here’s a good write-up on network configuration, which includes mean subtraction and scaling topics.
Why Survival Sim Frostpunk* is Eerily Relevant
The original article is published by Intel Game Dev on VentureBeat*: Why survival sim Frostpunk is eerily relevant. Get more game dev news and related topics from Intel on VentureBeat.
Six days. That's how long I survived Frostpunk*'s harsh world until the people overthrew me.
Eliminating homelessness and mining coal for our massive heat-emitting generator only got me so far — my citizens were overworked, sick, and out of food. The early press demo I tried was a short but acute example of the challenges Frostpunk has in store when it comes to PC on April 24. It's set in an alternate timeline where, in the late 19th century, the world undergoes a new ice age, and you're in charge of the last city on earth.
As the city ruler, you have to make difficult choices. For example, should you legalize child labor to beef up your manpower, or force people to work 24-hour shifts to keep the city running? These and other questions add a dark twist to the city simulation genre, and Frostpunk creator 11 Bit Studios* certainly knows how to craft anxiety-inducing decisions.
In 2014, the Polish developer released This War of Mine*, which was about helping a small group of people survive the aftermath of a brutal war. It was the company's first attempt at making a game about testing a player's morality. Frostpunk builds on those moral quandaries by dramatically increasing the stakes.
It gives you the unenviable task of keeping hundreds of people alive, a number that'll only grow as your city becomes more civilized.
"It's all about your choices: You can be a good leader or you can be a bad leader. But it's not as black and white as you'd think because if you're a good leader and you listen to your people, the people may not always be right. So this is why we thought this could be a really thrilling idea for a game. It shows how morality can be grey rather than black and white," said partnerships manager Pawel Miechowski.
The idea for Frostpunk actually predates This War of Mine's development. Originally, 11 Bit Studios made a prototype for a city simulation game called Industrial. But as with the company's other games, Industrial went through a long process of iteration, to the point where the final game is completely different from the prototype.
The developers created a stark, frozen world for Frostpunk because they wanted a setting that'd be dire enough to push society to its limits. They were partly inspired by Polish sci-fi author Jacek Dukaj and his book "Ice" an alternate history tale about a continent that also faces its own endless winter.
The Psychology of Survival
Making another game about the complexity of human behavior wasn't an accident. According to Miechowski, the team wanted to explore those themes because they realized that most games were about fun and entertainment and lacked "serious substance." The hunger for more dramatic experiences only grew as they got older (many are in their late 30s and 40s), which eventually led them to create This War of Mine and its expansions.
Along with other indie hits (like Papers, Please* and That Dragon, Cancer), This War of Mine helped usher in a new sub-genre that's geared toward a mature audience.
"We're focused on creating serious, thought-provoking games. I wouldn't say they're sad, depression simulators like This War of Mine, but something that makes you think even when you stop playing. These are the kind of games we want to create and we feel good about this philosophy," said Miechowski.
Like This War of Mine and the real-world conflicts it's loosely based on, Frostpunk's dystopia is partly grounded in reality. This is where the team's sociological and psychological backgrounds came in handy, especially for lead designer Kuba Stokalski (who has a master's degree in psychology).
"We looked both into stories of extreme endurance (early Arctic and Antarctic exploration, the case of Joe Simpson, Uruguayan Air Force Flight 571 or the horrid accident of Aron Ralston) as well as more systemic research on how the human psyche performs under long-term duress when survival is at stake — psychological aspects of living on the International Space Station, or being posted to Antarctic research bases among them," Stokalski said in an email.
"All of this yielded a lot of data, but we still had to distill it into playable systems. … We were able to abstract from data those aspects that we felt were both universal, true to the subject matter, and were fertile ground for gameplay systems that felt fresh and interesting. That's how our core social system in the game — Discontent and Hope — was born, for example."
Discontent and Hope (seen as two bars at the bottom of your screen) are a quick way of gauging how your citizens feel at any given moment. Hope is the population's morale, how confident they feel about you and their chances for survival. Discontent rises when the people are angry about the laws you're passing or the way you handle delicate situations (like starvation and rioting in the streets).
A low Hope bar paired with high Discontent is a formula for disaster: People will grow so frustrated that they'll overthrow you, thus ending the game. You have to find a way to balance both Hope and Discontent as you deal with a variety of social and economic challenges.
When it came to designing all the possible laws and decisions a player could make, the developers found inspiration in a curious social phenomena called creeping normality.
"Invoked in critical thought since before Aristotle and more recently famously formulated by Jared Diamond, it provides a powerful metaphor for how people accept small decisions that change the status quo only to find themselves in deeply undesirable situations down the line," said Stokalski.
"The rise of tyrannies of the twentieth century are the most stark examples and a lot of our decision-making mechanisms are designed around asking the player what small concessions they're willing to make in order to ensure survival — and showing our interpretations of the consequences of such a process."
With Great Power
Providing food, shelter, and heat is your main priority at the beginning of the game. But as your post-apocalyptic city moves past those needs — and develops sophisticated technology and buildings — Frostpunk's more advanced systems start to open up.
No matter how good of a leader you are, you'll never satisfy every person with your choices, and some of them will even form resistance groups. At this point, you'll have to make what 11 Bit Studios describes as an irreversible choice, a radical method of unifying your fractured society. So far, the developer has only revealed the Order and Discipline option, which allows you to become a dictator and use aggressive force to keep people in line.
"We felt that when talking about survival on a social scale (as opposed to survival of individuals), systems such as communism or fascism show how extreme situations can easily lead to extreme measures. … This is the ultimate question of the game that we want to leave the players with: Is survival worth it no matter the cost? In times of growing social friction worldwide, we feel that it's a very pertinent subject to explore," said Stokalski.
Whether you choose to go down the road of totalitarianism is up to you. That's what makes Frostpunk so intriguing: It puts your back up against the wall and asks what kind of leader you'll be. It's an elaborately crafted mirror for the human psyche, the trolley problem writ large. On some level, we all want to believe we're decent people. But whether that'd hold up in a dying world where humanity is struggling to survive, well … you'll just have to find out for yourself.
"If you ask a player to choose between gameplay gain and maintaining his self-concept (the way we view ourselves, a topic explored by famous researchers such as Carl Rogers or Manford Kuhn) a powerful dilemma forms," said Stokalski.
"Most of us view ourselves as good people. But will you sacrifice gameplay progress, even success, to remain a decent person as a player? This War of Mine exploited that to great effect when asking if you're willing to steal from the elderly, and you can expect many choices along this line in Frostpunk."
Spearhead Games*: From Leaving AAA to Making a Time-Traveling Columbo
The original article is published by Intel Game Dev on VentureBeat*: Spearhead Games: From leaving AAA to making a time-traveling Columbo. Get more game dev news and related topics from Intel on VentureBeat.
Like a lot of the indie studios that have cropped up over the last few years, Spearhead Games*' founders, Atul Mehra and Malik Boukhira, were originally AAA developers. The pair met when they were working at EA in Montreal on a project that wasn't looking very healthy.
"We were working on something that was just dying," Mehra remembers. "But it was a coincidence that we teamed up; I was getting ready to leave EA when one of my old friends, who was working with Malik, said we should talk because we were thinking the same thing. So that's when we started to consider doing something together."
And it's when Mehra visited Boukhira at home to have a look at a prototype and got very excited about it. "That's it," he told Boukhira. "We're quitting our jobs."
But Spearhead Games would take another four months before it was established and able to actually start working on projects. The problem? They had no money.
Enter the Canadian Media Fund, a government and media-funded organization that finances and assists in the development of Canadian televisions and digital media projects. Competition is intense and to ensure their application for funding had its best shot, they called on friends in the form of one of Red Barrel Studio's founders, Philippe Morin (Outlast), and Minority Media (Time Machine VR).
"They helped us maneuver our first CMF application, so we were really lucky," Mehra says. "That's how we got the funding for our first game."
"We were among the first people from AAA to jump ship and become indie, but there were still people before us," Boukhira adds. "So we learned from their experiences, their mistakes, and their successes. Now there are about 100 indie studios in Montreal, so it's really exploded since then."
It's indicative of the community in the city, the pair agree. It's grown from around 20 studios in 2010 to over 160 now, and Boukhira and Mehra describe it as very close-knit, with studios happily helping each other out.
One of the reasons they wanted to leave AAA behind—aside from the grim future for the game they were working on and the studio itself, which was closed shortly after they left—was the transformation of the industry. They remember, a decade ago, AAA being more like today's indies, but as projects became larger and team sizes blew up, it changed into something very different; a place where experimentation and risk were shied away from.
The Puzzle of Game Development
Spearhead's first game was almost like a mission statement in regards to their ambitions. "With Tiny Brains*," Mehra recalls, "we wanted to try something a bit ridiculous. We did four-player online co-op with a micro team — less than 10 people — and we decided to charge in and release it on PC, PS4*, and PS3* in six languages. Easy peasy. That's how we think! The only thing stopping us is our stupidity sometimes, when we hit a wall."
While a small team created some obstacles, ultimately it has proved to be a boon compared to the AAA environment. "With small teams, you have that maneuverability," explains Mehra. "If someone has a problem, we talk about the problem and come up with a solution together. But in AAA, there are hundreds of people. You have to get the leads together, schedule meetings, and by the time something happens, days will have gone by. It needs to happen that way for them, because of their size."
And it's Spearhead's agility that allowed the studio to finish their last game, Stories: The Path of Destinies*, an action-RPG, in a blisteringly-fast 11 months. It has not, however, always paid off.
ACE – Arena: Cyber Evolution was Spearhead's MOBA(Multiplayer online battle arena) bid. They put their own spin on it, by creating a futuristic sport. They dubbed it a MOSA (Multiplayer Online Sports Arena). Now they think it came out too early, with even Epic approaching them to say they would have seen a lot more success if they released it now.
"At that point, in 2014, people weren't ready," Mehra admits. "Even Twitch* wasn't ready. We proposed a thing to Twitch about spectator interaction and interactive feedback buttons so, as a streamer, if I want to give my audience a choice to engage them, they could all vote and the result would go into the game and trigger an event. They said no, that's not going to happen."
Now Microsoft*'s streaming platform, Mixer*, is trying just that.
Perhaps ACE will make a comeback; it's certainly something Mehra has been thinking about, but that doesn't mean Spearhead regrets its penchant for taking risks. "As a studio we always try to do something different–something innovative or experimental," says Boukhira. "As an indie studio it's a missed opportunity when we retread old ground because we have the chance to try something different. Maybe it's an esport, maybe it's a co-op game, maybe it's an RPG, but there will always be a twist."
It means they're able to pivot to a new genre quickly because the way players interact with the world is what connects their games, not the genre. They're learning, however, that this does create a problem when it comes to explaining what their games actually are.
For Stories, the task was simpler. It was an action-RPG, something familiar and easy to describe. The narrative and the game's consequence-heavy, time-hopping structure allowed Spearhead to still experiment, but all it took was a glance at a screenshot to understand that it's ultimately an action-RPG.
Foresight
The studio's next game, Omensight, shows how they've learned to balance their ambitions with familiarity. Though it's not a direct sequel to Stories, notes Mehra, it's a spiritual successor set in the same universe, and thus a game that fans will recognize. Mechanically, however, it's very different, and while time-travel and action-RPG combat returns, Omensight is very much its own thing: an action adventure game that's also a murder mystery.
"Your main goal as the Harbinger is to solve a murder," Boukhira explains. "The Harbinger is a cosmic fixer; she's a powerful warrior with magical abilities. The world is going to end by nightfall, and you find out this is going to happen because somebody was murdered. So you have 12 hours to save the world, which isn't a lot. That's where the time loop mechanic comes in: it's a tool used to investigate and stretch the timeline. So you can try to change the course of the day just to see what happens, or force people to meet who never normally would have."
Boukhira describes it as being a bit like walking in Columbo's shoes. Just as the cigar-smoking, raincoat-wearing detective sticks to suspects all day, constantly following them around and annoying them, the Harbinger pesters them too. But through multiple timelines.
This is complicated by the fact that a war is raging, with two factions dedicated to the other's destruction. The Harbinger can interact with both, even teaming up with one or the other in different timelines. It means you can see the war from different perspectives, as well as experimenting with how the outcome of the war affects the murder mystery.
Within these factions are key characters–generals, warriors, sorcerers–who serve as both suspects and companions. They all come with their own behaviors and personalities in and out of battle, and the power of the Harbinger is to influence these behaviors in an effort to change the result of the day. That could mean stabbing a general to get in someone's good graces, but it could also mean joining the same general in a battle and helping them out.
Getting Tactical
The combat is a continuation of what Spearhead created with Stories. "What we've tried to do with the combat system is create something that has a good flow of action but gives a lot of creative room and tactical options to the players," Boukhira explains. "For example, one of your powers lets you make a slow-motion sphere around yourself that affects everything but you, which is cool because you can hit people when they're frozen, but it also gives you a window to do other stuff. Like you have a telekinetic power that lets you grab objects in the environment and move them. So you can freeze time and then change the environment while everyone is frozen."
Companions shake things up too. They might run off and use their own tactics to take out an enemy, or they might actually be friends with the group you're about to fight, letting you sidestep the bout entirely. At least in that timeline. Approach them with a different companion and you might need to get your sword bloody.
Ultimately, combat is just another tool that you can use to change the world and solve the murder mystery at the heart of it. The one thing you can't do is talk. The Harbinger is a mute protagonist. It's an old trope that has a mechanical purpose this time. Spearhead wants players to make gameplay decisions exclusively through taking action.
This doesn't mean that there's no way to communicate, however. Actions speak louder than words. Kill someone's enemy and you'll make your intentions clear. And then there's the titular omensight, arguably the Harbinger's handiest ability.
"Omensight isn't just the title of the game but is also a key power of the Harbinger," Boukhira says. "It's a special power that allows you to share a vision with key characters; a vision that you and they know is true. If you show them something disturbing, it might change their perspective on the world, and then you'll see them do something totally different like switch sides or help you out."
Whatever you do, it's always reversible. It's the team's favorite part of the time loop system–the fact that you can always go back and change things. Every action has consequences, potentially dramatically changing the world and at least putting you one step closer to solving the mystery, but you're not stuck with those consequences. Once you've learned how things play out because of your actions, it's just a matter of starting the day over with that foreknowledge.
A release date has yet to be set, but Mehra teases that the studio is going to be picking one in the next month, and it will be in 2018. It's coming out on PC, but Spearhead is also in talks to have it appear on other platforms.
Using the Intel® Distribution for Python* to Solve the Scene-Classification Problem Efficiently
Abstract: The objective of this task is to get acquainted with image and scene categorization. Initially, we try to extract the image features, prepare a classifier using the training samples, and then assess the classifier on the test set. Later, we considered pre-trained AlexNet and ResNet models, and fine-tuned and applied them on the considered dataset.
Technology stack: Intel® Distribution for Python*
Frameworks: Intel® Optimization for Caffe* and Keras
Libraries used: NumPy, scikit-learn*, SciPY Stack
Systems used: Intel® Core™ i7-6500U processor with 16 GB RAM (Model: HP envy17t-s000cto) and Intel® AI DevCloud
Dataset
The scene database provides pictures from the eight classes: coast, mountain, forest, open country, street, inside city, tall buildings, and highways, respectively. The dataset is divided into a training set (1,888 images) and testing set (800 images), which are placed separately in their respective folders. The associated labels are stored in "train labels.csv" and "test labels.csv." The SIFT word descriptors are likewise included in "train sift features" and "test sift features" directories.
The following are a few of the images from the dataset:
Training set




Testing set




K-nearest neighbor (knn) classifier
Bag of visual words
We execute the K-means cluster algorithm to register a visual word dictionary. The component measurement (feature dimension) of the SIFT feature is 128. To build a bag of visual words, we utilize the included SIFT word descriptors incorporated into the "train sift features" and "test sift features" directories.
Classifying the test images
The method used to classify the images is called k-nearest neighbor (kNN) classifier.
Results
Number of Clusters | k value | Accuracy (%) |
---|---|---|
50 | 5 | 49.375 |
50 | 15 | 52.25 |
64 | 15 | 53.125 |
75 | 15 | 52.375 |
100 | 15 | 54.5 |
100 | 9 | 55.25 |
150 | 18 | 53.125 |
Discriminative Classifier—support Vector Machines (SVMs)
Bag of visual words
We execute the K-means cluster algorithm to register a visual word dictionary. The component measurement (feature dimension) of the SIFT feature is 128. Along these lines, we are utilizing an indistinguishable technique from above for the bag of visual word representation.
SVMs
Support Vector Machines (SVMs) are inherently two-class classifiers. We utilize one vs. all SVMs for preparing the multiclass classifier.
Results
Number of Clusters | Accuracy (%) |
---|---|
50 | 40.625 |
64 | 46.375 |
75 | 47.375 |
100 | 52.5 |
150 | 51.875 |
Transfer Learning and Fine Tuning
Transfer learning
A popular approach in deep learning where pre-trained models that are developed to solve a specific task are used as the starting point for a model on a second task.
Fine tuning
This process takes a network model that has already been trained for a given task, and makes it perform a second similar task.
How to use it
- Select source model: A pre-trained source model is chosen from available models. Many research institutions release models on large and challenging datasets that may be included in the pool of candidate models from which to choose from.
- Reuse model: The pre-trained model can then be used as the starting point for a model on the second task of interest. This may involve using all or parts of the model, depending on the modeling technique used.
- Tune model: Optionally, the model may need to be adapted or refined on the input-output pair data available for the task of interest.
When and why to use it
Transfer learning is an optimization; it's a shortcut to save time or get better performance.
In general, it is not obvious that there will be a benefit to using transfer learning in the domain until after the model has been developed and evaluated.
There are three possible benefits to look for when using transfer learning:
- Higher start: The initial skill (before refining the model) on the source model is higher than it otherwise would be.
- Higher slope: The rate of improvement of skill during training of the source model is steeper than it otherwise would be.
- Higher asymptote: The converged skill of the trained model is better than it otherwise would be.
We apply transfer learning with the pre-trained AlexNet model to demonstrate the results over the chosen subset of places database. Furthermore, we supplant only class score layer with another completely associated layer having eight nodes for eight classifications.
Results
Architectures Used | Top-1 Accuracy (%) | Top-3 Accuracy (%) | Top-5 Accuracy (%) |
---|---|---|---|
AlexNet | 51.25 | 68.65 | 81.35 |
ResNet | 53.45 | 74 | 87.25 |
GoogLeNet | 52.33 | 71.36 | 82.84 |
Top-1 Accuracy: Accuracies obtained while considering the top-1 prediction.
Top-3 Accuracy: Accuracies obtained while considering the top-3 predictions.
Top-5 Accuracy: Accuracies obtained while considering the top-5 predictions.
Training Time Periods (For Fine Tuning)
Architecture Used | System | Training Time |
---|---|---|
AlexNet | Intel® AI DevCloud | ~23 min |
AlexNet
| HP envy17t-s000cto | ~95 min |
ResNet | Intel® AI DevCloud | ~27 min |
ResNet | HP envy17t-s000cto | ~135 min |
GoogLeNet | Intel® AI DevCloud | ~23 min |
GoogLeNet | HP envy17t-s000cto | ~105 min |
Note: Have considered smaller datasets and experimented to test the speeds and accuracies that can be achieved by using Intel Distribution fot Python.
Conclusion
From the above experiments, it is quite clear that deep-learning methods are performing much better than extracting the features using traditional methods and applying machine-learning techniques for the scene-classification problem.
In the future, I want to design a new deep neural network by making some changes to the proposed architecture so that accuracies can be further increased. I would also like to deploy in AWS* DeepLens and make it real time.
Click GitHub for source code.
Please visit Places for more advanced techniques and datasets.
Intel® Data Analytics Acceleration Library API Reference
Intel® Data Analytics Acceleration Library (Intel® DAAL) is the library of Intel® architecture optimized building blocks covering all stages of data analytics: data acquisition from a data source, preprocessing, transformation, data mining, modeling, validation, and decision making.
Algorithms implemented in the library include:
- Moments of low order and quantiles
- K-Means clustering
- Classification algorithms, including boosting algorithms and Naïve Bayes, Support Vector Machine (SVM), and multi-class classifiers
- Neural network algorithms
Intel DAAL provides application programming interfaces (APIs) for C++, Java*, and Python* languages.
For the Developer Guide and previous versions of API reference, see Intel® Data Analytics Acceleration Library - Documentation.
Code Sample: Merging Masked Occlusion Culling Hierarchical Buffers
File(s): | Download Download 2 |
License: | Apache-2 |
Optimized for... | |
---|---|
Operating System: | Microsoft* Windows® 10 (64 bit) |
Hardware: | N/A |
Software: (Programming Language, tool, IDE, Framework) | C++, Visual Studio 2015, LLVM |
Prerequisites: | Familiarity with Visual Studio, 3D graphics, parallel processing. |
Tutorial: | Merging Masked Occlusion Culling Hierarchical Buffers for Faster Rendering |
Introduction
Efficient occlusion culling in dynamic scenes is a very important topic to the game and real-time graphics community in order to accelerate rendering. Masked Software Occlusion Culling [J. Hasselgren, M. Andersson, T. Akenine-Möller] presented a novel algorithm optimized for SIMD-capable CPUs that culled 98% of all triangles culled by a traditional occlusion culling algorithm. This update to the original Masked Occlusion library presents an addition to the original work that address the problem where silhouettes edges can bleed through the buffer when submitting complex unsorted geometry by splitting a scene into multiple buffers that better fit local dynamic ranges of geometry and that can be computed concurrently.
Get Started
Core masked occlusion culling library
The code additions form part of the core Intel Masked Occlusion library and have been integrated into the original source files. The library comes with visual studio projects to build the algorithm as a static lib for inclusion into another project, a performance benchmark test and a validation sample that tests the internal correctness of the mask occlusion results. The library does not include a sample of using the library in a gaming environment.
Related samples
An early example of the mask occlusion library used in a 3D workload can be found here although this sample is in maintenance mode and does use the latest build of the mask occlusion library.
References
Merging Masked Occlusion Culling Hierarchical Buffers for Faster Rendering
Original Masked Occlusion Whitepaper.
Updated Log
Created March 23, 2017
Merging Masked Occlusion Culling Hierarchical Buffers for Faster Rendering
Abstract
Efficient occlusion culling in dynamic scenes can accelerate rendering, which makes it an essential topic for the game and real-time graphics community. Masked Software Occlusion Culling, the paper published by J. Hasselgren, M. Andersson and T. Akenine-Möller, presented a novel algorithm optimized for SIMD-capable CPUs that culled 98 percent of all triangles culled by a traditional occlusion culling algorithm. While highly efficient and accurate for many use cases, there were still some issues that the heuristics didn't adequately solve. Here, we present an addition to the preceding work by Andersson et al. that addresses many of these problem cases by splitting a scene into multiple buffers that better fit local dynamic ranges of geometry and that can be computed concurrently. We then augment the algorithm's discard heuristics and combine the partial result buffers into a new hierarchical depth buffer, on which applications can perform reliably accurate, efficient occlusion queries.
Introduction
Masked Software Occlusion Culling was invented by — J. Hasselgren, M. Andersson, and T. Akenine-Möller of Intel — in 2015. It was designed for efficient occlusion culling in dynamic scenes suitable for the game and real-time graphics community. The benefit of the Masked Software Occlusion Culling algorithm subsequently proposed by Andersson, Hasselgren, and Akenine-Möller in 2016 was that it culled 98 percent of all triangles culled by a traditional occlusion culling algorithm, while being significantly faster than previous work. In addition, it still takes full advantage of Single instruction, multiple data (SIMD) instruction sets and, unlike graphics processing unit (GPU)-based solutions, didn't introduce any latency into the system. This is important to game-engine developers, as it can free the GPU from needlessly rendering non-visible geometry, and it could instead render other, richer game visuals.
Figure 1. Left: A visualization of the original Masked Occlusion hierarchical depth representation for the castle scene by Intel, where dark is farther away; conservative merging errors are highlighted in red. Middle: The in-game view of the castle scene, including bounding boxes for meshes. Right: A visualization of the Masked Occlusion hierarchical depth representation for the castle scene using the merging of two separate hierarchical buffers, with significantly improved accuracy.
An updated version [HAAM16] of the algorithm inspired by quad-fragment merging [FBH∗10], which is less accurate but performs better, was also added to the Masked Occlusion library. This approach works best if the incoming data is roughly sorted front to back, which also improves efficiency by reducing overdraw in the depth buffer.
Figure 2. Typical workflow for Masked Occlusion Culling.
A typical workflow for integrating Masked Occlusion Culling into a game engine is shown in Figure 2. This workflow mirrors the traditional graphics pipeline and has been used in this format by developers—including Booming Games* in their title Conqueror's Blade*— with good results.
Figure 3. Typical Masked Occlusion Culling buffer for Conqueror's Blade*
However, some game workloads showed that one area where Masked Occlusion Culling proved less effective was when rendering very large meshes with significant depth overlap, as the overlap made it impossible to do accurate sorting. This proved particularly problematic for the updated [HAMM16] algorithm. Specifically, the issues manifested when rendering a mixture of foreground assets and terrain patches for expansive landscapes. A single terrain patch covered a very wide depth range and couldn't be sorted relative to the foreground occluders in an optimal order. These discontinuities are inherent in the Masked Software Occlusion HiZ buffer creation, as the current heuristics used for discarding layers while constructing the buffer did not have enough context regarding future geometry to keep the most important data. Without the full context of the incoming geometry, the heuristics have to take a conservative approach during depth selection, which increases the number of later occlusion queries that return visible. This, in turn, means the GPU has to render geometry that eventually is culled by the GPU, and never contributes to the overall scene.
To solve this problem, we [authors Leigh Davies and Filip Strugar] have added the functionality to merge multiple Masked Occlusion hierarchical depth buffers in the Masked Software Occlusion library. This allows the developer to utilize a strategy of subgrouping scene geometry and computing partial results buffers for each subgroup. Subgroups are chosen for their tighter dynamic range of depth values, as well as for geometry sorting behavior. A subgroup of foreground objects, and another subgroup of terrain objects, is a common situation. The partial occlusion results for such subgroups is merged later into a single hierarchical depth buffer. This merging of the partial buffers uses an extension of the existing discard heuristic for combining layers.
Previous Work
The Masked Software Occlusion rasterization algorithm is similar to any standard two-level hierarchical rasterizer [MM00]. The general flow of the rasterization pipeline is shown in Figure 4:
Figure 4. Masked Occlusion Rasterization Pipeline, shown for Intel® Advanced Vector Extensions 2(Intel® AVX2).
Both the triangle setup and the tile traversal code have been heavily optimized to use SIMD, with the number of triangles and pixels that can be processed in parallel varying, depending on the flavor of SIMD being used. There are two main exceptions where the Masked Occlusion Culling algorithm differs from a standard software rasterizer, which are described below. First, rather than process a scanline at a time, it instead efficiently computes a coverage mask for an entire tile in parallel, using the triangle edges.
Figure 5. An example triangle rasterized on an Intel® AVX2 capable processor. We traverse all 32 x 8 pixel tiles overlapped by the triangle's bounding box and compute a 256-bit coverage mask using simple bit operations and shifts.
Since Intel AVX2 supports 8-wide SIMD with 32-bit precision, we use 32 x 8 as our tile size, as shown in Figure 5 (tile sizes will be different for Intel® Streaming SIMD Extensions 2 (Intel® SSE2/Intel SSE4.1/Intel® Advanced Vector Extensions 512 (Intel® AVX-512) implementations). This allows the algorithm to very efficiently compute coverage for 256 pixels in parallel.
The second difference is the hierarchical depth buffer representation, which decouples depth and coverage data, bypassing the need to store a full resolution depth buffer. The Masked Software Occlusion rasterization algorithm uses an inexpensive shuffle to rearrange the mask so that each SIMD-lane maps to a more well-formed 8 x 4 tile. For each 8 x 4 tile, the hierarchical depth buffer stores two floating-point depth values Zmax0 and Zmax1, and a 32-bit mask indicating which depth value each pixel is associated with. An example of a tile populated by two triangles using the Masked Occlusion algorithm can be found in Figure 6.
Figure 6. In this example, an 8 x 4 pixel tile is first fully covered by a blue polygon, which is later partially covered by a yellow triangle. Left: our HiZ-representation seen in screen space, where each sample belongs either to Zmax0 or Zmax1. Right: along the depth axis (z), we see that the yellow triangle is closer than the blue polygon. All the yellow samples (left) are associated with Zmax1 (working layer), while all blue samples are associated with Zmax0 (reference layer).
Limitation of a Single Hierarchical Depth Buffer
Given that we store only two depth values per tile, we require a method to conservatively update the representation each time a triangle is rasterized that—partially—covers a tile. Referring to the pseudo-code in Figure 7, we begin by assigning the Zmax0 as the reference layer representing the furthest distance visible in the tile, and Zmax1 value as the working layer that's partly covered by triangle data.
After determining triangle coverage, we update the working layer as Zmax1 = max (Zmax1, Zmaxtri), where Zmaxtri is the maximum depth of the triangle within the bounds of the tile, and combine the masks. The tile is covered when the combined mask is full, and we can overwrite the reference layer and clear the working layer.
function updateHiZBuffer(tile, tri) // Discard working layer heuristic dist1t = tile.zMax1 - tri.zMax dist01 = tile.zMax0 - tile.zMax1 if (dist1t > dist01) tile.zMax1 = 0 tile.mask = 0
Figure 7. Update tile pseudo code.
In addition to the rules above, we need a heuristic for when to discard the working layer. This helps prevent the silhouettes of existing data in the buffer leaking through occluders that are rendered nearer the camera if the data is not submitted in a perfect front-to-back order, as illustrated in Figure 8. As shown above in the updateHiZBuffer() function, we discard the working layer if the distance to the triangle is greater than the distance between the working and reference layers.
The Masked Occlusion update procedure is designed to guarantee that Zmax0≥ Zmax1, so we may use the signed distances for a faster test, since we never want to discard a working layer if the current triangle is farther away. The rationale is that a large discontinuity in depth indicates that a new object is being rendered, and that consecutive triangles will eventually cover the entire tile. If the working layer isn't covered, the algorithm has still honored the requirement to have a conservative representation of the depth buffer.
Figure 8. Top: two visualizations of the hierarchical depth buffer. The left image is generated without using a heuristic for discarding layers. Note that the silhouettes of background objects leak through occluders, appearing as darker gray outlines on the lighter gray foreground objects. The right image uses our simple layer discard heuristic and retains nearly all the occlusion accuracy of a conventional depth buffer. Bottom: our discard heuristic applied to the sample tile from Figure 2. The black triangle discards the current working layer, and overwrites the Z1max value, according to our heuristic. The rationale is that a large discontinuity in depth indicates that a new object is being rendered, and that consecutive triangles will eventually cover the entire tile.
Referring to Figure 9, the leaking of remaining silhouette edges through occluders happens because the heuristic for discarding working layers is not triggered, since the reference layer is a long way behind the working layer. In the problem case, the reference layer contains the clear value, resulting in a wide dynamic range to the depth values. The working layer is updated with the most conservative value from the working layer and the new triangle. This is in spite of the fact that consecutive triangles do eventually cover the entire tile, and the working layer could have used the nearer Zmax value from the incoming triangles.
Figure 9. Silhouette bleed-through in a well-sorted scene using only a single hierarchical depth buffer.
Removing Silhouette Bleed-through on Terrain
The final effects of silhouette bleed-through at high resolution is shown in Figure 9. The most prominent case of silhouette bleed-through occurs when an object nearer to the viewer is rendered in a tile containing a distant object and the clear color. It is caused by the need to keep the most conservative value in the tile layer without context of what else may be rendered later. One potential solution would be to only merge data into a tile when we have the full data of the new mesh being added, but that would require being able to store more layers in the hierarchical depth buffer.
An alternative way to solve silhouette bleeding is to render the terrain and the foreground objects into their own hierarchical depth buffers, and render them individually. This significantly reduces discontinuities in the hierarchical depth buffers. The foreground objects have almost no bleeding, as the rough front-to-back sort is enough to ensure triangles are rendered in an order that is optimal for the Masked Occlusion discard algorithm. The landscape has some bleeding issues due to internal sorting of triangles within the single mesh, but these are much more localized. The separated hierarchical depth buffers are shown in Figure 10. That just leaves the problem of merging the terrain mesh and the existing foreground objects to generate a final hierarchical depth buffer that may be used for culling.
Figure 10. Castle scene with terrain and foreground objects in separate buffers.
Merging the Buffers
The new Merge function added to the Masked Occlusion API does just that, taking a second hierarchical depth buffer and merging the data onto the first depth buffer. The merging algorithm works at the same 8 x 4 tile basis and uses the same SIMD instructions as the merging heuristic used for rasterization, allowing multiple subtiles to be processed in parallel. The merge code has been implemented for both the original merge algorithm [AHAM15] and the updated version [HAAM16]. The flow of the merge algorithm is described as follows:
1. Calculate conservative depth value for the tile using Reference + working layers. Trivial for HAAM16 as this is the reference layer, slightly more complex for AHAM1
New Reference Layer = _mm256_max_ps(Valid Reference A[0], Valid Reference B[0]);
2. Compare working of Buffer A layer with new Reference Layer. Mask out all subtiles failing the depth test -Update New Reference layer with result of depth test
3. Treat Working layer for Buffer B as an incoming triangle. Compare working of Buffer B layer with new Reference Layer Mask out all subtiles failing the depth test. Use distance heuristic to discard layer 1 if incoming triangle is significantly nearer to observer than the new working layer. Update the new mask with incoming triangle coverage.
Compute new value for zMin[1]. This has one of four outcomes:
zMin[1] = min(zMin[1], zTriv) zMin[1] = zTriv zMin[1] = FLT_MAX unchanged
Depending on if the layer is updated, discarded, fully covered, or not updated. Propagate zMin[1] back to zMin[0] if tile was fully covered, and update the mask.
In practice, we found that for a scene like the castle, we only need to use two layers; with minor modifications, the code could combine multiple buffers. The final merged buffer is shown in Figure 11. The original silhouette issues have been solved completely.
Figure 11. Visual representation of the final hierarchical depth buffer created from the merged foreground and terrain buffers.
The merging of buffers does not solve bleed issues local to an individual buffer—such as the internal issues on the terrain—but it does ensure they don't propagate forward if covered by data in another set of geometry. In the case of the castle, they are covered by the foreground objects.
Table 1 shows the performance of the merge algorithm at different resolutions; performance scales with resolution, while the code for merging the HI-Z data uses the same set of SIMD instructions used for the rest of the MOC algorithm. The data presented here was generated using Intel AVX2, thereby processing 8 subtiles in parallel, but this can be expanded with Intel AVX-512 to up to 16 subtiles.
Table 1. Merge cost relative to resolution; performance measured on Intel® Core™ i7-6950X processor.
Terrain Rasterization Time (ms) | Foreground Rasterization Time (ms) | Total Rasterization Time (ms) | Merge Time (ms) | % Time Spent in Merge | |
---|---|---|---|---|---|
640 x 400 | 0.19 | 0.652 | 0.842 | 0.008 | 0.9% |
1280 x 800 | 0.26 | 0.772 | 1.068 | 0.027 | 2.5% |
1920 x 1080 | 0.31 | 0.834 | 1.144 | 0.051 | 4.4% |
The merge function skips tiles that don't require a merge, as only one buffer has valid data in the tile as an optimization. Although there is a fixed cost for traversing the data set to check this, the low memory footprint for the Masked Occlusion hierarchical depth buffer is the primary reason for the performance of the merge. At 1920 x 1080 resolution, the screen consists of 64800 (8 x 4) subtiles that require only 12 bytes of storage per subtile. The merge function only needs to read 760 KB, compared to over 8.1 MB for a traditional 32-bit depth buffer. Additionally, by using Intel AVX2, we are able to process eight sub-tiles in parallel. The timing in Table 1 refers to single-threaded performance. Experiments on threading the merge showed only minimal benefits due to hitting memory limitations. A 1080p buffer can be merged in under 0.5 percent of a frame on a single core on a PC title running at 60 frames per second.
Using Buffer Merging to Parallelize Hierarchical Depth Buffer Creation
The Masked Occlusion library already provides a method for parallelizing the creation of the hierarchical depth buffer. By using a system of tiled rendering, where incoming meshes are sent to a job system and rendered in parallel, the resulting triangle output of the geometry transformation stage is stored into binned lists representing screen-space tiles and then processed by the job system with each thread, rasterizing the geometry of one screen-space tile. The merging of hierarchical depth buffers offers an alternative approach that works on a much coarser grain and doesn't require the geometry to be stored in a temporary structure. A typical setup is shown in Figure 12. The task system schedules two occluder rendering tasks initially, and creates a merge task that is triggered when both render tasks are complete.
Figure 12. Potential multithreading setup for Masked Occlusion Culling.
Once the merge task is complete, the rendering application's actual occlusion queries can be issued on as many threads as required. In theory, the foreground occluder render tasks could be further subdivided. That may be advantageous if the amount of work between them is very uneven, if the cost of each additional task would be an extra merge task, and if the merge function could be further modified to manage multiple buffers (if required). One side benefit of threading in this way is the removal of the memory required and the associated bandwidth saving of not having to write out the geometry between transform and rasterization passes. Overall, this approach of splitting a scene into two or more Masked Occlusion Culling buffers, and using independent threads to process them, allows for a functional threading paradigm that is suitable for engines that do not support the fine-grained tasking system required for the Intel tile-binning algorithm.
Conclusion
Our extensions to Masked Software Occlusion Culling results in a flexible solution that increases the final accuracy of the depth buffer for scenes that cannot be sufficiently presorted without compromising performance of the original Masked Occlusion algorithm. The merge time in our approach scales linearly with resolution and is independent of the geometric complexity in the scene. In test cases, our approach represented only a small part of the total time required for the culling system. A final benefit is that our partial results subgrouping enables new thread-level parallelism opportunities. These improvements produce robust, fast, and accurate CPU-based geometry culling. As a whole, the improvements significantly reduce the barrier to entry for a game engine to adopt Masked Software Occlusion Culling, freeing GPU rendering resources to render richer game experiences.
References
[AHAM15] M. Andersson, J. Hasselgren, T. Akenine-Möller: Masked Depth Culling for Graphics Hardware. ACM Transactions on Graphics 34, 6 (2015), 188:1–188:9. 2, 4, 5, 8
[FBH∗10] K. Fatahalian, S. Boulos, J. Hegarty, K. Akeley, W. R. Mark, H. Moreton, P. Hanrahan: Reducing Shading on GPUs using Quad-Fragment Merging. ACM Transactions on Graphics, 29, 4 (2010), 67:1–67:8. 4
[HAAM16] M. Andersson, J. Hasselgren, T. Akenine-Möller: MaskedSoftwareOcclusionCulling.
[MM00] J. Mccormack, R. Mcnamara: Tiled Polygon Traversal UsingHalf-PlaneEdgeFunctions. InGraphicsHardware (2000), pp.15– 21. 2
Optimizing VR Hit Game Space Pirate Trainer* to Perform on Intel® Integrated Graphics
Space Pirate Trainer* was one of the original launch titles for HTC Vive*, Oculus Touch*, and Windows Mixed Reality*. Version 1.0 launched in October 2017, and swiftly became a phenomenon, with a presence in VR arcades worldwide, and over 150,000 units sold according to RoadtoVR.com. It's even the go-to choice of many hardware producers to demo the wonders of VR.
"I'm Never Going to Create a VR Game!" These are the words Dirk Van Welden, CEO of I-Illusions and Creative Director of Space Pirate Trainer. He made the comment after first using the Oculus Rift* prototype. He received the unit as a benefit of being an original Kickstarter* backer for the project, but was not impressed, experiencing such severe motion sickness that he was ready to give up on VR in general.
Luckily, positional tracking came along and he was ready to give VR another shot with a prototype of the HTC Vive. After one month of experimentation, the first prototype of Space Pirate Trainer was completed. Van Welden posted it to the SteamVR* forums for feedback and found a growing audience with every new build. The game caught the attention of Valve*, and I-Illusions was invited to their SteamVR developer showcase to introduce Space Pirate Trainer to the press. Van Welden knew he really had something, so he accepted the invitation. The game became wildly popular, even in the pre-beta phase.
What is Mainstream VR?
Mainstream VR is the idea of lowering the barrier of entry to VR, enabling users to play some of the most popular VR games without heavily investing in hardware. For most high-end VR experiences, the minimum specifications on the graphics processing side require an NVIDIA GeForce* GTX 970 or greater. In addition to purchasing an expensive discrete video card, the player also needs to pay hundreds of dollars (USD) for the headset and sensor combo. The investment can quickly add up.
But what if VR games could be made to perform on systems with integrated graphics? This would mean that any on-the-go user with a top-rated Microsoft Surface* Pro, Intel® NUC, or Ultrabook™ device could play some of the top VR experiences. Pair this with a Windows Mixed Reality headset that doesn't require external trackers, and you have a setup for VR anywhere you are. Sounds too good to be true? Van Welden and I thought so too, but what we found is that you can get a very good experience with minimal trade-offs.
Figure 1. Space Pirate Trainer* - pre-beta (left) and version 1.0 (right).
Rendering Two Eyes at 60 fps on a 13 Watt SoC with Integrated Graphics? What?
So, how do you get your VR game to run on integrated graphics? The initial port of Space Pirate Trainer ran at 12 fps without any changes from the NVIDIA GeForce GTX 970 targeted build. The required frame rate for the mainstream space is only 60 fps, but that left us with 48 fps to somehow get back through intense optimization. Luckily, we found a lot of low-hanging fruit—things that greatly affected performance, the loss of which brought little compromise in aesthetics. Here is a side-by-side comparison:
Figure 2. Comparison of Space Pirate Trainer* at 12 fps (left) and 60 fps (right).
Getting Started
VR developers probably have little experience optimizing for integrated graphics, and there are a few very important things to be aware of. In desktop profiling and optimization, thermals (heat generation) are not generally an issue. You can typically consider both the CPU and GPU in isolation and can assume that each will run at their full clock-rate. Unfortunately, this isn't the case for SoCs (System on a Chip). "Integrated graphics" means that the GPU is integrated onto the same chip as the CPU. Every time electricity travels through a circuit, some amount of heat is generated and radiates throughout the part. Since this is happening on both the CPU and GPU side, this can produce great amounts of heat to the total package. To make sure the chip doesn't get damaged, clock rates for either the CPU or the GPU need to be throttled to allow for intermittent cooling.
For consistency, it's very helpful to get a test system with predictable thermal patterns to use as a baseline. This enables you to always have a solid reference point to go back to and verify the performance improvements or regressions as you experiment. For this, we recommend using the GIGABYTE Ultra-Compact PC GB-BKi5HT2-7200* as a baseline, as its thermals are very consistent. Once you've got your game at a consistent 60 fps on this machine, you can target individual original equipment manufacturer (OEM) machines and see how they work. Each laptop has their own cooling solution, so it helps to run the game on popular machines to make sure their cooling solutions can keep up.
Figure 3. System specs for the GIGABYTE Ultra-Compact PC GB-BKi5HT2-7200*.
* Product may vary based on local distribution.
- Features Latest Intel® Core™ 7th generation processors
- Ultra compact PC design at only 0.6L (46.8 x 112.6 x 119.4mm)
- Supports 2.5” HDD/SSD, 7.0/9.5 mm thick (1 x 6 Gbps SATA 3)
- 1 x M.2 SSD (2280) slot
- 2 x SO-DIMM DDR4 slots (2133 MHz)
- Intel® IEEE 802.11ac, Dual Band Wi-Fi & Bluetooth 4.2 NGFF M.2 card
- Intel® HD Graphics 620
- 1 x Thunderbolt™ 3 port (USB 3.1 Type-C™ )
- 4 x USB 3.0 (2 x USB Type-C™ )
- 1 x USB 2.0
- HDMI 2.0
- HDMI plus Mini DisplayPort Outputs (Supports dual displays)
- Intel Gigabit Lan
- Dual array Microphone (supports Voice wake up & Cortana)
- Headphone/Microphone Jack
- VESA mounting Bracket (75 x 75mm + 100 x 100mm)
- * Wireless module inclusion may vary based on local distribution.
The above system is currently used at Intel for all of our optimizations to achieve consistent results. For the purpose of this article and the data we provide, we'll be looking at the Microsoft Surface Pro:
Figure 4. System specs for the Microsoft Surface* Pro used for testing.
Optimized for... | |
---|---|
Device name | GPA-MSP |
Processor | Intel® Core™ i7-7660U CPU @ 2.50GHz 2.50 GHz |
Installed RAM | 16.0 GB |
Device ID | F8BBBB15-B0B6-4B67-BA8D-58D5218516E2 |
Product ID | 00330-62944-41755-AAOEM |
System type | 64-bit operating system, x64-based processor |
Pen and touch | Pen and touch support with 10 touch points |
For this intense optimization exercise, we used Intel® Graphics Performance Analyzers (Intel® GPA), a suite of graphics analysis tools. I won't go into the specifics of each, but for the most part we are going to be utilizing the Graphics Frame Analyzer. Anyway, on to the optimizations!
Optimizations
To achieve 60 fps without compromising much in the area of aesthetics, we tried a number of experiments. The following list shows the biggest bang for the optimization buck as far as performance and minimizing art changes goes. Of course, every game is different, but these are a collection of great first steps for you to experiment with.
Shaders—floor
The first optimization is perhaps the easiest and most effective change to make. The floor of your scenes can take up quite a bite of pixel coverage.
Figure 5. Floor scene, with the floor highlighted.
The above figure shows the floor highlighted in the frame buffer in magenta. In this image, the floor takes up over 60 percent of the scene as far as pixel coverage goes. This means that material optimizations affecting the floor can have huge implications on keeping below frame-budget. Space Pirate Trainer was using the standard Unity* shader with reflection probes to get real-time reflections on the surface. Reflections are an awesome feature to have, but a bit too expensive to calculate and sample every frame on our target system. We replaced the standard shader with a simple Lambert* shader. Not only was the reflection sampling saved, but this also avoided the extra passes required for dynamic lights marked as 'Important' when using the Forward Rendering System used by Windows Mixed Reality titles.
Figure 6. Original measurements for rendering the floor, before optimizations.
Figure 7. Measurements for rendering the floor, after optimizations.
Looking at the performance comparison above, we can see that the original cost of rendering the floor was around ~1.5 ms per frame, and the cost with the replacement shader was only ~0.3 ms per frame. This is a 5x performance improvement.
Figure 8. The assembly for our shader was reduced from 267 instructions (left) down to only 47 (right) and had significantly less pixel shader invocations per sample.
As shown in the figure above, the assembly for our shader was reduced from 267 instructions down to only 47 and had significantly less pixel shader invocations per sample.
Figure 9. Side-by-side comparison of the same scene, with a standard shader on the left and the optimized Lambert* shader on the right.
The above image shows the high-end build with no changes except for the replacement of the standard shader with the Lambert shader. Notice that after all these reductions and cuts, we're still left with a good-looking, cohesive floor. Microsoft has also created optimized versions of many of the Unity built-in shaders and added them as part of their Mixed Reality Toolkit. Experiment with the materials in the toolkit and see how they affect the look and performance of your game.
Shaders—material batching with unlit shaders
Draw call batching is the practice of bundling up separate draw calls that share common state properties into batches. The render thread is often a point of bottleneck contention, especially on mobile and VR, and batching is only one of the main tools in your utility belt to alleviate driver bottlenecks. The common properties required to batch draw calls, as far as the Unity engine is concerned, are materials and the textures used by those materials. There are two kinds of batching that the Unity engine is capable of: static batching and dynamic batching.
Static batching is very straightforward to achieve, and typically makes sense to use. As long as all static objects in the scene are marked as static in the inspector, all draw calls associated with the mesh renderer components of those objects will be batched (assuming they share the same materials and textures). It's always best practice to mark all objects that will remain static as such in the inspector for the engine to smartly optimize unnecessary work and remove them from consideration for the various internal systems within the Unity engine, and this is especially true for batching. Keep in mind that for Windows Mixed Reality mainstream, instanced stereoscopic rendering is not implemented yet, so any saved draw calls will count two-fold.
Dynamic batching has a little bit more nuance. The only difference in requirements between static and dynamic batching is that the vertex attribute count of dynamic objects must be considered and stay below a certain threshold. Be sure to check the Unity documentation for what that threshold is for your version of Unity. Verify what is actually happening behind the scenes by taking a frame capture in the Intel GPA Graphics Frame Analyzer. See Figures 10 and 11 below for the differences in the frame visualization between a frame of Space Pirate Trainer with batching disabled and enabled.
Figure 10. Batching and instancing disabled; 1,300 draw calls; 1.5 million vertices total; GPU duration 3.5 ms/frame.
Figure 11. Batching and instancing enabled; 8 draw calls; 2 million vertices total; GPU duration 1.7 ms/frame (2x performance improvement).
As shown above, the amount of draw calls required to render 1,300 ships (1.5 million verts total) went from 1,300 all the way down to 8. In the batched example, we actually ended up rendering more ships (2 million vertices total) to drive the point home. Not only does this save a huge amount of time on the render thread, but it also saves quite a bit of time on the GPU by running through the graphics pipeline more efficiently. We actually get a 2x performance improvement by doing so. To maximize the total amount of calls batched, we can also leverage a technique called Texture Atlasing.
A Texture Atlas is essentially a collection of textures and sprites used by different objects packed into a single big texture. To utilize the technique, texture coordinates need to be updated to conform to the change. It may sound complicated, but the Unity engine has utilities to make it easy and automated. Artists can also use their modeling tool of choice to build out atlases in a way they're familiar with. Recalling the batching requirement of shared textures between models, Texture Atlases can be a powerful tool to save you from unnecessary work at runtime, helping to get you rendering at less than 16.6 ms/frame.
Key takeaways:
- Make sure all objects that will never move over their lifetime are marked static.
- Make sure dynamic objects you want to batch have fewer vertex attributes than the threshold specified in the Unity docs.
- Make sure to create texture atlases to include as many batchable objects as possible.
- Verify actual behavior with Intel GPA Graphics Frame Analyzer.
Shaders—LOD system for droid lasers
For those unfamiliar with the term, LOD (Level of Detail) systems refer to the idea of swapping various asset types dynamically, based on certain parameters. In this section, we will cover the process of swapping out various materials depending on distance from the camera. The idea being, the further away something is, the fewer resources you should need to achieve optimal aesthetics for lower pixel coverage. Swapping assets in and out shouldn't be apparent to the player. For Space Pirate Trainer, Van Welden created a system to swap out the Unity standard shader used for the droid lasers for a simpler shader that approximates the desired look when the laser is a certain distance from the camera. See the sample code below:
using System.Collections; using UnityEngine; public class MaterialLOD : MonoBehaviour { public Transform cameraTransform = null; // camera transform public Material highLODMaterial = null; // complex material public Material lowLODMaterial = null; // simple material public float materialSwapDistanceThreshold = 30.0f; // swap to low LOD when 30 units away public float materialLODCheckFrequencySeconds = 0.1f; // check every 100 milliseconds private WaitForSeconds lodCheckTime; private MeshRenderer objMeshRenderer = null; private Transform objTransform = null; // « Imaaaagination » - Imagine coroutine is kicked off in Start(). Go green and conserve slide space. IEnumerator Co_Update () { objMeshRenderer = GetComponent<MeshRenderer>(); objTransform = GetComponent<Transform>(); lodCheckTime = new WaitForSeconds(materialLODCheckFrequencySeconds); while (true) { if (Vector3.Distance(cameraTransform.position, objTransform.position) > materialSwapDistanceThreshold) { objMeshRenderer.material = lowLODMaterial; // swap material to simple } else { objMeshRenderer.material = highLODMaterial; // swap material to complex } yield return lodCheckTime; } } }
Sample code for simple shader
This is a very simple update loop that will check the distance of the object being considered for material swapping every 100 ms, and switch out the material if it's over 30 units away. Keep in mind that swapping materials could potentially break batching, so it's always worth experimenting to see how optimizations affect your frame times on various hardware levels.
On top of this manual material LOD system, the Unity engine also has a model LOD system built into the editor (access the documentation here). We always recommend forcing the lowest LOD for as many objects as possible on lower-watt parts. For some key pieces of the scene where high fidelity can make all the difference, it's ok to compromise for more computationally expensive materials and geometry. For instance, in Space Pirate Trainer, Van Welden decided to spare no expense to render the blasters, as they are always a focal point in the scene. These trade-offs are what help the game maintain the look needed, while still maximizing target hardware—and enticing potential VR players.
Lighting and post effects—remove dynamic lights
As previously mentioned, real-time lights can heavily affect performance on the GPU while the engine utilizes the forward rendering path. The way this performance impact manifests is through additional passes for models affected by the primary directional light as well as all lights marked as important in the inspector (up to the Pixel Light Count setting in Quality Setting). If you have a model that's standing in the middle of two important dynamic point lights and the primary directional light, you're looking at least three passes for that object.
Figure 12. Double the amount of dynamic lights at the base of weapons lit on each when they fire in Space Pirate Trainer*, contributing 5 ms of frame time for the floor (highlighted).
In Space Pirate Trainer, the point-lights parented to the barrel of the gun were disabled in low settings to avoid these extra passes, saving quite a bit of frame time. Recalling the section about floor rendering, imagine that the whole floor was sent through for rendering three times. Now consider having to do that for each eye; you'd get six total draws of geometry that cover about 60 percent of the pixels on the screen.
Key takeaways:
- Make sure that all dynamic lights are removed/marked unimportant.
- Bake as much light as possible.
- Use light probes for dynamic lighting.
Post-processing effects
Post-processing effects can take a huge cut of your frame budget if care isn't taken. The optimized "High" settings for Space Pirate Trainer utilize Unity's post-processing stack, but still only take around 2.6 ms/frame on our surface target. See the image below:
Figure 13. "High" settings showing 14 passes (reduced from much more); GPU duration of 2.6 ms/frame.
The highlighted section above shows all of the draw calls involved in post-processing effects for Space Pirate Trainer, and the pop-out window shown is the total GPU duration of those selected calls—around 2.6 ms. Initially, Van Welden and the team tested the mobile bloom to replace the typically used effect, but found that it caused distracting flickering. Ultimately, it was decided that bloom should be scrapped and the remaining stylizing effects could be merged into a single custom pass using color lookup tables to approximate the look of the high-end version.
Merging the passes brought the frame time down from the previously noted 2.6 ms/frame to 0.6 ms/frame (4x performance improvement). This optimization is a bit more involved and may require the expertise of a good technical artist for more stylized games, but it's a great trick to keep in your back pocket. Also, even though the mobile version of Bloom* didn't work for Space Pirate Trainer, testing mobile VFX solutions is a great, quick-and-easy experiment to test first. For certain scene setups, they may just work and are much more performant. Check out the frame capture representing the scene on "low" settings with the new post-processing effect pass implemented:
Figure 14. "Low" settings, consolidating all post-processing effects into one pass; GPU duration of 0.6 ms/frame.
HDR usage and vertical flip
Avoiding the use of high-dynamic rage (HDR) textures on your "low" tier can benefit performance in numerous ways—the main one being that HDR textures and the techniques that require them (such as tone mapping and bloom) are relatively expensive. There is additional calculation for color finalization and more memory required per-pixel to represent the full color range. On top of this, the use of HDR textures in Unity has the scene rendered upside down. Typically, this isn't an issue as the final render-target flip only takes around 0.3 ms/frame, but when your budget is down to 16.6 ms/frame to render at 60 fps, and you need to do the flip once for each eye (~0.6 ms/frame total), this accounts for quite a significant chunk of your frame.
Figure 15. Single-eye vertical flip, 291 ms.
Key takeaways:
- Uncheck HDR boxes on scene camera.
- If post-production effects are a must, use the Unity engine's Post-Processing Stack and not disparate Image Effects that may do redundant work.
- Remove any effect requiring a depth pass (fog, etc.).
Post processing—anti-aliasing
Multisample Anti-Aliasing (MSAA) is expensive. For low settings, it's wise to switch to a temporally stable post-production effect anti-aliasing solution. To get a feel for how expensive MSAA can be on our low-end target, let's look at a capture of Space Pirate Trainer on high settings:
Figure 16. ResolveSubresource cost while using Multisample Anti-Aliasing (MSAA).
The ResolveSubresource API call is the fixed-function aspect of MSAA that determines the final pixels for render targets with MSAA enabled. We can see above that this step alone requires about 1 ms/frame. This is added to the additional work required per-draw that's hard to quantify.
Alternatively, there are several cheaper post-production effect anti-aliasing solutions available, including one Intel has developed called Temporally Stable Conservative Morphological Anti-Aliasing (TSCMAA). TSCMAA is one of the fastest anti-aliasing solutions available to run on Intel® integrated graphics. If rendering at a resolution less than 1280x1280 before upscaling to native head mounted display (HMD) resolution, post-production effect anti-aliasing solutions become increasingly important to avoid jaggies and maintain a good experience.
Figure 17. Temporally Stable Conservative Morphological Anti-Aliasing (TSCMAA) provides an up to 1.5x performance improvement over 4x Multisample Anti-Aliasing (MSAA) with a higher quality output. Notice the aliasing (stair stepping) differences in the edges of the model.
Raycasting CPU-side improvements for lasers
Raycasting operations in general are not super expensive, but when you've got as much action as there is in Space Pirate Trainer, they can quickly become a resource hog. If you're wondering why we were worried about CPU performance when most VR games are GPU-bottlenecked, it's because of thermal throttling. What this means is that any work across a System on Chip (SoC) generates heat across the entire system package. So even though the CPU is not technically the bottleneck, the heat generated by CPU work can contribute enough heat to the package that the GPU frequency, or even its own CPU frequency, can be throttled and cause the bottleneck to shift depending on what's throttled and when.
Heat generation adds a layer of complexity to the optimization process; mobile developers are quite familiar with this concept. Going down the rabbit hole of finding the perfect standardized optimization method for CPUs with integrated GPUs has become a distraction, but it doesn't have to. Just think about holistic optimization as the main goal. Using general good practices on both the CPU and GPU will go a long way in this endeavor. Now that my slight tangent is over, let's get back to the raycasting optimization itself.
The idea of this optimization is that raycast checking frequency can fluctuate based on distance. The farther away the raycast, the more frames you can skip between checks. In his testing, Van Welden found that in the worst case, the actual raycast check and response of far-away objects only varied by a few frames, which is almost undetectable at the frame rate required for VR rendering.
private int raycastSkipCounter = 0; private int raycastDynamicSkipAmount; private int distanceSkipUnit = 5; public bool CheckRaycast() { checkRaycastHit = false; raycastSkipCounter++; raycastDynamicSkipAmount = (int)(Vector3.Distance(playerOriginPos, transform.position) / distanceSkipUnit); if (raycastskipCounter >= raycastDynamicSkipAmount) { if (Physics.Raycast(transform.position, moveVector.normalized, out rh, transform.localScale.y + moveVector.magnitude • bulletSpeed * Time.deltaTime * mathf.Clamp(raycastDynamicSkipAmount,1,10), laserBulletLayerMask)) //---never skip more than 10 frames { checkRaycastHit = true; Collision(rh.collider, rh.point, rh.normal, true); } raycastSkipCounter = 0; } return checkRaycastHit; } }
Sample code showing how to do your own raycasting optimization
Render at Lower Resolution and Upscale
Most Windows Mixed Reality headsets have a native resolution of 1.4k, or greater, per eye. Rendering to a target at this resolution can be very expensive, depending on many factors. To target lower-watt integrated graphics components, it's very beneficial to set your render target to a reasonably lower resolution, and then have the holographic API automatically scale it up to fit the native resolution at the end. This dramatically reduces your frame time, while still looking good. For instance, Space Pirate Trainer renders to a target with 1024x1024 resolution for each eye, and then upscales.
Figure 18. The render target resolution is specified at 1024x1024, while the upscale is to 1280x1280.
There are a few factors to consider when lowering your resolution. Obviously, all games are different and lowering resolution significantly can affect different scenes in different ways. For instance, games with a lot of fine text might not be able to go to such a low resolution, or a different trick must be used to maintain text fidelity. This is sometimes achieved by rendering UI text to a full-size render target and then blitting it on top of the lower resolution render target. This technique will save much compute time when rendering scene geometry, but not let overall experience quality suffer.
Another factor to consider is aliasing. The lower the resolution of the render-target you render to, the more potential for aliasing you have. As mentioned before, some quality loss can be recouped using a post-production effect anti-aliasing technique. The pixel invocation savings from rendering your scene at a lower resolution usually come in net positive, after the cost of anti-aliasing is considered.
#define MAINSTREAM_VIEWPORT_HEIGHTMAX 1400 void App::TryAdjustRenderTargetScaling() { HolographicDisplayA defaultHolographicDisplay = HolographicDisplay::GetDefault(); if (defaultHolographicDisplay == nullptr) { return; } Windows::Foundation::Size originalDisplayViewportSize = defaultHolographicDisplay-MaxViewportSize; if (originalDisplayViewportSize.Height < MAINSTREAM_VIEWPORT_HEIGHT_MAX) { // we are on a 'mainstream' (low end) device. // set the target a little lower. float target = 1024.0f / originalDisplayViewportSize.Height; Windows::ApplicationModel::Core::CoreApplication::Properties->Insert("Windows.Graphics.Holographic.RenderTargetSizeScaleFactorRequest”, target); }
Sample code for adjusting render-target scaling
Render VR hands first and other sorting considerations
In most VR experiences, some form of hand replacement is rendered to represent the position of the players' actual hand. In the case of Space Pirate Trainer, not only were the hand replacements rendered, but also the players' blasters. It's not hard to imagine these things covering a large amount of pixels across both eye-render targets. Graphics hardware has an optimization called early-z rejection, which allows hardware to compare the depth of a pixel being rendered to the existing depth value from the last rendered pixel. If the current pixel is farther back than the last pixel, the pixel doesn't need to be written and the invocation cost of that pixel shader and all subsequent stages of the graphics pipeline are saved. Graphics rendering works like the reverse painters' algorithm. Painters typically paint from back to front, while you can get tremendous performance benefits rendering your scene in a game from front to back because of this optimization.
Figure 19. Drawing the blasters in Space Pirate Trainer* at the beginning of the frame saves pixel invocations for all pixels covered by them.
It's hard to imagine a scenario where the VR hands, and the props those hands hold, will not be the closest mesh to the camera. Because of this, we can make an informed decision to force the hands to draw first. This is very easy to do in Unity; all you need to do is find the materials associated with the hand meshes, along with the props that can be picked up, and override their RenderQueue property. We can guarantee that they will be rendered before all opaque objects by using the RenderQueue enum available in the UnityEngine.Rendering namespace. See the figures below for an example.
namespace UnityEngine.Rendering { ...public enum RenderQueue { ...Background = 1000, ...Geometry = 2000, ...AlphaTest = 2450, ...Geometrylast = 2500, ...Transparent = 3000, ...Overlay = 4000 } }
RenderQueue enumeration in the UnityEngine.Rendering namespace
using UnityEngine; using UnityEngine.Rendering; 0 references public class RenderQueueUpdate : MonoBehaviour { public Material myVRHandsMaterial; // Use this for initialization 0 references void Start () { // Guarantee that VR hands using this material will be rendered before all other opaque geometry. myVRHandsMaterial.renderQueue = (int)RenderQueue.Geometry - 1; } }
Sample code for overriding a material's RenderQueue parameter
Overriding the RenderQueue order of the material can be taken further, if necessary, as there is a logical grouping of items dressing a scene at any given moment. The scene can be categorized (see figure below) and ordered as such:
- Draw VR hands and any interactables (weapons, etc.).
- Draw scene dressings.
- Draw large set pieces (buildings, etc.).
- Draw the floor.
- Draw the skybox (usually already done last if using built-in Unity skybox).
Figure 20. Categorizing the scene can help when overriding the RenderQueue order.
The Unity engine's sorting system usually takes care of this very well, but you sometimes find objects that don't follow the rules. As always, check your scene's frame in GPA first to make sure everything is being ordered properly before applying these methods.
Skybox compression
This last one is an easy fix with some potentially advantages. If the skybox textures used in your scene aren't already compressed, a huge gain can be found. Depending on the type of game, the sky can cover a large amount of pixels in every frame; making the sampling be as light as possible for that pixel shader can have a good impact on your frame rate. Additionally, it may also help to lower the resolution on skybox textures when your game detects it's running on a mainstream system. See the performance comparison shown below in Space Pirate Trainer:
Figure 21. A 5x gain in performance can be achieved from simply lowering skybox resolution from 4k to 2k. Additional improvements can be made by compressing the textures.
Conclusion
By the end, we had Space Pirate Trainer running at 60 fps on the "low" setting on a 13-watt integrated graphics part. Van Welden subsequently fed many of the optimizations back into the original build for higher-end platforms so that everybody could benefit, even on the high end.
Figure 22. Final results: From 12 fps all the way up to 60 fps.
The "high" setting, which previously ran at 12 fps, now runs at 35 fps on the integrated graphics system. Lowering the barrier to VR entry to a 13-watt laptop can put your game in the hands of many more players, and help you get more sales as a result. Download Intel GPA today and start applying these optimizations to your game.
Resources
Action Classification Using PyTorch*
Abstract
For real-world video classification use cases it is imperative to capture the spatiotemporal features. In such cases, the interwoven patterns in an optical flow are expected to hold higher significance. By contrast, most of the implementations involve learning individual image representations disjunctive with the previous frames in the video. In an attempt at exploring more appropriate methods, this case study revolves around video classification that sends an alert in the instance of any violence detected. PyTorch*1, trained on an Intel® Xeon® Scalable processor, is used as the Deep Learning framework for better and faster training and inferencing.
Introduction
In typical contemporary scenarios we frequently observe sudden outbursts of physical altercations such as road rage or a prison upheaval. However, to address such scenarios, high tech industries have deployed closed-circuit television (CCTV) cameras that provide extensive virtual coverage of public places. In the case of any untoward incidents, it is common to analyze the footage made available through video surveillance and start an investigation. An intervention by security officials as the violence is taking place could prevent loss of precious lives and minimize destruction of public property. One obvious approach to implementing this solution is to position human forces for continuous manual monitoring of CCTV cameras. This can be burdensome and erroneous at the same time due to the tedious nature of the job along with human limitations. A more effective method could be automatic detection of violence in CCTV footage triggering alerts to security officials, thus reducing the risk for manual errors. More appealing to the defense and security industries, this solution can also be of relevance to authorities associated with handling public properties.
In this experiment, we implemented the proposed solution using 3D convolutional neural networks (CNN) with ResNet-342 as the base topology. The experiments were performed using transfer learning on pretrained 3D residual networks (ResNet) initialized with weights of the Kinetics* human action video dataset.
The dataset for training was taken from Google’s atomic visual action (AVA) dataset. This is a binary classification between fighting and non-fighting class (explained further in the Dataset Preparation section). Each class contains an approximately equal number of instances. The videos for the non-fighting class comprises instances from the eat, sleep, and drive class made available with the AVA dataset3.
Hardware Details
The configuration of the Intel Xeon Scalable processor is as follows:
Name | Description |
---|---|
Intel® architecture | x86_64 |
CPU op-mode(s) | 32-bit, 64-bit |
Byte Order | Little Endian |
CPU(s) | 24 |
On-line CPU(s) list | 0–23 |
Thread(s) per core | 2 |
Core(s) per socket | 6 |
Socket(s) | 2 |
Non-uniform memory access (NUMA) node(s) | 2 |
Vendor ID | Genuine Intel |
CPU family | 6 |
Model | 85 |
Model name | Intel® Xeon® Gold 6128 processor 3.40 GHz |
Stepping | 4 |
CPU MHz | 1199.960 |
BogoMIPS | 6800.00 |
Virtualization type | VT-x |
L1d cache | 32K |
L1i cache | 32K |
L2 cache | 1024K |
L3 cache | 19712K |
NUMA node0 CPU(s) | 0-5,12-17 |
NUMA node1 CPU(s) | 6-11,18-23 |
Table 1. Intel® Xeon® Gold processor configuration.
Software Configuration
Prerequisite dependencies to proceed with the development of this use case are outlined below:
Library | Version |
---|---|
PyTorch* | 0.3.1 |
Python* | 3.6 |
Operating System | CentOS* 7.3.1 |
OpenCV | 3.3.1 |
youtube-dl | 2018.01.21 |
ffmpeg | 3.4 |
torchvision | 0.2.0 |
Table 2. Software configuration.
Solution Design
In addition to being time consuming, a CNN requires millions of data points to be trained from scratch. In this context, with only 545 video clips in the fighting class and 450 in the non-fighting class, training a network from scratch could result in an over-fitted network. Therefore, we opted for transfer learning, which minimizes the training time and helps attain better inference. The experiment uses a pretrained 3D CNN ResNet network, initialized with the weights of the Kinetics video action dataset. Fine-tuning of the network is done by training the final layers with the acquired AVA training dataset customized to the fight classification. This fine-tuned model is later used for inference.
Image-based features extracted using 2D convolutions are not directly suitable for deep learning on video-based classifications. Learning and preserving spatiotemporal features is vital here. One of the alternatives for capturing this information is 3D ConvNet4. In 2D ConvNets, convolution and pooling operations are performed spatially, whereas in 3D ConvNets these are done spatiotemporally. The difference in treatment of multiple frames as input is marked in the figures below:
Figure 1. 2D convolution on multiple frames4.
Figure 2. 3D convolution4.
As shown, 2D convolutions applied on multiple images (treating them as different channels), results in an image (figure 1). Even though input is three dimensional—that is, W, H, L, where L is the number of input channels— the output shape is a 2D matrix. Here, convolutions are calculated across two directions and the filter depth matches the input channels. Consequently, there is a loss of temporal information of the input signal after every convolution.
Input shape = [W,H,L] filter = [k,k,L] output = 2D
On the other hand, 3D convolution preserves the temporal information of the input signal and results in an output volume (figure 2). The same phenomenon is applicable for 2D and 3D pooling operations as well. Here, the convolutions are calculated across three directions, giving the output shape of a 3D volume.
Input shape = [W,H,L] filter = [k,k,d] output = 3D
Note: d<L
3D ConvNet models temporal information better because of its 3D convolution and 3D pooling operations.
In our case, video clips are referred with a size of c × l × h × w, where c is the number of channels, l is length in number of frames, and h and w are the height and width of the frame, respectively. We also refer 3D convolution and pooling kernel size by d×k ×k, where d is kernel temporal depth and k is kernel spatial size.
Dataset Preparation
The dataset for training is acquired from the Google AVA3. The original AVA dataset contains 452 videos split into 242 for training, 66 for validation, and 144 for test. Each video has 15 minutes annotated in one-second intervals, resulting in 900 annotated segments. These annotations are specified by two CSV files, ava_train_v2.0.csv and ava_val_v2.0.csv. The CSV file has the following information.
- video_id: YouTube* identifier.
- middle_frame_timestamp: in seconds from the start of the YouTube video.
- person_box: top-left (x1, y1) and bottom-right (x2, y2) normalized with respect to frame size, where (0.0, 0.0) corresponds to the top-left, and (1.0, 1.0) corresponds to the bottom-right.
- action_id: identifier of an action class.
Among the 80 action classes available, only the fighting class (450 samples) is considered for positive samples for the current use case, and an aggregate of 450 samples (150 per class) are taken from the eat, sleep, and drive sub classes to form the non-fighting class. The YouTube videos are downloaded using the command-line utility, youtube-dl.
Each clip is four seconds long and has approximately 25 frames per second. The frames for each clip are extracted into a separate folder with the folder name as the name of the video clip. These are extracted using the ava extraction script.
Data Conversion
The ffmpeg library is used for converting the available AVA video clips to frames. The frames are then converted to type Float Tensor using the Tensor class provided with PyTorch. This conversion results in efficient memory management as the tensor operations in this class do not make memory copies. The methods either transform the existing tensor or return a new tensor referencing the same storage.
Network Topology
3d cnn resnet
The architecture followed for the current use case is ResNet based with 3D convolutions. A basic ResNet block consists of two convolutional layers and each convolutional layer is followed by batch normalization and a rectified linear unit (ReLU). A shortcut pass5 connects the top of the block to the layer just before the last ReLU in the block. For our experiments, we use the relatively shallow ResNet-34 that adopts the basic blocks.
Figure 3. Architecture of 3d cnn resnet – 34.
When the dimensions increase (dotted line shortcuts in the given figure), the following two options are considered:
- Shortcut performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter.
- The projection shortcut is used to match dimensions (done by 1×1 convolutions).
For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.
We have used Type A shortcuts with the ResNet-34 basic block to avoid increasing the number of parameters of the relatively shallow network.
The 3D convolutions are used to directly extract the spatiotemporal features from raw videos. A two-channeled approach of using a combination of RGB color space and optical flows as inputs to the 3D CNNs is used on the Kinetics dataset to derive the pretrained network. As pretraining on large-scale datasets is an effective way to achieve good performance levels on small datasets, we expect the 3D ResNet-34 pretrained on Kinetics to perform well for this use case.
Training
The training is performed using the Intel Xeon Scalable processor. The pretrained weights used for this experiment can be downloaded from GitHub*. This model is trained on the Kinetics Video dataset.
A brief description of the pretrained model is provided below:
resnet-34-kinetics-cpu.pth: --model resnet --model_depth 34 --resnet_shortcut A
The solution is based on the 3D-Resnets-PyTorch implementation by Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh.
The number of frames per clip is written to the n_frames files generated using utils/n_frames_kinetics.py. After this, an annotation file is generated in JavaScript* Object Notation (JSON) format using utils/kinetics_json.py. The opts.py file contains the train and test dataset paths and the default values for fine-tuning parameters, which can be changed to suit the use case. Fine-tuning is done on the conv5_x and fc layers of the pretrained model. The checkpoints are saved as .pth files for every 10th epoch. The system was trained for 850 epochs and the loss converged up to 0.22 (approximately).
Inference
The trained model is inferred on the YouTube videos downloaded from the test dataset in AVA. The video clips are further broken down into frames and are passed to the classifier. These are then converted to Torch* tensors. The frames obtained per video clip are divided into segments, and a classification score is obtained for each of the segments. The classification results are written on to the video frames and stitched back into a video. Inferencing is done from the code in this GitHub link.
Given that input videos are located in ./videos, the following command is used for inference:
python main.py --input ./input --video_root ./videos --output ./output.json --model ./resnet-34-kinetics.pth --mode score
Results
The following gif is extracted from the video results obtained by passing a video clip to the trained PyTorch model.
Figure 4. Inferred GIF.
Conclusion and Future Work
The results are obtained with a high level of accuracy. AVA contains samples from movies that are at least a decade old. Hence, to test the efficacy of the trained model, inferencing was done on external videos (CCTV footage, recently captured fight sequences, and so on). The results showed that the quality of the video or the period during which the video was captured did not influence the accuracy. As a future work, we could enhance the classification problem with detection. Also, the same experiment can be carried out using recurrent neural network techniques.
About the Authors
Astha Sharma and Sandhiya S are Technical Solution Engineers working with the Intel® AI Academia Program.