Quantcast
Channel: Intel Developer Zone Articles
Viewing all 1201 articles
Browse latest View live

Hands-on with the OpenVINO™ Inference Engine Python* API

$
0
0

Introduction

The OpenVINO™ toolkit 2018 R1.2 Release introduces several new preview features for Python* developers:
  • Inference Engine Python API support.
  • Samples to demonstrate Python API usage.
  • Samples to demonstrate Python* API interoperability between AWS Greengrass* and the Inference Engine.
This paper presents a quick hands-on tour of the Inference Engine Python API, using an image classification sample that is included in the OpenVINO™ toolkit 2018 R1.2. This sample uses a public SqueezeNet* model that contains around one thousand object classification labels.
(Important: As stated in the Overview of Inference Engine Python* API documentation, “this is a preview version of the Inference Engine Python* API for evaluation purpose only. Module structure and API itself will be changed in future releases.”)

Prerequisites

Ensure your development system meets the minimum requirements outlined in the Release Notes. The system used in the preparation of this paper was equipped with an Intel® Core™ i7 processor with Intel® Iris® Pro graphics.
The Inference Engine Python API is supported on Ubuntu* 16.04 and Microsoft Windows® 10 64-bit OSes. The hands-on steps provided in this paper are based on development systems running Ubuntu 16.04.

Install the OpenVINO™ Toolkit

If you already have OpenVINO™ toolkit 2018 R1.2 installed on your computer you can skip this section. Otherwise, you can get the free download here: https://software.intel.com/en-us/openvino-toolkit/choose-download/free-download-linux
Next, install the OpenVINO™ toolkit as described here: https://software.intel.com/en-us/articles/OpenVINO-Install-Linux

Optional Installation Steps

After completing the toolkit installation steps, rebooting, and adding symbolic links as directed in the installation procedure, you can optionally add a command to your .bashrc file to permanently set the environment variables required to compile and run OpenVINO™ toolkit applications.

  1. Open .bashrc:
    cd ~
    
    sudo nano .bashrc
  2. Add the following command at the bottom of .bashrc:
    source/opt/intel/computer_vision_sdk_2018.1.265/bin/setupvars.sh
  3. Save and close the file by typing CTRL+X, Y, and then ENTER.

The installation procedure also recommends adding libOpenCL.so.1 to the library search path. One way to do this is to add an export command to setupvars.sh script:

  1. Open setupvars.sh:
    cd /opt/intel/computer_vision_sdk_2018.1.265/bin/
    
    sudo nano setupvars.sh
  2. Add the following command at the bottom of setupvars.sh:
    export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/libOpenCL.so.1:$LD_LIBRARY_PATH
  3. Save and close the file by typing CTRL+X, Y, and then ENTER.
  4. Reboot the system:
    reboot

    Run the Python* Classification Sample

    Before proceeding with the Python classification sample, run the demo_squeezenet_download_convert_run.sh sample script from the demo folder as shown below:

    cd /opt/intel/computer_vision_sdk/deployment_tools/demo
    
    sudo ./demo_squeezenet_download_convert_run.sh

    (Important: You must run the demo_squeezenet_download_convert_run.sh script at least once in order to complete the remaining steps in this paper, as it downloads and prepares the deep learning model used later in this section.)

    The demo_squeezenet_download_convert_run.sh script accomplishes several tasks:

    • Downloads a public SqueezeNet model, which is used later for the Python classification sample.
    • Installs the prerequisites to run Model Optimizer.
    • Builds the classification demo app.
    • Runs the classification demo app using the car.png picture from the demo folder. The classification demo app output should be similar to that shown in Figure 1.

    Figure 1. Classification demo app output

    If you did not modify your .bashrc file to permanently set the required environment variables (as indicated in the Optional Installation Steps section above), you may encounter problems running the demo. If this is the case, run setupvars.sh before running demo_squeezenet_download_convert_run.sh.
    The setupvars.sh script also detects the latest installed Python version and configures the required environment. To check this, type the following command:

    echo $PYTHONPATH

    Python 3.5.2 was installed on the system used in the preparation of this paper, so the path to the preview version of the Inference Engine Python API is:

    /opt/intel/computer_vision_sdk/deployment_tools/inference_engine/python_api/ubuntu_1604/python3

    Go to the Python samples directory and run the classification sample:

    cd /opt/intel/computer_vision_sdk/deployment_tools/inference_engine/samples/python_samples
    
    python3 classification_sample.py –m /opt/intel/computer_vision_sdk_2018.1.265/deployment_tools/demo/ir/squeezenet1.1/squeezenet1.1.xml -i /opt/intel/computer_vision_sdk_2018.1.265/deployment_tools/demo/car.png

    In the second command we are launching Python3 to run classification_sample.py, specifying the same model (-m) and image (-i) parameters that were used in the earlier demo (i.e., demo_squeezenet_download_convert_run.sh). (Troubleshooting: if you encounter the error "ImportError: No module named 'cv2'", run the command sudo pip3 install opencv-python to install the missing library.) The Python classification sample output should be similar to that shown in Figure 2.

    Figure 2. Python classification sample output

    Note that the numbers shown the second column (#817, #511…) refer to the index numbers in the labels file, which identifies all of the objects that are recognizable by the deep learning model. The labels file is located in:

    /opt/intel/computer_vision_sdk_2018.1.265/deployment_tools/demo/ir/squeezenet1.1/squeezenet1.1.labels

    Conclusion

    This paper presented an overview of the Inference Engine Python API, which was introduced as a preview in the OpenVINO™ toolkit 2018 R1.2 Release. It is important to remember that as a preview version of the Inference Engine Python API, it is intended for evaluation purposes only. 

    A hands-on demonstration of Python-based image classification was also presented in this paper, using the classification_sample.py example. This is only one of several Python samples contained in the OpenVINO™ toolkit, so be sure to check out the other Python features contained in this release of the toolkit.

    OpenVINO™ Toolkit


    How Netrolix AI-WAN* Makes Wide-Area Networking More Secure

    $
    0
    0

    Introduction – Wide-Area Networks (WAN) Contribution to the Growing Attack Surface

    It is often said that complexity is the enemy of security. The reason for this is very simple.

    Attackers work by continuously probing attack surfaces in search of "seams," or weaknesses, in the IT infrastructure. These could be places where there is human interaction, where there is control hand-off and opportunity for error, where invisible infrastructure touches an open network, or where operational information inadvertently reveals potential exploits. It's simple math. Larger and more complex attack surfaces offer more opportunities for attackers to get inside.

    There was a time when minimizing the size and complexity of networks made good management and security sense. Then came the cloud, low-cost internet connectivity, mobility, the internet of Things (IoT); suddenly, networks have become anything but simple.

    For example, take an ordinary database application that at one time would have run securely inside a corporate data center. Today, pieces of that solution might be scattered around the globe. You might have data storage from one cloud service provider, an analytics engine from another, access governance from another, and business logic provided by another. You might even have widely distributed connected devices feeding data directly into the database.

    To further complicate this security picture, you also have WAN services that connect you to all your distributed assets and that connect the assets to each other. The combination of cloud services, mobility, and low-cost internet WAN has produced explosive growth in the size and complexity of networks and their attack surfaces. From the attackers' perspective, it's open season, and the conditions are perfect!

    Protecting digital assets in this environment involves applying traditional layered strategies to new network architectures, building security into applications, and using new tools to secure endpoints. There is another part of the infrastructure that does not get the security attention it deserves. That is WAN connectivity, which is increasingly becoming the fabric that holds together the entire distributed enterprise.

    As the one piece of infrastructure that touches all those distributed digital assets, the WAN is an important part of the attack surface, but it may also be the foundation for a more comprehensive approach to securing complex IT infrastructures. Yet people making decisions about WAN services typically focus on cost and performance. Security of the WAN itself is too often a secondary consideration.

    In an earlier article How Netrolix Broke the SD-WAN Barrier with AI-WAN*, we showed how Netrolix has developed a new approach to wide-area networking. It is an artificial intelligence (AI)-driven internet network, an AI-WAN, that has the cost and agility advantages of software-defined wide-area networking (SD-WAN), the quality of service guarantees of private line solutions, and security that is superior to both.

    In this article, we dig deeper into Netrolix AI-WAN* security advantage. But before doing that, let's look closely at the security strengths and weaknesses of the two most common WAN solutions: Multiprotocol Label Switching (MPLS) private lines and SD-WAN.


    Private Infrastructure and MPLS – Private, but Secure?

    From a performance, reliability, and security perspective, private connections are typically considered the ideal WAN solution. MPLS has become one of the most widely used data-carrying methods in private infrastructure because of its ability to carry a variety of network protocols and access technologies, simplifying the configuration of multipurpose private connections.

    However, these connections are expensive and inflexible. They need to be configured by service providers, which sometimes takes months, and their cost is too high to support all the demand for wide-area networking. Many organizations reserve private connections for their most critical operations and use public infrastructure to fill less sensitive WAN needs.

    This decision is based in part on the assumption that private connections are more secure. But are they really?

    The idea is that because private lines are private, they provide no visibility to potential attackers, and therefore they are proof against outside attack. Being so invulnerable, they were never designed to natively support data-protection methods like encryption. Of course, if a service provider's core network were ever breached, all that unencrypted data would be exposed.

    But that's not the only way MPLS connections can be compromised. Private MPLS circuits often use a layer-2 connection from a local incumbent service provider to send unprotected data across many miles and multiple facilities until it is handed off to the MPLS provider. This is done because, in the case of a customer with widely distributed locations, no single provider has the extended geographical footprint to directly connect all the locations using just its own assets. For example, when service provider A delivers MPLS services in service provider B's territory, provider A purchases backhaul connectivity from provider B back to the agreed upon colocation point. This way of delivering private MPLS connections creates a very easy and well-known man-in-the-middle attack surface.

    The only way to protect against these kinds of MPLS attacks is to encrypt data, but alas, MPLS does not natively support encryption. To encrypt data passing through MPLS private connections, every single application, wherever it is located in the complex, highly distributed infrastructure, must encrypt its data. This can be a big management task with lots of room for error, especially in a complex network infrastructure.


    Public Infrastructure and SD-WAN – The Devil Is In the Details

    WAN connections over public infrastructure rely on the internet to transport data, and it's this internet connection that worries security managers. There's too much visibility and complexity to assure data security in the vastness of the global internet. That's why WAN solutions using public infrastructure often include data encryption.

    For example, SD-WAN is one form of internet WAN that many organizations are adopting to fill their growing need for more WAN connectivity. Although SD-WAN doesn't deliver the quality of service of a private line, it is easier to set up and costs much less. Many contend that SD-WAN is more secure than MPLS because it natively supports data encryption, which makes it easy to encrypt all data moving in the network, regardless of where the data is coming from.

    This sounds great, but as is often the case, it's not the whole story. SD-WAN has real and potential vulnerabilities that need to be considered, including:

    • SD-WAN's low cost and ease of deployment make it possible to expand your WAN quickly, which means that more assets can be moved into the cloud and new services can be easily added for partners and customers. All of this creates rapid growth in network size, complexity, and attack surface.
    • Not all data encryption provides the same level of protection. Some providers use less challenging encryption algorithms to save on computational cost. The types of encryption keys and re-keying practices can also affect the strength of encryption.
    • SD-WAN appliances used by many service providers contain known vulnerabilities, and they do not have adequate protection from administrative or physical access. Because of the way SD-WANs operate, compromising one SD-WAN appliance can provide access to an entire network. As SD-WANs grow, the risk from vulnerable appliances also grows. Since there is no private core network connectivity in most SD-WANs, the individual and unique peer-to-peer connections required to make them work offer no way of seeing or detecting abnormal behavior.


    Netrolix AI-WAN – More Secure Than SD-WAN and MPLS

    As detailed in an earlier article about Netrolix AI-WAN (How Netrolix Broke the SD-WAN Barrier with AI-WAN*), the Netrolix AI-WAN consists of the AI-WAN fabric, which is a vast network of ISPs and host data centers around the globe whose traffic is continuously analyzed and monitored by a proprietary deep-learning analytical engine.

    Netrolix accomplishes this by having hardware and software installed in 70 data centers globally and collecting internet data from 20,000 nodes. This enables continuous analysis of multiple performance factors on all the ISPs on the planet to determine optimal data paths to any endpoint and across the Netrolix AI-WAN core.

    To connect to this AI-WAN fabric, Netrolix has developed a suite of low-cost endpoint devices, which are software-defined gateways (SDGs) that run on either their own Intel® architecture-based bare-metal platforms, or appropriate client-owned equipment. The AI engine monitors the global internet while monitoring and communicating with every endpoint device connected to the AI-WAN fabric. All of Netrolix's services, including MPLS, Virtual Private LAN Service (VPLS), Ethernet private line, SD-WAN, global Virtual Private Network (VPN), cloud services, and other offerings are layered over the AI-WAN fabric.

    This enables Netrolix to deliver WAN performance that is on par or better than traditional private networks from global service providers, with guaranteed throughput at wire speeds, much lower cost, and all with the flexibility and ease of setup that an SD-WAN offers.

    There is another big advantage to Netrolix AI-WAN. It provides a level of WAN security that is unmatched by SD-WAN or MPLS services. Let's see why that is so.


    Netrolix AI-WAN Defense in Depth

    The Netrolix AI-WAN security posture addresses three aspects of network security:

    • Securing data on the network
    • Securing the AI-WAN fabric
    • Integrating with or augmenting existing enterprise security

    Netrolix AI-Wan* infographic

    Netrolix uses defense in depth to secure the AI-WAN fabric while integrating with existing enterprise and cloud defenses.

    Five factors that secure data on the AI-WAN network

    Netrolix applies a multifactor security strategy for protecting data on the network that includes the following elements:

    • Data encryption – All data passing through the Netrolix AI-WAN is automatically encrypted using IKEv2 elliptic curve cryptography, which is the most powerful encryption standard currently in use.
    • Key management – The Netrolix AI-WAN uses a robust Key Management System (KMS) to generate encryption keys for every device, every element of the AI-WAN network, every storage instance, and every network configuration. Many SD-WAN providers use one encryption key across a network, and key swapping or re-keying is done manually. In that case, if a key is compromised in one location, the entire network is compromised. In the Netrolix AI-WAN, every network element has its own key, and every key in the global AI-WAN is automatically re-keyed every 30 minutes.
    • Hardware Security Module (HSM) authentication – In the Netrolix AI-WAN, every Netrolix SDG uses HSM authentication, which is the same hardware-based authentication used in credit and debit card chips. It prevents access to the encryption keys of any Netrolix SDG unless the device is connected over the AI-WAN to a Netrolix management console, which prohibits unauthorized access.
    • RADIUS attributes – These are used to authenticate any device that connects to the AI-WAN.
    • The AI analytics engine – The Netrolix AI-WAN uses a proprietary deep-learning analytical engine that does several things. It analyzes global internet traffic and optimize end-to-end data paths from any device connected to the AI-WAN, across the AI-WAN core, to any endpoint (for details about this process, see How Netrolix Broke the SD-WAN Barrier with AI-WAN*). Every device connected to the AI-WAN gets data path re-optimization every five minutes.

    The analytics engine also performs another important security function. It continuously monitors every device connected to the AI- WAN and identifies anomalous data patterns. It not only monitors the AI-WAN fabric itself, but also data coming from or going to devices connected to the AI-WAN, such as IoT devices, industrial control systems, or autonomous devices such as drones or robots, as an example. The ability to detect unusual network activity related to specific devices like these is an important capability. When these kinds of devices are added to an environment, they enlarge the attack surface. Yet many are being built with little understanding or regard for IT security.

    Securing the AI-WAN fabric

    In addition to protecting network data, several of the features described in the previous section also protect the AI-WAN fabric. For instance, by analyzing traffic associated with every device connected to the AI-WAN, the analytics engine is able to prevent someone from unplugging a device from the network and moving it to a new location. This change would immediately be detected and cause the device to be quarantined.

    RADIUS functionality and IPSec prevent unauthorized devices from connecting to the network, and HSM prevents the compromise of the encryption keys. Beyond that, however, there are additional architectural features that harden the Netrolix AI-WAN.

    For instance, it is not possible to locally manage a Netrolix SDG or gain visibility into a device by accessing the underlying operating system, for example. Access can only happen through a management console, and this is a containerized application that runs on a hypervisor in redundant centralized locations. All management functions executed by this console happen over IPSec using HSM authentication. With no access to the underlying architecture and no direct access to the hypervisor, Netrolix's SDGs become pretty impervious to unauthorized tampering.

    The Netrolix SDGs are also protected against physical tampering. They were designed to be rigid boxes with no moving parts and no easy way to open. If a Netrolix SDG is forced open, its data is wiped with no possibility of recovery.

    Total integration with enterprise layered security

    The third key part of Netrolix AI-WAN security strategy is the way it is architected to easily integrate with an existing enterprise security stack. For instance, a Netrolix SDG can be configured as a simple network interface (NID). If a Netrolix AI-WAN user wants to keep their existing Fortinet, Juniper, Cisco, or whatever network devices they have in place today, those devices can connect to the AI-WAN through the Netrolix SDG configured as an NID.

    But a Netrolix SDG can do more. It can work as a network access point plus a router, a switch, and a firewall, all in one solution. And it can be further configured with edge compute capabilities so that it combines network access, router, switch, firewall, and multi-access edge computing (MEC) in one solution.

    Netrolix makes it very easy to configure their SDGs through the management console. When a new Netrolix SDG is plugged into your internet service, the AI-WAN immediately discovers it, optimizes it, begins encryption and key management, and enables its functions to be configured through the highly secure management console. This makes integration with existing security stacks an easy process.


    How Intel Enables Netrolix AI-WAN Security

    Netrolix considered several factors when choosing technology from Intel for the bare-metal platform that is the basis of all their SDGs. These considerations are detailed in an earlier article (How Netrolix Broke the SD-WAN Barrier with AI-WAN*).

    Ultimately, it was the flexibility of chipsets from Intel in supporting Netrolix's architectural needs and the supporting software that were deciding factors. From the very beginning, designing an internet WAN that was more secure than any currently available public or private WAN option was central to those architectural needs.

    The earlier article details chipsets used in different Netrolix SDG platforms, but several Intel technologies play an important role in supporting Netrolix's underlying AI-WAN security, including virtualization, secure hardware sharing, and hardware-based encryption. These include:

    • Intel® Virtualization Technology (Intel® VT) for IA-32, Intel® 64 and Intel® Architecture (Intel® VT-x)
    • Intel® Virtualization Technology (Intel® VT) for Directed I/O (Intel® VT-d) – secure hardware sharing
    • Intel® Advanced Encryption Standard New Instructions (Intel® AES-NI) – CPU-based encryption


    Netrolix AI-WAN Delivers a New Level of Internet WAN Security

    One big challenge facing IT security managers today is that networks are growing so fast, traditional security practices are unable to keep up. In a world in which connected devices, distributed processing, and lots of internetworking are all happening beyond the direct control of those responsible for securing digital assets, WAN security is becoming fundamentally important.

    Netrolix has created a new approach to internet WAN, an AI-WAN that is optimized and secured by a proprietary, deep-learning analytics engine. Netrolix's multifactor security strategy has effectively created a "defense in depth" approach to WAN security that does more than provide new levels of protection. It also extends WAN security beyond the wires to integrate with IT systems and existing enterprise security stacks. These are all the ways Netrolix AI-WAN makes wide-area networking more secure.

    To learn more about the Netrolix AI-WAN and Netrolix's many networking services built on the AI-WAN fabric, see the article How Netrolix Broke the SD-WAN Barrier with AI-WAN and visit the Netrolix website.

    Also, visit the Intel® Network Builders website and explore a vibrant resource center that provides support to anyone interested in software-defined networking and network function virtualization.

    Intel® System Studio - Download and Install Intel® C++ Compiler

    $
    0
    0

     

            Intel® C++ Compiler, known as icc, is the high performance compiler which can be used to build and optimize your C/C++ project. The icc is distributed as part of Intel® System Studio product. In order to use icc to build your project, the installation of the Intel® System Studio product is required on your build system. If you are new to Intel System Studio, click the Choose & Download page at Intel System Studio website to acquire a free renewable commercial license for 90-day use. If you need long term license with priority support, please contact your Intel representative or drop email to intelsystemstudio@intel.com for more information.

            This document describes the steps to register, download and install Intel® System Studio for a new user. By following the steps, you can install the Intel® System Studio in command line mode on your host machine without the GUI support. The Intel® C++ Compiler will be installed as part of the Intel® System Studio, along with all the other components. For detailed components in the Intel® System Studio product, please refer to Intel System Studio Website.

           Click Here to Download the Document

    Understanding Capsule Network Architecture

    $
    0
    0

    Capsule networks (CapsNet) are the new architecture in neural networks, an advanced approach to previous neural network designs, particularly for computer vision tasks. To date, convolutional neural networks (CNN) have been used for computer vision tasks. Although CNNs have managed to achieve far greater accuracy, they still have some shortcomings.

     
    Drawback of Pooling Layers

    CNNs were built at first to classify images; they do so by using successive layers of convolutions and pooling. The pooling layer in a convolutional block is used to reduce the data dimension and to achieve something called spatial invariance, which means regardless of where the object is placed in the image, it identifies the object and classifies it. While this is a powerful concept it has some drawbacks. One of them is that while performing pooling it tends to lose a lot of information, information particularly useful while performing tasks such as image segmentation and object detection. When the pooling layer loses the required spatial information about the rotation, location, scale, and different positional attributes of the object, the process of object detection and segmentation becomes very difficult. While modern CNN architecture has managed to reconstruct the positional information using various advanced techniques, they are not 100 percent accurate, and reconstruction itself is a tedious process. Another drawback of the pooling layer is that if the position of the object is slightly changed the activation doesn’t seem to change with its proportion, which leads to good accuracy in terms of image classification but poor in performance, if you want to locate exactly where the object is in the image.

     
    Capsule Networks

    To overcome such difficulties, a new approach was proposed by Geoffrey Hinton, called capsule network1. A capsule is a collection or group of neurons that stores different information about the object it is trying to identify in a given image; information mostly about its position, rotation, scale, and so on in a high dimensional vector space (8 dimension or 16 dimension), with each dimension representing something special about the object than can be understood intuitively (see Figure 4).

    In computer graphics there is a concept of rendering, which simply means taking into account various internal representations of an object like its position, rotation, and scale, and converting them to an image on screen. In contrast to this approach our brain works in the opposite way, called inverse graphics. When we look at any object, we internally deconstruct it into different hierarchical sub parts, and we tend to develop a relationship between these internal parts of the whole object. This is how we recognize objects, and because of this our recognition does not depend on a particular view or orientation of the objects. This concept is the building block of capsule networks.

    To understand how this works in a capsule network let’s look at its architectural design. The architecture of a capsule network is divided into three main parts and each part has sub operations in it. They are:

    • Primary capsules
      • Convolution
      • Reshape
      • Squash
    • Higher layer capsules
      • Routing by agreement
    • Loss calculation
      • Margin loss
      • Reconstruction loss

    1. Primary capsules

    This is the first layer of the capsule network and this is where the process of inverse graphics takes place. Suppose you are feeding the network with the image of a boat or a house, like in the following images:

    primary capsules process of inverse graphics

    Now, these images are broken down into their sub hierarchical parts in this layer. Let’s assume for the sake of simplicity that these images are constructed out of two distinct sub parts; that is, one rectangle and one triangle.

    primary capsules one rectangle and one triangle

    In this layer, capsules representing the triangle and rectangle will be constructed. Let us suppose we initialize this layer with 100 capsules, 50 representing the rectangle and 50 representing the triangle. The output of these capsules is represented with the help of arrows in the image below; the black arrows representing the rectangle’s output, and the blue arrows representing the triangle’s. These capsules are placed in every location of the image, and the output of these capsules denotes whether or not that object is located in that position. In the picture below you can see that in the location where the object is not placed, the length of the arrow is shorter and where the object is placed, the arrow is longer. The length represents whether the object is present, and the pose of the arrow represents the orientation of that particular object (position, scale, rotation, and so on) in the given image.

    primary capsules object position, scale, rotation

    An interesting thing about this representation is that, if we slightly rotate the object in our input image, the arrows representing these objects will also slightly rotate with proportion to its input counterpart. This slight change in input resulting in a slight change in the corresponding capsule’s output is known as equivariance. This enables the capsule networks to locate the object in a given image with its precise location, scale, rotation, and other positional attributes associated with it.

    primary capsules equivariance

    This is achieved using three distinct processes:

    1. Convolution
    2. Reshape function
    3. Squash function

    In this layer the input image is fed into a couple of convolution layers. This outputs some array of feature maps; let’s say this outputs an array of 18 feature maps. Now, we apply the Reshaping function to these feature maps and let’s say we reshape it into two vectors of nine dimensions each (18 = 2 x 9) for every location in the image, which is similar to the image above representing the rectangle and triangle capsules. Now, the last step is to make sure that the length of each vector is not greater than 1; this is because the length of each vector is the probability of whether or not that object is located in that given location in the image, so it should be between 1 and 0. To achieve this we apply something called the Squash function. This function simply makes sure that the length of each vector is between 1 and 0 and will not destroy the positional information located in higher dimensions of the vector.

    primary capsules squash function

    Now we need to figure out what these sub parts or capsules are related to. That is, if we consider our example of boat and house, we need to figure out which triangle and rectangle is part of the house and which is part of the boat. So far, we know where in the image these rectangles and triangles are located, using these convolutions and squashing functions. Now we need to figure out whether a boat or a house is located there and how these triangles and rectangles are related to the boat and the house.

    2. Higher layer capsules

    Before we get into higher layer capsules, there is still one major function left by the primary capsule layer. That is, before the higher layer capsule can operate, right after the Squash function in the primary layer, every capsule in the primary layer will try to predict the output of every capsule in the higher layer in the network; for example, we have 100 capsules, 50 rectangles, and 50 triangles. Now, suppose we have two types of capsules in the higher layer, one for house and another for boat. Depending upon the orientation of both triangle and rectangle capsules, these capsules will make the following predictions on higher layer capsules. This will give rise to the following scenario:

    higher layer capsules triangle and rectangle predictions

    And as you can see, with respect to its original orientation, the rectangle capsule and the triangle capsule both predicted the boat present in the picture in one of their predictions. They both agree that it’s the boat capsule that should be activated in the higher layer capsule. This means that the rectangle and triangle are more a part of a boat than of a house. This also means that the rectangle and triangle capsules think that selecting the boat capsule will explain their own orientation in the primary capsule. In this, both primary layer capsules agree to select the boat capsule in the next layer as a possible object located in the image. This is called routing by agreement.

    higher layer capsules routing by agreement

    This particular technique has several benefits. Once the primary capsules agree to select a certain higher-level capsule, there is no need to send a signal to another capsule in another higher layer, and the signal in the agreed-on capsule can be made stronger and cleaner and can help in accurately predicting the pose of the object. Another benefit is that if we trace the path of the activation, from the triangle and rectangle to the boat capsule in a higher layer, we can easily sort out the hierarchy of the parts and understand which part belongs to which object; in this particular case, rectangle and triangle belong to the boat object.

    So far we have dealt with the primary layer; now the actual work of the higher capsule layer comes in. Even though the primary layer predicted some output for the higher layer, it still needs to calculate its own output and cross-check which prediction matches with its own calculation.

    The first step the higher capsule layer takes to calculate its own output is to set up something called routing weights. We have some predictions given by our primary layer now for each prediction; at first iteration it declares its routing weights to be zero for all. These initial routing weights are fed into a Softmax function and the output is assigned to each prediction.

    higher layer capsules routing weights softmax function

    Now, after assigning the Softmax output to the predictions, it calculates the weighted sum to each capsule in this higher layer. This gives us two capsules from a bunch of predictions. This is the actual output of the higher layer for the first round or first iteration.

    higher layer capsules weighted sum

    Now we can find which prediction is the most accurate compared to the actual output of the layer.

    higher layer capsules prediction most accurate compared to layer

    After selecting the accurate prediction, we again calculate another routing weight for the next round by scalar product of the prediction and the actual output of the layer and adding it to the existing routing weight. Given by equation:

    U^ij (Prediction by primary layer)

    Vj (Actual output by the higher layer)

    Bij += U^ij + Vj

    Now, if the prediction and the output match, the new routing weights will be large, and if not, the weight will be low. Again, the routing weights are fed into the Softmax function and the values are assigned to the predictions. You can see that the strong agreed predictions have large weights associated with them, whereas others have low weights.

    higher layer capsules routing weight values assigned to predictions

    Again we compute the weighted sum on these predictions with new weights given to it. But now we find that the boat capsule has a longer vector associated with it compared to the house capsule, as the weights were in favor of the boat capsule, so the layer chooses the boat capsule over the house capsule in just two iterations. In this way we can compute the output in this higher layer and choose which capsule to select from this higher layer for the next steps in the capsule network.

    I have only described two iterations or rounds for the sake of simplicity, but actually it can take longer, depending upon the task you are performing.

    higher layer capsules compute output in higher layer

    3. Loss calculation

    Now that we have made a decision on what object is in the image using the routing by agreement method, you can perform classification. As with our previous higher layer, one capsule per class, that is, one capsule for boat and one for house, we can easily add a layer on top of this higher layer and compute the length of the activation vector and, depending upon the length, we can assign a class probability to make an image classifier.

    In the original paper, margin loss was used to calculate the class probability of multiple classes to create such an image classifier. Margin loss simply means that if a certain object of a class is present in the image then the squared length of the corresponding vector of that object’s capsule must not be less than 0.9. Similarly, if that object of that class is not present in the image, then the squared length of the corresponding vector of that object should not be more than 0.1.

    Suppose Vk is the length of the output vector of that class K object. Now, if that object of class K is present then its squared value should not be less than 0.9; that is, |Vk|2 >=0.9. Similarly, if the class K object is not present then |Vk|2 <=0.1.

    In addition to margin loss, there is an additional unit called the decoder network connected to the higher capsule layer. This decoder network is three fully connected layers, two of them being rectified linear unit (ReLU) activated units, and the last one the sigmoid activated layer, which is used to reconstruct the input image.

    loss calculation compute iterations

    This decoder network learns to reconstruct the input image by minimizing the squared difference between the reconstructed image and the input image:

    Reconstruction Loss = (Reconstructed Image – Input Image)2

    Now, we have total loss as:

    Total Loss = Margin Loss + alpha * Reconstruction Loss

    Here, the value of alpha (a constant to minimize reconstruction loss) in the paper1 it is 0.0005 (no extra information is given on why this particular value was chosen). Here the reconstruction loss is scaled down considerably so as to give more importance to the margin loss and so it can dominate the training process. The importance of the reconstruction unit and the reconstruction loss is that it forces the network to preserve the information required to reconstruct the image up to the highest capsule layer. This also acts as a regularizer to avoid over-fitting during the training process.

    In the paper1, the capsule network is used to classify between MNIST* digits. As you can see below (in Figure 1), the paper showed different units of CapsNet for MNIST classification. Here, the input after feeding through two convolutional layers is reshaped and squashed to form 32 primary capsules with 6 x 6 x 8 capsules each. These primary capsules are fed into higher layer capsules, a total of 10 capsules with 16 dimensions each, and at last margin loss is calculated on these higher layer capsules to give class probability.

    loss calculation dynamic routing between capsules.

    Figure 1: Dynamic routing between capsules1

    Figure 2 shows the decoder network used to calculate reconstruction loss. A higher layer capsule is connected to three fully connected layers with the last layer being a sigmoid activated layer, which will output 784-pixel intensity values (28 x 28 reconstructed image).

    loss calculation decoder structure to reconstruct a digit

    Figure 2: Decoder structure to reconstruct a digit1

    An interesting thing about this higher capsule layer is that each dimension on this layer is interpretable. That is, if we take the example from the paper on the MNIST dataset, each dimension from the 16-dimension activation vector can be interpreted and signifies certain characteristics of the object. If we modify one of the 16 dimensions we can play with the scale and thickness of the input; similarly, another can represent stroke thickness, another width and translation, and so on.

    loss calculation dimension perturbations

    Figure 3: Dimension perturbations1

    Let’s look at how we can implement3 it using Keras* with TensorFlow* backend. You start by importing all the required libraries:

    from keras import layers, models, optimizers
    from keras.layers import Input, Conv2D, Dense
    from keras.layers import Reshape, Layer, Lambda
    from keras.models import Model
    from keras.utils import to_categorical
    from keras import initializers
    from keras.optimizers import Adam
    from keras.datasets import mnist
    from keras import backend as K
    
    import numpy as np
    import tensorflow as tf
    

    First, let’s define the Squash function:

    def squash(output_vector, axis=-1):
        norm = tf.reduce_sum(tf.square(output_vector), axis, keep_dims=True)
        return output_vector * norm / ((1 + norm) * tf.sqrt(norm + 1.0e-10))
    

    After defining the Squash function, we can define the masking layer:

    class MaskingLayer(Layer):
        def call(self, inputs, **kwargs):
            input, mask = inputs
            return K.batch_dot(input, mask, 1)
    
        def compute_output_shape(self, input_shape):
            *_, output_shape = input_shape[0]
            return (None, output_shape)
    

    Now, let’s define the primary Capsule function:

    def PrimaryCapsule(n_vector, n_channel, n_kernel_size, n_stride, padding='valid'):
        def builder(inputs):
            output = Conv2D(filters=n_vector * n_channel, kernel_size=n_kernel_size, strides=n_stride, padding=padding)(inputs)
            output = Reshape( target_shape=[-1, n_vector], name='primary_capsule_reshape')(output)
            return Lambda(squash, name='primary_capsule_squash')(output)
        return builder
    

    After that, let’s write the capsule layer class:

    class CapsuleLayer(Layer):
        def __init__(self, n_capsule, n_vec, n_routing, **kwargs):
            super(CapsuleLayer, self).__init__(**kwargs)
            self.n_capsule = n_capsule
            self.n_vector = n_vec
            self.n_routing = n_routing
            self.kernel_initializer = initializers.get('he_normal')
            self.bias_initializer = initializers.get('zeros')
    
        def build(self, input_shape): # input_shape is a 4D tensor
            _, self.input_n_capsule, self.input_n_vector, *_ = input_shape
            self.W = self.add_weight(shape=[self.input_n_capsule, self.n_capsule, self.input_n_vector, self.n_vector], initializer=self.kernel_initializer, name='W')
            self.bias = self.add_weight(shape=[1, self.input_n_capsule, self.n_capsule, 1, 1], initializer=self.bias_initializer, name='bias', trainable=False)
            self.built = True
    
        def call(self, inputs, training=None):
            input_expand = tf.expand_dims(tf.expand_dims(inputs, 2), 2)
            input_tiled = tf.tile(input_expand, [1, 1, self.n_capsule, 1, 1])
            input_hat = tf.scan(lambda ac, x: K.batch_dot(x, self.W, [3, 2]), elems=input_tiled, initializer=K.zeros( [self.input_n_capsule, self.n_capsule, 1, self.n_vector]))
            for i in range(self.n_routing): # routing
                c = tf.nn.softmax(self.bias, dim=2)
                outputs = squash(tf.reduce_sum( c * input_hat, axis=1, keep_dims=True))
                if i != self.n_routing - 1:
                    self.bias += tf.reduce_sum(input_hat * outputs, axis=-1, keep_dims=True)
            return tf.reshape(outputs, [-1, self.n_capsule, self.n_vector])
    
        def compute_output_shape(self, input_shape):
            # output current layer capsules
            return (None, self.n_capsule, self.n_vector)
    

    The class below will compute the length of the capsule:

    class LengthLayer(Layer):
        def call(self, inputs, **kwargs):
            return tf.sqrt(tf.reduce_sum(tf.square(inputs), axis=-1, keep_dims=False))
    
        def compute_output_shape(self, input_shape):
            *output_shape, _ = input_shape
            return tuple(output_shape)
    

    The function below will compute the margin loss:

    def margin_loss(y_ground_truth, y_prediction):
        _m_plus = 0.9
        _m_minus = 0.1
        _lambda = 0.5
        L = y_ground_truth * tf.square(tf.maximum(0., _m_plus - y_prediction)) + _lambda * ( 1 - y_ground_truth) * tf.square(tf.maximum(0., y_prediction - _m_minus))
        return tf.reduce_mean(tf.reduce_sum(L, axis=1))
    

    After defining the different necessary building blocks of the network we can now preprocess the MNIST dataset input for the network:

    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train = x_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0
    x_test = x_test.reshape(-1, 28, 28, 1).astype('float32') / 255.0
    y_train = to_categorical(y_train.astype('float32'))
    y_test = to_categorical(y_test.astype('float32'))
    X = np.concatenate((x_train, x_test), axis=0)
    Y = np.concatenate((y_train, y_test), axis=0)
    

    Below are some variables that will represent the shape of the input, number of output classes, and number of routings:

    input_shape = [28, 28, 1]
    n_class = 10
    n_routing = 3
    

    Now, let’s create the encoder part of the network:

    x = Input(shape=input_shape)
    conv1 = Conv2D(filters=256, kernel_size=9, strides=1, padding='valid', activation='relu', name='conv1')(x)
    primary_capsule = PrimaryCapsule( n_vector=8, n_channel=32, n_kernel_size=9, n_stride=2)(conv1)
    digit_capsule = CapsuleLayer( n_capsule=n_class, n_vec=16, n_routing=n_routing, name='digit_capsule')(primary_capsule)
    output_capsule = LengthLayer(name='output_capsule')(digit_capsule)
    

    Then let’s create the decoder part of the network:

    mask_input = Input(shape=(n_class, ))
    mask = MaskingLayer()([digit_capsule, mask_input])  # two inputs
    dec = Dense(512, activation='relu')(mask)
    dec = Dense(1024, activation='relu')(dec)
    dec = Dense(784, activation='sigmoid')(dec)
    dec = Reshape(input_shape)(dec)
    

    Now let’s create the entire model and compile it:

    model = Model([x, mask_input], [output_capsule, dec])
    model.compile(optimizer='adam', loss=[ margin_loss, 'mae' ], metrics=[ margin_loss, 'mae', 'accuracy'])
    

    To view the layers and overall architecture of the entire model, we can use this command: model.summary()

    Finally, we can train the model for three epochs and find out how it will perform:

    model.fit([X, Y], [Y, X], batch_size=128, epochs=3, validation_split=0.2)
    

    After training the model for only three epochs, the training set output accuracy of the model on the MNIST dataset was 0.9914, and 0.9919 for the validation set, which is 99 percent accurate for both the training and validation sets.

    For the above implementation the Intel® AI DevCloud was used to train the network. Intel AI DevCloud is available for academic and personal research purposes for free and the request can be made from here: https://software.intel.com/en-us/ai-academy/tools/devcloud.

    In this way, you can implement the capsule network using Keras and TensorFlow backend.

    Now let’s look at some of the pros and cons of a capsule network.


    Pros

    1. Requires less training data
    2. Equivariance preserves positional information of the input object
    3. Routing by agreement is great for overlapping objects
    4. Automatically calculates hierarchy of parts in a given object
    5. Activation vectors are interpretable
    6. Reached high accuracy in MNIST

    Cons

    1. Results are not state of the art in difficult datasets like CIFAR10*
    2. Not tested in larger datasets like ImageNet*
    3. Slow training process due to inner loop
    4. Problem of crowding—not being able to distinguish between two identical objects of the same type placed close to one another.


    References

    1. Dynamic Routing Between Capsules by Sara Sabour, Nicholas Frosst and Geoffrey E Hinton: https://arxiv.org/pdf/1710.09829.pdf
    2. Capsule Networks (CapsNets) – Tutorial created by Aurélien Géron: https://www.youtube.com/watch?v=pPN8d0E3900
    3. Code used above is adopted from the GitHub* site engwang/minimal-capsule: https://github.com/fengwang/minimal-capsule

    More implementations of CapsNet in popular frameworks:

    Intel® Math Kernel Library (Intel® MKL) 2019 System Requirements

    $
    0
    0

    Operating System Requirements

    The Intel MKL 2019 release supports the IA-32 and Intel® 64 architectures. For a complete explanation of these architecture names please read the following article:

    Intel Architecture Platform Terminology for Development Tools

    The lists below pertain only to the system requirements necessary to support developing applications with Intel MKL. Please review your compiler (gcc*, Microsoft* Visual Studio* or Intel® Compiler Pro) hardware and software system requirements, in the documentation provided with that product, to determine the minimum development system requirements necessary to support your compiler product.

    Supported operating systems: 

    • Windows 10 (IA-32 / Intel® 64)
    • Windows 8.1* (IA-32 / Intel® 64)
    • Windows 7* SP1 (IA-32 / Intel® 64)
    • Windows HPC Server 2016 (Intel® 64)
    • Windows HPC Server 2012 (Intel® 64)
    • Windows HPC Server 2008 R2 (Intel® 64) 
    • Red Hat* Enterprise Linux* 6 (IA-32 / Intel® 64)
    • Red Hat* Enterprise Linux* 7 (IA-32 / Intel® 64)
    • Red Hat* Enterprise Linux* 7.5 (IA-32 / Intel® 64)
    • Red Hat Fedora* core 28 (IA-32 / Intel® 64)
    • Red Hat Fedora* core 27 (IA-32 / Intel® 64)
    • SUSE Linux Enterprise Server* 11 
    • SUSE Linux Enterprise Server* 12
    • SUSE Linux Enterprise Server* 15  ????
    • openSUSE* 13.2
    • CentOS 7.1
    • CentOS 7.2
    • Debian* 8 (IA-32 / Intel® 64)
    • Debian* 9 (IA-32 / Intel® 64)
    • Ubuntu* 16.04 LTS (IA-32/Intel® 64)
    • Ubuntu* 17.10 LTS (IA-32/Intel® 64)
    • Ubuntu* 18.04 LTS (IA-32/Intel® 64)
    • WindRiver Linux 8
    • WindRiver Linux 9
    • WindRiver Linux 10
    • Yocto 2.3
    • Yocto 2.4
    • Yocto 2.5
    • Yocto 2.6
    • macOS* 10.13 (Xcode 6.x) and macOS* 10.14 (Xcode 6.x) (Intel® 64)

             Note: Intel® MKL is expected to work on many more Linux distributions as well. Let us know if you have trouble with the distribution you use.

    Supported C/C++ and Fortran compilers for Windows*:

    • Intel® Fortran Composer XE 2019 for Windows* OS
    • Intel® Fortran Composer XE 2018 for Windows* OS
    • Intel® Fortran Composer XE 2017 for Windows* OS
    • Intel® Visual Fortran Compiler 19.0 for Windows* OS
    • Intel® Visual Fortran Compiler 18.0 for Windows* OS
    • Intel® Visual Fortran Compiler 17.0 for Windows* OS
    • Intel® C++ Composer XE 2019 for Windows* OS
    • Intel® C++ Composer XE 2018 for Windows* OS
    • Intel® C++ Composer XE 2017 for Windows* OS
    • Intel® C++ Compiler 19.0 for Windows* OS
    • Intel® C++ Compiler 18.0 for Windows* OS
    • Intel® C++ Compiler 17.0 for Windows* OS
    • Microsoft Visual Studio* 2017 - help file and environment integration
    • Microsoft Visual Studio* 2015 - help file and environment integration
    • Microsoft Visual Studio* 2013 - help file and environment integration

    Supported C/C++ and Fortran compilers for Linux*:

    • Intel® Fortran Composer XE 2019 for Linux* OS
    • Intel® Fortran Composer XE 2018 for Linux* OS
    • Intel® Fortran Composer XE 2017 for Linux* OS
    • Intel® Fortran Compiler 19.0 for Linux* OS
    • Intel® Fortran Compiler 18.0 for Linux* OS
    • Intel® Fortran Compiler 17.0 for Linux* OS
    • Intel® C++ Composer XE 2019 for Linux* OS
    • Intel® C++ Composer XE 2018 for Linux* OS
    • Intel® C++ Composer XE 2017 for Linux* OS
    • Intel® C++ Compiler 19.0 for Linux* OS
    • Intel® C++ Compiler 18.0 for Linux* OS
    • Intel® C++ Compiler 17.0 for Linux* OS
    • GNU Compiler Collection 4.4 and later
    • PGI* Compiler version 2018
    • PGI* Compiler version 2017

    Note: Using the latest version of Intel® Manycore Platform Software Stack (Intel® MPSS is recommended on Intel MIC Architecture. It is available from the Intel® Software Development Products Registration Center at http://registrationcenter.intel.com as part of your Intel® Parallel Studio XE for Linux* registration

    Supported C/C++ and Fortran compilers for OS X*:

    • Intel® Fortran Compiler 19.0 for macOS *
    • Intel® Fortran Compiler 18.0 for macOS *
    • Intel® Fortran Compiler 17.0 for macOS *
    • Intel® C++ Compiler 19.0 for macOS *
    • Intel® C++ Compiler 18.0 for macOS *
    • Intel® C++ Compiler 17.0 for macOS *
    • CLANG/LLVM Compiler 9.0
    • CLANG/LLVM Compiler 10.0

    MPI implementations that Intel® MKL for Windows* OS has been validated against:

    • Intel® MPI Library Version 2019 (Intel® 64) (http://www.intel.com/go/mpi)
    • Intel® MPI Library Version 2018 (Intel® 64) (http://www.intel.com/go/mpi)
    • Intel® MPI Library Version 2017 (Intel® 64) (http://www.intel.com/go/mpi)
    • MPICH version 3.3  (http://www-unix.mcs.anl.gov/mpi/mpich)
    • MPICH version 2.14  (http://www-unix.mcs.anl.gov/mpi/mpich)
    • MS MPI, CCE or HPC 2012 on Intel® 64 (http://www.microsoft.com/downloads)

    MPI implementations that Intel® MKL for Linux* OS has been validated against:

    • Intel® MPI Library Version 2019 (Intel® 64) (http://www.intel.com/go/mpi)
    • Intel® MPI Library Version 2018 (Intel® 64) (http://www.intel.com/go/mpi)
    • Intel® MPI Library Version 2017 (Intel® 64) (http://www.intel.com/go/mpi)
    • MPICH version 3.3  (http://www-unix.mcs.anl.gov/mpi/mpich)
    • MPICH version 3.1  (http://www-unix.mcs.anl.gov/mpi/mpich)
    • MPICH version 2.14  (http://www-unix.mcs.anl.gov/mpi/mpich)
    • Open MPI 1.8.x (Intel® 64) (http://www.open-mpi.org)

    Note: Usage of MPI and linking instructions can be found in the Intel Math Kernel Library Developer Guide

    Other tools supported for use with example source code:

    • uBLAS examples: Boost C++ library, version 1.x.x
    • JAVA examples: J2SE* SDK 1.4.2, JDK 5.0 and 6.0 from Sun Microsystems, Inc.

    Note: Parts of Intel® MKL have FORTRAN interfaces and data structures, while other parts have C interfaces and C data structures. The Intel Math Kernel Library Developer Guide  contains advice on how to link to Intel® MKL with different compilers and from different programming languages.

    Deprecation Notices :

    • Dropped support for all MPI IA-32 implementations
    • Red Hat Enterprise Linux* 5.0 support is dropped
    • Windows XP* is not supported Support for Windows XP has been removed
    • Windows Server 2003* and Windows Vista* not supported

     

    Improving Cycle-GAN using Intel® AI DevCloud

    $
    0
    0

    Introduction

    In this article, we will see some scope for optimization in Cycle-GAN for unpaired image-to-image translation, and come up with a new architecture. Also, we will dive deeper into using Intel® AI DevCloud for further speeding up the training process by using the power of multiple clusters.

    Image-to-image translation involves transferring the characteristics of an image from one domain to another. For learning such mapping, the training dataset of images can be either paired or unpaired. Paired images imply that each example is in the form of a pair, having an image from both source and target domain; the dataset is said to be unpaired when there is no one-to-one correspondence between training images from input domain X and target domain Y.

    Paired versus unpaired Image dataset
    Figure 1. Paired versus unpaired Image dataset. The paired image dataset contains examples such that for every ith example, there is an image pair xi and yi. Here, xi and yi are a sketch and its corresponding actual photograph, respectively. The unpaired image dataset contains a set of images separately for actual photographs (X) and paintings (Y). Source:Cycle-GAN Paper

    Previous works such as pix2pix* have offered image-to-image translation using paired training data; for example, converting a photograph from daytime to nighttime. For this we obtained paired data by getting pictures of the same location in the daytime as well as at night.

    Applications pix2pix trained on paired image dataset
    Figure 2. Some applications of pix2pix* trained on a paired image dataset; that is, when it is possible to obtain images of the same subject under different domains. Source: pix2pix paper

    However, obtaining paired training data can be difficult and sometimes impossible. For example, for converting a horse into a zebra from an image, it is impossible to obtain a pair of images of horse and zebra in exactly the same location and in the same posture. This is where unpaired image-to-image translation is desired. Still, it is challenging to convert the image from one domain to another when there are no paired examples available. For example, such a system would have to convert the part of the image, where the horse is detected, but not alter its background, so that one-to-one correspondence exists between the source image and the target image.

    Cycle-GAN provides an effective technique for learning mappings from unpaired image data. Some of the applications of using Cycle-GAN are shown below:

    Applications of Cycle G A N
    Figure 3. Applications of Cycle-GAN. This technique uses an unpaired dataset for training and is still able to effectively learn to translate images from one domain to another. Source: Cycle-GAN Paper

    Cycle-GAN has applications in domains where a paired image dataset is not available. Even when a paired image can be obtained, it is easier to collect from both domains separately than by selectively obtaining paired images. Also, a dataset can be built much larger and faster in the case of unpaired images. Cycle-GAN is further discussed in the next section.

    Background

    Generative adversarial network

    A generative adversarial network (GAN) is a framework for estimating generative models. As an example, a generative model can generate the next likely video frame based on previous frames.

    Generative adversarial networks for image generation
    Figure 4. Generative adversarial networks for image generation—the generator draws samples from latent random variables and the discriminator tells whether the sample came from the generator or the real world. Source: Deep Learning for Computer Vision: Generative models and adversarial training (UPC 2016)

    The generative adversarial network not only involves a neural network for generating content (generator), but also a neural network for determining whether the content is real or fake. It is called the adversarial (discriminator) network. The training of both generator and discriminator is performed simultaneously, such that both optimize against each other in a two-player zero-sum game setting, until both networks lead to an equilibrium point (Nash equilibrium) of such game.

    Through the combination of the generator network and the discriminator network (adversarial) there emerged tremendous possibilities for many more creative tasks from a computer than ever possible by another method. Facebook*'s AI research director Yann LeCun referred the adversarial training of GANs as "the most interesting idea in the last 10 years in ML." However, despite the plethora of possibilities in creativity in AI through GANs, one of the weaknesses of early GANs was limited stability for training the model.

    Cycle-GAN

    The Cycle-GAN architecture was proposed in the paper, Unpaired Image-to-Image Translation Cycle-Consistent Adversarial Networks. If a simple GAN is used for this problem then, Jan-Yan Zhu and his colleagues (2017) suggested:

    "With large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, an adversarial loss alone cannot guarantee that the learned function can map an individual input to a desired output."

    In other words, vanilla GAN would not have any sense of direction to maintain the correspondence between the source and the target image. In order to provide this sense of direction to the network, the authors introduced the cycle-consistency loss.

    Cycle-consistency loss in Cycle-GAN
    Figure 5. Cycle-consistency loss in Cycle-GAN. If an input image A from domain X is transformed into a target image B from domain Y via some generator G, then when image B is translated back to domain X via some generator F, this obtained image should match the input image A. The difference between these two images is defined as the cycle-consistency loss. This loss is similarly applicable to the image from domain Y. Source: Cycle-GAN Paper

    This approach requires creating two pairs of generators and discriminators: one for A2B (source to target conversion) and another for B2A (target to source conversion).

    A simplified architecture of Cycle-GAN
    Figure 6. A simplified architecture of Cycle-GAN. Considering an example for converting an image of a horse into a zebra, Cycle-GAN requires two generators. The generator A2B converts a horse into a zebra and B2A converts a zebra into a horse. Both train together to ensure one-to-one correspondence between the input horse image and the generated image of the zebra. The two discriminators determine real or fake images for horse and zebra, respectively. Source: Understanding and Implementing CycleGAN in TensorFlow

    Coming with a Better Architecture

    I took special interest in Cycle-GAN due to impressive results. Initially, my goal was to implement the approach provided in the paper on the TensorFlow* framework and study the technical aspects in detail.

    While implementing it, I noticed that the training process was time consuming and there was a scope for optimization in Cycle-GAN:

    • In order to create a system to learn the mapping from domain A to domain B, Cycle-GAN involved an extra module for learning the mapping from domain B back to domain A. This extra module takes equal computational resources, and therefore about half of the resources are utilized in creating a system, which would not have utility after the training process.

    Let us reconsider the purpose for introducing cyclic-loss in Cycle-GAN:

    Problem with Vanilla GANs

    While generative adversarial networks are found to be performing very well at generating samples or images, in the case of the unpaired image-to-image translation problem, the correspondence between the input image and the target output image is desired.

    The correspondence between input image and target output
    Figure 7. When converting an image of a horse into a zebra, if the generator creates an image of a zebra, which has no relation with the input horse image, the discriminator will be okay with such an image too.

    It turns out that GANs do not force the output to correspond to its input. To address this issue, that is, to preserve the true content/characteristics of the input image in the translated image, the authors of the Cycle-GAN paper introduced cycle-consistency loss.

    Optimization in generator network

    I questioned whether there was any other way through which this goal could be achieved without having to create a second generator-discriminator pair.

    This idea for optimizing an already well-enough performing architecture came to my mind by taking inspiration from Gandhian Engineering; it talks about reducing the cost of product through innovations. The core of this approach is to create a product that has more value and yet is accessible by more people; that is, more for less for more. The key idea for doing so is nothing goes unquestioned.

    For this, I specifically targeted the problem of converting an image of an apple into an image of an orange. Thus, in this case the goal would be to modify the portion of the input image where the apple is detected, but keep the background intact.

    This is a different perspective from that taken in Cycle-GAN, which tries to modify the complete image but not make an assumption that the background will remain intact. That way, the second discriminator has to learn and enforce this assumption, which results in the extra time taken for learning this fact.

    I figured this goal can be achieved using a single neural network. And for this we essentially have a system that takes images from both domains—A (source) and B (target)—and gives only images from domain B in the output.

    Cycle-consistency loss versus deviation loss
    Figure 8. Cycle-consistency loss versus deviation loss. By applying deviation loss, only one generator can ensure one-to-one correspondence between source and target image. This eliminates the need for two generator networks.

    To ensure that images from domain B do not change I introduced deviation loss, which is defined as the difference between the encodings of an image and the output of the generator network. This loss is introduced as a replacement for the cycle-consistency loss that was present in the Cycle-GAN architecture. The deviation loss regularizes the training of the translator by directing it to translate only the bare-minimum part of the encoded image of domain A, to make it appear like real encoding from domain B. Also, it enforces the spatial features to be kept intact throughout the translation.

    Optimization in discriminator and encoder-decoder pair

    I found another opportunity for optimization in the discriminator of Cycle-GAN or convolutional GANs in general.

    I started rethinking about generative adversarial networks. As mentioned earlier, GAN essentially turns out to be an optimization problem for two-player zero-sum games, in which the generator tries to fool the discriminator and the discriminator tries to detect fake objects. There is an entire field of research around game theory that has not been applied to GAN, even though it is a game-theoretic problem at its core, and most of the game-playing strategies involve acquiring as much of the information as possible about the opponent's thinking. Thus, it makes more sense that both the competing players share some of their perspective about the game.

    So, the generator and discriminator should share as much information as possible while maintaining enough exclusiveness to keep the competition alive.

    This lead to another modification in the way the generator and discriminator input the information. In the case of Cycle-GAN, the discriminator takes the whole image and predicts whether the image looks real or fake. We can consider this discriminator to be working in two parts. The first part encodes the input image and the second part predicts from the encoding.

    The discriminator in Cycle-GAN
    Figure 9. The discriminator in Cycle-GAN. The second half of discriminator (Conv-2) needs feature encodings of the image, which was already available from the output of the translator network; thus, unnecessary upsampling of this encoding from the decoder and again encoding it from the first part of the discriminator (Conv-1) will not just induce an error into the encodings but will also take more training resources.

    So, the latter half of the discriminator need not take input from the output of the generator's decoder (a whole image), but it can directly take the translated feature encodings from the translator network as an input.

    The discriminator in Cycle G A N versus Proposed Architecture
    Figure 10. The discriminator in Cycle-GAN versus the discriminator in Proposed Architecture. Note that the need for the first part of the discriminator is completely eliminated. The generator can provide feature encodings directly to the discriminator.

    Due to this change, the decoder part of the generator will be unable to take part in the generator-discriminator optimization. However, it can be optimized separately, along with the encoder, in the same way as an AutoEncoder.

    Also, due to this major change of using only one generator-discriminator pair instead of two, some more optimization seems possible. In the cycle-GAN architecture there were two separate encoder-decoder pairs: one pair for encoding-decoding the images from the source domain and the other pair for encoding-decoding the images from the target domain.

    Since there is only one generator now, only a single encoder-decoder pair can manage to encode and decode the images from both domains. Therefore, this pair can even be trained separately, which has its own advantages.

    The separate step for training can be governed by the cyclic loss or reconstruction loss, which is defined as the difference between the original image and the image obtained when it is encoded and decoded to get the same image. This is similar to AutoEncoder, but between these pairs, the translator (generator) network is sandwiched.

    For training the discriminator network, the conventional loss for GAN's discriminator is used. If the discriminator correctly classifies the encodings of the fake image the translator network is penalized. If the discriminator incorrectly classifies either of the real or fake encodings the discriminator network is penalized. This loss is kept similar to that of Cycle-GAN, but the structure of the discriminator has changed in the proposed architecture.

    Optimization for Intel® AI DevCloud

    When I started working on implementing Cycle-GAN I soon realized the lack of computational resources for doing so, as generative adversarial networks and Cycle-GAN are very sensitive to initialization and choosing just the right hyperparameters. And training such a big network on a local system using only a CPU is not a good idea.

    Intel AI DevCloud works especially well for testing research ideas. This is because it can independently perform computations on multiple nodes of the cluster. Thus, several ideas can be tested simultaneously without waiting for others to complete execution.

    For utilizing multiple nodes of a cluster for higher performance, I created several versions of implementation to obtain the right set of hyperparameters. For example, I created a job having the learning rate of 0.005, another job having a learning rate of 0.01, another one of 0.02, and so on. In this way, if three jobs are submitted simultaneously, it effectively speeds up the process by 3x, as compared to sequentially running each version. This technique is very general and can be used for training any model on Intel AI DevCloud.

    For specifically this optimized architecture, there emerges further possibilities to speed up the training process. The architecture consists of mainly three modules:

    1. Encoder-decoder pair
    2. Translator network (generator)
    3. Discriminator network

    I observed that each of these modules can be trained on separate compute nodes. The only catch is that the translator and discriminator network's inputs depend upon the encoder's output, which needs to be trained. Also, the discriminator network's input is dependent on the translator's output and the translator's loss is dependent on the discriminator's output. Thus, it is required that if each of these three networks train on separate compute nodes, they periodically share their updated weights with other networks. Since all the submitted jobs use the same storage area, I chose to update the weights at the end of each epoch. Three separate checkpoints are created by each job, while the translator and discriminator networks update their encoder-decoder pairs at the end of each epoch and only train their corresponding network's weights. That is, the translator only trains the translator network, but updates the encoder-decoder pair and discriminator network in every epoch, only to use it for inference. Similarly, the discriminator uses other two networks whose weights are periodically updated for inference, while only the discriminator network is trained.

    Therefore, this technique can further speed up the training of a single implementation by up to 3x. If combining this technique with submitting multiple jobs for different implementations, three different implementations can result in up to 9x speed-up.

    Proposed Approach

    The final proposed architecture for unpaired image-to-image translation:

    Proposed architecture
    Figure 11. Proposed architecture. The aim is to obtain image Fake B from image Input A. Neural networks are represented by solid boundaries and those having the same color represent the same network. The orange-colored circles indicate loss terms.

    Explanation: Consider an example for converting the image of an apple to an orange. The goal is to perform this task while keeping the background intact. Forward pass involves downsampling of the input image of an apple, translating it to the encoding of an orange and upsampling it, to produce the image of an orange. Deviation loss ensures that the output of the translator is always the feature encodings of the orange. Thus, the image of an orange is unchanged (including the background), whereas the image of an apple changes in such a way that apples are converted into oranges (since the discriminator network is forcing this conversion), while everything in the background is unchanged (since deviation-loss is resisting this change). The key idea is that the translator network learns to not alter the background and the orange but to convert the apple to an orange.

    Experimental Evaluation

    The performance of this architecture is compared with the Cycle-GAN implementation on the TensorFlow Framework on Intel AI DevCloud using Intel® Xeon® Gold 6128 processors.

    Table 1. Comparison of time taken by Cycle-GAN and proposed architecture.

    No. of Epoch(s)Time by Cycle-GANTime by Proposed ArchitectureSpeed-up
    166.27 minutes32.92 minutes2.0128x
    2132.54 minutes65.84 minutes2.0130x
    3198.81 minutes98.76 minutes2.0138x
    15994.09 minutes493.80 minutes2.0131x

    Furthermore, this speed-up is achieved by using only a single compute node. By using multiple nodes on Intel AI DevCloud, the speed-up can be as high as 18x. Also, it is observed that due to using the same neural network for encoding and decoding, and also using a less-complex decoder, the proposed system converges nearly twice as fast; that is, it needs nearly half the number of epochs required by Cycle-GAN to produce the same result.

    Results

    The neural networks were trained on images of apples and oranges collected from ImageNet* and were directly available from Taesung Park's Cycle-GAN Datasets. The images were 256 x 256 pixels. The training set consisted of 1177 images of class apple and 996 images of class orange.

    Input apples converted into oranges in the output
    Figure 12. Results. The input images of apples are converted into oranges in the output. Note that the image background has not changed in the process.

    Conclusion

    Summary of the proposed architecture:

    1. Elimination of a second translator (to translate B to A).
    2. Using the same neural network to encode images from both domains (A or B), and the same neural network to decode images from both domains (A or B).
    3. The discriminator takes downsampled image encoding as input, as opposed to taking the whole image, which was the case with the discriminator in Cycle-GAN.
    4. Use of deviation loss, instead of cycle-consistency loss, from Cycle-GAN.

    This optimized architecture speeds up the training process by at least 2x; it is also observed that convergence is achieved in fewer epochs than with Cycle-GAN. Also, by using optimization techniques specific to Intel AI DevCloud, up to 18x speed-up can be achieved.

    Using Modern C++ Techniques to Enhance Multi-core Optimizations

    $
    0
    0

    butterfly enhanced with modern cpp

    With multi-core processors now common place in PCs, and core counts continually climbing, software developers must adapt. By learning to tackle potential performance bottlenecks and issues with concurrency, engineers can future-proof their code to seamlessly handle additional cores as they are added to consumer systems.

    To help with this effort, Intel software teams have created a graphics toolkit that shows how parallel processing techniques are best applied to eight different graphics filters. The entire source code package contains C++ files, header files, project files, filters, and database files. A DLL overlay with a simple user interface shows the speed at which each filter can be applied, both in a single-core system and when using parallel-processing techniques.

    In this white paper, readers learn to use modern C++ techniques to process data in parallel, across cores. By studying the sample code, downloading the application, and learning the techniques, developers will better understand Intel® architecture and multi-core technology.


    Getting Started with Parallel Processing

    There are countless articles and books written on parallel processing techniques. Ian Foster has a good recap, and multiple papers have been presented at SIGGRAPH, including one by John Owens. A good reference is the 2015 book Programming Models for Parallel Computing, edited by Pavan Balaji. It covers a wide range of parallel programming models, starting with a description of the Message Passing Interface (MPI), the most common parallel programming model for distributed memory computing.

    With applications across the computing spectrum, from database processing to image rendering, parallel processing is a key concept for developers to understand. Readers are assumed to have some experience and background in computer science to benefit from the concepts described here. The source code was written for C++, but the concepts extend to other languages, and will be of interest to anyone looking to better understand how to optimize their code for multi-core processors.

    An in-depth discussion of the Intel architecture is beyond the scope of this article. Software developers should register at the Intel® Developer Zone and check out the documentation download page for Intel architecture to read some of the following manuals:


    Getting Started

    The system requirements to explore the Image Filters project solution are minimal. Any multi-core system with Windows® 10 is sufficient.

    This project assumes that you have a C++ toolkit, such as Microsoft Visual Studio* with the .NET framework. Freebyte* has a full set of links here if you want to explore different C++ tools. To simply look through code, you may want to use a free code viewer such as NotePad++* or a similar product.

    To begin exploring the Image Filters project, follow these steps:

    1. Create an empty directory on your computer with a unique title such as "Image Filters Project" or "Intel C++ DLL Project".
      Use whatever naming strategy you are comfortable with; you can include the year, for example.
    2. Download the .zip file to your new directory.
      The file is not large—about 40 KB. After extracting the files and building the project, you will consume about 65 MB of disk space.
    3. Extract the files in the new directory.
      For example, if using 7-Zip*, right-click on the .zip file and select 7-Zip > Extract here. You can use any file compression software, such as 7-Zip, WinZip*, or WinRAR*.
    4. Open Microsoft Visual Studio or similar C++ tool. These instructions assume that you loaded the most current version of Microsoft Visual Studio.
    5. Open the project by using File > Open > Project/Solution and locating the ImageFilters.sln file.
      The ImageFilters.sln file should appear in the Solution Explorer on the left.
      The ImageFilters solution has two projects:
      a) ImageFiltersWPF — The client application that utilizes the ImageProcessing DLL and shows how to interact with it using C#.
      b) ImageProcessing — The C++ DLL that contains the multi-core image processing algorithms.

      build both projects inside the Solution Explorer
      Figure 1. You must build both projects inside the Solution Explorer.

    6. From the Solution Explorer, select the ImageFiltersWPF project; then hold down the CTRL key and select the ImageProcessing project.
    7. Right-click on one of the highlighted files to pull up an Actions menu and select Build Selection. This starts the compiler for both.
    8. Wait while the system quickly compiles the existing source files into the binary, .DLL, and .EXE files.

    The following files display in the project solutions bin directory:

    Compiled files in the bin
    Figure 2. Compiled files in the bin > debug folder, including the ImageFiltersWPF executable.


    Multithreading Technique

    By default, applications run on a single processing core of a system. Because all new computing systems feature a CPU with multiple cores and threads, this implies that some complex calculations could be distributed intelligently, greatly speeding computation times.

    OpenMP*(Open Multi-Processing) is an API first published in 1981 for Fortran 1.0 that supports multiplatform shared memory multiprocessing programming in C,C++, and Fortran on most platforms, instruction set architectures,and operating systems. It consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior.

    In the case of the C++ DLL, which is where the execution is actually happening, OpenMP involves using compiler directives to execute the filtering routines in parallel. For picture processing, each pixel of the input image has to be processed in order to apply the routine to that image. Parallelism offers an interesting way of optimizing the execution time by spreading out the work across multiple threads, which can work on different areas of the image.

    #pragma omp parallel

    In each filtering routine in the application, processing is implemented as a loop. The goal is to study every pixel in the image, one by one. The "#pragma omp parallel" compiler directive causes the loop exercise to be divided and distributed to the cores.

    #pragma omp parallel for if(openMP)			
    	for (int i = 0; i < height; ++i) {
    		auto offset = i * stride;
    		BGRA* p = reinterpret_cast<BGRA*>(inBGR + offset);
    		BGRA* q = reinterpret_cast<BGRA*>(outBGR + offset);
    		for (int j = 0; j < width; ++j) {
    			if (i == 0 || j == 0 || i == height - 1 || j == width - 1)
    				q[j] = p[j];	// if conv not possible (near the edges)
    			else {
    				BYTE R, G, B;
    				double red(0), blue(0), green(0);
    				// Apply the conv kernel to every applicable 
    				// pixel of the image
    				for (int jj = 0, dY = -radius; jj < size; jj++, dY++) {
    					for (int ii = 0, dX = -radius; ii < size; ii++, dX++) {
    						int index = j + dX + dY * width;
    						// Multiply each element in the local 
    						// neighborhood 
    						// of the center pixel by the corresponding
    						// element in the convolution kernel
    						// For the three colors
    						blue += p[index].B * matrix[ii][jj];
    						red += p[index].R * matrix[ii][jj];
    						green += p[index].G * matrix[ii][jj];
    					}
    				}
    				// Writing the results to the output image
    				B = blue;
    				R = red;
    				G = green;
    				q[j] = BGRA{ B,G,R,255 };
    			}
    		}
    	}
    

    Sample code for setting up parallel processing for BoxBlur.cpp

    If you follow the comments in the code, "BoxBlur.cpp" is setting up offsets, handling calculations when edge conditions make convolution impossible, and applying the convolution kernel to each element for red, blue, and green colors.

    #pragma omp parallel for if(openMP)
    		for (int i = 0; i < height; ++i) {
    			auto offset = i * stride;
    			BGRA* p = reinterpret_cast<BGRA*>(tmpBGR + offset);
    			BGRA* q = reinterpret_cast<BGRA*>(outBGR + offset);
    			for (int j = 0; j < width; ++j) {
    				if (i == 0 || j == 0 || i == height - 1 || j == width - 1)
    					q[j] = p[j];	// if conv not possible (near the edges)
    				else {
    					double _T[2];
    					_T[0] = 0; _T[1] = 0;
    					// Applying the two Sobel operators (dX dY) to 
    					// every available pixel
    					for (int jj = 0, dY = -radius; jj < size; jj++, dY++) {
    						for (int ii = 0, dX = -radius; ii < size; ii++, dX++) {
    							int index = j + dX + dY * width;
    							// Multiplicating each pixel in the 
    							// neighborhood by the two Sobel 
    							// Operators
    							// It calculates the vertical and 
    							// horizontal derivatives of the image 
    							// at a point.
    							_T[1] += p[index].G * M[1][ii][jj];
    							_T[0] += p[index].G * M[0][ii][jj];
    						}
    					}
    					// Then is calculated the magnitude of the 
    					// derivatives
    					BYTE a = sqrt((_T[0] * _T[0]) + (_T[1] * _T[1]));
    					// Condition for edge detection
    					q[j] = a > 0.20 * 255 ? BGRA{ a,a,a,255 } : BGRA{ 0,0,0,255 };
    				}
    			}
    		}
    
    		// Delete the allocated memory for the temporary grayscale image
    		delete tmpBGR;
    	}
    	return 0;
    }
    

    Parallelism structure for SobelEdgeDetector.cpp

    In the second example of "omp parallel for" taken from "SobelEdgeDetector.cpp", similar filtering operations take place, with the edge detector working with grayscale pictures.


    Memory Management

    In software development, developers must be careful about memory management to avoid serious impacts on application performance. In the case of the Harris corner detector and the Shi-Tomasi corner detector, memory management is crucial to creating three matrices and storing the results of Sx, Sy and Sxy.

    // Creating a temporary memory to keep the Grayscale picture
    	BYTE* tmpBGR = new BYTE[stride*height * 4];
    	if (tmpBGR) {
    		// Creating the 3 matrices to store the Sobel results, for each thread
    		int max_threads = omp_get_max_threads();
    		double *** Ix = new double**[max_threads];
    		double *** Iy = new double**[max_threads];
    		double *** Ixy = new double**[max_threads];
    		for (int i = 0; i < max_threads; i++) {
    			Ix[i] = new double*[size_kernel];
    			Iy[i] = new double*[size_kernel];
    			Ixy[i] = new double*[size_kernel];
    			for (int j = 0;j < size_kernel;j++) {
    				Ix[i][j] = new double[size_kernel];
    				Iy[i][j] = new double[size_kernel];
    				Ixy[i][j] = new double[size_kernel];
    			}
    		}
    

    Setting up temporary memory for the Shi-Tomasi corner detector filter

    Allocating such matrices for every pixel of the source would require considerable memory, and multithreading probably wouldn't be beneficial when applied. In fact, it could even result in slower calculations, due to the overhead of working with a large memory space.

    In order to avoid memory issues and set up a scenario where multithreading makes the calculations faster, these three matrices can be considered as a set of matrices, with each available thread containing its own set of matrices. The application can then be set up to allocate, outside of the parallel section, as many sets of matrices as there are available threads for this application. To get the maximum number of threads the following function is used: "omp_get_max_threads()" from the "omp.h" file. This file is found with the rest of the header files in the ImageProcessing > External Dependencies directory.

    As described at the Oracle* support site, this function should not be confused with the similarly named "omp_get_num_threads()".The "max" call returns the maximum number of threads that can be put to use in a parallel region. The "num" call returns the number of threads that are currently active and performing work. Obviously, those are different numbers. In a serial region "omp_get_num_threads" returns 1; in a parallel region it returns the number of threads that are being used.

    The call "omp_set_num_threads" sets the maximum number of threads that can be used (equivalent to setting the OMP_NUM_THREADS environment variable).

    In the processing loop, each thread accesses its proper set of matrices. These sets are stored in a dynamically allocated array of sets. To access the correct set the index of the actual thread is used with the function "omp_get_thread_num()".Once the routine is executed the three matrices are reset to their initial values, so that the next time the same thread has to execute the routine for another pixel, the matrix is already prepared for use.


    Principle of Convolution

    Image filtering is a good showcase for multi-core processing because it involves the principle of convolution. Convolution is a mathematical operation that accumulates effects; think of it as starting with two functions, such as f and g, to produce a third function that represents the amount of overlap as one function is shifted over another.

    In this version of iteration, convolution is the process of adding each element of the image to its local neighbors, weighted by a kernel. By adding each element of the image to its local neighbors, weighted by the kernel, convolution can be used for blurring, sharpening, embossing, edge detection, and more. There are numerous resources available with a quick search, such as detailed discussions of kernels and image processing.

    In the case of image filtering, convolution works with a kernel, which is a matrix of numbers. This kernel is then applied to every pixel of the image, with the center element of the convolution kernel placed over the source pixel (Px). The source pixel is then replaced by the weighted sum of itself and nearby pixels. Multithreading helps filter the image faster by breaking the process into pieces.

    Convolution using weighting in 3 x 3 kernel
    Figure 3. Convolution using weighting in a 3 x 3 kernel (source: GNU Image Manipulation Program).

    In this example, the convolution kernel is a 3 x 3 matrix represented by the green box. The source pixel is 50, shown in red at the center of the 3 x 3 matrix. All local neighbors, or nearby pixels, are the pixels directly within the green square. The larger the kernel, the larger the number of neighbors.

    In this case, the only weight different than zero is the second element of the first row and represented by a 1. All other elements in the 3 x 3 matrix are 0. The operation multiplies each element by zero, removing it, except for the single pixel represented by 42. So, the new source pixel is 42 x 1 = 42. Thus, the pixel just above the source pixel is overlapped by the weight 1 of the convolution kernel.

    If you imagine each weighting as a fraction rather than zero, you can picture how images could be blurred by analyzing and processing each surrounding pixel.

    Filtering Techniques

    To see the result of a filtering technique, you'll have to build the project as described in the "Getting Started" section. Then double-click on the Image Filters > ImageFiltersWPF > Bin > Debug > ImageFiltersWPF.exe file.

    ImageFilters W P F executable main window
    Figure 4. ImageFiltersWPF executable main window. Use the small gray box at the top-right corner of the screen to locate directories with images you want to experiment with.

    The interface is very simple. You can select images on your system using the directory search feature in the gray box in the upper-right corner. Use the "Stretch" button to make sure an image completely fills the graphical user interface (GUI). Select an image, then apply a filter. Watch the "Time in seconds" calculations at the bottom-left of the interface to see how long a filter would take to apply in a multi-core system versus a system with a single core.

    There are eight filters in all; each alters the original image, but some filters create more dramatic changes.

    Box blur filter

    A box blur is a spatial domain linear filter in which each pixel in the resulting image has a value equal to the average value of its neighboring pixels in the input image. Due to its property of using equal weights, it can be implemented using a simple accumulation algorithm, and is generally a fast way to achieve a blur. The name refers to the boxy, pixelated result.

    The weights of the convolution kernel for the box blur are all the same. Assuming that we have a 3 x 3 matrix, this means that we have nine elements inserted into the matrix in total.

    Formula to calculate the elements

    The weight for every element is calculated so that the sum of every element is 1.

    First image is vivid, the other is blured
    Figure 5. The original image on the left is vivid, with good detail, while the image on the right has had the Box Blur effect applied.

    When using the app, it was calculated that a single core system would take 0.1375 seconds to apply the Box Blur while a multi-core system, in this case with an Intel® Core™ i7-4220 processor, took 0.004 seconds.

    Let's look in depth at what is going on in BoxBlur.cpp to understand the multithreading principles.

    include "stdafx.h"
    #include <fstream>
    #include "omp.h"
    
    using namespace std;
    
    extern "C" __declspec(dllexport) int __stdcall BoxBlur(BYTE* inBGR, BYTE* outBGR, int stride, int width, int height, KVP* arr, int nArr)
    {
    	// Pack the following structure on one-byte boundaries: 
    	// smallest possible alignment
    	// This allows us to use the minimal memory space for this type: 
    	// exact fit - no padding 
    #pragma pack(push, 1)
    	struct BGRA {
    		BYTE B, G, R, A;
    	};
    

    Beginning of the BoxBlur.cpp file

    First, the file is set up to include the "stdafx.h" header, a standard system include. Finally, omp.h is the header file that brings in OpenMP instructions.

    BoxBlur then uses the extern "C" function to declare calls and variables. The rest of the C++ file is devoted to multi-core functionality. First, using "#pragma pack(push, 1)",the file defines how to efficiently handle a BGRA (blue green red alpha) color component packing structure on one-byte boundaries using the smallest possible alignment.

    Next, the file declares "#pragma pack (pop)" to set up the default packing mode, defines the Boolean operator for whether multiple cores have been detected, sets up the convolution kernel, and allocates memory.

    Finally, if there are multiple cores (OpenMP = true), the file uses "#pragma omp parallel for if (openMP)". The code determines offsets and casts, and handles situations at the edges where convolution is not possible. Results are written to the output image, and the allocated memory is cleared for the convolution kernel. There are similar sections of code in each of the filters.

    Gaussian blur filter

    Gaussian blur is the result of blurring an image by a Gaussian kernel to reduce image noise and reduce detail. It is similar to Box Blur filtering; each pixel in the image gets multiplied by placing the center pixel of the Gaussian kernel on the image pixel and multiplying the values in the original image with the pixels in the kernel that overlap. The values resulting from these multiplications are added up, and that result is used for the value at the destination pixel.

    The weights of the elements of a Gaussian matrix N x N are calculated by the following:

    Gaussian matrix

    Here x and y are the coordinates of the element in the convolution kernel. The top-left corner element is at the coordinates (0, 0), and the bottom-right at the coordinates (N-1, N-1).

    For the same reason as the Box Blur, the sum of every element has to be 1. Thus, at the end, we need each element of the kernel to be divided by the total sum of the weights of the kernel.

    Threshold filter

    The threshold routine is the only technique that does not use the convolution principle. Threshold filters examine each value of the input dataset and change all values that do not meet the boundary conditions. The result is that, if the luminance is smaller than the threshold, the pixel is turned to black; otherwise, it remains the same.

    Threshold filtering example
    Figure 6. Threshold filtering example.

    Sobel edge detection

    The Sobel operator is an edge detector that relies on two convolutions by using two different kernels. These two kernels calculate the horizontal and vertical derivatives of the image at a point. Though its goal is different, it is used to detect the edges inside a picture. Applying such a kernel provides a score, indicating whether or not the pixel can be considered as being part of an edge. If this score is greater than a given threshold, it can be considered as part of an edge.

    Edge detection relies on two different kernels
    Figure 7. Sobel edge detection relies on two different kernels.

    This means that for each pixel there are two results of convolution, Gx and Gy. Considering them as a scalar of a 2D vector, the magnitude G is calculated as follows:

    formula to calculate the magnitude G

    Laplacian edge detector

    This filter technique is quite similar to Box Blur and Gaussian Blur. It relies on a single convolution, using a kernel such as the one below to detect edges within a picture. The results can be applied to a threshold so that the visual results are smoother and more accurate.

    Laplacian Edge D

    Laplacian of Gaussian

    The Laplacian edge detector is particularly sensitive to noise so, to get better results, we can apply a Gaussian Blur to the whole image before applying the Laplacian filter. This technique is named "Laplacian of Gaussian".

    Harris corner detector

    Convolutions are also used for the Harris corner detector and the Shi-Tomasi corner detector, and the calculations are more complex than earlier filter techniques. The vertical and horizontal derivatives (detailed in the Sobel operator) are calculated for every local neighbor of the source pixel Px (including itself). The size of this area (window) has to be an odd number so that it has a center pixel, which is called the source pixel.

    Thus, Sx and Sy are calculated for every pixel in the window. Sxy is calculated as Sxy = Sx * Sy.

    These results are stored in three different matrices. These matrices respectively represent the values Sx, Sy, and Sxy for every pixel around the source pixel (and also itself).

    A Gaussian matrix (the one used for the Gaussian blur) is then applied to these three matrices, which results in three weighted values of Sx, Sy, and Sxy. We will name them Ix, Iy, and Ixy.

    These three values are stored in a 2 x 2 matrix A:

     A 2 x 2 matrix A

    Then a score k is calculated, representing whether that source pixel can be considered as a part of a corner, or not.

    Then a score k is calculated

    Then, if k is greater than a given threshold, the source pixel is turned into white, as part of a corner. If not, it is set to black.

    Shi-Tomasi corner detector

    This detector is based on the Harris corner detector; however, there is a change relating to the condition of corner detection. This detector has better performance than the previous one.

    Once the matrix A is calculated, in the same way as above, the eigenvalues and of the matrix A are calculated. The eigenvalues of a matrix are the solutions of the following equation det (A) = 0.

     the eigen values of the matrix A

    As with the Harris corner detector, with regard to the value of k, this source pixel will be or will not be considered as part of a corner.

    Example of Shi-Tomasi corner detector filter
    Figure 8. Example of Shi-Tomasi corner detector filter, resulting in almost total conversion to black pixels. The filtering took 61 seconds using a single core, versus 14 seconds using multiple cores.


    Conclusion

    This C++ DLL application is a good example of how important it is to apply multithreading techniques to software development projects. In almost every scenario, the calculations involved on a four-core system to apply the various filters required about three times as long to complete using a single core versus using multi-core techniques.

    Developers should not expect to get an N times speedup when running a program parallelized using OpenMP on an N processor platform. According to sources, there are several reasons why this is true:

    • When a dependency exists, a process must wait until the data it depends on is computed.
    • When multiple processes share a nonparallel proof resource (like a file to write in), their requests are executed sequentially. Therefore, each thread must wait until the other thread releases the resource.
    • A large part of the program may not be parallelized by OpenMP, which means that the theoretical upper limit of speedup is limited, according toAmdahl's law.
    • N processors in symmetric multiprocessing(SMP) may have N times the computation power, but thememory bandwidthusually does not scale up N times. Quite often, the original memory path is shared by multiple processors, and performance degradation may be observed when they compete for the shared memory bandwidth.
    • Many other common problems affecting the final speedup in parallel computing also apply to OpenMP, likeload balancingand synchronization overhead.

    With current systems powered by processors such as the Intel® Core™i9-7980XE Extreme Edition processor, which has 18 cores and 36 threads, the advantages of developing code that is optimized to handle multithreading is obvious. To learn more, download the app, analyze it with an integrated development environment such as Microsoft Visual Studio, and get started with your own project.


    Appendix A. About ImageFiltersWPF

    ImageFiltersWPF is a Windows* WPF client application that uses Extensible Application Markup Language (XAML) to display its GUI. The entry point into this app is MainWindow.xaml/MainWindow.xaml.cs. Along with the main window are several supporting classes to help keep functionality clean.

    ImageFileList.cs

    This class's primary function is to generate a list of image files that can be selected in order to apply filters.

    ImageFilterList.cs

    This is a very simple list that encapsulates a list of the filters that the C++ DLL provides. Once created, it is used by the GUI element lbImageFilters.

    BitmapWrapper.cs

    This class accepts a source bitmap image and takes that bitmap and turns it into a byte array that can be consumed by the C++ DLL.

    ImageProcessWrapper.cs

    This class loads up the DLL and supplies the function to call, which passes the byte arrays to the DLL.

    MainWindow.xmal.cs

    This is the MainWindow GUI code. This function gets the current filter name and sends the bitmapwrapper instance to the DLL. It does this twice, once for single core, then again as multi-core. After each of those run, it updates labels that contain the number of seconds it took to process the image. Once the processing is complete, the new image is displayed.


    Resources

    OpenMP

    Intel® Many Integrated Core (Intel® MIC) Architecture

    Code Sample: Multicore Photo Editing

    $
    0
    0
    File(s):Download
    License:Intel Sample Source Code License Agreement
    Optimized for... 
    OS:Windows® 10
    Hardware:N/A
    Software:
    (Programming Language, tool, IDE, Framework)
    C++, C#, Microsoft Visual Studio*
    Prerequisites:Familiarity with Microsoft Visual Studio, C++ and C#, multi-core software development


    Introduction

    This software example demonstrates how to use multi-core technologies to edit images. There are two parts to this project, a .NET Windows application front end written using C# and Windows Presentation Foundation (WPF) and a C++ DLL which is responsible for the actual manipulation of the image.

    The image editing is done by applying filters to images, where each filter is a different function in the C++ DLL. The C# front end passes the image bitmap data to the C++ DLL, the DLL processes the image by applying a chosen filter, the DLL then passes back to C# GUI the newly created image. Further, this app allows the user to see the performance difference between running single-core vs. multi-core.


    Get Started

    Download the code from GitHub* and read the article Using Modern C++ Techniques to Enhance Multicore Optimizations for a better understanding of how to perform multicore development.


    Updated Log

    Created June/19/2018


    Developer Success Stories Library

    $
    0
    0

    Intel® Parallel Studio XE | Intel® System Studio  Intel® Media Server Studio

    Intel® Advisor | OpenVINO™ Toolkit | Intel® Data Analytics Acceleration Library 

    Intel® Distribution for Python* | Intel® Inspector XEIntel® Integrated Performance Primitives

    Intel® Math Kernel Library | Intel® Media SDK  | Intel® MPI Library | Intel® Threading Building Blocks

    Intel® VTune™ Amplifer 

     


    Intel® Parallel Studio XE


    Altair Creates a New Standard in Virtual Crash Testing

    Altair advances frontal crash simulation with help from Intel® Software Development products.


    CADEX Resolves the Challenges of CAD Format Conversion

    Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


    Envivio Helps Ensure the Best Video Quality and Performance

    Intel® Parallel Studio XE helps Envivio create safe and secured code.


    ESI Group Designs Quiet Products Faster

    ESI Group achieves up to 450 percent faster performance on quad-core processors with help from Intel® Parallel Studio.


    F5 Networks Profiles for Success

    F5 Networks amps up its BIG-IP DNS* solution for developers with help from
    Intel® Parallel Studio and Intel® VTune™ Amplifer.


    Fixstars Uses Intel® Parallel Studio XE for High-speed Renderer

    As a developer of services that use multi-core processors, Fixstars has selected Intel® Parallel Studio XE as the development platform for its lucille* high-speed renderer.


    Golaem Drives Virtual Population Growth

    Crowd simulation is one of the most challenging tasks in computer animation―made easier with Intel® Parallel Studio XE.


    Lab7 Systems Helps Manage an Ocean of Information

    Lab7 Systems optimizes BioBuilds™ tools for superior performance using Intel® Parallel Studio XE and Intel® C++ Compiler.


    Mentor Graphics Speeds Design Cycles

    Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.


    Massachusetts General Hospital Achieves 20X Faster Colonoscopy Screening

    Intel® Parallel Studio helps optimize key image processing libraries, reducing compute-intensive colon screening processing time from 60 minutes to 3 minutes.


    Moscow Institute of Physics and Technology Rockets the Development of Hypersonic Vehicles

    Moscow Institute of Physics and Technology creates faster and more accurate computational fluid dynamics software with help from Intel® Math Kernel Library and Intel® C++ Compiler.


    NERSC Optimizes Application Performance with Roofline Analysis

    NERSC boosts the performance of its scientific applications on Intel® Xeon Phi™ processors up to 35% using Intel® Advisor.


    Nik Software Increases Rendering Speed of HDR by 1.3x

    By optimizing its software for Advanced Vector Extensions (AVX), Nik Software used Intel® Parallel Studio XE to identify hotspots 10x faster and enabled end users to render high dynamic range (HDR) imagery 1.3x faster.


    Novosibirsk State University Gets More Efficient Numerical Simulation

    Novosibirsk State University boosts a simulation tool’s performance by 3X with Intel® Parallel Studio, Intel® Advisor, and Intel® Trace Analyzer and Collector.


    Pexip Speeds Enterprise-Grade Videoconferencing

    Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.


    Schlumberger Parallelizes Oil and Gas Software

    Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.


    Ural Federal University Boosts High-Performance Computing Education and Research

    Intel® Developer Tools and online courseware enrich the high-performance computing curriculum at Ural Federal University.


    Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

    Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


    Intel® System Studio


    CID Wireless Shanghai Boosts Long-Term Evolution (LTE) Application Performance

    CID Wireless boosts performance for its LTE reference design code by 6x compared to the plain C code implementation.


    GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

    GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and the OpenVINO™ toolkit.


    NERSC Optimizes Application Performance with Roofline Analysis

    NERSC boosts the performance of its scientific applications on Intel® Xeon Phi™ processors up to 35% using Intel® Advisor.


    Daresbury Laboratory Speeds Computational Chemistry Software 

    Scientists get a speedup to their computational chemistry algorithm from Intel® Advisor’s vectorization advisor.


    Novosibirsk State University Gets More Efficient Numerical Simulation

    Novosibirsk State University boosts a simulation tool’s performance by 3X with Intel® Parallel Studio, Intel® Advisor, and Intel® Trace Analyzer and Collector.


    Pexip Speeds Enterprise-Grade Videoconferencing

    Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.


    Schlumberger Parallelizes Oil and Gas Software

    Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.


    OpenVINO™ Toolkit


    GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

    GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and the OpenVINO™ toolkit.


    Intel® Data Analytics Acceleration Library


    MeritData Speeds Up a Big Data Platform

    MeritData Inc. improves performance—and the potential for big data algorithms and visualization.


    Intel® Distribution for Python*


    DATADVANCE Gets Optimal Design with 5x Performance Boost

    DATADVANCE discovers that Intel® Distribution for Python* outpaces standard Python.
     


    Intel® Inspector XE


    CADEX Resolves the Challenges of CAD Format Conversion

    Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


    Envivio Helps Ensure the Best Video Quality and Performance

    Intel® Parallel Studio XE helps Envivio create safe and secured code.


    ESI Group Designs Quiet Products Faster

    ESI Group achieves up to 450 percent faster performance on quad-core processors with help from Intel® Parallel Studio.


    Fixstars Uses Intel® Parallel Studio XE for High-speed Renderer

    As a developer of services that use multi-core processors, Fixstars has selected Intel® Parallel Studio XE as the development platform for its lucille* high-speed renderer.


    Golaem Drives Virtual Population Growth

    Crowd simulation is one of the most challenging tasks in computer animation―made easier with Intel® Parallel Studio XE.


    Schlumberger Parallelizes Oil and Gas Software

    Schlumberger increases performance for its PIPESIM* software by up to 10 times while streamlining the development process.


    Intel® Integrated Performance Primitives


    JD.com Optimizes Image Processing

    JD.com Speeds Image Processing 17x, handling 300,000 images in 162 seconds instead of 2,800 seconds, with Intel® C++ Compiler and Intel® Integrated Performance Primitives.


    Tencent Optimizes an Illegal Image Filtering System

    Tencent doubles the speed of its illegal image filtering system using SIMD Instruction Set and Intel® Integrated Performance Primitives.


    Tencent Speeds MD5 Image Identification by 2x

    Intel worked with Tencent engineers to optimize the way the company processes millions of images each day, using Intel® Integrated Performance Primitives to achieve a 2x performance improvement.


    Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

    Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


    Intel® Math Kernel Library


    DreamWorks Puts the Special in Special Effects

    DreamWorks Animation’s Puss in Boots uses Intel® Math Kernel Library to help create dazzling special effects.


    GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

    GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and the OpenVINO™ toolkit.

     


    MeritData Speeds Up a Big Data Platform

    MeritData Inc. improves performance―and the potential for big data algorithms and visualization.


    Qihoo360 Technology Co. Ltd. Optimizes Speech Recognition

    Qihoo360 optimizes the speech recognition module of the Euler platform using Intel® Math Kernel Library (Intel® MKL), speeding up performance by 5x.


    Intel® Media SDK


    NetUP Gets Blazing Fast Media Transcoding

    NetUP uses Intel® Media SDK to help bring the Rio Olympic Games to a worldwide audience of millions.


    Intel® Media Server Studio


    ActiveVideo Enhances Efficiency

    ActiveVideo boosts the scalability and efficiency of its cloud-based virtual set-top box solutions for TV guides, online video, and interactive TV advertising using Intel® Media Server Studio.


    Kraftway: Video Analytics at the Edge of the Network

    Today’s sensing, processing, storage, and connectivity technologies enable the next step in distributed video analytics, where each camera itself is a server. With Kraftway* video software platforms can encode up to three 1080p60 streams at different bit rates with close to zero CPU load.


    Slomo.tv Delivers Game-Changing Video

    Slomo.tv's new video replay solutions, built with the latest Intel® technologies, can help resolve challenging game calls.


    SoftLab-NSK Builds a Universal, Ultra HD Broadcast Solution

    SoftLab-NSK combines the functionality of a 4K HEVC video encoder and a playout server in one box using technologies from Intel.


    Vantrix Delivers on Media Transcoding Performance

    HP Moonshot* with HP ProLiant* m710p server cartridges and Vantrix Media Platform software, with help from Intel® Media Server Studio, deliver a cost-effective solution that delivers more streams per rack unit while consuming less power and space.


    Intel® MPI Library


    Moscow Institute of Physics and Technology Rockets the Development of Hypersonic Vehicles

    Moscow Institute of Physics and Technology creates faster and more accurate computational fluid dynamics software with help from Intel® Math Kernel Library and Intel® C++ Compiler.


    Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

    Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


    Intel® Threading Building Blocks


    CADEX Resolves the Challenges of CAD Format Conversion

    Parallelism Brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


    Johns Hopkins University Prepares for a Many-Core Future

    Johns Hopkins University increases the performance of its open-source Bowtie 2* application by adding multi-core parallelism.


    Mentor Graphics Speeds Design Cycles

    Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.

     


    Pexip Speeds Enterprise-Grade Videoconferencing

    Intel® analysis tools enable a 2.5x improvement in video encoding performance for videoconferencing technology company Pexip.


    Quasardb Streamlines Development for a Real-Time Analytics Database

    To deliver first-class performance for its distributed, transactional database, Quasardb uses Intel® Threading Building Blocks (Intel® TBB), Intel’s C++ threading library for creating high-performance, scalable parallel applications.


    University of Bristol Accelerates Rational Drug Design

    Using Intel® Threading Building Blocks, the University of Bristol helps slash calculation time for drug development—enabling a calculation that once took 25 days to complete to run in just one day.


    Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

    Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


    Intel® VTune™ Amplifer


    CADEX Resolves the Challenges of CAD Format Conversion

    Parallelism brings CAD Exchanger* software dramatic gains in performance and user satisfaction, plus a competitive advantage.


    F5 Networks Profiles for Success

    F5 Networks amps up its BIG-IP DNS* solution for developers with help from
    Intel® Parallel Studio and Intel® VTune™ Amplifer.


    GeoVision Gets a 24x Deep Learning Algorithm Performance Boost

    GeoVision turbo-charges its deep learning facial recognition solution using Intel® System Studio and the OpenVINO™ toolkit.


    Mentor Graphics Speeds Design Cycles

    Thermal simulations with Intel® Software Development Tools deliver a performance boost for faster time to market.

     


    Nik Software Increases Rendering Speed of HDR by 1.3x

    By optimizing its software for Advanced Vector Extensions (AVX), Nik Software used Intel® Parallel Studio XE to identify hotspots 10x faster and enabled end users to render high dynamic range (HDR) imagery 1.3x faster.


    Walker Molecular Dynamics Laboratory Optimizes for Advanced HPC Computer Architectures

    Intel® Software Development tools increase application performance and productivity for a San Diego-based supercomputer center.


     

    Intel® Graphics Performance Analyzers (Intel® GPA) 2018 R2 Release Notes

    $
    0
    0

    Thank you for choosing the Intel® Graphics Performance Analyzers (Intel® GPA), available as a standalone product and as part of Intel® System Studio.

    Contents

    Introduction
    What's New
    System Requirements and Supported Platforms
    Installation Notes
    Technical Support and Troubleshooting
    Known Issues and Limitations
    Legal Information

    Introduction

    Intel® GPA provides tools for graphics analysis and optimizations for making games and other graphics-intensive applications run even faster. The tools support the platforms based on the latest generations of Intel® Core™ and Intel Atom™ processor families, for applications developed for  Windows*, Android*, Ubuntu*, or macOS*.

    Intel® GPA provides a common and integrated user interface for collecting performance data. Using it, you can quickly see performance opportunities in your application, saving time and getting products to market faster.

    For detailed information and assistance in using the product, refer to the following online resources:

    • Home Page - view detailed information about the tool, including links to training and support resources, as well as videos on the product to help you get started quickly.
    • Getting Started - get the main features overview and learn how to start using the tools on different host systems.
    • Training and Documentation - learn at your level with Getting Started guides, videos and tutorials.
    • Online Help for Windows* Host - get details on how to analyze Windows* and Android* applications from a Windows* system.
    • Online Help for macOS* Host - get details on how to analyze Android* or macOS* applications from a macOS* system.
    • Online Help for Ubuntu* Host - get details on how to analyze Android* or Ubuntu* applications from an Ubuntu* system.
    • Support Forum - report issues and get help with using Intel® GPA.

    What's New

    Intel® GPA 2018 R2 offers the following new features:

    New Features for Analyzing All Graphics APIs

    System Analyzer

    • View all available Intel® GPU metrics metrics in System View on Windows* Platforms, with an ability to switch between these counter sets using the Ctrl+M hotkey

    Graphics Frame Analyzer

    • Search for and pin interesting metrics to the top of the metrics table
    • Copy resource names in the Resource Viewer using CTRL+C 

    All Tools

    • Modified the Dark Mode color scheme for improved usability

    New Features for analyzing OpenGL* applications

    New Platforms

    • Support for macOS High Sierra (10.13.4) has been added for this release including support for:
      • Real-time metrics in System Analyzer
      • Per-region/per-event metrics in Graphics Frame Analyzer

    Graphics Monitor

    • OpenGL applications downloaded from Apple AppStore can be launched through Graphics Monitor or Graphics Frame Analyzer launch dialog without Sandbox removal
    • User-configurable frame delimiters have been added, these delimiters: SwapBuffer, MakeCurrent context, Clear, Flush, Finish or BlitFramebuffer can be used individually or in combination

    System Analyzer HUD

    • Updated Heads-up display for OpenGL applications on Windows, Ubuntu, and macOS platforms

    New Features for analyzing Microsoft DirectX* applications

    New Platforms 

    • Metrics for AMD* Radeon RX Vega M (in the new Intel® NUC KIT NUC8I7HVK) are available in System Analyzer and Graphics Frame Analyzer for DirectX 11 and DirectX 12 applications

    Graphics Monitor

    • Graphics applications launched in “Auto-detect launched applications” mode are automatically added to recent applications list

    Graphics Frame Analyzer

    • Any DirectX* 11 shader resource view (SRV) can now be replaced with a simple 2x2 texture or clamped to a selected MIP map level independently from other input textures
    • Shader DXBC and ISA code update whenever a shader is modified 
    • Support for DirectX 12 Unreal Engine 4.19 applications running on multi-GPU systems has been added
    • Multi-sampled render targets (including depth and stencil textures) are now viewable in DirectX 12 frames
    • Pixel History for DirectX 11 supports rendering to layers and mip levels, and respects applied experiments
    • View the per-target, post-transformation geometry for a range of selected draw calls in DirectX 11 frames

    New Features for analyzing Android* Open GLES applications

    System Analyzer

    • An ability to view and profile any Android process has been added to the System Analyzer settings

    New Features for analyzing macOS* Metal applications

    Graphics Frame Analyzer for Metal

    • Additional title support
    • Modified the Stream file format to improve performance and stability
    • Stream files play back instantly within Graphics Frame Analyzer

    System Requirements and Supported Platforms

    The minimum system requirements are: 

    • Host Processor: Intel® Core™ Processor
    • Target Processor: See the list of supported Windows* and Android* devices below
    • System Memory: 8GB RAM
    • Video Memory: 512MB RAM
    • Minimum display resolution for client system: 1280x1024
    • Disk Space: 300MB for minimal product installation

    The table below shows platforms and applications supported by Intel® GPA 2018 R2

    Target System
    (the system where your game runs)
    Host System
    (your development system where you run the analysis)
    Target Application
    (types of supported applications running on the target system)

    Windows* 7 SP1/8.1/10

    Windows* 7 SP1/8.1/10

    Microsoft* DirectX* 9/9Ex, 10.0/10.1, 11.0/11.1/11.2/11.3

    Windows* 10

    Windows* 10

    Microsoft* DirectX* 12, 12.1

    Google* Android* 4.1, 4.2, 4.3, 4.4, 5.x, 6.0

    Windows* 7 SP1/8.1/10
    or
    macOS* 10.11, 10.12
    or
    Ubuntu* 16.04

    OpenGL* ES 1.0, 1.1, 2.0, 3.0, 3.1, 3.2

    Ubuntu* 16.04

    Ubuntu* 16.04

    OpenGL* 3.2, 3.3, 4.0, 4.1 Core Profile

    macOS* 10.12 and 10.13macOS* 10.12 and 10.13

    OpenGL* 3.2, 3.3, 4.0, 4.1 Core Profile

    and

    Metal* 1 and 2

    Intel® GPA does not support the following Windows* configurations: All server editions, Windows* 8 RT, or Windows* 7 starter kit.

    Supported Windows* Graphics Devices

    Intel® GPA supports the following graphics devices as targets for analyzing Windows* workloads. All these targets have enhanced metric support:

    TargetProcessor
    Intel® UHD Graphics 6308th generation Intel® Core™ processor
    Intel® UHD Graphics 6307th generation Intel® Core™ processor
    Intel® UHD Graphics 6207th generation Intel® Core™ processor
    Intel® HD Graphics 6207th generation Intel® Core™ processor
    Intel® HD Graphics 6157th generation Intel® Core™ m processor
    Intel® HD Graphics 5306th generation Intel® Core™ processor
    Intel® HD Graphics 5156th generation Intel® Core™ m processor
    Iris® graphics 61005th generation Intel® Core™ processor
    Intel® HD Graphics 5500 and 60005th generation Intel® Core™ processor
    Intel® HD Graphics 53005th generation Intel® Core™ m processor family
    Iris® Pro graphics 52004th generation Intel® Core™ processor
    Iris® graphics 51004th generation Intel® Core™ processor
    Intel® HD Graphics 4200, 4400, 4600, and 50004th generation Intel® Core™ processor

    Although the tools may appear to work with other graphics devices, these devices are unsupported. Some features and metrics may not be available on unsupported platforms. If you run into in an issue when using the tools with any supported configuration, please report this issue through the Support Forum.

    Driver Requirements for Intel® HD Graphics

    When running Intel® GPA on platforms with supported Intel® HD Graphics, the tools require the latest graphics drivers for proper operation. You may download and install the latest graphics drivers from http://downloadcenter.intel.com/.

    Intel® GPA inspects your current driver version and notifies you if your driver is out-of-date.

    Supported Devices Android devices

    Intel® GPA supports both Intel® and ARM*-based Android devices, with known limitations, for further information see this article.

    Installation Notes

    Installing Intel® GPA 

    Download the Intel® GPA installer from the Intel® GPA Home Page.

    Installing Intel® GPA on Windows* Target and Host Systems

    To install the tools on Windows*, download the *.msi package from the Intel® GPA Home Page and run the installer file.

    The following prerequisites should be installed before you run the installer:

    • Microsoft .NET 4.0 (via redirection to an external web site for download and installation)

    If you use the product in a host/target configuration, install Intel® GPA on both systems. For more information on the host/target configuration, refer to Best Practices.

    For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

    Installing Intel® GPA on Ubuntu* Host System

    To install Intel® GPA on Ubuntu*, download the .sh file from the Intel® GPA Home Page and run the installer script.

    It is not necessary to explicitly install Intel® GPA on the Android* target device since the tools automatically install the necessary files on the target device when you run System Analyzer. For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

    Installing Intel® GPA on macOS* Host System

    To install the tools on macOS*, download from the Intel® GPA Home Page and run the .pkg installer.

    It is not necessary to explicitly install Intel® GPA on the Android* target device because the tools automatically install the necessary files on the target device when you run the System Analyzer. For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

    Technical Support and Troubleshooting

    For technical support, including answers to questions not addressed in the installed product, visit the Support Forum.

    Troubleshooting Android* Connection Problems

    If the target device does not appear when the adb devices command is executed on the client system, do the following:

    1. Disconnect the device
    2. Execute $ adb kill-server
    3. Reconnect the device
    4. Run $ adb devices

    If these steps do not work, try restarting the system and running $adb devices again. Consult product documentation for your device to see if a custom USB driver needs to be installed. 

    Known Issues and Limitations

    • Full Intel GPA metrics are not supported on macOS* 10.13.4 for Skylake-based and Kaby Lake-based Mac Pro systems.  For full metric support, please do not upgrade to macOS* 10.13.4.
    • Metrics in the System Analyzer's system view are inaccurate for Intel® Graphics Driver for Windows* Version 15.65.4.4944. You can use Intel® Graphics Driver for Windows* Version 15.60.2.4901 instead.
    • Playback of the Metal stream files captured with earlier Intel® GPA versions is not supported. Old Metal stream files can be converted to the new stream format using the following steps:
      1. Open Terminal and change the directory to /Applications/Intel/FrameAnalyzer.app/Contents/Resources/metal.
      2. Capture a new stream of the old player running the .gpa_stream file that you want to convert by the following command:
        ./gen2/gpa-capture ./gpa-playback --layer capture -- <path-to-old-.gpa_stream-file>
      3. The newly converted stream is automatically added to ~/Documents/GPA/ and is displayed in the Graphics Frame Analyzer open file dialog.
    • Intel® Graphics Performance Analyzers Sample (gpasample.exe) cannot be launched with Global Injection Mode enabled on Windows* 7 platforms
    • macOS users who are running OS X El Capitan or newer must disable System Integrity Protection (SIP) in order to profile Steam applications. If SIP is enabled on your machine, a message will appear at the top of Graphics Monitor directing you to disable it. If you would prefer not to disable SIP but need to launch into a Steam application, use the following process: 
      1. Launch and sign into Steam 
      2. Locate the executable of the desired application and copy the location, it typically looks something like this: 
        /Users/YOUR_USER_NAME/Library/Application\ Support/Steam/steamapps/common/YOUR_APPLICATION_BINARY 
      3. Launch Graphics Monitor
      4. Paste the location of desired application in the first input box and hit start
      5. GPA will now be injected into the executable, allowing for live profiling and Trace/Frame Capture

    *Other names and brands may be claimed as the property of others.

    ** Disclaimer: Intel disclaims all liability regarding rooting of devices. Users should consult the applicable laws and regulations and proceed with caution. Rooting may or may not void any warranty applicable to your devices.

    Explore the Possibilities of Generative Modeling

    $
    0
    0

    conception of neural activity

    Investigation into the capabilities of GANs, conducted by Intel Student Ambassador Prajjwal Bhargava, provides insights into using Intel® architecture-based frameworks to understand and create practical applications using this technology.

    "GANs' (generative adversarial networks) potential is huge, because they can learn to mimic any distribution of data. That is, GANs can be taught to create worlds eerily similar to our own in any domain: images, music, speech, prose. They are robot artists in a sense, and their output is impressive—poignant even."1

      Excerpt from "GAN: A Beginner's Guide to Generative Adversarial Networks"


    Challenge

    Past efforts at building unsupervised learning capabilities into deep neural  networks have been largely unsuccessful. A new modeling approach that uses opposing neural networks, one functioning as a generator and the other as a discriminator, has opened innovative avenues for research and practical applications.


    Solution

    The possibilities of using GANs to accelerate deep learning in an unsupervised training environment are progressively being revealed through ongoing exploration and experimentation. Prajjwal's work in this area promises to uncover paths likely to yield positive results as applications move from speculative to real-world implementations.


    Background and Project History

    An increasingly important area of generative modeling, known as generative adversarial networks (GANs), offers a means to endow computers with a better understanding of the surrounding world through unsupervised learning techniques. This field of inquiry has been the focus of Prajjwal Bhargava in his work for the Intel® AI Academy.

    Prior to becoming a Student Ambassador for the Intel AI Academy, Prajjwal sharpened his expertise in convolutional neural networks for image recognition, data structure and algorithms, deep-learning coding techniques, and machine learning. These topics have been useful in his research on GANs. "Initially, I started off with computer vision," Prajjwal said. "Back then I was learning how convolutional neural networks worked and how they do what they do. That required going deeper into the architectures." After getting into them, he started working with Recurrent Neural Networks (RNNs) and complex architectures like Long Short-Term Memory (LSTM)."

    "I later learned more about GANs," he continued, "and it was quite fascinating to me. I knew there were some significant challenges. For example, training a GAN— with the generator and discriminator getting updated independently—can have a serious impact on reaching convergence."

    Prajjwal observed that the original GAN paper didn't fully address this issue. It became clear that a different mechanism was needed for effectively resolving this problem. He looked into the issue further and found the paper describing this approach, "Wasserstein GAN", to be very influential and revolutionary.

    "The theory was explained brilliantly and it supported their experiment well," Prajjwal said. From this perspective, he started working on implementations using a variety of architectures to see which approaches could yield the greatest results.

    "Since Ian Goodfellow presented his landmark paper at the NIPS [Neural Information Processing Systems] conference in 2014, I've always felt that this architecture [GANs] is quite revolutionary by itself. I feel that these networks have changed the way we look at deep learning compared to a few years back. It has enabled us to visualize data in ways that couldn't have been accomplished through other techniques."

      Prajjwal Bhargava, Student Ambassador for Artificial Intelligence, Intel AI Academy

    Prajjwal has been working on GANs for over a year, and he doesn't see an end to his research. Each new research paper that is published offers fresh ideas and different perspectives. His own paper, "Better Generative Modeling through Wasserstein GANs," provides the insights he has gained over the course of his work with Intel AI Academy.

    "I want to try all possible variants of GANs," Prajjwal said. "There are so many, each one performing a new task in the best possible manner. However, I think the future calls for something universal and I think this applies to GANs as well. The more we are able to let our network generalize, the better it is. There's so much more to do and hopefully I will continue to contribute towards this research."

    "Training and sampling from generative models is an excellent test of our ability to represent and manipulate high-dimensional probability distributions. High-dimensional probability distributions are important objects in a wide variety of applied math and engineering domains."2

      Ian Goodfellow, Staff Research Scientist, Google Brain


    Key Findings of the Experimentation

    As Prajjwal continues to research GAN variants, the work that he has accomplished so far has led him to a key conclusion. In summary, he noted, "GANs are essentially models that try to learn distribution of real data by minimizing divergence (difference in probability distribution) through generation of adversarial data. In the original [Goodfellow] paper, convergence in mix max objective is interpreted as minimizing Jensen-Shannon divergence. Wasserstein is a better alternative than using Jensen-Shannon divergence. It gives a smooth representation in between."

    "If we have two probability distributions—P and Q—there is no overlap when they are not equal, but when they are equal, the two distributions just overlap," Prajjwal continued. "If we calculate D(kl), we get infinity if two distributions are disjoint. So, the value of D(js) jumps off and the curve isn't differentiable: Ɵ is 0."

    "The Wasserstein metric provides a smooth measure. This helps ensure a stable learning process using gradient descents," he added.

    Real Samples
    flowchart of double feedback loop
    Figure 1. Double feedback loop used for a generative adversarial network (GAN).

    The research being done on GANs suggests a wide variety of use cases across multiple industries, Prajjwal believes. Some of the promising possibilities include the following:

    • Accelerating drug discovery and finding cures for previously incurable diseases. The Generator could propose a drug for treatment and the Discriminator could determine whether the drug would be likely to produce a positive outcome.
    • Advancing molecule development in oncology, generating new anti-cancer molecules within a defined set of parameters.
    • Performing text translation to describe the content of images accurately.
    • Generating super-resolved images from downsampled original images to improve the perceptual qualities.
    • Boosting creativity in fields where variety and innovation are important, such as fashion or design.

    "Unsupervised learning is the next frontier in artificial intelligence," Prajjwal said, "and we are moving rapidly in that direction, even though we still have a long way to go."


    Enabling Technologies

    The primary enabling technologies that were used for research during this project include:

    • PyTorch*, which includes the use of the Intel® Math Kernel Library (Intel® MKL), is a library based on Python* that was used to build the architecture for GAN research.
    • Intel® AI DevCloud powered by Intel® Xeon Phi™ processors (current versions of the Intel AI DevCloud use Intel® Xeon® Scalable processors).

    "Intel MKL was really useful for optimizing matrix calculations and vector operations on my platform," Prajjwal commented, "and I have gone through technical articles on the Intel® Developer Zone (Intel® DZ) to better understand how to improve optimization on the architecture that I was using. A number of tutorials targeting Intel architecture were also quite useful."

    One of the key challenges that Prajjwal encountered was training GANs efficiently on Intel architecture-based systems. The difficulties included managing updates for the Generator and Discriminator concurrently, rather than independently. As it stands, reaching convergence can be a challenge. Part of the solution will require optimizing the training models so that the workflow proceeds more efficiently, taking better advantage of Intel architecture capabilities and built-in features.

    "It's been a year since I started working with Intel in the Intel AI Academy," Prajjwal noted. "And over this time, I've learned a lot. I've received much help and gained expertise working with Intel architecture-based hardware. It's great to see so many other Student Ambassadors working across the world in the same field. I've gotten to know so many people through conferences and online communities. Intel goes a long way to share the projects that we've done so that Student Ambassadors get recognition. Also, Intel provides a really good platform to publish our research and findings. I am really grateful that I got to become part of this AI Academy program and hope to do some more great work in the future."


    AI is Expanding the Boundaries of Generative Modeling

    Through the design and development of specialized chips, sponsored research, educational outreach, and industry partnerships, Intel is frmly committed to advancing the state of artifcial intelligence (AI) to solve difcult challenges in medicine, manufacturing, agriculture, scientifc research, and other industry sectors. Intel works closely with government organizations, non-government organizations, educational institutions, and corporations to advance solutions that address major challenges in the sciences.

    In terms of real-world applications of GAN techniques, the collaborative work accomplished by the NASA Frontier Development Lab (FDL) offers a striking example. FDL brings together companies, Intel being one, to share resources and expertise in a cooperative effort to solve space exploration challenges.

    During the Planetary Defense segment of the 2016 session, a GAN was developed to help detect potentially hazardous asteroids and determine the shape and the spin axis of the asteroid.

    One of the participants on this project, Adam Cobb, described the challenge of handling the input data: "Our predominant form of input data consisted of a series of delay-Doppler images. These are radar images that are defned in both time delay and frequency. Although to the untrained eye...these images might look like they are optical images, they actually have a non-unique mapping to the true asteroid shape. This many-to-one relationship added an extra level of complexity to the already difcult challenge of going from 2D to 3D representations. In order to go about solving this task we applied deep-learning architectures such as autoencoders, variational autoencoders, and generative adversarial networks to generate asteroid shapes and achieved promising results."

    Beyond the challenge of asteroid shape modeling, another challenge in the Planetary Defense area, Asteroid "Deflector Selector" Decision Support, used machine learning to determine the most effective deflection strategies to prevent an asteroid from colliding with Earth (see Figure 2 for an artist's rendering of this scenario).

    space scene, asteroid and planet
    Figure 2. NASA rendering of an asteroid in proximity to Earth.

    The NASA FDL is hosted by The SETI Institute in Mountain View, California with support from the NASA Ames Research Center. Intel provided hardware and technology, software and training, as well as expertise to the endeavor. Other corporate participants included NVIDIA Corporation, IBM* and Lockheed Martin, ESA, SpaceResources Luxembourg, USC MASCLE, Kx Systems*, and Miso Technologies.

    In these early stages of AI, at a time when commercial GAN implementations haven’t been widely released to the field, some of the best examples of the potential of this technique come from research papers and student implementations exploring the mechanisms to discover how GANs can be applied to real-world scenarios.

    One of the more interesting examples along this line is image-to-image translation with CycleGANs. A collection of resources on this topic, including code, interactive demos, videos, and a research paper, have been compiled by members of the University of California, Berkeley research team and can be found here: https://phillipi.github.io/pix2pix/.

    In image-to-image translation, the goal is to learn the mapping between an input image and an output image, using a training set of aligned image pairs. Practically speaking, paired training is not usually available, wherein the network can learn the mapping from domain X to domain Y. The objective in this approach is to learn a mapping G: X → Y, such that the distribution of images from G(X) is indistinguishable from the distribution Y, using adversarial loss.

    Preformatter images that maintain a strong correlation between both domains are required, but getting data to accomplish that can be time consuming and ineffective. CycleGANs build upon a pix2pix architecture, which supports modeling of unpaired collections of images, and, in the process, it can learn to translate the image between two aesthetics without tightly integrating matches into a single X/Y training image.

    Figure three shows some specifc image-to-image translation processes that highlight the capabilities of a CycleGAN.4

    The Intel® AI technologies used in this implementation included:

    Intel xeon inside badge

    Intel Xeon Scalable processors: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning

     

    Logos

    Framework optimization: Achieve faster training of deep neural networks on a robust scalable infrastructure.

     

    Intel AI Dev Cloud banner

    For Intel AI Academy members, the Intel AI DevCloud provides a cloud platform and framework for machine-learning and deeplearning training. Powered by Intel Xeon Scalable processors, the Intel AI DevCloud is available for up to 30 days of free remote access to support projects by academy members

    Join today at: https://software.intel.com/ai/sign-up

    For a complete look at our AI portfolio, visit https://ai.intel.com/technology.

    examples of image to image translations
    Figure 3. Image-to-image translation examples (courtesy of Berkeley AI researchers).

    Naveen Rao

    "At Intel, we're encouraged by the impact that AI is having, driven by its rich community of developers. AI is mapping the brain in real time, discovering ancient lost cities, identifying resources for lunar exploration, helping to protect  Earth's oceans, and fighting fraud that costs the world billions of dollars per year, to name just a few projects. It is our privilege to support this community as it delivers world-changing AI across verticals, use cases, and geographies."5

      Naveen Rao, Vice President and General Manager, Artificial Intelligence Products Group, Intel


    Resources

     

    1. "GAN: A Beginner's Guide to Generative Adversarial Networks." DL4J. 2017.
    2. Goodfellow, Ian. "NIPS 2016 Tutorial." 2016.
    3. Cobb, Adam. "3D Shape Modelling of Asteroids." 2017.
    4. Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, Alexei Efros. "Image-to-Image Translation with Conditional Adversarial Networks." Berkeley AI Research Laboratory. 2017.
    5. Rao, Naveen. "Helping Developers Make AI Real." Intel. May 16, 2018.

    Part 1: Using Transfer Learning to Introduce Generalization in Models

    $
    0
    0


    Source: distill.pub  

    Author’s note: The research was conducted using Intel® AI DevCloud, a cloud-hosted hardware and software platform available for developers, researchers and startups to learn, sandbox and start their Artificial Intelligence projects. This free cloud compute is available for Intel® AI Academy members.

    Abstract

    Researchers often try to capture as much information as they can, either by using existing architectures, creating new ones, going deeper, or employing different training methods. This paper compares different ideas and methods that are used heavily in Machine Learning to determine what works best. These methods are prevalent in various domains of Machine Learning, such as Computer Vision and Natural Language Processing (NLP).

    Transfer Learning is the Key

    Throughout our work, we have tried to bring generalization into context, because that’s what matters in the end. Any model should be robust and able to work outside your research environment. When a model lacks generalization, very often we try to train the model on datasets it has never encountered … and that’s when things start to get much more complex. Each dataset comes with its own added features which we have to adjust to accommodate our model.

    One common way to do so is to transfer learning from one domain to another.

    Given a specific task in a particular domain, for which we need labelled images for the same task and domain, we train our model on that dataset. In practice, the dataset is usually the largest in that domain so that we can leverage the features extracted effectively. In computer vision, it’s mostly Imagenet, which has 1,000 classes and more than 1 million images. When training your network upon it, it’s bound to extract features2 that are difficult to obtain otherwise. Initial layers usually capture small, fine details, and as we go deeper, ConvNets try to capture task-specific details; this makes ConvNets fantastic feature extractors.

    Normally we let ConvNet capture features by training it on a larger dataset and then modify. Fully connected layers in the end can do whatever we require for carrying out classification, and we can add a combination of linear layers. This makes it easy to transfer the knowledge of our network to carry out another task.

    Transfer Learning in Natural Language Processing

    A recent paper, Universal LM for Text Classification,3 showed how to apply transfer learning to Natural Language Processing. This method has not been applied widely in this field. We can use pretrained models and not embeddings that have been trained on WikiText 103. Embeddings are word representations that allow words with similar meaning to have similar representation. If you visualize their embeddings, they would appear close to one another. It’s basically a fixed representation, so their scope is limited in some ways. But, creating a language model that has learned to capture semantic relationships within languages is bound to work better on newer datasets, as evidenced by results from the paper. So far, it has been tested on Language Modeling tasks and the results are impressive. This applies to Seq2Seq learning as well in instances where length of inputs and outputs is variable. This can be expanded further to many other tasks in NLP. Read more: Introducing state of the art text classification with universal language models.

    diagrams of LM training and tuning
    Figure 1

    Learning Without Forgetting

    Another paper, Learning without Forgetting,4 provides context for what’s been done earlier to make our network remember what it was trained on earlier, and how it can made to remember new data without forgetting earlier learning. The paper discussed the researchers’ methods compared with other prevalent, widely used methods such as transfer learning, joint training, feature extraction, and fine tuning. And, they tried to capture differences in how learning is carried out.

    For example, fine tuning is an effective way to extend the learning of neural networks. Using fine tuning, we usually train our model on a larger dataset – let’s say ResNet50 trained on Imagenet trained on ImageNet. A pretrained ResNet5 has 25.6 Million parameters. Resnets let you go deeper without incrementing the number of parameters over counterparts. The number of parameters is so great that you can expect to use the model to fit any other dataset in a very efficient manner: you simply load the model, remove the fully connected layers which are task specific, freeze the model, add linear layers as per your own needs, and train it on your own dataset. It’s that simple and very effective. The trained model has so many capabilities and reduced our workload by a huge factor; we recommend using fine tuning wherever you can.

    What We’ve Actually Been Doing: Curve Fitting

    Judea Pearl recently published a paper6 in which he states that although we have gained a strong grasp of probability, we still can’t do cause-and-effect reasoning. Instead, basically what we’ve doing is curve fitting. So many different domains can be unlocked with do-calculus and causal modelling.

    The Three Layer Causal Hierarchy
    Level (Symbol)Typical ActivityTypical QuestionsExamples
    1. Association
         P(y\x)
    SeeingWhat is?
    How would seeing X
    change my belief in Y?
    What does a symptom tell me about a disease?
    What does a survey tell us about the election results?
    2. Intervention
         P(y\do(x), z)
    Doing,
    Intervening
    What if?
    What if I do X?
    What if I take aspirin, will my headache by cured?
    What if we ban cigarettes?
    3. Counterfactuals
         P(yx\xʹ, yʹ) 
    Imagining,
    Retrospection
    Why?
    Was it X that causes Y?
    What if I had acted differently?
    Was it the aspirin that stopped my headache?
    Would Kennedy be alive had Oswald not shot him?
    What if I had not been smoking the past 2 years?

    Returning to where we were, we implemented learning without forgetting to measure how well the model does compared to other discussed methods in some computer vision tasks. They define three types of parameters: θs,θ o, and θn. θs are the shared set of parameters, while θ o is a parameter the model has trained on previous tasks (with a different dataset). Θn is a parameter the model will have when trained on another dataset.

    How to Perform Training

    First, we used ResNet50 (authors used 5 conv layers + 2 FC layers of AlexNet) instead of stated architecture with pretrained weights. The purpose behind pretrained weights is that our model will be used in domain adaptation and will see increased use of fine tuning. It’s necessary that the convolutional layers have extracted rich features that will help in many computer vision tasks, preferably on ImageNet, which has 26.5 million parameters. If you want to go deep, consider using other ResNet variants like ResNet 101. After that, our model must be trained using the architecture as prescribed in the paper:

    ResNet50 model
    Figure 2.

    The model in between is ResNet50 as per our implementation. We removed the last two layers and added two FC (fully connected) layers. We dealt with FC layers in a different manner appropriate to our task, but it can be modified for each use case. Add multiple FC layers depending on how many tasks you plan to perform.

    After creating the architecture, it’s necessary to freeze the second FC layer. This is done to ensure that the first FC layer can perform better on this task when the model is learned on another task with a significantly lower learning rate.

    This method solves a big challenge: after training, the older dataset is no longer required, whereas other methods of training do still require it.

    Features of Learning Without Forgetting Vs. Commonly Used Deep Learning Training Models
     Fine TuningDuplicate and
    Fine Tuning
    Feature
    Extraction
    Joint
    Training
    Learning Without
    Forgetting
    New Task PerformancegoodgoodX mediumbest✔ best
    Original Task PerformanceX badgoodgoodgood✔ good
    Training EfficiencyfastfastfastX slow✔ fast
    Testing EfficiencyfastX slowfastfast✔ fast
    Storage RequirementmediumX largemediumX large✔ medium
    Requires Previous Task DatanononoX yes✔ no

    This is a big challenge: to make incremental learning more natural, dependence on older datasets must be removed. After training the model we are required to freeze the base architecture (in our case it implies ResNet50) and the first FC layer with only the second FC layer turned on. We have to train the model with this arrangement.

    The rationale for this training approach

    The base model (ResNet in our case) earlier had fine-tuned weights. Convolutional layers do an excellent job of feature extraction. As we fine-tune the base model, we are updating the weights as per the dataset we’re using. When we freeze the base model and train with another FC layer turned on, it implies that we have gone task specific, but we don’t want go much deeper into that task. By training the base model on a particular task and re-training it, the model will capture the weights required to perform well on the default dataset. If we want to perform domain adaptation, earlier and middle layers should be very good at feature extraction and bring generalization into context rather than making it task-specific.

    Learning without forgetting

    training formula

    After performing the training, we must join train all the layers. This implies turning on both FC layers of the base model and training them to converge.

    Use any loss function your task requires. The authors used modified cross entropy (knowledge distillation loss), which proved to work well for encouraging the outputs of one network to approximate the outputs of another.

    training formula

    In our work, we tried loss function Triplet Loss and Cross entropy.

    Observations

    This method seems to work well when the number of tasks is kept to a minimum (in our case, two). It may outperform fine-tuning for new tasks because the base model is not being retrained repeatedly, only the FC layers. Performance is similar to joint training when new tasks are being added. But, this method is bound to work poorly on older tasks as new tasks are added.

    This is because same convolutional layers are being used when we are freezing them, which means they are using the same feature extractor. We don’t expect them to outperform on all above-mentioned training tasks just by dealing with FC layers.

     more task-specific layers, network expansion
    Figure 3.

    You can add more task-specific layers to introduce more generalization. But, as you go deep, you will make the model task-specific. This method addresses the problem of adapting to different domains of computer vision without relying on older datasets that were used in earlier training. It can be regarded as a hybrid of knowledge distillation and fine-tuning training methods.

    This is an incremental step toward bringing generalization to neural networks, but we still lack ways to achieve full generalization, wherein we can expect to make our networks learn just like we do. We still have a long way to go, but research is in progress.

    Technologies Involved

    Since we were dealing with image-related datasets, we wanted the transfer between images and CPU to be fast, and DevCloud seemed to hasten the process. We performed all preprocessing on DevCloud; in addition, we trained our model incrementally. We also used Nvidia GTX 1080 for some parts of the training.

    Intel Development Tools Used

    The project made use of Jupyter notebook on the Intel AI DevCloud (using Intel® Xeon® Scalable processors) to write the code and for visualization purposes. We also used information from the Intel AI Academy forum. The code we used can be found in this GitHub* repository.

    Join the Intel® AI Academy

    Sign up for the Intel AI Academy and access essential learning materials, community, tools and technology to boost your AI development. Apply to become an AI Student Ambassador and share your expertise with other student data scientists and developers.

    Contact Intel AI Student Ambassador Prajjwal Bhargava on Twitter or GitHub.

    References:

    1. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks
    2. Visualizing and Understanding Convolutional Networks
    3. Universal Language Model Fine-tuning for Text Classification
    4. Learning Without Forgetting
    5. Deep Residual Learning for Image Recognition
    6. Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution

    Expanding the Possibilities of Computer Vision with AI

    $
    0
    0

    abstract blue eye

    The benefits and possibilities of computer vision are amplified and expanded through a generous influx of AI and IoT technologies.

    “The computer vision market will include a mix of consumer-facing applications like augmented reality and virtual reality, robots, drones, and self-driving cars, along with business applications like medical image analysis, video surveillance, real estate development optimization, ad insertions, and converting paperwork into digital data. While the consumer-facing applications often generate more buzz, the big shift is that enterprises are moving beyond data analytics to embrace AI-based business applications that utilize computer vision capabilities.”1

    Aditya Kaul, Research Director, Tractica  

    Challenge

    Computer vision has unlocked a multitude of possibilities for both consumer and enterprise business applications, but traditional technologies have been hampered by numerous vexing implementation obstacles. 

    Solution

    With advanced computer vision technologies that embed intelligence at the network edge and solutions enhanced by artificial intelligence (AI), innovative new use cases are being developed that are generating increasing real-world value. 

    Background and Project History

    From an early career as a DJ, Adam Milton-Barker gained an interest in coding while building websites to promote his business, which spiraled into deeper interests, including AI and the Internet of Things (IoT). Over the course of several years, numerous jobs, and time spent managing his own company, Adam continued to accumulate AI expertise; at one stage leading a medical neural network project for a team based in Sao Paulo. In 2014, an ad caught his attention: a challenge from the Microsoft Development Program offering an Intel® Galileo development board to each of the winners. At that point,” Adam said, “I was primarily involved in web, business administration applications, and natural linguistics programs, using AIML (Artificial Intelligence Markup Language). I also had a stint building games and apps as a member of the Facebook developer forum, as well as teaching coding. I had never come across IoT. I really liked the idea of the Internet of Things. And, because I had a lot of equipment in my home, an obvious project for me would be a security system.”

    Adam developed a facial recognition solution that he dubbed TASS (TechBubble Assisted Security System) and was awarded the Intel Galileo from Microsoft* for the project idea. He then bought a standard Intel Galileo board to be able to include Linux* in his development efforts. TASS debuted at the Intel booth at Codemotion Amsterdam as part of the Intel Innovator program and the solution became the focus for a number of conference presentations and demos at worldwide venues. After considering launching TASS as a full-fledged product, Adam decided to release the specifications and project details to the open source community. “TASS,” he said, “is now open source, IoT-connected computer vision software that runs on the edge. There are several versions of TASS that have been created over the last few years, each using different techniques and technologies to provide facial recognition applications without the need for cloud services.”

    The initial TASS project expanded in several productive directions. “During the next few years,” Adam said, “I was a semifinalist in the IBM Global Mobile Innovators Tournament with a project that included TASS and the IoT JumpWay*, which was then built on a Raspberry Pi*, but is now a free cloud service for people that want to learn about the IoT and AI. The project was one of the top five in Europe. I was also a first phase winner in the World’s Largest Arduino* Maker Challenge and I worked with a team at the AT&T Foundry Hackathon where we won first place for using a brain- computer interface to detect seizures; although, as a proof of concept the project never went beyond the demo stage. After a version of TASS won the Intel Experts Award at the Intel and Microsoft IoT Solutions World Congress Hackathon, I was invited to join the Intel Innovators program. This had been a goal of mine since the early days of my involvement in IoT. I joined the program in the areas of AI and IoT and have since added a new area—VR.”

    The landmark accomplishments that Adam has achieved over several years were attained without his earning a technology degree or taking any formal college courses. “My work, project experiences, and self-learning path led me to the Intel® Software Innovators Program, which opened global opportunities. I’ve demonstrated my projects at some of the biggest tech conferences in the world. Ultimately, this led me to my dream job at Bigfinite as an IoT network engineer.”

    “Moving to Barcelona and working at Bigfinite gave me a totally new life; I now work in an environment where I am not only surrounded by like-minded people, but people that know a lot more than me. It is an amazing place for me to continue learning, something that I have never experienced at other companies where I have worked. Bigfinite is also fully supportive of my personal projects and role in the Intel® Software Innovator program and promote my projects on our social media. We also have an initiative called beTED where I can continue helping people learn about AI and IoT at work.”

    Adam Milton-Barker, Intel Software Innovator and Bigfinite IoT Engineer

    “The project is ongoing,” Adam said. “I originally began developing it in 2014 and since then there have been many different versions. All of these versions are available on the IoT JumpWay GitHub* repos. As new technologies emerge, I create new versions of TASS.” 

    Refining Facial Recognition Technology

    Through his development experience and ongoing research, Adam has identified key areas that could guide developers in productive directions when implementing facial recognition capabilities into their apps. Foremost among the concerns is the open set recognition issue. “The open set recognition issue is something that not many people talk about when they promote their computer vision projects or products,” Adam commented, “as it is an unmistakable flaw in almost all compute vision projects. What happens is this: Say that you had trained a network to know two people and then introduce an unknown person for classification. By default, the AI application will predict that it is one of the people it has been previously trained on. Because of this, any application that relies on detecting who is actually unknown will fail.”

    Facial recognition
    Figure 1. Facial recognition is implemented through a polygonal grid linked to features.

    Overcoming the issue, according to Adam, can be done in two different ways. First, you can introduce an unknown class composed of, for example, 500 images. This approach works well in small environments, but within a larger environment you have a greater likelihood of seeing someone that looks like someone from the unknown dataset. This implementation, however, doesn’t work in TensorFlow* Inception v3, but it does work within an OpenFace* implementation (which is available in the GitHub repository).

    Another way to contend with the issue involves using FaceNet, which calculates the distance between faces in frames and a known database of images. On its own, this approach will typically not work well in the real world. If your application relies on thousands of known people, the program must loop through every single person in the known dataset and compare it to the person or persons in a frame. If you have a very powerful server and abundant resources, this may not be a serious issue. But, on the network edge where compute resources are limited, it becomes more of a challenge.

    Along this line, Adam continued, “My next step will be to combine my earlier models with FaceNet and use FaceNet as a backup to check known faces, eliminating the need to loop through all of the known people. Because we know exactly what image to compare against—due to using the result from the first classification—if the second classification confirms, then it is more than likely that it is that person and not a false positive. The only requirement is to retrieve the classification from model 1 and use the ID to directly reference the corresponding image in the known database. Currently, I believe that this is the best way to solve the issue, but it is kind of a hacky way of doing things. This approach was suggested to me by a colleague, Stuart Gray, a fellow member of the Artificial Intelligence and Deep Learning group on Facebook.”

    Two other issues that bear consideration:

    Lighting, whether too dark or too bright, presents a challenge. Intel® RealSense™ technology minimizes lighting issues significantly, but developers need to be aware of scenarios where a poor lighting situation completely shuts down the recognition process.

    Photos designed to fool a computer vision system and foil either security protections or the facial recognition accuracy represent a current challenge that requires more attention as facial recognition moves into mainstream applications. 

    Enabling Technologies

    Adam uses a range of different Intel solutions in his projects, building new iterations of TASS to take advantage of emerging technologies. “Different versions have used different technologies,” Adam said. “Initially it was built on a Raspberry Pi. At the IoT World Congress Hackathon, we built it on an Intel® Joule™ development kit (now discontinued). The server version was built on an Intel® NUC DE3815TYKE and also an Intel NUC I7 using the OpenVINO™ toolkit. I have used Intel® RealSense™ cameras in some versions that helped with issues such as lighting. The more current versions use the UP Squared* Grove* IoT Development Kit and Intel® Movidius™ technology and they are trainable using the Intel® AI DevCloud. I will soon be working on a version that uses virtual reality using the hardware provided by Intel.”

    Among the specific benefits that Adam gained from the use of Intel technology:

    • Intel RealSense technology helped improve management of lighting issues.
    • Intel AI DevCloud was effective for training small models.
    • Intel Movidius technology has enhanced the capabilities of running AI on the edge.
    • Sample code and other resources available through Intel helped gain a better understanding of the hardware used in the solutions.
    • OpenVINO substantially improved the project results, adding speed and accuracy to the solution.

    “Each time I have implemented Intel technologies it has drastically increased the functionality of the project. In addition to increasing the capabilities of the project, the support I have received from the Intel Innovators in the Intel® AI Academy program has been amazing. The hardware and opportunities to demonstrate at various events that were provided through the program have helped the project reach new heights.”

    Adam Milton-Barker, Intel Software Innovator and IoT Engineer at Bigfinite, Inc. 

    Bringing Vision to the Edge: The OpenVINO™ Toolkit

    The release of the Open Visual Inference and Neural Network Optimization (OpenVINO) toolkit by Intel gives developers a rapid way to implement deep learning inference solutions using computer vision at the network edge. This addition to the current slate of Intel® Vision Products is based on convolutional neural network (CNN) principles, making it easier to design, develop, and deploy effective computer vision solutions that leverage IoT to support business operations.

    The components in the toolkit include three APIs:

    • A deep learning inference toolkit supporting the full range of Intel Vision Products.
    • A deep learning deployment toolkit for streamlining distribution and use of AI-based computer vision solutions.
    • A set of optimized functions for OpenCV and OpenVX*.

    Currently supported frameworks include TensorFlow, Caffe*, and MXNet. The toolkit helps boost solution performance with numerous Intel based accelerators, including CPUs and integrated graphics processing units (GPUs), field-programmable gate arrays, video processing units, and image processing units.

    “Processing high-quality video requires the ability to rapidly analyze vast streams of data near the edge and respond in real time, moving only relevant insights to the cloud asynchronously. The OpenVINO toolkit is designed to fast-track development of high- performance computer vision and deep learning inference applications at the edge.”2

    Tom Lantzsch, Senior Vice President and General Manager, IoT Group, Intel

    Substantial performance improvements are available through the OpenVINO toolkit (click here and zoom in on the Increase Deep Learning Performance chart for full details). The components also enable a single- source approach to creating solutions, allowing developers to develop once and deploy anywhere, taking any model and optimizing for a large number of Intel hardware platforms.

    A free download of the OpenVINO toolkit is available today, putting developers on a path to produce optimized computer vision solutions that maximize performance on Intel acceleration platforms. Ecosystem partners in the Intel® IoT Solutions Alliance offer additional tools and technologies to help build innovative computer vision and IoT solutions. 

    Forward-Looking Development Perspectives

    Opportunities in the burgeoning fields of AI and IoT are abundant, and numerous resource and learning tools are available to anyone with the initiative to explore the various applications. International Data Corporation (IDC) projects that worldwide spending on IoT will reach USD 772 billion in 2018, up from USD 674 billion in 2017. IoT hardware represents the largest technology category in 2018; sales of modules, sensors, infrastructure, and security are projected to total USD 239 billion, with services listed as the next largest category.3

    Aerial drone technology
    Figure 2. Aerial drone technology opens up numerous vision computing opportunities.

    Industry reports project that strong growth will continue in the computer vision market:

    • By 2022, the video analytics market is projected to reach USD 11.17 billion.4
    • By 2023, the overall computer vision market should reach USD 17.38 billion.5
    • Deep learning revenue is projected to increase from USD 655 million in 2016 to USD 35 billion by 2025.6

    Developers interested in taking advantage of these technology opportunities have a number of different channels for gaining knowledge and expertise in AI and deep learning.

    “I would recommend the Coursera Deep Learning Specialization and Stanford Engineering Everywhere Machine Learning course for people wanting to know more about the inner workings of modern AI,” Adam said. “For those who just want to dive straight in head first as I did (and do), I have created a number of complete walkthroughs and provided source code that is freely available through the IoT JumpWay Developer Program.” 

    AI is Expanding the Boundaries of Computer Vision

    Through the design and development of specialized chips, sponsored research, educational outreach, and industry partnerships, Intel is firmly committed to advancing the state of AI to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, and other industry sectors. Intel works closely with government organizations, non- government organizations, educational institutions, and corporations to uncover and advance solutions that address major challenges in the sciences.

    For example, working with the engineering team at Honeywell, Intel is combining IoT technology and computer vision tools to help ensure safe and secure buildings.

    “The Internet of Things is creating huge advancements in the way we use video to ensure safe and secure buildings. With new emerging technology like analytics, facial recognition, and deep learning, Honeywell and Intel are connecting buildings like never before. Intel is an important partner in establishing the vision of smarter video solutions for the industry, and we look forward to continued collaboration that benefits customers.”7

    Jeremy Kimber, Marketing Director, Video Solutions, Honeywell

    The Intel® AI technologies used in this implementation included:

    Intel Xeon Scalable processors Intel® Xeon® Scalable processors: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning.


    LogosFramework Optimization: Achieve faster training of deep neural networks on a robust scalable infrastructure.


    Intel Movidius MyriadIntel® Movidius™ Myriad™ Vision Processing Unit (VPU): Create and deploy on-device neural networks and computer vision applications.


    Intel AI DevCloudIntel AI DevCloud: Free cloud compute for machine learning and deep learning training, powered by Intel Xeon Scalable processors.


    Internet of thingsInternet of Things: IoT consists of a network of devices that exchange and analyze data across linked and wireless interconnections.


    For a complete look at the Intel® AI portfolio, visit https://ai.intel.com/technology.

    “With the OpenVINO toolkit, we are now able to optimize inferencing across silicon from Intel, exceeding our throughput goals by almost six times. We want to not only keep deployment costs down for our customers, but also offer a flexible, high-performance solution for a new era of smarter medical imaging. Our partnership with Intel allows us to bring the power of AI to clinical diagnostic scanning and other healthcare workflows in a cost-effective manner.”8

    David Chevalier, Principal Engineer, General Electric (GE) Healthcare* 

    Resources

    References

    1. Computer Vision Hardware, Software, and Services Market to Reach $26.2 Billion by 2025, According to Tractica. Business Wire 2018.
    2. Wheatley, Mike. Intel’s OpenVINO toolkit enables computer vision at the network edge. SiliconANGLE 2018.
    3. Worldwide Semiannual Internet of Things Spending Guide. International Data Corporation (IDC) 2017.
    4. Marketsandmarkets, Video Analytics Market 2017.
    5. Marketsandmarkets, Computer Vision Market 2017.
    6. Tractica, 2Q, 2017
    7. Intel Customer Quote Sheet. Intel Newsroom 2018.
    8. OpenVINO Toolkit. What Customers Are Saying. Intel 2018.

    Visionary AI Explorations: From Finding Moon Resources to Design to Tracking Whales

    $
    0
    0

    moon and earth

    "[Scientists] gathering of data far outpaces their ability to make sense of it. The data NASA collects far exceeds its ability to understand. The research world usually has less access to the latest and greatest compute tools than a lot of the companies out there. But as a scientist, I fundamentally believe that we need to make sure we support those efforts.”1

    Naveen Rao, General Manager of Artificial Intelligence Products, Intel

    Research insights gained through artificial intelligence (AI) techniques deepen our understanding of the world around us, as well as delivering discoveries about off-world environments. For example, NASA’s Frontier Development Lab (FDL)—hosted by the SETI Institute in partnership with the NASA Ames Research Center—provides a platform for applying AI solutions to the challenges of space exploration. A recent project sponsored by Intel focused on using AI to identify useful resources on the moon. Ongoing research through FDL is revealing new ways in which AI can be used in space exploration, as well as charting paths for future scientific research across diverse fields of inquiry.  

    Challenge

    Applications grounded in AI technologies continue to gain traction and demonstrate efficacy in science, medicine, finance, agriculture, and other sectors. At the same time, prospective early adopters seek tangible examples to guide project development and serve as proofs of concept for AI techniques.  

    Solution

    Practical examples of the ways in which AI can address real-world challenges are appearing with increasing frequency. This, in turn, is encouraging wider acceptance of AI technology, with successful projects ranging from space exploration breakthroughs for NASA to 3D-printed orthopedic braces that add intelligence and personalization to medical devices. These achievements are helping demonstrate applied innovation techniques and establishing a foundation for new use cases.  

    Discoveries Based on Landmark AI Projects

    Pioneering projects in AI are reshaping the nature of scientific inquiries and providing a richer, full-spectrum view of our surroundings, our bodies, and our human potential. Innovation is fueled by new technologies, with intelligent agents springing up everywhere from the core of data centers to the furthest reaches of the network edge. Advances driving these capabilities include improvements in processor capabilities, specialized integrated circuits (ICs) that are optimized for AI operations, computer vision advances, and software enhancement tailored to deep learning and machine learning. Innovators are imaginatively applying AI tools and techniques to further our knowledge about the natural world and extend research into space, as well as solving problems that benefit individuals at a one-to-one level.

    To illustrate the ways in which AI is being practically applied, Shashi Jain, Innovation Manager in the Developer Relations Division of the Intel Visual Computing Group has led several projects that ventured into applied innovation techniques, combining diverse technologies into fresh solutions, encompassing pathfinding in the Internet of Things (IoT), machine learning, virtual reality (VR), as well as exploration into 3D-printing technology.

    “We do experiments to find out what problems we can solve with our technology,” Shashi said. “As an example, a few years ago we developed an industrial wearable device built around the Intel® Edison module to help reduce back injuries in the workplace.”

    “We also put microcontrollers on wine bottles,” Shashi continued, “and did some really interesting things identifying the right wine pairing for a meal. Or, finding all the bottles in your collection that meet certain criteria. Using this technique, you could track wine from the moment of bottling at the winery through distribution to a retailer to an individual wine cellar.”

    “Another thing that we did,” Shashi said, “was to put a microcontroller with a sensor on a scoliosis brace. We captured pressure data to determine how well it was fitting and how long it was worn and presented this data to the user in an app, hoping it would improve compliance. We achieved that, but the real magic is what the designer of the brace did with the sensor data. The brace is fully 3D printed and it starts with a body scan. The designer used the sensor to incrementally improve the design of the brace, based on the individual patient’s own sensor data.”

    “We are looking for all of these interesting use cases and fits for our technology and generating the insights that we can’t get any other way,” Shashi commented.

    The following sections offer more insights into applied innovation techniques.  

    Identifying Moon Resources

    The NASA FDL is ushering in a new age of discovery by hosting collaborative interdisciplinary projects that address specific scientific challenges using AI-based research. Intel, along with other key private-sector partners, contributes hardware and technology to advance this endeavor.

    Shashi recently led Intel’s sponsorship of the FDL and brought together Intel engineers to collaborate with researchers, applying AI to identify potential resources on the moon. The research relied on a massive dataset of images from the lunar polar regions and AI-guided crater detection. “FDL,” Shashi explained, “FDL is part of a public- private partnership between the government, the SETI Institute, and industry partners to apply commercial AI to space exploration challenges. The program focuses on accelerating the pace of modern machine-learning and AI techniques at NASA. NASA has some 50 years’ worth of data on lunar missions and space missions out to the planets— and everything in between—and now we have a chance to do space exploration using that data without leaving Earth.”

    Shashi continued, “We bring together teams of experts in their areas: machine learning, generally for a post-doctorate program, a doctorate program, or anything in-between. They can either be university researchers, industry researchers, or people who are published in the background. We bring them together for eight weeks to focus on the challenges of space exploration that are relevant to NASA or to the commercial space industry and spend a good amount of time defining the problems in advance.”

    “Beyond Space, AI is proving a vital tool in identifying gene activations, diagnosing tumors, managing power and heat, developing new molecules and even teaching robots to walk in a constantly moving environment.”2

    James Parr, Director, NASA Frontier Development Lab

    Lunar water and volatiles project

    In 2017, Intel sponsored the Lunar Water and Volatiles challenge, assembling a team to focus on recovering water and volatile chemicals from the moon. As this is an applied research accelerator, Intel guided the team to identify and define challenges relevant to actual users, who turned out to be engineers at the NASA Jet Propulsion Laboratory (JPL) and NASA’s Ames Research Center, as well as companies focused on lunar missions. What’s remarkable is that missions to recover moon and planetary resources are being planned and may be launched within five years.

    “There are 10,000 decisions that need to be made for any of these missions to happen,” Shashi said. “And we are right at the front of that process. So, the engineers we talked with articulated that their missions included topographical maps of the moon. We said, ‘OK, maybe we can maybe help you that.’ The objective was to identify craters using the imagery these agencies had obtained from the Lunar Reconnaissance Orbiter and LCROSS missions. If you can identify craters, you can identify orientation and shadowing and create a better topological map of the moon by ordering and combining images. The output is very precise. Right now, there are only a few areas around the equator that are very well mapped for upcoming lunar missions. The machine-learning operations will open up the other regions of the moon for a deeper analysis, including the permanently shadowed regions, which are at the poles. This is where NASA believes most of the water is.”

    Accelerated identification with machine learning

    The team used a methodology to combine the optical imagery with an overlay of depth sensor imagery. “We can’t fully analyze a flat image to identify a potential water source,” Shashi explained. “However, once we overlay the depth sensor data with the optical imagery and run the computer vision algorithms, we can get a positive identification of those craters likely to have water. Using this approach, we can look all over the moon for the right kind of craters even if they are in shadowed regions.”

    As shown in Figure 1, the two different datasets were used, relying on the craters themselves to register the optical images from the Lunar Reconnaissance Orbiter (LRO) Narrow Angled Camera (NAC) with elevation data captured using the Lunar Orbiter Laser Altimeter Digital Elevation Model (LOLA DEM). The computer vision algorithm developed by the team relied on a convolutional neural network (CNN) to analyze the optical images and elevation data using an adaptive convolution filter.

    Lunar Reconnaissance Orbiter
    Figure 1. Elevation measures (left) were aligned with optical images (right) to create precise lunar maps.

    Using technology provided by Intel, the team sped up the crater identification process, requiring only 1 minute to classify 1,000 images. Results from this project showed that the AI-based crater detection provided 100X faster the identification speed in compared with human, with a success rate of 98.4 percent.3 The Intel research team is planning to improve the algorithms that were developed so that NASA can use the technology in a potential future moon mission to harvest resources. Complete validation of the machine- learning techniques could follow when manned missions to the moon resume. At that time, maps could be adjusted and refined and then resource accessibility can be reassessed.

    “We have 50 years’ worth of NASA imagery from all sides of the moon. We’ve only recently begun to combine them and make one big, awesome map.”4

    Shashi Jain, Innovation Manager, Intel Visual Computing Group

    N A S A lunar vehicle
    Figure 2. Eugene Cernan at the controls of the Apollo 17 lunar rover (courtesy of NASA).  

    Converging Technologies Lead to a Personalized 3D-Printed Back Brace

    Another area rich with possibilities is using data and generative design processes to inform customizations of personalized medical devices. Intel contributed to a project for using captured data from a microcontroller to construct a 3D-printing orthopedic brace for people with scoliosis. Scoliosis—an abnormal curvature of the spine— affects millions of people and is common in children. Past-generation scoliosis braces have typically been heavy, uncomfortable, and burdensome, often causing wearers to remove them to gain some relief.

    In comparison, a 3D-printed brace, designed by Studio Bitonti and commercialized by UNYQ, incorporates built-in sensors to monitor the wearer’s personal data. The sensor data contributes to the generative design process by using AI techniques to introduce incremental improvements, an area in which Intel provided expertise and design assistance. As a result, the customized, lightweight, comfortable braces can be worn for many hours during the day and are stylish enough to be worn as a fashion item outside of clothing.

    The design prototype incorporated a compact Intel microcontroller that included an accelerometer and gyro, pressure sensors, and Bluetooth® technology capabilities. An app developed by Intel monitored and logged activity, pressure points, temperature, and other data. The designer, Francis Bitonti, recognized the potential in linking data to design and optimized the design to remove plastic that wasn’t therapeutic. Using a generative design technique, through multiple iterations, he enhanced the structure and design of the brace (Figure 3). Feeding data into a machine- learning system or deep-learning system provides a mechanism for shaping a design for aesthetics, functionality, and materiality.

    U N Y Q Align design
    Figure 3. The UNYQ Align* brace detects stress and weight points during a generative design process.

    The capabilities of generative design extend to fashion as well, and collaborations with Bitonti Technology, Chromat, and Intel have produced landmark designs such as the Adrenaline Dress (see Figure 4), which employs fabrics that respond to changes in the wearer’s breathing, temperature, and heart rate, expanding and contracting dynamically. As Bitonti states on their website, “Our design process is a collaboration with artificial intelligence.”

    In a presentation given at a TCT Inside 3D Printing conference, Shashi talked about 3D printing, smart devices, and new manufacturing methods: “Where I think it gets much more interesting is when we start taking third-party datasets: electronic medical records, sports data, FitBit data. Every one of us has a phone, every one of us has step counting and tracking sensors. We need to take this data and funnel it into these systems to generate these same design hints. We need to apply those third-party datasets to find optimizations that make a medical product fit a user’s lifestyle better.”5

    Intel’s Adrenaline dress
    Figure 4. Intel’s Adrenaline dress uses smart fabric to map to wearer’s emotions.

    “We believe the next generation of material innovation will be both digital and physical. In other words, designers can work with a synthesis of information and design parameters and turn it into design.”6

    Francis Bitonti, Studio Bitonti  

    Conducting Scientific Research with Drones and AI

    Tracking whales and identifying individuals over hundreds of miles of ocean is a significant challenge that is made easier through a combination of Intel machine learning technology and unmanned aerial drones. A collaborative effort (dubbed Project Parley SnotBot) involving Parley for the Oceans, the Ocean Alliance, and Intel used drones to harvest whale spout water, emitted through the whale’s blowhole, to evaluate the biological data contained within it. Machine-learning algorithms devised by Intel can identify individual whales and perform real-time assessment of a number of factors, including the overall health of the whale. Despite limited visibility in the ocean and the unpredictable movements of the whales, drone tracking and analyzed data gives researchers a means to make decisions in the field and rapidly gain access to factors, such as DNA readings, presence of harmful viruses and bacteria, exposure to toxins, hormones associated with stress or pregnancy, and other conditions.

    The founder of Parley for the Oceans, Cyrill Gutsch, commented, “Our vision is to create a global network of digital exploration tools that generate the big data we need to identify threats with new speed and precision, so we can act on them instantly.”7

    Novel forms of data collection are one of the earmarks of AI-based solutions. Drones can collect data in difficult environments under challenging conditions. As in the previous example of tracking whales, AI techniques could be employed to use the thermal image data collected by drones to automate the identification of individual polar bears located in different environments.

    The polar bear is one of the most elusive, wide-ranging animals on the planet. They are especially difficult to observe and track because of their white fur provides little contrast against the snow pack. With their habitat threatened by the impact of global climate disruption, polar bears are struggling to adapt and survive. As part of a research project to learn more about polar bear migration and behavior in the arctic, Intel teamed with Parley for the Oceans and noted wildlife photographer Ole Jørgen Liodden. Using an Intel® Falcon™ 8+ drone equipped with a thermal camera, the team was able to get close to the bears (within 50 to 100 meters) without disturbing them and collect data to better understand the bears’ habits and health status. The data helps inform wildlife researchers as well as climate change scientists to determine the effects of changing weather patterns on the animals living in this region as well as the environmental impacts.

    Data tracking whales
    Figure 5. Data captured tracking whales can help ensure their survival.

    Traditional methods of observing polar bears include helicopter exploration, which is invasive and dangerous, and observation from a vessel, which is typically difficult because of the harsh arctic conditions. These methods can be easily retired in favor of using aerial drones equipped with cameras (Figure 7). Research projects, such as this one, provide opportunities for taking advantage both of unmanned aerial drones and AI-based data collection.

    drone camera sleeping bear
    Figure 6. Sleeping bear observed by the drone camera.

    Traditional methods of observing polar bears include helicopter exploration, which is invasive and dangerous, and observation from a vessel, which is typically difficult because of the harsh arctic conditions. These methods can be easily retired in favor of using aerial drones equipped with cameras (Figure 7). Research projects, such as this one, provide opportunities for taking advantage both of unmanned aerial drones and AI-based data collection.

    aerial drones unlock
    Figure 7. Unmanned aerial drones unlock opportunities for new and exciting scientific research.

    “Polar bears are a symbol of the Arctic. They are strong, intelligent animals. If they become extinct, there will be challenges with our entire ecosystem. Drone technology can hopefully help us get ahead of these challenges to better understand our world and preserve the Earth’s environment.”8

    Ole J. Liodden  

    Enabling Technologies from Intel

    Hardware compute resources

    Intel® AI DevCloud, powered by Intel® Xeon® Scalable processors, provides an ideal platform for machine-learning and deep-learning training and inference computing. Developers in the Intel® AI Academy like the easy access and the pre-configured environment of the Intel AI DevCloud. Portions of projects discussed in this success story resided at some time on the Intel® Deep Learning Cloud (Intel® DL Cloud) & System, tailored for enterprise developers.

    Optimized frameworks

    The Intel® Optimization for Caffe* framework, available through the Intel AI Academy, contains many optimization features tuned for CPU-based models. Intel’s contributions to Caffe*, a community-based framework developed by Berkeley AI research, improved performance when running algorithms on Intel® Xeon® processors.

    Additionally, a customized deep-learning framework, Extended-Caffe*, provided an addition to the software stack so that CPU architecture can efficiently support 3D CNN computations. This makes it possible for researchers, data scientists, and developers to effectively implement projects using the CPU for 3D CNN model development, similar to the CNN techniques that proved successful for the Intel team working on the NASA FDL project.

    “[People] think we are recreating a brain. But we want to go beyond that, we want to create a new kind of AI that can understand the statistics of data used in business, in medicine, in all sorts of areas, and that data is very different in nature than the actual world.”9

    Amir Khosrowshahi, Chief Technology Officer, Artificial Intelligence Products Group, Intel Corporation  

    AI is Expanding the Boundaries of Scientific Exploration

    Through the design and development of specialized chips, sponsored research, educational outreach, and industry partnerships, Intel is firmly committed to advancing the state of artificial intelligence (AI) to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, and other industry sectors. Intel works closely with government organizations, non-government organizations, educational institutions, and corporations to discover and advance solutions that address major challenges across diverse sectors.

    To bring a new generation of AI-savvy developers into the fold, Intel sponsors challenges and events designed to encourage imaginative solutions to difficult problems. For example, the Intel® AI Interplanetary Challenge, launched on May 21, 2018, brings together the Planetary Society and Intel AI experts with others interested in crafting solutions to real- world space exploration challenges.

    “Intel’s AI portfolio of products, tools, and optimized frameworks is uniquely designed to enable researchers and data scientists to use AI to solve some of the world’s biggest challenges, and it’s ideal for a problem such as accelerating space travel. From the moment we heard about this challenge, we were committed to applying our expertise and technology solutions to the groundbreaking work being done on applications of AI for space research. Congratulations to the research teams, and to the Intel mentors, who are advancing technology that could take us to Mars and beyond.”10

    Naveen Rao, Corporate Vice President and General Manager, Artificial Intelligence Products Group, Intel

    The Intel® AI technologies used in this implementation included:

    Intel Xeon Scalable processors Intel Xeon Scalable processors: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning.


    LogosFramework optimization: Achieves faster training of deep neural networks on a robust scalable infrastructure.


    Intel AI DevCloud Intel® AI DevCloud: Offers a free cloud compute platform for machine-learning and deep-learning training and inference.


    Join today at: https://software.intel.com/ai/sign-up

    For a complete look at our AI portfolio, visit https://ai.intel.com/technology.

    “Scientists need to partner with AI. They can greatly benefit from mastering the tools of AI, such as deep learning and others, in order to explore phenomena that are less defined, or when they need faster performance by orders of magnitude to address a large space. Scientists can partner with machine learning to explore and investigate which new possibilities have the best likelihood of breakthroughs and new solutions.”11

    Gadi Singer, Vice President and Architecture General Manager of Intel’s Artificial Intelligence Products Group  

    Resources

    References

    1. Heater, Brian. “NASA is using Intel’s deep learning to build better moon maps.” Techcrunch 2017.
    2. "Artificial Intelligence Accelerator 2018.” NASA Frontier Development Lab 2018.
    3. Dietmar Backes, Bohacek, E., Dobrovolskis, A., Seabrook, T. Automated Crater Detection Using Deep Learning. NASA FDL 2017.
    4. Gilbert, Elissa. “Using AI to Discover the Moon’s Hidden Treasures.” iq@Intel 2018.
    5. Jain, Shashi. “Robotic Design: How to Achieve Customisation at Scale.” YouTube 2017.
    6. Bonime, Western. “Get Personal, The Future of Artificial Intelligence Design.” Forbes 2017.
    7. "From Polar Bears to Whales: Intel Pushes the Boundaries of Wildlife Research with Drone and Artificial Intelligence.” Intel Newsroom 2018.
    8. Miller Landau, Deb. “Researchers Deploy Test Drones to Track Arctic Polar Bears.” IQ by Intel October 2018.
    9. "The Many Ways to Define Artificial Intelligence.” Intel Newsroom 2018.
    10. "Intel Showcases Application of AI for Space Research at NASA FDL Event."
    11. "How is Artificial Intelligence Changing Science?” Intel Newsroom 2018.

    Get Started With Unity* Machine Learning Using Intel® Distribution for Python* (Part 1)

    $
    0
    0

    3d objects moving across plane

    Abstract

    This article will show game developers how to use reinforcement learning to create better artificial intelligence (AI) behavior. Using Intel® Distribution for Python—an improved version of the popular object-oriented, high-level programming language—readers will glean how to train pre-existing machine-language (ML) agents to learn and adapt. In this scenario, we will use Intel® Optimization for TensorFlow* to run Unity* ML-Agents in the localized environments.


    Introduction

    Unity ML-Agents are a good way for game developers to learn how to apply concepts of reinforcement learning while creating a scene in the popular Unity engine. We used the ML-Agents plugin to create a simulated environment. We then configure rigorous training to generate an output file from TensorFlow that can be consumed by the created scene in Unity and improve the simulation.

    The basic steps are as follows:

    1. Start with an introduction to reinforcement learning.
    2. Perform the setup from the "requirements.txt" file that installs TensorFlow 1.4, and other dependencies.
    3. Train a pre-existing ML-Agent.


    System Configuration

    The following configuration was used:

    • Standard ASUS laptop
    • 4th Generation Intel® Core™ i7 processor
    • 8 GB RAM
    • Windows® 10 Enterprise Edition


    What is Reinforcement Learning?

    Reinforcement learning is a method for "training" intelligent programs—known as agents—to constantly adapt and learn in a known or unknown environment. The system advances based on receiving points that might be positive (rewards) or negative (punishments). Based on the interaction between the agents and their environment, we then imply which action needs to be taken.

    Some important points about reinforcement learning:

    • It differs from normal machine learning, as we don't look at a training dataset.
    • It works not with data, but with environments, through which we depict real-world scenarios.
    • It is based upon environments, so many parameters come into play as "RL" takes in lots of information to learn and act accordingly.
    • It uses potentially large-scale environments that are real-world scenarios; they might be 2D or 3D environments, simulated worlds, or a game-based scenario.
    • It relies on learning objectives to reach a goal.
    • It obtains rewards from the available environment.

    The reinforcement learning cycle is depicted below.

    reinforcment loop example
    Figure 1. Reinforcement learning cycle.


    How the Reward System Works

    Rewards work by offering points when single or multiple agents transition from one state to another during interaction with their environment. These points are known as rewards. The more we train, the more rewards we receive, and thus the more accurate the system becomes. Environments can have many different features, as explained below.

    Agents

    Agents are software routines that make intelligent decisions. Agents should be able to perceive what is happening around them in the environment. The agent is able to perceive the environment based on making decisions that result in an action. The action that the agents perform must be the optimal one. Software agents might be autonomous, or might work together with other agents or people.

    flowchart
    Figure 2. Flow chart showing the environment workflow.

    Environments

    Environments determine the parameters within which the agent interacts with its world. The agent must adapt to the environmental factors in order to learn. Environments may be a 2D or 3D world or grid.

    Some important features of environments:

    a) Deterministic

    b) Observable

    c) Discrete or continuous

    d) Single or multiagent

    Each of these features is explained below.

    Deterministic

    If we can logically infer and predict what will happen in an environment based on inputs and actions, the case is deterministic. Being deterministic, the changes that happen are very predictable for the AI, and the reinforcement learning problem becomes easier because everything is known.

    Deterministic Finite Automata (DFA)

    In automata theory, a system is described as "DFA" if each of its transitions is uniquely determined by its source state and input symbol. Reading an input symbol is required for each state-transition. Such systems work through a finite number of steps and can only perform one action for a state.

    Non-Deterministic Finite Automata (NDFA)

    If we are working in a scenario where it cannot be guaranteed which exact state the machine will move into, then it is described as "NDFA." There is still a finite number of steps, but the transitions are not unique.

    Observable

    If we can say the environment around us is fully observable, that environment is suitable for implementing reinforcement learning. If you consider a chess game, the environment is predictable, with a finite number of potential moves. In contrast, a poker game is not fully observable, because the next card is unknown.

    Discrete or continuous

    Continuing with the chess/poker scenarios, when the next choice for a move or play is limited, it is in a discrete state. If there are multiple possible states, we call it continuous.

    Single or multiagent

    Solutions in reinforcement learning can use single agents or multiple agents. When we are dealing with non-deterministic problems, we use multiagent reinforcement learning. The key to understanding reinforcement learning is in how we use the learning techniques. In multiagent solutions, the agent interactions between different environments is enormous. The key is understanding what kind of information is generally available.

    Single agents cannot tackle convergence, so when there is some portion of convergence in reinforcement learning it is handled by multiple agents in dynamic environments. In multiagent models, each agent's goals and actions impact the environment.

    The following figures depict the differences between single-agent and multiagent models.

    diagram
    Figure 3. Single-agent system.

    diagram
    Figure 4. Multiagent system.


    Getting Started

    We will be using the Unity Integrated Development Engine (IDE) to demonstrate reinforcement learning in game-based simulations. After creating the simulation from scratch, we will use Unity ML-Agents to showcase how reinforcement learning is implemented in the created project and observe how accurate the results are.

    Step 1: Create the environment

    To start, we will create an environment for the Intel Distribution for Python.

    Prerequisites

    Make sure you have the Anaconda* IDE installed. Anaconda is a free and open-source distribution of the Python programming language for data science and machine learning-related applications. Through Anaconda, we can install different Python libraries which are useful for machine learning.

    The download link is here: https://www.anaconda.com/download/.

    The command to create a new environment with an Intel build is shown below.

    conda create -n idp intelpython3_core python=3

    After all dependencies are installed we proceed to step 2.

    Step 2: Activate the environment

    Now we will activate the environment. The command is shown below.

    source activate idp

    Step 3: Inspect the environment

    As we have activated the environment, let us check the Python version. (It should reflect the Intel one.)

    (idp) C:\Users\abhic>python

    Python 3.6.3 |Intel Corporation| (default, Oct 17 2017, 23:26:12) [MSC v.1900 64 bit (AMD64)] on win32

    Type "help", "copyright", "credits" or "license" for more information.

    Intel® Distribution for Python is brought to you by Intel Corporation.

    Please see: https://software.intel.com/en-us/python-distribution

    Step 4: Clone the GitHub* repository

    We need to clone or copy the Unity ML repo from the GitHub* link while inside the activated Intel-optimized environment (i.e., named idp). To clone the repo, we use the following command:

    (idp) C:\Users\abhic\Desktop>git clone https://github.com/Unity-Technologies/ml-agents.git

    Step 5: Install requirements

    As cloning is proceeding, we need to install certain requirements. The requirements.txt is found in the Python subdirectory.

    (idp) C:\Users\abhic\Desktop\ml-agents\python>pip install -r requirements.txt

    This will install the mandatory dependencies.

    Step 6: Create the build

    The build is created inside the Unity IDE and the executable is generated. The crawler executable is shown below.

    3d objects moving across plane
    Figure 5. Crawler executable before training.

    Step 7: Optimize the build

    To make the training go faster with Intel Distribution for Python, issue the following command from the Python subdirectory:

    (idp) C:\Users\abhic\Desktop\ml-agents\python>python learn.py manisha.exe --run-id=manisha –train

    Once the training has completed a full run, we get the byte file needed to use inside the brain, within the child object of Academy:

    INFO: unityagents: Saved Model
    INFO: unityagents: Ball3DBrain: Step: 50000. Mean Reward: 100.000. STD of Reward: 0.000.
    INFO: unityagents: Saved Model
    INFO: unityagents: Saved Model
    INFO: unityagents: List of nodes to export:
    INFO: unityagents:       action
    INFO: unityagents:       value_estimate
    INFO: unityagents:       action_probs
    INFO: tensorflow: Restoring parameters from ./models/manisha\model-50000.cptk
    INFO: tensorflow: Restoring parameters from ./models/manisha\model-50000.cptk
    INFO: tensorflow: Froze 15 variables.
    INFO: tensorflow:Froze 15 variables.
    Converted 15 variables to const ops.
    

    The byte file we generated is used for making the simulation work with machine learning.


    Advantages of Using Intel® Distribution for Python*

    Python was not designed for multithreading. It runs on one thread, and while developers can run code on other threads, those threads cannot easily access any Python objects. Intel Distribution for Python features thread-enabled libraries, so consider Intel® Threading Building Blocks (Intel® TBB) as a potential tool for multithreading.

    The advantages of using Intel Distribution for Python with Unity ML-Agents are as follows:

    • The training process is much faster.
    • The CPU version of TensorFlow involves less overhead.
    • Handling multiple agents using the Intel-optimized pipeline is easier and faster.


    Unity* ML-Agents v 0.3

    Unity ML-Agents are constantly evolving, with updates responding to community feedback. ML-Agents are based on imitation learning, which is different from reinforcement learning. The most common imitation learning method is "behavioral cloning." Behavioral cloning is a method that we apply to neural networks (specifically Convolutional Neural Networks, or CNNs) to replicate a behavior such as a self-driving car environment, where the goal is for the system is to drive the car as humans do.


    Imitation Learning

    Generally, when we are talking about "imitation learning," we refer to learning by demonstration. The demonstration is based upon the learning behavior patterns we get while analyzing and generating learning signals to the agent. In the table below, you can see the differences between imitation learning and reinforcement learning.

    Imitation learning versus reinforcement learning.

    Imitation learning

    The process of learning happens through demonstration.

    No such mechanism for rewards and punishments are required. Rewards are not necessary.

    Generally evolved for real-time interaction.

    After training, the agent becomes “human-like” at performing a task.

    Reinforcement learning

    Involves learning from rewards and punishments.

    Based on trial-and-error methods.

    Specifically meant for speedy simulation methods.

    After training, the agent becomes “optimal” at performing a task.


    TensorFlowSharp

    In this section, we will cover some basics of TensorFlowSharp. First released in 2015, TensorFlow is Google's open-source library for dataflow programming and framework for deep learning. TensorFlow doesn't provide a C# native API, and the internal brain as it is written in C# is not natively supported. To enable the internal brain for machine learning, we need to utilize TensorFlowSharp, a third-party library which has the specific purpose of binding the .NET framework to TensorFlow.


    Running the Examples

    We will now go through an example of a Unity ML-Agents project to implement imitation learning. The process will involve the following steps.

    1. Include the TensorFlowSharp Unity Plugin.
    2. Launch Unity IDE.
    3. Find the example folder which is inside Assets. There is a subset folder within the ML-Agents project folder named "Examples." We will work with the example named Crawler. Every change will occur inside this folder.

    As we are working to create an environment for training, we will have to set the brain used by the agents to External. Doing this will allow the agents to communicate with the external training process when they are trying to make decisions.

    We are exploring the example project Crawler. The setup is a creature with four limbs, from which extend a shorter limb, or forearm (see figure 5 above). For this scenario to be successful, we have the following goal: The agent must move its body along the x axis without falling.

    We need to set some parameters to fine-tune the example. The environment contains three agents linked to a single brain. Inside Academy, locate the child object "CrawlerBrain" within the Inspector window. Set the Brain type to External.

    Next, open Player Settings:

    1. Go to Menu > Edit > ProjectSetting> Player.
    2. Go to Options Resolution and Presentation.

    Check that "Run in Background" is selected. Then check that the resolution dialog is set to "disabled." Finally, click "Build." Save within the Python folder. Name it "Manisha1" and then save it.

    Unity interface
    Figure 6. Saving the build within the Python* folder.


    Train the Brain

    Now we will work on training the brain. To open the Anaconda prompt, use the search option in Windows and type in Anaconda. The Anaconda prompt will open. Once inside the Anaconda prompt, we need to find out the environments available.

    (base) C:\Users\abhic>conda info --envs

    # conda environments:
    #
    base                  *  C:\ProgramData\Anaconda3
    idp                      C:\Users\abhic\AppData\Local\conda\conda\envs\idp
    tensorflow-gpu           C:\Users\abhic\AppData\Local\conda\conda\envs\tensorflow-gpu
    tf-gpu                   C:\Users\abhic\AppData\Local\conda\conda\envs\tf-gpu
    tf-gpu1                  C:\Users\abhic\AppData\Local\conda\conda\envs\tf-gpu1
    tf1                      C:\Users\abhic\AppData\Local\conda\conda\envs\tf1
    

    We will pass the following command:

    (base) C:\Users\abhic>activate idp

    Intel Distribution for Python and Intel Optimization for TensorFlow are installed in the environment idp. Next, we will activate the idp by opening the cloned folder in the desktop.

    (idp) C:\Users\abhic\Desktop\ml-agents>

    As we have saved the .exe file in the Python subdirectory, we will locate it there.

    (idp) C:\Users\abhic\Desktop\ml-agents>cd python

    Using the directory command dir we can list the items in the Python subfolder:

    We are displaying the contents of the folder to make it easier to identify the files that reside inside in the Python subfolder. As major changes or the supportive code all resides within this subfolder it is easier and efficient to make changes to the way we are going to train the Machine learning agents within the  subfolder. The python subfolder is important because the default code and other supportive library reside within this subfolder. As we have created the build for the Unity scene, we see that one auto-executable file is generated, along with data folders named "manisha1.exe" and "manisha1_Data."

    Directory of C:\Users\abhic\Desktop\ml-agents\python

    28-05-2018  06:28    
    . 28-05-2018 06:28 .. 21-05-2018 11:34 6,635 Basics.ipynb 21-05-2018 11:34 curricula 21-05-2018 11:34 2,685 learn.py 29-01-2018 13:48 650,752 manisha.exe 29-01-2018 13:24 650,752 manisha1.exe 28-05-2018 06:28 manisha1_Data 21-05-2018 11:58 manisha_Data 21-05-2018 12:00 models 21-05-2018 11:34 101 requirements.txt 21-05-2018 11:34 896 setup.py 21-05-2018 12:00 summaries 21-05-2018 11:34 tests 21-05-2018 11:34 3,207 trainer_config.yaml 21-05-2018 12:00 24 unity-environment.log 21-05-2018 12:00 unityagents 29-01-2018 13:55 36,095,936 UnityPlayer.dll 21-05-2018 12:00 unitytrainers 18-01-2018 04:44 42,704 WinPixEventRuntime.dll 10 File(s) 37,453,692 bytes 10 Dir(s) 1,774,955,646,976 bytes free

    Look inside the subdirectory to locate the executable "manisha1." We are now ready to use Intel Distribution for Python and Intel Optimization for TensorFlow to train the model. For training, we will use the learn.py file. The command for using Intel-optimized Python is shown below.

    (idp) C:\Users\abhic\Desktop\ml-agents\python>python learn.py manisha1.exe --run-id=manisha1 –train

    (idp) C:\Users\abhic\Desktop\ml-agents\python>python learn.py manisha1.exe --run-id=manisha1 --train

    INFO:unityagents:{'--curriculum': 'None',
     '--docker-target-name': 'Empty',
     '--help': False,
     '--keep-checkpoints': '5',
     '--lesson': '0',
     '--load': False,
     '--run-id': 'manisha1',
     '--save-freq': '50000',
     '--seed': '-1',
     '--slow': False,
     '--train': True,
     '--worker-id': '0',
     '': 'manisha1.exe'}
    INFO:unityagents:
    'Academy' started successfully!
    Unity Academy name: Academy
            Number of Brains: 1
            Number of External Brains: 1
            Lesson number: 0
            Reset Parameters:
    Unity brain name: CrawlerBrain
            Number of Visual Observations (per agent): 0
            Vector Observation space type: continuous
            Vector Observation space size (per agent): 117
            Number of stacked Vector Observation: 1
            Vector Action space type: continuous
            Vector Action space size (per agent): 12
            Vector Action descriptions: , , , , , , , , , , ,
    2018-05-28 06:57:56.872734: I C:\tf_jenkins\home\workspace\rel-win\M\windows\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
    C:\Users\abhic\AppData\Local\conda\conda\envs\idp\lib\site-packages\tensorflow\python\ops\gradients_impl.py:96: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
    "Converting sparse IndexedSlices to a dense Tensor of unknown shape."
    INFO: unityagents: Hyperparameters for the PPO Trainer of brain CrawlerBrain:
            batch_size:     2024
            beta:   0.005
            buffer_size:    20240
            epsilon:        0.2
            gamma:  0.995
            hidden_units:   128
            lambd:  0.95
            learning_rate:  0.0003
            max_steps:      1e6
            normalize:      True
            num_epoch:      3
            num_layers:     2
            time_horizon:   1000
            sequence_length:        64
            summary_freq:   3000
            use_recurrent:  False
            graph_scope:
            summary_path:   ./summaries/manisha1
            memory_size:    256
    INFO:unityagents: CrawlerBrain: Step: 3000. Mean Reward: -5.349. Std of Reward: 3.430.
    INFO:unityagents: CrawlerBrain: Step: 6000. Mean Reward: -4.651. Std of Reward: 4.235.
    The parameters above set up the training process. After the training process is complete (it can be lengthy) we get the following details in the console:
    INFO: unityagents: Saved Model
    INFO: unityagents: CrawlerBrain: Step: 951000. Mean Reward: 2104.477. Std of Reward: 614.015.
    INFO: unityagents: CrawlerBrain: Step: 954000. Mean Reward: 2203.703. Std of Reward: 445.340.
    INFO:unityagents: CrawlerBrain: Step: 957000. Mean Reward: 2205.529. Std of Reward: 531.324.
    INFO:unityagents: CrawlerBrain: Step: 960000. Mean Reward: 2247.108. Std of Reward: 472.395.
    INFO:unityagents: CrawlerBrain: Step: 963000. Mean Reward: 2204.579. Std of Reward: 554.639.
    INFO:unityagents: CrawlerBrain: Step: 966000. Mean Reward: 2171.968. Std of Reward: 547.745.
    INFO:unityagents: CrawlerBrain: Step: 969000. Mean Reward: 2154.843. Std of Reward: 581.117.
    INFO:unityagents: CrawlerBrain: Step: 972000. Mean Reward: 2268.717. Std of Reward: 484.157.
    INFO:unityagents: CrawlerBrain: Step: 975000. Mean Reward: 2244.491. Std of Reward: 434.925.
    INFO:unityagents: CrawlerBrain: Step: 978000. Mean Reward: 2182.568. Std of Reward: 564.878.
    INFO:unityagents: CrawlerBrain: Step: 981000. Mean Reward: 2315.219. Std of Reward: 478.237.
    INFO:unityagents: CrawlerBrain: Step: 984000. Mean Reward: 2156.906. Std of Reward: 651.962.
    INFO:unityagents: CrawlerBrain: Step: 987000. Mean Reward: 2253.490. Std of Reward: 573.727.
    INFO:unityagents: CrawlerBrain: Step: 990000. Mean Reward: 2241.219. Std of Reward: 728.114.
    INFO:unityagents: CrawlerBrain: Step: 993000. Mean Reward: 2264.340. Std of Reward: 473.863.
    INFO:unityagents: CrawlerBrain: Step: 996000. Mean Reward: 2279.487. Std of Reward: 475.624.
    INFO:unityagents: CrawlerBrain: Step: 999000. Mean Reward: 2338.135. Std of Reward: 443.513.
    INFO:unityagents:Saved Model
    INFO:unityagents:Saved Model
    INFO:unityagents:Saved Model
    INFO:unityagents:List of nodes to export :
    INFO:unityagents:       action
    INFO:unityagents:       value_estimate
    INFO:unityagents:       action_probs
    INFO:tensorflow:Restoring parameters from ./models/manisha1\model-1000000.cptk
    INFO:tensorflow:Restoring parameters from ./models/manisha1\model-1000000.cptk
    INFO:tensorflow:Froze 15 variables.
    INFO:tensorflow:Froze 15 variables.
    Converted 15 variables to const ops.
    

    Integration of the Training Brain with the Unity Environment

    The idea behind using Intel Distribution for Python is to make the training process more accurate. Some examples will require more time to complete the training process because of the large number of steps.

    Since TensorFlowSharp is still in the experimental phase, it is disabled by default. To enable TensorFlow and make the internal brain available, follow these steps:

    1. Make sure that the TensorFlow plugin is present in the Assets folder. Within the Project tab, navigate to this path: Assets->ML-Agents->Plugins->Computer.
    2. Open the Edit->projectSettings->Player to enable TensorFlow and .NET support. Elect Scripting Runtime Version to Experimental(.net 4.6).
    3. Open the Scripting-defined symbols and add the following text: ENABLE_TENSORFLOW.
    4. Press the Enter key and save the project.


    Bringing the Trained Model to Unity

    After the training process is over, the TensorFlow framework creates a byte file for the project. Locate the model created during the training process under crawler/models/manisha1.

    The executable file generated in the build for the Crawler scene will be the name we use for the next file to be generated. The file name will be the name of the environment with the extension of the bytes file when the training is complete.

    If "manisha1.exe" is the executable file, then the byte file generated will be "manisha1.bytes," which follows the convention <env-name>.bytes.

    1. Copy the generated bytes file from the models folder to the TF Models subfolder.
    2. Open up the Unity IDE and select the crawler scene.
    3. Select the brain from the scene hierarchy.
    4. Change the type of brain to internal.
    5. Drag the .bytes file from the project folder to the graph model placeholder in the brain inspector, and hit play to run it.

    The output should look similar to the screenshot below.

    Unity interface
    Figure 7. Executable created without the internal brain activated.

    We then build the project with the internal brain. An executable is generated.

    Unity interface
    Figure 8. The executable after building the project with the internal brain.


    Summary

    Unity and Intel are lowering the entry barrier for game developers who seek more compelling AI behavior to boost immersion. Intelligent agents, each acting with dynamic and engaging behavior, offer promise for more realism and better user experiences. The use of reinforcement learning in game development is still in its early phase, but it has the potential to be a disruptive technology. Use the techniques and resources listed in this article to get started creating your own advanced game-play, and see what the excitement is all about.


    Resources


    Using Intel® Xeon® processors for Multi-node Scaling of TensorFlow* with Horovod*

    $
    0
    0

    TensorFlow* is one of the leading Deep Learning (DL) and machine learning frameworks today. In 2017, Intel worked with Google* to incorporate optimizations for Intel® Xeon® processor-based platforms using Intel® Math Kernel Library (Intel® MKL)4. Optimizations such as these with multiple popular frameworks have led to orders of magnitude improvement in performance—up to 127 times 2 higher performance for training and up to 198 times1 higher performance for inference. For TensorFlow, Intel updated the optimizations and performance results for a number of DL models running on the Intel® Xeon® Scalable processor2  3.

    Intel has mainly been reporting out Intel® Optimization for TensorFlow performance improvements on single nodes2  3. However, some complex DL models train more efficiently using multi-node training configurations. They either don't fit in one machine, or their time-to-train can be significantly reduced if they are trained on a cluster of machines. Therefore, Intel has also performed scaling studies on multi-node clusters of Intel Xeon Scalable processors. This article describes distributed training performance on a cluster of Intel® Xeon® platforms using a Horovod*-based configuration option for the TensorFlow framework.

    Horovod, which was developed by Uber*, uses the Message Passing Interface (MPI) as the main mechanism of communication. It uses MPI concepts such as allgather and allreduce to handle the cross-replicas communication and weight updates. OpenMPI* can be used with Horovod to support these concepts. Horovod is installed as a separate Python* package. By calling Horovod's API from the Deep Learning Neural Network's model script, a regular build of TensorFlow can be used to run distributed training. With Horovod, there is no additional source code change required in TensorFlow to support distributed training with MPI.

    Scaling Results Using Uber Horovod* with TensorFlow* 1.7

    In this section, we show the performance numbers of Intel Xeon proceesor optimizations for TensorFlow 1.7 + ResNet-50* and Inception- v3* training, running on up to 64 nodes containing Intel® Xeon® Gold processors. A real training dataset was used to perform these runs. As shown in the charts below, by running one MPI process per node, ResNet-50 was able to maintain at least 89.1 percent scalability for up to 64 nodes, while Inception-v3 could achieve at least 89.4 percent5. So, with the higher throughput for ResNet-50 and Inception-v3, time to train is reduced significantly. Although this study shows the scaling for up to 64 nodes, it is expected that the same scalability rate would carry over to 128 nodes.

    Performance Scaling for ResNet 50 and InceptionV3
    Figure 1. Up to 89 percent (ResNet-50* and Inception-v3*) of scaling efficiency for TensorFlow* 1.7 can be achieved for 64 nodes of Intel® Xeon® Gold processors using one MPI process/node.

    The user can also run the same models by having two MPI processes running on each node. As shown in the charts below, we can get up to 17 percent and 24 percent performance improvements for ResNet-50 and Inception-v3, respectively5, with no extra hardware cost. Please note that the batch size per node remains the same as what we used for running one MPI process per node.

    ModelBatch Size per NodeGain of TensorFlow* with Horovod* versus without Horovod on two Sockets
    ResNet-50*12817%
    Inception-v3*12824%

    Thus, by running two MPI processes per node as shown in the two graphs below, ResNet-50 was able to maintain at least 94.1 percent scalability for up to 64 nodes, while Inception-v3 could achieve at least 87.4 percent5. So, with higher throughput for ResNet-50 and Inception-v3, time to train is reduced significantly, even faster than using one MPI process per node.

    Performance Scaling for ResNet 50 and InceptionV3
    Figure 2. Up to 94 percent of scaling (parallel efficiency) can be achieved for TensorFlow* 1.7 for 64 Intel® Xeon® Gold processors, using two MPI processes/node.

    Gathering and Installing Relevant Software Tools

    1. OpenMPI can be installed via Yellowdog Updater, Modified* (YUM) software on recent versions of CentOS*. Some existing clusters already have available OpenMPI. In this article, we will use OpenMPI 3.0.0. OpenMPI can be installed from the instructions at Open MPI: Version 3.0.

    2. The latest GNU Compiler Collection* (GCC) version is needed; at least, GCC version 6.2 or newer is recommended. See GCC, the GNU Compiler Collection for the latest installation.

    3. Python versions 2.7 and 3.6 have both been tested.

    4. Uber Horovod supports running TensorFlow in distributed fashion. Horovod can be installed as a standalone Python package as follows:

    pip install --no-cache-dir horovod (for example, horovod-0.11.3)

    Install Horovod from the source.

    5. The current TensorFlow benchmarks are recently modified to use Horovod. You can obtain the benchmark code from GitHub*, and run the tf_cnn_benchmarks.py as explained below.

    $ git clone https://github.com/tensorflow/benchmarks
    $ cd tensorflow/benchmarks/scripts
    $ python tf_cnn_benchmarks.py

    Running TensorFlow Benchmark Using Horovod with TensorFlow

    Here, we discuss commands needed to run distributed TensorFlow using the Horovod framework. For the hardware platform, we use a dual-socket Intel® Xeon® Gold 6148 processor-based cluster system. For networking, a 10 GB ethernet is used. Mellanox InfiniBand* or Intel® Omni-Path Architecture (Intel® OPA) also can be used for networking the cluster.

    Running two MPI processes on a single node:

    export LD_LIBRARY_PATH=<path to OpenMP lib>:$LD_LIBRARY_PATH
    export PATH=<path to OpenMPI bin>:$PATH
    export inter_op=2
    export intra_op=18 {# cores per socket}
    export batch_size=64 
    export MODEL=resnet50 {or inception3}
    export python_script= {path for tf_cnn_benchmark.py script}
    
    mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -cpus-per-proc 20 --map-by socket  --overscribe --report-bindings  -n 2 python  $python_script      --mkl=True --forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>
    
    

    For one MPI process per node, the configuration is as follows. The other environment variables will be the same:

    export intra_op=38
    export batch_size=128 
    
    mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS --bind-to none --report-bindings  -n 1 python  $python_script --mkl=True --forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>
    

    Please note that if you want to train models to achieve good accuracy please use the configuration flag --distortions=True. Other hyper-parameters may also need adjusted.

    For running a model on a multi-node cluster, a similar script as above. For example, to run on a 64-node cluster (two MPIs per node), where each node is an Intel Xeon Gold 6148 processor, the distributed training can be launched as shown below. All the export lists will be the same as above:

    mpirun -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -cpus-per-proc 20 --map-by node  --report-bindings -hostfile host_names  -n 128 python  $python_script --mkl=True --forward_only=False --num_batches=200 --kmp_blocktime=0 --num_warmup_batches=50 --num_inter_threads=$inter_op --distortions=False --optimizer=sgd --batch_size=$batch_size --num_intra_threads=$intra_op --data_format=NCHW --model=$MODEL --variable_update horovod --horovod_device cpu --data_dir <path-to-real-dataset> --data_name <dataset_name>

    Here, the host_names file is the list of hosts that you want to run the workload on.

    What Distributed TensorFlow Means for Deep Learning Training on Intel® Xeon® Processors

    Various efforts were taken to implement distributed TensorFlow on a CPU and graphics processing unit; for example, Remote Procedure Call (gRPC), Remote Direct Memory Access (RDMA), and TensorFlow built in MPI—all of these technologies are incorporated within the TensorFlow codebase. Uber Horovod is one distributed TensorFlow technology that was able to harness the power of Intel Xeon processors. It uses MPI underneath and it uses ring-based reduction and gather for DL parameters. As shown above, Horovod on Intel Xeon processors demonstrates great scaling for existing DL benchmark models, such as ResNet- 50 (up to 94 percent) and Inception-v3 (up to 89 percent) for 64 nodes5. In other words, time to train a DL network can be accelerated by as much as 57 times (ResNet-50) and 58 times (Inception-v3) using 64 Intel Xeon processor nodes, compared to a single Intel Xeon processor node. Thus, Intel recommends TensorFlow users use Intel® Optimization for TensorFlow and Horovod MPI for multi-node training on Intel Xeon Scalable processors.

    Acknowledgements

    The authors (Mahmoud Abuzaina, Ashraf Bhuiyan, Wei Wang) would like to thank Vikram Saletore, Mikhail Smorkalov, and Srinivas Sridharan for their collaboration with us on using Horovod with TensorFlow.

    References

    1. Performance is reported at Amazing Inference Performance with Intel® Xeon® Scalable Processors

    2. The results are reported at TensorFlow* Optimizations on Modern Intel® Architecture

    3. The updated results is in TensorFlow* Optimizations for the Intel® Xeon® Scalable Processor

    4. Refer to GitHub for more details on Intel® MKL-DNN optimized primitives

    5. System configuration

    TensorFlow* Source Codehttps://github.com/tensorflow/tensorflow
    TensorFlow Commit ID024aecf414941e11eb643e29ceed3e1c47a115ad.
    CPU 
       Thread(s) per core2
       Core(s) per socket20
       Socket(s)2
       NUMA node(s)2
       CPU family6
       Model85
       Model nameIntel® Xeon® Gold 6148 Processor @ 2.40GHz

       Stepping4
       Hyper ThreadingON
       TurboON
    Memory192GB (12 x 16GB) 2666MT/s
    DisksIntel RS3WC080 x 3 (800GB, 1.6TB, 6TB)
    BIOSSE5C620.86B.00.01.0013.030920180427
    OSRed Hat* Enterprise Linux* Server release 7.4 (Maipo)
    Kernel* 3.10.0-693.21.1.0.1.el7.knl1.x86_64

    Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown". Implementation of these updates may make these results inapplicable to your device or system.

    Use Machine Learning to Detect Defects on the Steel Surface

    $
    0
    0

    Definition

    Project overview

    Surface quality is the essential parameter for steel sheet. In the steel industry, manual defect inspection is a tedious assignment. Consequently, it is difficult to guarantee the surety of a flawless steel surface. To meet user requirements, vision-based automatic steel surface investigation strategies have been proven to be exceptionally powerful and prevalent solutions over the past two decades1.

    The input is taken from the NEU surface defect database2, which is available online. This database contains six types of defects including crazing, inclusion, patches, pitted surface, rolled-in-scale, and scratches.

    Problem statement

    The challenge is to provide an effective and robust approach to detect and classify metal defects using computer vision and machine learning.

    Image preprocessing techniques such as filtering and extracting the features from the image is a good training model solution from which we can determine which type of defect the steel plate has. This solution can even be used in real-time applications.

    Metrics

    The evaluation is done using accuracy metrics. The following shows the accuracy of the system given:

    equation

    Because the classes are balanced, accuracy is an appropriate metric to evaluate the project. The accuracy tells us about how well the algorithm is classifying the defects.

    Analysis

    Data exploration

    The NEU surface dataset2 contains 300 pictures of each of six deformities (a total of 1800 images). Each image is 200 × 200 pixels. The images given in the dataset are in the .bmp format. The images in the dataset are gray-level images of 40.1 KB each. A few samples are shown in figure 1.

    samples of defects
    Figure 1. Samples of defects (a) crazing, (b) inclusion, (c) patches, (d) pitted surface, (e) rolled-in-scale, and (f) scratches.

    Exploratory visualization

    The following chart shows the histogram of images per class.

    histograms of sample defects
    Figure 2. Histogram samples of defects: (a) crazing, (b) inclusion, (c) patches, (d) pitted surface, (e) rolled-in-scale, and (f) scratches.

    An image histogram acts as a graphical representation of the tonal distribution in a digital image. The horizontal axis of the graph represents the intensity variations; the vertical axis represents the number of pixels of that particular intensity. A histogram gives us an idea of the contrast of the image that I used as a feature. It is important to observe the histogram of the image to get an overview of the feature, like contrast. From figure 2 it is observed that the histogram of each class is visually distinguishable, which makes contrast an important feature to be included in the feature vector.

    As said earlier, the classes are well balanced, justifying accuracy as an evaluation metric.

    Algorithms and techniques

    Different classifiers such as k-nearest neighbors (KNN), support vector classifier (SVC), gradient boosting, random forest classifier, AdaBoost (adaptive boosting), and decision trees will be compared.

    Texture features such as contrast, dissimilarity, homogeneity, energy, and asymmetry will be extracted from the gray-level co-occurrence matrix (GLCM), and used for training the classifiers.

    SVM

    SVM is classified into linear and nonlinear. The linear SVM classifier is worthwhile to the nonlinear classifier to map the input pattern into a higher dimensional feature space. The data that can be linearly separable can be examined using a hyperplane, and the data that are linearly non-separable are examined methodically with kernel function, like a higher order polynomial. The SVM classification algorithm is based on different kernel methods; that is, radial basic function (RBF), and linear and quadratic kernel function. The RBF kernel is applied on two samples, x and x', which indicate as feature vectors in some input space and it can be defined as:

    equation

    The value of the kernel function is decreased according to distance, and ranges between zero (in the limit) and one (when x = x').

    graph
    Figure 3. Hyperplane in feature space.

    AdaBoost algorithm

    Input: Data set D = { (x1 , y1) ,( x2 , y2) ,......,(xm , ym) }

    Base learning algorithm Ը; number of learning rounds T.

    Algorithm:

    Initialize the weight distribution: D1(i) = 1/m.

    for t = 1,...,T;

    Train a learner ht from D using distribution Dt: ht= Ը(D,Dt);

    Measure the error of ht: equation

    If Et> 0:5 then break

    Find weak classifier ht(x) using a perturbed empirical distribution: equation

    Update the distributions, where Zt is the Normalization, which enables D(t+1) to be distributed

    equation

    K-Nearest neighbor algorithm

    1. A value of K is defined (K>0), along with the new data sample.
    2. We select the K entries in our database that are near the new testing sample.
    3. We find out the most analogous classification of these entries.
    4. This is the classification we give to the new sample using the value of K.
    5. If the result is not adequate, change the value of K until the reasonable level of correctness is achieved.

    Decision trees algorithm

    1. Create a root node for the tree.
    2. If all examples are positive, return leaf node ‘positive’.
    3. Else if all examples are negative, return leaf node ‘negative’.
    4. Calculate the entropy of the current state.
    5. For each attribute, calculate the entropy with respect to the attribute ‘x’.
    6. Select the attribute that has maximum value of information gain (IG).
    7. Remove the attribute that offers highest IG from the set of attributes.
    8. Repeat until we run out of all attributes or the decision tree has all leaf nodes.

    Random Forest

    Random forest is nothing but an ensemble of decision trees. It avoids the problem of over-fitting that is usually seen in decision trees where there is a single decision tree for the entire dataset.

    equation

    Benchmark

    I uploaded a basic model that uses the KNN algorithm to classify the images to GitHub* and achieves 75.27 percent accuracy. This will be the benchmark model on which I will try to improve the accuracy. The link is provided at the steel_plate repository.

    Methodology

    Data preprocessing

    No preprocessing is used on the input images, as the defects of the steel plate heavily depend on the texture of its surface and, as we are using textural features, any preprocessing method such as smoothing or sharpening will change its texture.

    Implementation

    The following flowchart represents the entire workflow of the project.

    workflow chart
    Figure 4. Project workflow

    The project starts with loading the images and extracting texture features such as contrast, dissimilarity, homogeneity, energy, and asymmetry. The features with the label are then given to test the train split function that is already present in the sci-kit-learn library. The train-test split function splits data and labels. The data is split 80 for training and 20 percent for testing.

    The 80 percent data was given for training different classifiers and the testing was done on 20 percent of the data. The model that gave the highest accuracy was then selected as the main model.

    The GLCM feature extraction is given below:

    GLCM is an example network used to discover the work of art drawing in an image by showing the surface as a gray-level variation of the two-dimensional array. The highlighting of GLCM is considered between the arrangement of the elements to portray the contrast of the pixels and the energy of the region of interest. GLCM is calculated in four directions: 0o, 45o, 90o, and 135o and for four distances: 1, 2, 3, and 4.

    GLCM seems to be a recognized numerical technique for feature extraction. GLCM is a group of how often different combinations of pixel gray levels could come about in an image. A co-occurrence matrix depicts the joint gray-level histogram of the image (or a region of the image) in the form of a matrix with the dimensions of Ng*Ng.

    Directional analysis graph
    Figure 5. Directional analysis of GLCM.

    The integer array specifies the distance between the pixel of interest and its neighbor. Each row in the array is a two-element vector, which specifies the relationship or displacement of a pair of pixels. Because offset is often considered to be an angle, the following table lists the offset values ​​that specify the common angles, given the pixel distance D.

    formation of a GLCM matrix
    Figure 6. Formation of GLCM matrix.

    Features used in this method are as follows: contrast, dissimilarity, homogeneity, energy, and asymmetry.

    Table 1. GLCM features.

    Sr. No.FeaturesFormulae
    1.ContrastContrast = equation
    2.HomogeneityHomogeneity = equation
    3.DissimilarityDissimilarity = equation
    4.EnergyEnergy = equation
    5.AsymmetryAsymmetry= equation

    Gradient boosting is the combination of two methods; that is, the gradient descent method and AdaBoost. It builds a model in a forward fashion and optimizes the differential loss function. The algorithm is highly customizable for a specific application. AdaBoost has an advantage that it boosts the outliers near classification boundaries. It helps to increase the accuracy of the classifier.

    The gradient boosting algorithm in detail is as follows:

    Input: Training feature set {(Xi,Yi)}ni=1 loss function L(y, F(x)) and number of iterations.

    Algorithm

    1. Initialize model with a constant value:equation
    2. For m=1,2….., M
      • Compute so-called pseudo-residuals:equation
      • Fit a base learner hm(x) to pseudo-residuals; that is, train it using the training set equation
      • Compute multiplier γm by solving the following 1D optimization problem:equation
      • Update the model:
        equation
      • Output: equation

    Initially, smoothing or sharpening of the image was considered in preprocessing of the images. It was later observed that using the above preprocessing disrupts the textural features of the image, which has a negative impact on the output of the classifier. So, complication of preprocessing was solved, and as mentioned in the Data Preprocessing section, no preprocessing was used in this project.

    Refinement

    The selection of algorithms and parameter tuning is an important aspect of machine learning. In this approach, the gradient boosting algorithm is selected, which is the combination of two machine learning approaches; that is, gradient descent and AdaBoost. AdaBoost algorithms boost the weak learners to minimize the false alarm and improve the accuracy. Boosting stages are finely tuned to get the promising accuracy.

    In the gradient boosting model, the boosting factor (n_estimators) was tweaked to 80 (from the default value of 100).

    Table 2. Hyperparameter values and accuracy.

    n_estimatorsAccuracy (%)
    8092.5
    9092.22
    10091.6
    11091.38
    50090.00

    The default value of n_estimators is 100, which gives 91.6 percent accuracy in the initial results. When the value of n_estimators is set to 80 the accuracy increases to 92.5 percent, which is our final result.

    Results

    The following table shows the accuracy comparison of different classifiers:

    Table 3. Performance evaluation.

    Sr. NoClassifierAccuracy (%)
    1KNN75.27
    2AdaBoost51.11
    3SVC14.72
    4Decision Tree88.33
    5Random Forest89.44
    6Gradient Boosting92.50

    accuracy comparison graph
    Figure 7. Accuracy comparison graph.

    From the above table and graph we can observe that gradient boosting gives the highest accuracy of 92.5 percent. The confusion matrix of testing by using gradient boosting is given below.

    As the extracted textural features are based on GLCM, variations in light intensities may negatively affect the result of the model.

    Table 4. KFold CV results.

    Sr. NoClassifierCV accuracy (%)CV mean accuracy (%)
    Folds= 5,
    Random state=9
    Folds=10,
    Random state=70
    Folds=15,
    Random state=35
    1KNN75.347274.93055674.72222274.99999
    2AdaBoost49.37500047.56950.41666749.12022
    3SVC15.48611114.30555613.81944414.53704
    4Decision Tree84.86111185.62500085.06944485.18519
    5Random Forest87.01388988.81944487.08333387.63889
    6Gradient Boosting87.70833388.61111188.75000088.35648

    graph of confusion matrix
    Figure 8. Confusion matrix.

    From the confusion matrix of the gradient boosting classifier output it is seen that out of 360 testing images 333 are correctly classified and 27 are misclassified.

    Justification

    The gradient boosting classifier achieved an accuracy of 92.5 percent, which is more than the KNN benchmark model with 75.27 percent accuracy.

    In KNN the data points at the boundaries of classes can be misclassified, and this is where the gradient boosting algorithm excels over KNN for this specific problem, as weak classifiers are transformed into strong classifiers.

    Conclusion

    In the proposed system, the machine learning-based steel plate defect detection system was implemented.

    The input images were taken from the NEU dataset2, which is freely available.

    No preprocessing was done, as mentioned in the Data preprocessing section.

    The texture features were extracted by the GLCM, and extracted features were further classified into six respective classes (crazing, inclusion, patches, pitted surface, rolled-in-scale, and scratches) using different classification algorithms.

    The test train split of the extracted features was done.

    The gradient boosting classifier had the highest testing accuracy. Then, the hyperparameter of the boosting factor was tuned (which was difficult) to get even more accuracy, as mentioned in the refinement section. This approach achieved the classification accuracy of 92.5 percent.

    In the future, this approach can be implemented using deep learning algorithms if the large dataset is available.

    This was an interesting project, as this model can be implemented in real-life scenarios in the steel industry, which suffers from the problem of steel plate defects.

    Intel® AI DevCloud development tools

    Intel® AI DevCloud was used to train the network for the above implementation. Intel AI DevCloud is available for academic and personal research purposes for free and the request can be made from the Intel AI DevCloud website. The code developed can be found in this GitHub* repository.

    Join the Intel® AI Academy

    Sign up for the Intel® AI Academy and access essential learning materials, community, tools, and technology to boost your AI development. Apply to become an Intel® Student Ambassador and share your expertise with other student data scientists and developers.

    References

    1. Yong Jie Zhao, Yun Hui Yan, Ke Chen Song, Vision-based automatic detection of steel surface defects in the cold rolling process: considering the influence of industrial liquids and surface textures, The International Journal of Advanced Manufacturing Technology, 2017, Volume 90, Number 5-8, Page 1665.
    2. NEU surface defect database, http://faculty.neu.edu.cn/yunhyan/NEU_surface_defect_database.html
    3. https://github.com/rajendraprasadlawate/steel_plate

    Exciting Innovations Showcased at Unity* Unite Beijing 2018

    $
    0
    0

    crowd at a conference

    Unity* Unite events continue to grow in popularity, drawing developers from around the globe in increasing numbers. Having already visited Seoul and Tokyo this year, the event moved to Beijing in May, settling at the China National Convention Center. Top technology experts and industry talents from across the world provided the audience with over 80 multi-themed technical courses and activities, including keynote presentations focusing on Unity's next-generation technology developments. Intel was there as well, presenting a technical keynote that demonstrated Intel technology in four new games.

    Optimizing with Intel® Graphics Performance Analyzers

    Peng Tao, senior software engineer at Intel China, gave a presentation with the title of "Increasing Gaming Audience: Intel® HD Graphics Can Also Run MR Games." He outlined how Intel® Graphics Performance Analyzers (Intel® GPA) can be used for game performance optimization during development. This helps achieve the goal of running mixed reality (MR) games smoothly on integrated graphics from Intel, and helps gamers run virtual reality (VR) games on a wider range of hardware platforms, expanding the target market for VR games.

    Using the MR game Space Pirate Trainer* as an example, before the use of Intel GPA, the frame-rate on a certain platform (Intel® Core™ i5 processors, Intel HD Graphics 620, Intel® NUC) was only 12 frames per second (fps), which was far from a smooth gaming experience. That low performance resulted in lagging and dizziness in players. Even when some of the special effects were removed, a frame rate of only 30 fps was achieved, still far from optimal.

    After game performance optimization using the Intel GPA toolkit, some image quality was compromised, but Space Pirate Trainer achieved a rate of 60 fps on this configuration, which met the Windows* MR application requirements.

    Intel GPA is a free graphics performance analysis tool that enables game developers to optimize the performance potential of their gaming platforms. The toolkit includes GPA Monitor, which connects Intel GPA with applications; System Analyzer HUD, which displays application performance indicators in real time; Graphics Frame Analyzer, which enables the captured frame files to be viewed in detail; and GPA Platform Analyzer, which enables detailed analysis of the running source code on all threads.

    Intel GPA helps developers perform detailed analysis without changing the game's source code. Intel recommends the use of this tool for game performance optimization by all developers.

    Intel showed how to analyze intricate game scenarios using Intel GPA during the performance optimization of Space Pirate Trainer. The demands for hardware performance were reduced by approaching the optimization from the aspects of the shader, materials processing, lighting, post-processing, CPU performance, power consumption, and other areas.

    Visionary Games and Evolutionary Gaming Experience

    The four games that were demonstrated at the Intel booth for Unite Beijing 2018—Seeking Dawn*, TEOT: The End of Tomorrow*, Candleman: The Complete Journey*, and Enlightenment*—were recommended by Intel China team. Intel showed how its graphical support, architecture, and other technologies were used to improve overall game performance, and how Intel can assist game developers to optimize their gaming experience and satisfy the high demands of the gaming market.

    Seeking Dawn combines elements of science fiction, survival, and exploration with different visual performances on different hardware platforms. On a gaming platform equipped with an Intel® Core™ i7 processor, physical effects and other aspects of Seeking Dawn showed considerable improvement when compared to a platform powered by an Intel Core i5 processor.

    CEO Freeman introducing Seeking Dawn Game
    Figure 2. Freeman, founder and CEO of Multiverse, introduced Seeking Dawn*.

    Candle man is a uniquely creative, highly original game about the dreams of ordinary people. It has received positive reviews for its use of dynamic lighting, linear color space, and other visual techniques. Candle man was successfully migrated to Intel HD Graphics, resulting in smoother gameplay and more enabled visual effects than before optimization started.

    Gao Ming Introducing Candleman
    Figure 3. Gao Ming, game producer and co-founder of Spotlightor Interactive, introduced Candleman*.

    The game TEOT: The End of Tomorrow offers realistic 3D scenes and an attractive storyline with interesting gameplay. Collaboration with Intel helped developers improve game performance, accurately detect bottlenecks, and provide more gaming solutions. Thanks to performance optimization, TEOT can now run smoothly on Intel HD Graphics, potentially offering more sales.

    Convergence of the Latest Gaming Technology Trends

    The interactive activities set up at the Intel booth attracted both game developers and game players. Intel showcased products and emerging technologies for upgrading visual experiences and improving performance. The Intel booth attracted over 1,000 registrations from developers. Intel conducted live interviews in every booth, and developers learned how to optimize gaming experiences via cooperation with Intel.

    Large crowd at Unite Beijin 2018
    Figure 4. Standing room only: The crowd at the Unite Beijing 2018 presentations.

    Many leading gaming technology trends affected by Unity were also showcased at Unite Beijing 2018. The core topics of the event included:

    • Fully upgraded Unity 2018 release. Unity presented its latest 2018 version, which features improvements to the two core concepts of low-level rendering and default performance. Other improvements are in the areas of GPU Instancing support for GI (global illumination, or bounced lighting); presets for settings import and component editor; an all-new particle system upgrade; the new scriptable render pipeline (SRP) real-time rendering architecture option; and more. The new functions can turn the Unity engine into a high-end studio.
    • Unity 2018 feature analysis. Unity summarized the pioneering technologies available on its 2018 version, and provided suggestions to developers on how to utilize these new technologies. These new technologies include next-generation rendering features such as the SRP, Post-processing Stack v2, the Unity Shader Graph, and more. These technologies can help developers efficiently compile high-performance code for the C# Job System and the new generation Entity Component System. Unity also explained the new generation Unity particle system and the custom rendering texture feature, and discussed optimization tips.
    • Application of AI and machine learning in game development. Unity showcased their progress with artificial intelligence (AI) and machine learning, explaining how to use the revolutionary AI and machine learning tool Machine Learning Agents (Unity ML-Agents). Better AI features promise to bring new possibilities to game development and content creation, and help developers create smarter games and systems.
    • Efficient creations with real-time image rendering. Unity demonstrated its high quality, real-time rendering functions along with the Timeline, Cinemachine, Post-processing Stack, and other additional film production modules. Developers and artists can use Unity 2018 to create animations, add movie effects, edit scenes, and include other content while greatly reducing development and production time.

    Unity* Experts Take the Stage

    During the keynote presentations on May 11, several guests from Unity shared their views surrounding Unity's content creation engine, Unity's market strategy, and other topics.

    Chief marketing officer Clive Downie shared some of Unity's market penetration data. He said that Unity is seeing success in VR titles across all major platforms—69 percent of VR content on the Oculus Rift* platform, 74 percent of VR content on the HTC Vive* platform, 87 percent of VR content on the Samsung* Gear VR platform, and 91 percent of content on the Microsoft HoloLens* platform—all developed using Unity.

    Clive Downie from Unity
    Figure 5. Chief marketing officer Clive Downie discussed Unity's impressive market penetration statistics. 

    Carl Callewaert, the global director of evangelism at Unity, did a deep technology dive by introducing a series of new features in Unity, such as the all-new art tool, next-generation rendering pipeline, real-time ray tracing, GPU lightmapper, and Nested Prefabs.

    Carl Callewaert from Unity
    Figure 6. Unity's Carl Callewaert, global director of evangelism, discussed the new rendering pipeline.

    Andy Touch, Unity's global content evangelist, presented function highlights of Unity since 2015 by comparing the contents of a Unity demo in different stages, and introduced the feature of a High Definition Render Pipeline (HDRP) through Unity's latest real-time rendering film, Book of the Dead*.

    Andy Touch from Unity
    Figure 7. Unity's Andy Touch introduced the High Definition Render Pipeline.

    Unity Evangelist Mark Schoennagel presented the lightweight version of Unity, long-awaited by developers. As a web-based application, the file size of the new Unity core is a mere 72 KB. At the same time, Unity also performed optimization on the asset pipeline, reducing file sizes further, and implemented optimizations on the lightweight project more efficiently.

    Mark Schoennagel from Unity
    Figure 8. Unity Evangelist Mark Schoennagel discussed the new, smaller footprint for the Unity* core.

    Danny Lange, Vice President of Unity AI and Machine Learning, shared new developments in the field of machine learning. Unity strives to reduce the entrance threshold for machine learning and help developers apply this technology to their games. Unity's open-source AI toolkit, Unity ML-Agents, can help developers and researchers train machine agents in real and complex environments, and help developers enter the age of smart development.

    Danny Lange from Unity
    Figure 9. Unity's Danny Lange, VP of AI and Machine Learning, discussed Unity ML-Agents, an open-source toolkit to help developers boost immersion and realism.

    Hu Min, Customer Management Director of Unity Ads for the Greater China Region, announced the official Unity Ads direct advertising solution for China's Android market. This is an easy-to-deploy and extremely secure solution that can help Chinese developers generate revenue through advertising.

    Hu Min from Unity
    Figure 10. Unity's Hu Min, Customer Management Director of Unity Ads for the Greater China Region, introduced the official Unity Ads direct advertising solution.

    Anuja Dharkar, head of Learning, Global Education at Unity, introduced Unity's global education certification system. Developers can utilize a variety of official channels, such as online interactive tutorials and official offline training, to fully master Unity skills and obtain official certification. Unity's Senior Operations Director of the Greater China Region, Beeling Chua, introduced Unity's education strategy in the Greater China Region.

    Anuja Dharkar from Unity
    Figure 11. Unity's Anuja Dharkar, head of learning, introduced Unity's global education certification system.

    Zhang Zhunbo, General Manager of the Greater China Region and Global Vice President at Unity, introduced Unity's service system in China, which includes technical support, technology presentations, education, the Unity Asset Store*, industry solutions, advertising services, strategic platform cooperation, and more. He also introduced products and services that were developed locally.

    Zhang Zhunbo from Unity
    Figure 12. Unity's Zhang Zhunbo, GM of GC and VP of Unity, introduced Unity's service system in China.

    Abundance of Demos in the Exhibition Area

    Multiple gaming hardware manufacturers, game developers, game development tool providers, and other vendors set up shop in the exhibition area.

    Conference attendees
    Figure 13. Conference attendees had an opportunity to try out the latest gear and demos.

    HTC Vive*, Windows Mixed Reality, 3Glasses*, Pico Interactive, HoloLens, Lenovo Mirage* AR, and others presented VR head-mounted displays to the public. Demo games attracted enthusiastic players and were very popular, with shooting games making up the bulk of the VR titles at the show.

    With augmented reality (AR) development tools becoming widely available, plenty of AR content also appeared at the event. Automotive AR applications such as YuME and Meow!* have caught the attention of many players. Google demonstrated its ARCore* software, which has three main functions to help create realistic AR experiences: plane recognition, motion tracking, and light estimation.

    Different types of games that were developed using the Unity engine were presented at the event, and many of those games were developed by independent developer teams in China. Horizontal games, puzzle solving games, martial arts role-playing games , action role-playing games, and others were represented, all showcasing the effectiveness of the Unity engine across gaming genres.

    Besides VR gaming, the introduction of VR into education and other industries was also a major focus in the exhibition area. The VR product developed by Unity Education and its application in the Ruiyu Imagination Classes at Sichuan Normal University was showcased. Developed using the Unity engine, this experimental K12 product utilizes VR technology and includes all physics, chemistry, biology, and scientific experiments from elementary school through high school (sourced from Chinese textbooks). Experiencing laboratories and experiments through VR allows students to safely walk through methodology and application; errors can be corrected immediately, and processes repeated, in order to better remember and learn.

    Interesting Technical Topics

    More than 80 courses and activities ran at Unite Beijing 2018, focusing on Unity's next-generation technology developments, hopefully spurring game developers to envision greater possibilities.

    demo show of Unity Interface
    Figure 14. Demo shown at Unite Beijing 2018.

    • High-end real-time rendering. Unity highlighted its creative experience in film and animation and introduced how to use the new generation Unity HDRP to develop film clips with realistic qualities.
    • Quality real-time rendering animation function. Jiang Yibing, Global Technical Art Supervisor at Unity showcased the process of creating her upcoming animated short Windup*, and took the audience through the process from initial concept, to writing a captivating script, time-lining, camera setup, role creation, scene building, lighting control, and all the way to the final animation production.
    • Game production process based on image reconstruction. Zhang Liming, Technical Director of the Greater China Region at Unity, revealed the creation process of the Book of the Dead and introduced how the newly added rendering features in Unity 2018 were used to bring the real-time rendering quality to unprecedented heights.
    • Quickly creating virtual worlds in Unity. Yang Dong, Technical Director of the Unity Platform Department of the Greater China Region, explained the various function models of ProBuilder*, and how to use its features for level development.
    • Machine learning in Unity. The machine learning framework published by Unity allows developers to use the Python* API to apply machine learning capabilities to game production and various VR simulations within the Unity environment. Sun Zhipeng from Unity introduced Unity's machine learning framework and its tuning to the audience, and shared some practical cases.
    • Analysis of underlying memory usage in iOS*. Detailed analysis of underlying memory use in iOS was presented and the use of Unity Profiler, Xcode*, and an instrument to compile data was introduced. This allows for the discovery of where memory consumption is highest within the game and its corresponding optimization method.
    • Geometry and compute shaders in Unity. The familiar rendering pipeline was reviewed as a way to introduce the application of the geometry shader, which is used in Unity to implement more realistic grass renderings.
    • Best practices for baked lightmaps. This presentation focused on baked lightmap effects, baking time, and lightmap memory usage. Some best practices for baking system tuning and optimization were discussed so that developers could obtain the optimal baking effects in the shortest amount of time while optimizing lightmap memory usage.

    Emerging Technologies Introduce New Innovations into Traditional Industries

    Aside from having a wide range of applications in the world of gaming, VR/AR/MR, other new and emerging technologies can also be applied to manufacturing, automotive, construction, animation, design, and education.

    an architectural rendering
    Figure 15. Unity technology is now being applied to traditional industries such as construction, architecture, and design.

    The application of Unity's technology in traditional industries was introduced under the broad subject of industrial applications. Speakers touched on the following areas:

    • Quickly generate VR/AR workflows for industrial engineering design. Zhong Wu, Global Development Consultant at Autodesk, showed how to utilize the seamless interconnection between industry-leading cloud services technology and Unity's workflow. The goal is to quickly and efficiently solve the challenges of the engineering design industry and to create an all-new, highly efficient design data presentation in VR/AR.
    • Upgrade project management communication. Using building information model (BIM) with VR, users can upgrade their project management communication scenarios and bring forth a more intuitive project management mode to overhaul the traditional project management communication model.
    • Apply interactive MR in broadcasting. DataMesh CTO Wu Hao shared how to use Unity to create interactive MR experiences and implement broadcasting-grade live mixed-reality content.
    • Apply MR in industrial IoT. Liu Hongyu, co-founder and architect of Holoview Lab (Shanghai) Co. Ltd., shared how to apply Unity to development for industrial Internet of Things (IoT), and used it with the HoloLens mixed-reality device, Microsoft Azure*, GE Predix*, and other platforms to share the development experience with mixed-reality IoT applications.
    • Transform video technology. High-dimension video digital assets based on the Unity engine will become core elements in the various layers of video content in the future, and the use of Unity within the video industry will probably also increase.
    • Create industrial AR/VR applications. Zhou Guichun, Research Engineer at United Technologies Corporations, shared how to obtain data from BIMs, and how to analyze data from 3D models developed using Unity. He also described his experience with developing Unity animations, and explained how to design UI/UX for AR/VR applications and Unity's AR/VR development framework.

    The Future is Here

    Over three days, Unity demonstrated the tremendous energy and enthusiasm in the growing Unity ecosystem with cutting-edge technology and the spirit of community at its core. Speaker after speaker presented Unity's most advanced technical features, while challenging developers to realize their potential to bring revolutionary changes to various industries.

    As a major collaborator with Unity, Intel demonstrated how game developers can take full advantage of multicore technology to improve their game development capabilities. Intel showcased a full line of optimization tools that work with the Unity engine, and clearly established that they will continue to help drive gaming technology and concept innovation. If you want to get in on the action at an upcoming Unite conference, check out their upcoming schedule and make your plans to attend.

    Code Sample: Rendering Objects in Parallel Using Vulkan* API

    $
    0
    0
    File(s):Download
    License:Intel Sample Source Code License Agreement
    Optimized for... 
    OS:64-bit Windows* 7, 8.1 or Windows® 10
    Hardware:GPU required
    Software:
    (Programming Language, tool, IDE, Framework)
    Microsoft Visual Studio* 2017, Qt Creator 4.5.0, C++ 17, Qt 5.10, Vulkan* 1.065.1 SDK, ASSIMP 4.1.0 library
    Prerequisites:Familiarity with Visual Studio, Vulkan* API, 3D graphics, parallel processing.


    Introduction

    One of the industry’s hottest new technologies, Vulkan APIs support multithreaded programming, simplify cross-platform development and have the backing of major chip, GPU and device-makers. The API is a collaborative effort by the industry to meet current demands of computer graphics. It is a new approach that emphasizes hiding the CPU bottleneck through parallelism, and allowing much more flexibility in application structure. Aside from components related only to graphics, the Vulkan API also defines the compute pipeline for numerical computation. In all, Vulkan APIs are positioned to become one of the next dominant graphics rendering platforms.

    This code and accompanying article (see References below) discuss the process of rendering multiple FBX (Filmbox) and OBJ (Wavefront) objects using Vulkan APIs. The application employs a non-touch graphical user interface (GUI) that reads and displays multiple 3D object files in a common scene. Files are loaded and rendered using linear or parallel processing, selectable for the purpose of comparing performance. In addition, the application allows objects to be moved, rotated, and zoomed through a simple UI. We recommend that you read the article while looking at the code. Make sure you have the examples downloaded and use your favorite code browsing tool.

    The code demonstrates the following concepts:

    1. Loaded models displayed in a list
    2. Selected objects identified on-screen with a bounding box
    3. An object information and statistics display showing the number of vertices
    4. The ability to specify either delta or absolute coordinates and rotations
    5. An option to view objects in wireframe mode
    6. Statistics for comparing single- versus multi-threading when reading object files


    Get Started

    At a high level, when programming using Vulkan, the goal is to construct a virtual device to which drawing commands will be submitted. The draw commands are submitted to constructs called “queues”. The number of queues available and their capabilities depend upon how they were selected during construction of the virtual device, and the actual capabilities of the hardware. The power of Vulkan lies in the fact that the workload submitted to queues could be assembled and sent in parallel to already executing tasks. Vulkan offers functionality to coherently maintain the resources and perform synchronization.

    Tutorial: Rendering Objects in Parallel Using Vulkan* APIs

    Reading FBX and OBJ files

    The first first step is to set up and create the user interface. As we said, this UI is keyboard- and mouse-driven, but it could be enhanced to support touch.

    Once the UI is in place, the work begins with reading either an FBX or OBJ file and loading it into memory. The application supports doing this using a single or multiple threads so you can see the difference in performance. We are going to cheat here and use the Open Asset Import Library (assimp) to read and parse the files. Once loaded, the object will be placed in a data structure (Object3D) that we can hand to Vulkan. This is described in detail in the article.

    Displaying and manipulating the 3D objects

    The main area of the user interface is a canvas where the loaded objects are displayed. These are place in a default location but can be moved anywhere on the canvas so they do not overlap. When you select an object from the list of loaded items it is highlighted with a bounding box. Once selected, you can move, rotate or resize the object by entering new coordinates or size into the form. Again, you can read the details in the accompanying article.

    Using Vulkan to render the 3D objects

    Loading the objects from memory and displaying them on the screen is handled gracefully by Vulkan. The source code contains code to show how to load an object file using Vulkan. About a dozen lines in, the loaded file is sent to the renderer with support for a secondary command buffer to allow object-loading in parallel. The system processor, GPU, and other factors of the host system as well as the size of the object file will determine single- and multi-threaded object rendering times. Your results will vary.

    Because of the complexities of the Vulkan APIs, the biggest challenge was building Renderer, which implements application-specific rendering logic for Vulkan Window. Especially challenging was the synchronization of worker and UI threads without using mutual exclusive locks on rendering- and resource-releasing phases. On the rendering phase, this is achieved by separating command pools and secondary command buffers for each Object3D instance. On resource releasing phase, it is necessary to make sure the host and GPU rendering phases are finished.

    The key functions of interest in Renderer are:

    • void Renderer::startNextFrame()
    • void Renderer::endFrame()
    • void Renderer::drawObject()
    • void Renderer::initPipeline()

    This latter method was required in order to handle different types of graphical objects – those loaded from files and those dynamically generated in the form of bounding boxes that surround the selected object. This caused a problem because they use differing shaders, primitive topologies and polygon modes. The goal was to unify code as much as possible for the different objects to avoid replicating similar code. Both types of objects are expressed by single-class Object3D.


    Conclusion

    Coding flexibility is a hallmark of low-level Vulkan APIs but it is critical to remain focused on what is going on in each Vulkan step. These lower-level programming capabilities also allows for fine-tuning certain aspects of hardware access not available with OpenGL*. If you take it slow, and build your project in small, incremental steps, the payoffs will include far greater rendering performance, much lower runtime footprint, and greater portability to a multitude of devices and platforms.


    References

    Alexey Korenevsky, Integrated Computing Solutions, Inc., Vulkan Code Sample: Rendering Objects in Parallel, Rendering Objects in Parallel Using Vulkan* APIs, 2018

    Open Asset Import Library


    Updated Log

    Created May 23, 2018

    Rendering Objects in Parallel Using Vulkan* APIs

    $
    0
    0

    If you're a game developer and not yet up to speed on Vulkan*, you should be. Vulkan APIs are one of the industry's hottest new technologies. They support multithreaded programming, simplify cross-platform development and have the backing of makers of major chips, GPUs, and devices. Vulkan APIs are positioned to become one of the next dominant graphics rendering platforms. Characteristics of the platform help apps gain longevity and run in more places. You might say that Vulkan lets apps live long and prosper—and this code sample will help get you started.

    The APIs were introduced by the Khronos Group* in 2015, and quickly gained the support of Intel and Google*. Unity Technologies* came on board in 2016, and Khronos* confirmed plans to bestow Vulkan with support for multiple discrete GPUs automatically. By 2017, as the Vulkan APIs matured, an increasing number of game makers announced that they would begin adopting it. Vulkan became available for Apple's macOS* and iOS* platforms in 2018.

    Vulkan carries a low overhead while also providing greater control over threading and memory management as well as improving direct access to the GPU over OpenGL* and other predecessor APIs. These features combine to give the developer versatility for targeting an array of platforms with essentially the same code base. With early backing from major industry players, the Vulkan platform has tremendous potential, and developers should be advised to get on board soon. Vulkan is built for now.

    To help experienced pro and indie developers prepare for Vulkan, this article walks through the code of a sample app that renders multiple .fbx and .obj objects using Vulkan APIs. The app employs a non-touch graphical user interface (GUI) that reads and displays multiple object files in a common scene. Files are loaded and rendered using linear or parallel processing, selectable for the purpose of comparing performance. In addition, the app allows objects to be moved, rotated, and zoomed through a simple UI.

    Multiple rendered objects displayed simultaneously
    Figure 1. Multiple rendered objects displayed simultaneously; the selected object is indicated with a bounding box.

    The app also features:

    • Loaded models displayed in a list
    • Selected objects identified on-screen with a bounding box
    • An object info and stats display showing the number of vertices
    • The ability to specify either delta or absolute coordinates and rotations
    • Open object files in a file explorer window
    • Option to view objects in wireframe mode
    • Displays for stats for single- versus multithreading when reading and rendering

    Keeping developers informed and educated on the latest technologies and development techniques is an important part of ensuring their success and prosperity. To that end, all source code and libraries from this project are available for download, so you can build and learn from the app on your own and adapt the functions for use in your own apps.

    For people new to Vulkan, the learning curve could be steep. Because it gives developers rich features and a broad level of control, Vulkan contains far more structures and requires a greater number of initializations than OpenGL and other graphics libraries. For the sample app, the renderer alone (renderer.cpp) required more than 500 lines of code.

    In an effort to minimize the amount of code required, this sample app focuses heavily on architecting a unified means of rendering different object types. Commonalities are identified in the initialization steps, which are separated from the general pipeline, and parts specific to a particular instance of 3D objects are loaded and rendered from a file. A boundary box is another type of object and requires its own shaders, settings, and pipeline. There is only one instance, however. Minimizing coding differences between object types also helped to improve flexibility and simplify the code.

    One of the most significant challenges of developing this sample involved multithreaded rendering. Though Vulkan APIs are considered "thread-safe," some objects required explicit synchronization on the host side and at the point of implementation if applied to command pool and command buffer. When an object requests the command buffer, the buffer is allocated from the command pool. If the command pool is accessed in parallel from several threads at once, the app would crash or report a warning in the Vulkan console. One answer would be to use mutual exclusions, or mutexes, to serialize an access to the shared command pool. But this would eliminate the advantage of parallel processing, because threads would compete and block each other. Instead, the sample app implements separate command buffers and command pools for each 3D object instance, which then requires extra code for the release of resources.


    What You'll Need

    The minimum requirement for developing with Vulkan APIs on graphics processor units (GPUs) from Intel is a processor from the 6th Generation Intel® processor family (introduced in August 2015), running 64-bit Windows* 7, 8.1, or 10. Intel also offers a 64-bit Windows® 10-only driver for 6th, 7th, or 8th Generation processors. Vulkan drivers are now included with Intel® HD Graphics Driver, which helps simplify the setup process. Instructions are available for installing Vulkan drivers on Intel®-based systems running Unity* or Unreal* Engine 4.


    Code Walk-Through

    This app was built as an aid to developers learning to use Vulkan. This walk-through explains the techniques used to make the sample app, simplifying the work of getting started on your own. To reduce time spent on planning the architecture, the app was developed using an incremental, iterative process, which helps minimize changes during the coding phase. The project was divided into three parts: UI (MainWindow and VulkanWindow), model loader (Model.cpp/h), and rendering (Renderer.cpp/h). The feature list was prioritized and sorted by difficulty of implementation. Coding then started with the easiest features—refactoring and changing design only when needed.


    MainWindow.cpp

    In the sample app's main window, object files are loaded using either a single process or in parallel. Either way, a timer counts the total loading time to allow for comparison. When files are processed in parallel, the QtConcurrent component is used to process worker threads.

    The "loadModels()" function starts the parallel or linear processing of files. In the first few lines, a counter is started. Then the loading times for file(s) are counted and an aiScene is created using the Assimp* external library. Next, the aiScene is converted to a class model created for this app that's more convenient to Vulkan. A progress dialog is created and presented while parallel file processing takes place.

    void MainWindow::loadModels()
    {
        clearModels();
        m_elapsedTimer.start(); // counts total loading time
    
        std::function<QSharedPointer<Model>(const QString &)> load = [](const QString &path) {
            QSharedPointer<Model> model;
            QFileInfo info(path);
            if (!info.exists())
                return model;
            QElapsedTimer timer;
            timer.start(); // loading time for this file
            Assimp::Importer importer;
    // read file from disk and create aiScene (external library Asimp) instance
            const aiScene* scene = importer.ReadFile(path.toStdString(),
                                                     aiProcess_Triangulate |
                                                     aiProcess_RemoveComponent |
                                                     aiProcess_GenNormals |
                                                     aiProcess_JoinIdenticalVertices);
    
            qDebug() << path << (scene ? "OK" : importer.GetErrorString());
            if (scene) {
    // aiScene format is not very convenient for renderer so we designed class Model to keep data ready for Vulkan renderer.
                model = QSharedPointer<Model>::create(info.fileName(), scene);  //convert aiScene to class Model (Model.cpp) that’s convenient for Vulkan renderer
    
                if (model->isValid()) {
                    model->loadingTime = timer.elapsed();
                } else {
                    model.clear();
                }
            }
            return model;
        };
    // create a progress dialog for app user
        if (m_progressDialog == nullptr) {
            m_progressDialog = new QProgressDialog(this);
            QObject::connect(m_progressDialog, &QProgressDialog::canceled, &m_loadWatcher, &QFutureWatcher<void>::cancel);
            QObject::connect(&m_loadWatcher,  &QFutureWatcher<void>::progressRangeChanged, m_progressDialog, &QProgressDialog::setRange);
            QObject::connect(&m_loadWatcher, &QFutureWatcher<void>::progressValueChanged,  m_progressDialog, &QProgressDialog::setValue);
        }
        // using QtConcurrent for parallel file processing in worker threads
        QFuture<QSharedPointer<Model>> future = QtConcurrent::mapped(m_files, load);
        m_loadWatcher.setFuture(future);
    //present the progress dialog to app user
        m_progressDialog->exec(); 
    }
    

    The "loadFinished()" function processes results of the parallel or linear processing, adds object file names to "listView," and passes models to the renderer.

    void MainWindow::loadFinished() {
        qDebug("loadFinished");
        Q_ASSERT(m_vulkanWindow->renderer());
        m_progressDialog->close(); // close the progress dialog
    // iterate around result of file load
        const auto & end = m_loadWatcher.future().constEnd();
    
    // loop for populating list of file names
        for (auto it = m_loadWatcher.future().constBegin() ; it != end; ++it) {
            QSharedPointer<Model> model = *it;
            if (model) {
                ui->modelsList->addItem(model->fileName); // populates list view
    // pass object to renderer (created in vulkanWindow, which is part of the mainWindow)
                m_vulkanWindow->renderer()->addObject(model); 
            }
        }
    

    Identify the selected object on the screen by surrounding it with a bounding box.

    mainwindow.cpp: MainWindow::currentRowChanged(int row)
    {
    ...
       if (m_vulkanWindow->renderer())
               m_vulkanWindow->renderer()->selectObject(row);
    
    renderer.cpp: Renderer::selectObject(int index) - inflates BoundaryBox object’s  model
    ...
    

    Display object info and statistics (i.e., number of vertices) of the selected object on the screen. Here, object-specific statistics are created and loading time for the scene is displayed.

    MainWindow::currentRowChanged(int row) - shows statistic for selected object:
    {
    …
    // prepare object-specific statistics (verticies, etc)
    QString stat = tr("Loading time: %1ms. Vertices: %2, Triangles: %3")
                   .arg(item->model->loadingTime)
                   .arg(item->model->totalVerticesCount())
                   .arg(item->model->totalTrianglesCount());
    ui->objectStatLabel->setText(stat);
    
    // display total scene loading time
    void MainWindow::loadFinished() 
    ui->totalStatLabel->setText(tr("Total loading time: %1ms").arg(m_elapsedTimer.elapsed()));
    
    // show rendering performance in frames per second
    void MainWindow::timerEvent(QTimerEvent *) 
    ui->fpsLabel->setText(tr("Performance: %1 fps").arg(renderer->fps(), 0, 'f', 2, '0'));
    ...
    

    Enable users of the app to specify absolute coordinates and rotations.

    void MainWindow::positionSliderChanged(int)
    {
        const int row = ui->modelsList->currentRow();
        if (row == -1 || m_ignoreSlidersSignal || !m_vulkanWindow->renderer())
            return;
        m_vulkanWindow->renderer()->setPosition(row, ui->posXSlider->value() / 100.0f, ui->posYSlider->value() / 100.0f,
                                    ui->posZSlider->value() / 100.0f );
    }
    
    void MainWindow::rotationSliderChanged(int)
    {
        const int row = ui->modelsList->currentRow();
        if (row == -1 || m_ignoreSlidersSignal || !m_vulkanWindow->renderer())
            return;
         m_vulkanWindow->renderer()->setRotation(row, ui->rotationXSlider->value(), ui->rotationYSlider->value(),
                                    ui->rotationZSlider->value());
    }
    

    Sample app uses file explorer for objects to render
    Figure 2. The sample app implements a file explorer window for finding and opening objects to render.

    Allow the app to open object files using a file explorer window.

    MainWindow::MainWindow(QWidget *parent)
        : QWidget(parent),
          ui(new Ui::MainWindow)
    {
    …
    
    connect(ui->loadButton, &QPushButton::clicked, this, [this] {
           const QStringList & files = QFileDialog::getOpenFileNames(this, tr("Select one or more files"), QString::null, "3D Models (*.obj *.fbx)");
           if (!files.isEmpty()) {
               m_files = files;
               loadModels();
               ui->reloadButton->setEnabled(true);
           }
       });
    ...
    

    Objects rendered in wireframe mode
    Figure 3. Objects rendered in wireframe mode; the selected object is indicated by a bounding box.

    Allow the user to display objects in wireframe mode.

    MainWindow::MainWindow(QWidget *parent)
        : QWidget(parent),
          ui(new Ui::MainWindow)
    {
    ... 
     connect(ui->wireframeSwitch, &QCheckBox::stateChanged, this, [this]{
           if (m_vulkanWindow->renderer()) {
               m_vulkanWindow->renderer()->setWirefameMode(ui->wireframeSwitch->checkState() == Qt::Checked);
           }
       });
    Renderer.cpp  (line 386-402):
    void Renderer::setWirefameMode(bool enabled)
    ...
    

    Renderer.cpp

    Because of the complexities of the Vulkan APIs, the biggest challenge to this app's developer was building Renderer, which implements application-specific rendering logic for VulkanWindow.

    Thread selection is simplified with drop-down window
    Figure 4. Thread selection is simplified with a drop-down window; the ideal number is based on cores in the host system.

    Especially challenging was the synchronization of worker and UI threads without using mutual exclusive locks on rendering and resource releasing phases. On the rendering phase, this is achieved by separating command pools and secondary command buffers for each Object3D instance. In the resource releasing phase, it is necessary to make sure the host and GPU rendering phases are finished.

    Total loading time allows comparison
    Figure 5. Total loading time and vertices count of an object file allow comparison of single- and multithreaded loading times.


    Rendering Results May Vary

    The system processor, GPU, and other factors of the host system as well as the size of the object file will determine single- and multithreaded object rendering times. Your results will vary. Normally, the host rendering phase is finished when "Renderer::&m_renderWatcher" emits a "finished" signal and "Renderer::endFrame()" is called. The resource-releasing phase might be initiated in cases such as:

    1. The Vulkan window is resized or closed.
      "Renderer::releaseSwapChainResources" and "Renderer::releaseResources" will be called.
    2. The wireframe mode changed—"Renderer::setWirefameMode"
    3. Objects are deleted—"Renderer::deleteObjects"
    4. Objects are added—"Renderer::addObject"

    In those situations, the first things we need to do are:

    1. Wait until all worker threads are finished.
    2. Explicitly finish the rendering phase, calling "Renderer::endFrame()", which also sets the flag "m_framePreparing = false" to ignore all results from worker threads that come asynchronously in the near future.
    3. Wait until the GPU finishes all graphical queues using the "m_deviceFunctions->vkDeviceWaitIdle(m_window->device())" call.

    This is implemented in "Renderer::rejectFrame":

    void Renderer::rejectFrame()
    {
       m_renderWatcher.waitForFinished(); // all workers must be finished
       endFrame(); // flushes current frame
       m_deviceFunctions->vkDeviceWaitIdle(m_window->device()); // all graphics queues must be finished
    }
    

    Parallel preparation of command buffers to render 3D objects iseature is implemented in the following three functions; the code for each follows after:

    1. Renderer::startNextFrame—This is called when the draw commands for the current frame need to be added.
    2. Renderer::drawObject—This records commands to the secondary command buffer. This is running in worker thread. When it's done, the buffer is reported to the UI thread to be recorded to the primary command buffer.
    3. Renderer::endFrame—This finishes the render pass for current command buffer, reports to VulkanWindow that a frame is ready, and requests an immediate update to keep rendering.


    Function 1: void Renderer::startNextFrame()

    This section contains mainly Vulkan-specific code that is not likely to need modification. The snippet is intended to show how to load an object file using Vulkan. About a dozen lines in, the loaded file is sent to the renderer with support for a secondary command buffer to allow object-loading in parallel.

    void Renderer::startNextFrame()
    {
        m_framePreparing = true;
    
        const QSize imageSize = m_window->swapChainImageSize();
    
        VkClearColorValue clearColor = { 0, 0, 0, 1 };
    
        VkClearValue clearValues[3] = {};
        clearValues[0].color = clearValues[2].color = clearColor;
        clearValues[1].depthStencil = { 1, 0 };
    
        VkRenderPassBeginInfo rpBeginInfo = {};
        memset(&rpBeginInfo, 0, sizeof(rpBeginInfo));
        rpBeginInfo.sType = VK_STRUCTURE_TYPE_RENDER_PASS_BEGIN_INFO;
        rpBeginInfo.renderPass = m_window->defaultRenderPass();
        rpBeginInfo.framebuffer = m_window->currentFramebuffer();
        rpBeginInfo.renderArea.extent.width = imageSize.width();
        rpBeginInfo.renderArea.extent.height = imageSize.height();
        rpBeginInfo.clearValueCount = m_window->sampleCountFlagBits() > VK_SAMPLE_COUNT_1_BIT ? 3 : 2;
        rpBeginInfo.pClearValues = clearValues;
    
        // starting render pass with secondary command buffer support
        m_deviceFunctions->vkCmdBeginRenderPass(m_window->currentCommandBuffer(), &rpBeginInfo,  VK_SUBPASS_CONTENTS_SECONDARY_COMMAND_BUFFERS);
    
        if (m_objects.size()) {
            // starting parallel command buffers generation in worker threads using QtConcurrent
            auto drawObjectFn = std::bind(&Renderer::drawObject, this, std::placeholders::_1);
            QFuture<VkCommandBuffer> future = QtConcurrent::mapped(m_objects, drawObjectFn);
            m_renderWatcher.setFuture(future);
    } else {
    // if no object exists, end immediately    
            endFrame();
        }
    }
    


    Function 2: Renderer::endFrame()

    This function instructs Vulkan that all command buffers are ready for rendering with the GPU.

    void Renderer::endFrame()
    {
        if (m_framePreparing) {
            m_framePreparing = false;
            m_deviceFunctions->vkCmdEndRenderPass(m_window->currentCommandBuffer());
            m_window->frameReady();
            m_window->requestUpdate();
            ++m_framesCount;
        }
    }
    

    Function 3: Renderer::drawObject()

    The function prepares the command buffers to be sent to the GPU. As above, the Vulkan-specific code in this snippet also runs in a worker thread and is not likely to need modification for use in other apps.

    // running in a worker thread
    VkCommandBuffer Renderer::drawObject(Object3D * object)
    {
        if (!object->model)
            return VK_NULL_HANDLE;
    
        const PipelineHandlers & pipelineHandlers = object->role == Object3D::Object ? m_objectPipeline : m_boundaryBoxPipeline;
        VkDevice device = m_window->device();
    
        if (object->vertexBuffer == VK_NULL_HANDLE) {
            initObject(object);
        }
    
        VkCommandBuffer & cmdBuffer = object->cmdBuffer[m_window->currentFrame()];
    
        VkCommandBufferInheritanceInfo inherit_info = {};
        inherit_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_INHERITANCE_INFO;
        inherit_info.renderPass = m_window->defaultRenderPass();
        inherit_info.framebuffer = m_window->currentFramebuffer();
    
        VkCommandBufferBeginInfo cmdBufBeginInfo = {
            VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO,
            nullptr,
            VK_COMMAND_BUFFER_USAGE_RENDER_PASS_CONTINUE_BIT,
            &inherit_info
        };
        VkResult res = m_deviceFunctions->vkBeginCommandBuffer(cmdBuffer, &cmdBufBeginInfo);
        if (res != VK_SUCCESS) {
            qWarning("Failed to begin frame command buffer: %d", res);
            return VK_NULL_HANDLE;
        }
    
        const QSize & imageSize = m_window->swapChainImageSize();
    
        VkViewport viewport;
        viewport.x = viewport.y = 0;
        viewport.width = imageSize.width();
        viewport.height = imageSize.height();
        viewport.minDepth = 0;
        viewport.maxDepth = 1;
        m_deviceFunctions->vkCmdSetViewport(cmdBuffer, 0, 1, &viewport);
    
        VkRect2D scissor;
        scissor.offset.x = scissor.offset.y = 0;
        scissor.extent.width = imageSize.width();
        scissor.extent.height = imageSize.height();
        m_deviceFunctions->vkCmdSetScissor(cmdBuffer, 0, 1, &scissor);
    
        QMatrix4x4 objectMatrix;
        objectMatrix.translate(object->translation.x(), object->translation.y(), object->translation.z());
        objectMatrix.rotate(object->rotation.x(), 1, 0, 0);
        objectMatrix.rotate(object->rotation.y(), 0, 1, 0);
        objectMatrix.rotate(object->rotation.z(), 0, 0, 1);
        objectMatrix *= object->model->transformation;
    
    
        m_deviceFunctions->vkCmdBindPipeline(cmdBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineHandlers.pipeline);
    
        // pushing view-projection matrix to constants
        m_deviceFunctions->vkCmdPushConstants(cmdBuffer, pipelineHandlers.pipelineLayout, VK_SHADER_STAGE_VERTEX_BIT, 0, 64, m_world.constData());
    
        const int nodesCount = object->model->nodes.size();
        for (int n = 0; n < nodesCount; ++n) {
            const Node &node = object->model->nodes.at(n);
            const uint32_t frameUniSize = nodesCount * object->uniformAllocSize;
            const uint32_t frameUniOffset = m_window->currentFrame() * frameUniSize + n * object->uniformAllocSize;
            m_deviceFunctions->vkCmdBindDescriptorSets(cmdBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineHandlers.pipelineLayout, 0, 1,
                                                       &object->descSet, 1, &frameUniOffset);
    
            // mapping uniform buffer to update matrix
            quint8 *p;
            res = m_deviceFunctions->vkMapMemory(device, object->bufferMemory, object->uniformBufferOffset + frameUniOffset,
                                                          MATRIX_4x4_SIZE, 0, reinterpret_cast<void **>(&p));
            if (res != VK_SUCCESS)
                qFatal("Failed to map memory: %d", res);
    
            QMatrix4x4 nodeMatrix = objectMatrix * node.transformation;
            memcpy(p, nodeMatrix.constData(), 16 * sizeof(float)); //updating matrix
    
            m_deviceFunctions->vkUnmapMemory(device, object->bufferMemory);
    
            // drawing meshes
            for (const int i: qAsConst(node.meshes)) {
                const Mesh &mesh = object->model->meshes.at(i);
                VkDeviceSize vbOffset = mesh.vertexOffsetBytes();
                m_deviceFunctions->vkCmdBindVertexBuffers(cmdBuffer, 0, 1, &object->vertexBuffer, &vbOffset);
                m_deviceFunctions->vkCmdBindIndexBuffer(cmdBuffer, object->vertexBuffer, object->indexBufferOffset + mesh.indexOffsetBytes(), VK_INDEX_TYPE_UINT32);
    
                m_deviceFunctions->vkCmdDrawIndexed(cmdBuffer, mesh.indexCount, 1, 0, 0, 0);
            }
        }
    
        m_deviceFunctions->vkEndCommandBuffer(cmdBuffer);
    
        return cmdBuffer;
    }
    

    The complete secondary buffer is reported back to a GUI thread, and commands can be executed on the primary buffer (unless frame rendering is canceled):

    Renderer.cpp (line 31-38):
    QObject::connect(&m_renderWatcher, &QFutureWatcher<VkCommandBuffer>::resultReadyAt, [this](int index) {
           // secondary command buffer of some object is ready
           if (m_framePreparing) {
               const VkCommandBuffer & cmdBuf = m_renderWatcher.resultAt(index);
               if (cmdBuf)
                   this->m_deviceFunctions->vkCmdExecuteCommands(this->m_window->currentCommandBuffer(), 1, &cmdBuf);
           }
       });
    ...
    

    Another major challenge to development of the renderer came in the handling of different types of graphical objects—those loaded from files and those dynamically generated in the form of boundary boxes that surround selected objects. This caused a problem because they use differing shaders, primitive topologies, and polygon modes. The goal was to unify code, as much as possible, for different objects to avoid replication of similar code. Both types of objects are expressed by single-class Object3D.

    In the "Renderer::initPipelines()" function, differences were isolated as function parameters and called in this way:

    initPipeline(m_objectPipeline, 
    QStringLiteral(":/shaders/item.vert.spv"),
     QStringLiteral(":/shaders/item.frag.spv"),
                    VK_PRIMITIVE_TOPOLOGY_TRIANGLE_LIST,
    m_wireframeMode ? VK_POLYGON_MODE_LINE : VK_POLYGON_MODE_FILL);
    
    initPipeline(m_boundaryBoxPipeline,
    QStringLiteral(":/shaders/selection.vert.spv"),
    QStringLiteral(":/shaders/selection.frag.spv"),
                    VK_PRIMITIVE_TOPOLOGY_LINE_LIST, VK_POLYGON_MODE_LINE);
    

    It also proved helpful to unify initialization of particular objects according to their role. This is handled by the "Renderer::initObject()" function:

    const PipelineHandlers & pipelineHandlers = object->role == Object3D::Object ? m_objectPipeline : m_boundaryBoxPipeline;

    "Function: Renderer::initPipeline()" shows the full function. Note that in addition to object files, the boundary box is another type of object and requires its own shaders, settings, and pipeline. Minimizing coding differences between object types also helped to improve flexibility and simplify the code.

    void Renderer::initPipeline(PipelineHandlers & pipeline, const QString & vertShaderPath, const QString & fragShaderPath,
                                VkPrimitiveTopology topology, VkPolygonMode polygonMode)
    {
        VkDevice device = m_window->device();
        VkResult res;
        VkVertexInputBindingDescription vertexBindingDesc = {
            0, // binding
            6 * sizeof(float), //x,y,z,nx,ny,nz
            VK_VERTEX_INPUT_RATE_VERTEX
        };
    
        VkVertexInputAttributeDescription vertexAttrDesc[] = {
            { // vertex
                0,
                0,
                VK_FORMAT_R32G32B32_SFLOAT,
                0
            },
            { // normal
                1,
                0,
                VK_FORMAT_R32G32B32_SFLOAT,
                6 * sizeof(float)
            }
        };
    
    
        VkPipelineVertexInputStateCreateInfo vertexInputInfo = {};
        vertexInputInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_VERTEX_INPUT_STATE_CREATE_INFO;
        vertexInputInfo.vertexBindingDescriptionCount = 1;
        vertexInputInfo.pVertexBindingDescriptions = &vertexBindingDesc;
        vertexInputInfo.vertexAttributeDescriptionCount = 2;
        vertexInputInfo.pVertexAttributeDescriptions = vertexAttrDesc;
    
    
        VkDescriptorSetLayoutBinding layoutBinding = {};
        layoutBinding.descriptorType = VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER_DYNAMIC;
        layoutBinding.descriptorCount = 1;
        layoutBinding.stageFlags =  VK_SHADER_STAGE_VERTEX_BIT;
    
        VkDescriptorSetLayoutCreateInfo descLayoutInfo = {
            VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO,
            nullptr,
            0,
            1,
            &layoutBinding
        };
    
         //!  View-projection matrix going to be pushed to vertex shader constants.
        VkPushConstantRange push_constant = {
                VK_SHADER_STAGE_VERTEX_BIT,
                0,
                64
            };
    
        res = m_deviceFunctions->vkCreateDescriptorSetLayout(device, &descLayoutInfo, nullptr, &pipeline.descSetLayout);
        if (res != VK_SUCCESS)
            qFatal("Failed to create descriptor set layout: %d", res);
    
    
        // Pipeline layout
        VkPipelineLayoutCreateInfo pipelineLayoutInfo = {};
        pipelineLayoutInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;
        pipelineLayoutInfo.setLayoutCount = 1;
        pipelineLayoutInfo.pSetLayouts = &pipeline.descSetLayout;
        pipelineLayoutInfo.pushConstantRangeCount = 1;
        pipelineLayoutInfo.pPushConstantRanges = &push_constant;
    
        res = m_deviceFunctions->vkCreatePipelineLayout(device, &pipelineLayoutInfo, nullptr, &pipeline.pipelineLayout);
        if (res != VK_SUCCESS)
            qFatal("Failed to create pipeline layout: %d", res);
    
        // Shaders
        VkShaderModule vertShaderModule = loadShader(vertShaderPath);
        VkShaderModule fragShaderModule = loadShader(fragShaderPath);
    
        // Graphics pipeline
        VkGraphicsPipelineCreateInfo pipelineInfo;
        memset(&pipelineInfo, 0, sizeof(pipelineInfo));
        pipelineInfo.sType = VK_STRUCTURE_TYPE_GRAPHICS_PIPELINE_CREATE_INFO;
    
        VkPipelineShaderStageCreateInfo shaderStageCreationInfo[2] = {
            {
                VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO,
                nullptr,
                0,
                VK_SHADER_STAGE_VERTEX_BIT,
                vertShaderModule,
                "main",
                nullptr
            },
            {
                VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO,
                nullptr,
                0,
                VK_SHADER_STAGE_FRAGMENT_BIT,
                fragShaderModule,
                "main",
                nullptr
            }
        };
        pipelineInfo.stageCount = 2;
        pipelineInfo.pStages = shaderStageCreationInfo;
    
        pipelineInfo.pVertexInputState = &vertexInputInfo;
    
        VkPipelineInputAssemblyStateCreateInfo inputAssemblyInfo = {};
        inputAssemblyInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_INPUT_ASSEMBLY_STATE_CREATE_INFO;
        inputAssemblyInfo.topology = topology;
        pipelineInfo.pInputAssemblyState = &inputAssemblyInfo;
    
        VkPipelineViewportStateCreateInfo viewportInfo = {};
        viewportInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_VIEWPORT_STATE_CREATE_INFO;
        viewportInfo.viewportCount = 1;
        viewportInfo.scissorCount = 1;
        pipelineInfo.pViewportState = &viewportInfo;
    
        VkPipelineRasterizationStateCreateInfo rasterizationInfo = {};
        rasterizationInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_RASTERIZATION_STATE_CREATE_INFO;
        rasterizationInfo.polygonMode = polygonMode;
        rasterizationInfo.cullMode = VK_CULL_MODE_NONE;
        rasterizationInfo.frontFace = VK_FRONT_FACE_COUNTER_CLOCKWISE;
        rasterizationInfo.lineWidth = 1.0f;
        pipelineInfo.pRasterizationState = &rasterizationInfo;
    
        VkPipelineMultisampleStateCreateInfo multisampleInfo = {};
        multisampleInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_MULTISAMPLE_STATE_CREATE_INFO;
        multisampleInfo.rasterizationSamples = m_window->sampleCountFlagBits();
        pipelineInfo.pMultisampleState = &multisampleInfo;
    
        VkPipelineDepthStencilStateCreateInfo depthStencilInfo = {};
        depthStencilInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_DEPTH_STENCIL_STATE_CREATE_INFO;
        depthStencilInfo.depthTestEnable = VK_TRUE;
        depthStencilInfo.depthWriteEnable = VK_TRUE;
        depthStencilInfo.depthCompareOp = VK_COMPARE_OP_LESS_OR_EQUAL;
        pipelineInfo.pDepthStencilState = &depthStencilInfo;
    
        VkPipelineColorBlendStateCreateInfo colorBlendInfo  = {};
        colorBlendInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_COLOR_BLEND_STATE_CREATE_INFO;
        VkPipelineColorBlendAttachmentState att = {};
        att.colorWriteMask = 0xF;
        colorBlendInfo.attachmentCount = 1;
        colorBlendInfo.pAttachments = &att;
        pipelineInfo.pColorBlendState = &colorBlendInfo;
    
        VkDynamicState dynamicEnable[] = { VK_DYNAMIC_STATE_VIEWPORT, VK_DYNAMIC_STATE_SCISSOR };
        VkPipelineDynamicStateCreateInfo dynamicInfo = {};
        dynamicInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_DYNAMIC_STATE_CREATE_INFO;
        dynamicInfo.dynamicStateCount = 2;
        dynamicInfo.pDynamicStates = dynamicEnable;
        pipelineInfo.pDynamicState = &dynamicInfo;
    
        pipelineInfo.layout = pipeline.pipelineLayout;
        pipelineInfo.renderPass = m_window->defaultRenderPass();
    
        res = m_deviceFunctions->vkCreateGraphicsPipelines(device, m_pipelineCache, 1, &pipelineInfo, nullptr, &pipeline.pipeline);
        if (res != VK_SUCCESS)
            qFatal("Failed to create graphics pipeline: %d", res);
    
        if (vertShaderModule)
            m_deviceFunctions->vkDestroyShaderModule(device, vertShaderModule, nullptr);
        if (fragShaderModule)
            m_deviceFunctions->vkDestroyShaderModule(device, fragShaderModule, nullptr);
    }
    


    Conclusion

    Coding flexibility is a hallmark of low-level Vulkan APIs, but it's critical to remain focused on what's going on in each Vulkan step. Lower-level programming also allows for precise fine-tuning of certain aspects of hardware access not available with OpenGL. If you take it slow and build your project in small, incremental steps, the payoffs will include far greater rendering performance, much lower runtime footprint, and greater portability to a multitude of devices and platforms.

    Pros and indies alike should prepare for Vulkan. This article provided a walk-through of an app that shows how to use Vulkan APIs to render multiple .fbx and .obj objects, and read and display multiple object files in a common scene. You've also seen how to integrate a file explorer window to load and render files using linear or parallel processing and compare performance of each in the UI. The code also demonstrates a simple UI to move, rotate, and zoom the objects; to enclose objects in a bounding box; render objects in wireframe mode; display object info and stats; and allow absolute coordinates and rotations to be specified.


    APPENDIX: How to Build the Project

    As described earlier, the minimum requirement for developing with Vulkan APIs on GPUs  from Intel is a 6th Gen Intel® processor running 64-bit Windows 7, 8.1, or 10. Vulkan drivers are now included with the latest Intel HD Graphics drivers. Follow the step-by-step instructions for installing Vulkan drivers on Intel-based systems running Unity or Unreal Engine 4, and then return here.

    The following steps are for building this project using Microsoft Visual Studio* 2017 from a Windows command prompt.

    Preparing the build environment

    1. Download the Vulkan 3D Object Viewer sample code project to a convenient folder on your hard drive.

    2. Make sure your Microsoft Visual Studio 2017 setup has Visual C++. If it doesn't, download and install it from Visual Studio site.

    3. The sample app relies on the Open Asset Import Library (assimp), but the pre-built version of this library doesn't work with Visual Studio 2017; it has to be re-built from scratch. Download it from SourceForge*.

    4. CMake is the preferred build system for assimp. You can download the latest version from cmake*/ or use one from Visual Studio (YOUR_PATH_TO_MSVS\2017\Community\Common7\IDE\CommonExtensions\Microsoft\CMake\CMake\bin). Follow these steps to build assimp:

    a. Open a command prompt (cmd.exe).

    b. Set "PATH=PATH_TO_CMAKE\bin;%PATH%" (skip this step If you already set this variable permanently in your system environment variables. To do that, go to: Control Panel->System->Advanced System Settings->Environment Variables and add the line above to the list).

    c. Enter "cmake -f CMakeLists.txt -G "Visual Studio 15 2017 Win64".

    d. Open the generated "assimp.sln" solution file in Visual Studio, go to Build->Configuration Manager and select "Release" under Configuration (unless you need to debug assimp for some reason, building the release version is recommended for the best performance).

    e. Close the configuration manager and build assimp.

    5. Download and install the Vulkan SDK from Vulkan.

    6. Download and install Qt. The sample app uses Qt 5.10 UI libraries, which is the minimum version required for Vulkan support. Open-source and commercial versions will do the job here, but you'll need to register either way. To get Qt:

    a. Go to qt.io and select a version.

    b. Log in or register and follow prompts to set up the Qt Online Installer.

    c. Next, you'll be prompted to select a version. Pick Qt 5.10 or higher and follow prompts to install.

    7. Clone or download the sample app repository to your hard drive.

    Building the app

    8. The file "env_setup.bat" is provided to help you set environment variables locally for the command processor. Before executing it:

    a. Open "env_setup.bat" and check whether listed variables point to the correct locations of your installed dependencies:

    I. "_VC_VARS"—path to Visual Studio environment setup vcvarsall.bat

    II. "_QTDIR"—path to Qt root

    III. "_VULKAN_SDK"—Vulkan SDK root

    IV. "_ASSIMP"—assimp root

    V. "_ASSIMP_BIN"—path to Release or Debug configuration of binaries

    VI. "_ASSIMP_INC"—path to assimp's header files

    VII. "_ASSIMP_LIB"—points to Release or Debug configuration of assimp lib

    b. Output from the batch file will report any paths you might have missed.

    c. Alternatively, add the following to the system's (permanent) environment variables:

    I. Create new variables:

    1. "_QTDIR"—path to Qt root

    2. "_VULKAN_SDK"—Vulkan SDK root

    3. "_ASSIMP"—assimp root

    II. Add to variable "PATH" values:

    1. %_QTDIR%\bin

    2. %_VULKAN_SDK%\bin

    3. %_ASSIMP%\bin

    III. Create the system variable "LIB" if it doesn't exist and add the value: %_ASSIMP%\lib

    IV. Create the system variable "INCLUDE" if it doesn't exist and add the values:

    1. %_VULKAN_SDK%\Include

    2. %_ASSIMP%\Include

    d. At the command prompt, set the current directory to the project root folder (which contains the downloaded project).

    e. Run qmake.exe.

    f. Start build:

    I. For release: nmake -f Makefile.Release

    II. For debug: nmake -f Makefile.Debug

    9. Run app:

    a. For release: WORK_DIR\release\model-viewer-using-Vulkan.exe

    b. For debug: WORK_DIR\debug\model-viewer-using-Vulkan.exe

    10. Execute the newly built Vulkan object viewer app.

    11. Select the number of threads to use or check "single thread." By default, the app selects the optimal number of threads based on logical cores in the host system.

    12. Click "Open models..." to load some models with a selected number of threads. Then change the number of threads and click "Reload" to load the same models with new thread settings for comparison.

    Viewing all 1201 articles
    Browse latest View live


    <script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>