Quantcast
Channel: Intel Developer Zone Articles
Viewing all 1201 articles
Browse latest View live

Intel® Parallel Computing Center at Brigham & Women’s Hospital

$
0
0

Principal Investigator

""Dr. Patsopoulos the past few years has been leading the genetics of multiple sclerosis. He has analyzed the raw genetic data of more than 100,000 individuals, modelling TBs of data to unravel the genetic architecture of multiple sclerosis. He has been applying and developing advanced statistical models to enable analysis of large-scale data sets with millions of genetic positions and analyzed subjects.  

Description

Leveraging our hands-on experience with large-scale genetic data sets and the exhaustive number of analyses one can perform we have designed a framework for fine-mapping. Modern genetics studies involve millions of analyzed positions in the genome, most of which are linked together. Fine-mapping is the application of algorithms to identify statistically independent positions in the genome that contribute to disease susceptibility. We have developed Effect Fine Mapping (EFM), a framework that not only identifies independent positions but further quantifies the probability of any linked genetic variants to be the one truly associated with the disease. This empowers the translational studies of genetic associations by providing highly-accurate lists of disease associated genetic variants. EFM can analyze millions of genetic variants and millions of subjects in any production machine, and is optimized for multi-threaded CPUs, like the Xeon Phis.

Publications

List of peer-reviewed publications: http://patslab.bwh.harvard.edu/publications-full-list

Preprints: https://www.biorxiv.org/search/author1%3Apatsopoulos%20numresults%3A10%20sort%3Arelevance-rank%20format_result%3Astandard

Related Websites

Laboratory: http://patslab.bwh.harvard.edu

Software support by IPCC: http://patslab.bwh.harvard.edu/efm

Software supported by IPCC (git): https://bitbucket.org/patslab/efm


Using Google Blocks* for Prototyping Assets in VR

$
0
0

House render

This article discusses how to use Google Blocks* to quickly model things in VR, improving the workflow for your virtual reality (VR) projects. Before we delve into using the software, let’s look at a typical workflow.

Workflows

Typical workflow for VR development

A typical VR project is an iterative combination of code and assets. Usually there’s a period of preproduction to concept interactivity and assets (via sketches). However, once production begins, you often face many starts and stops, because the code-development side must wait for finished assets and vice versa.

This situation is made worse when developing for VR, because VR’s unique visual perspective creates issues of scale, placement, and detail that exist in 3D projects presented on a 2D monitor. So you must deal with more back and forth on asset development to get ideas right. You’re also often left with low-quality visual presentation in the prototyping stages that can hinder attracting funding or interest.

Benefits of use VR to prototype

Creating prototype model assets in VR itself creates a much smoother flow for both code and asset developers. This approach allows code developers to more quickly design and adjust the rough versions of what they need, have playable objects, and provide reference material to asset developers, compared to what sketches alone can provide. Modeling normally requires specialized skills in using many different tools and camera perspectives in order to work on a 3D object using 2D interfaces, lending precision and detail but at the cost of heavy workloads.

On the other hand, modeling in VR lets you work in the natural 3D environment by actually moving and shaping in the real-world, room-scale environment. In this way, 3D modeling is more like working with clay than it is adjusting vertices in a modeling app. This approach is not only much faster when creating low-detail prototype assets, but also much more accessible to people with no modeling skill whatsoever.

Benefits of Google Blocks low poly models

Google Blocks provides a simple “build by creating and modifying primitives” approach for VR and allows quick volumetric “sketching.” This combination of modified shapes and simple colored surfaces also lends itself to an aesthetic style that is clean-looking and low poly. This means much better performance, which is going to be extremely useful during the prototyping stage where performance will not yet be optimized. The clean look can even be used as-is for a fairly good-looking presentation.

Revised workflow for VR development

The new workflow for production shifts from start and stop “waterfall” development to one where the code team can provide prototypes of anything they need during the prototyping stage without waiting on the asset team. This approach allows the asset team to take the prototype models and preproduction sketches and develop finished assets that can be simply swapped into the working prototype that the code team has already put in place.

It’s easy to think you can just use primitives within a development tool like Unity* software to do all the prototype “blocking” you need, but the lack of actual rough models can lead to difficulty in developing proper interactions and testing. You will often find your progress hindered when trying to build things out using cubes, spheres, and cylinders. With the new workflow, you can quickly obtain shapes closer to finished assets and provide additional direction for the development of finalized assets as a result of what you learn during interaction development.

Tools Overview

Let’s lay out the tools we’ll use for development. All of them, except HTC vive*, are free. As mentioned, we’ll use the Google Blocks app to build models in VR using HTC vive. Next we’ll export and share the models using the Google Poly* cloud service. Then we’ll use the models in Unity software for actual development. Finally we’ll work with the models in Blender* for finalized asset development. You can also swap Blender with Maya*, or with whatever your preferred 3D modeling app is, and replace Unity with the Unreal Engine* (also free-ish).

Using Google Blocks

Because we will be using HTC vive for our VR head-mounted display (HMD), first download Steam*, and then install Blocks from here:
http://store.steampowered.com/app/533970/Blocks_by_Google/

When you start Blocks for the first time, you’ll get a tutorial for making an ice cream cone with all the fixings:
https://www.youtube.com/watch?v=kTCcM5sRz74&feature=youtu.be

This tutorial is a good introduction to using the basics in Blocks.

  • Moving your object
  • Creating and scaling shapes
  • Painting objects
  • Selecting and copying
  • Erasing

You have many more tools and options at your disposal through the simple interface provided. Here’s an overview of all of them:
https://youtu.be/41IljbEcGzQ

Tools list

Tools list

  • Shapes: Cone, Sphere, Cube, Cylinder, Torus
  • Strokes: Shapes that are three-sided or more
  • Paintbrush: Color options on back side of the Palette, Paint Objects, or Faces
  • Hand: Select, Grab, Scale, Copy, Flip, Group/Ungroup
  • Modify: Reshape, Subdivide, Extrude
  • Eraser: Object, Face

Palette controls

  • Tutorial, Grid, Save/Publish, New scene, Colors

Extra controls

  • Single Grip button to move scene, both buttons  to zoom/rotate
  • Left/Right on Left Trackpad to Undo/Redo
  • Left Trigger to create symmetrical
  • Right Trigger to place things

Files (left controller menu button)

  • Yours, Featured, Liked, Environments

You also need to initiate an option using your mouse, in order to import a reference image (or images) that you can place in the scene. This is a great way to go from preproduction sketches to prototyping without having to rely on memory. To import an image, click the Add reference image button in the top center of the screen on your desktop.

To import an image

Google Poly

Before we go any further into modeling in Blocks, let’s take a look at Poly, Google’s online “warehouse” of published models: https://youtu.be/qnY4foiOEgc

Poly incorporates published works from both Tilt Brush and Blocks, but you can browse specifically for either by selecting Blocks from the sidebar. Take a moment to browse and search to see the kinds of things you can make using shapes and colors without any textures needed. As you browse, be sure to click Like to make it easy to find them inside Blocks.

Like inside Block

Also be sure to note which models are under a sharing or editing license. You can only modify and publish changes to a model if it is marked remixable. Currently any models marked remixable require you to also credit the author, so be sure to do so if you use any remixable models as a base for something in an actual project.

Remix content

Now let’s take a look at the specific Blocks model you’ll be using as a base to build content for: https://poly.google.com/view/bvLXsDt9mww

Render

After you’ve Liked it, to load it easily from inside Blocks, click the menu button on the left controller, and then click the heart option to see your Liked models. Then select the house and press Insert.

Blocks

Starting out, the house will be scaled like a diorama. To make it bigger, grab it with the grab tool (the hand) and then press and hold Up on the right Trackpad (the +) to scale up. Once the size is what you want, you can use the controller grips to move and rotate it.

Rendering performance may suffer depending on your computer due to the complexity of the model, but this shows you an example of how such a complex scene is composed and lets you even modify it for your own purposes. Once we start using Unity software we will be using a prefab version I’ve modified to reduce complexity and add collision boxes, saving time. Now let’s set up the Unity software with VR implemented so we have a place to put what we make.

Unity* Software Project Setup

Importing plug-ins

First we’ll create a new project (save it wherever you like). Next we’ll need to import the SteamVR* plug-in to support vive: https://assetstore.unity.com/search?q=steamvr

SteamVR

You may have to tweak your project settings to meet its recommendations. Click Accept all to do this the first time. It may ask you again after this, unless you’ve modified your project settings. Just close the dialog in the future.

Next we’ll grab a great toolkit called VRTK, which makes movement and object interactions easier: https://assetstore.unity.com/search/?q=vrtk

We won’t cover the details of how to use the VRTK plug-in for this tutorial, but we will be using a sample scene to handle some things like touchpad walking. The VRTK plug-in is fantastic for making quick interactions available to the things you make in Blocks. To find out more about using VRTK, watch this video:

VRTK virtual reality toolkit

We’ll use the sample scene “017_CameraRig_TouchpadWalking” as a quick starting scene, so load that scene up and delete everything except the following:

  • Directional Light
  • Floor
  • [VRTK_SDKManager]
  • [VRTK_Scripts]

Sample scene

Next, scale the X/Z of the Floor larger to give you more space to walk around and place things.

scale the X/Z

Importing the prefab

Grab the prefab of the house model we looked at earlier to import into your scene: https://www.dropbox.com/s/87l8k23pc3h4a40/house.unitypackage?dl=0

Be sure to move the house to be pretty flush with the ground:

move the house

Then you should be able to run the project, put on your HMD and walk around in the scene using the touchpad to move. You could alternatively have imported this into a different example scene that used a different form of locomotion (all the example scenes are labeled) such as teleporting if touchpad movement is uncomfortable to you.

If you want to skip these set-up steps and go directly to importing the house, you can download a copy of the fully set-up Unity software project here: https://www.dropbox.com/s/cehr2wxhi6nmh6c/blocks.zip?dl=0

Outside of colliding, right now you can’t interact with anything when you move through the scene. To see the original FBX that came from Poly, in the Test Objects folder, open model 1. Now you have a nice little VR scene to start adding things to.

Building Objects

Now that the scene is set up, we have to decide what to prototype to add to it. Depending on what we want to add, there are two approaches: build something from scratch or remix something from Poly. For demonstration purposes, we’ll do both.

Let’s say we want to turn the house into a kind of cooking and training simulator, and we need more interactive props with shapes that are a little more complex than primitives can offer. We’ll start with a simple remix of two components to create a frying pan with a lid.

Combining objects

We’ll use the following two remixable models to mix into one, so be sure to Like them:

Let’s combine them: https://youtu.be/SbjSs_rcFbk

You might be wondering why we couldn’t just bring both objects into Unity software and do the mixing of objects there. This highlights the importance of the two download options you can get from Poly: OBJ and FBX. When you download an OBJ, the entire model is a single mesh, but with an FBX the model is a group of all the individual shape pieces.

This makes a difference in how you can use the model within the Unity software. Having an entire mesh as a single object can be useful when putting on mesh colliders and setting up objects to be interactive via VRTK. The two models we are remixing are available only as OBJ files, so we can’t modify the individual parts within the Unity software. However our new model will be available as both (sometimes it takes time for the OBJ option to show up under download).

Now let’s download the FBX and import it into the Unity software using drag and drop, as we would with any other file, and then we’ll drop the object into the scene.

Testing the scene

Once you’ve got the object placed, click play to hop into the scene and check the scale. If the scale doesn’t look right, select the whole model group to adjust it. Now you have a simple object to use for interactive prototyping with VRTK that will be much more effective than using a couple of unmodified primitives. You also have a great starting point for refining the model or adding extra details like textures.

What’s really cool is that you could also have inserted the house object into Blocks first and modeled the pan (or any other accessory) while having the full house for reference, and then deleted the single house object before saving and publishing, without even having to go into the Unity software.

Creating a model

Now let’s look at making a quick, simple model—a cheese grater—from scratch. This model is more than a simple primitive, but not overly complex. We’ll also make a block of cheese with the kind of crumbly edge you get when you slice it. You’ll notice that I used the menu button on the right wand to group and ungroup objects for easier manipulation, and I also used the modify tool for both edges and faces. See Video:

Because you can import this model as an FBX into the Unity software, you can easily separate the cheese from the grater for different interactions but do it as a single import.

If you prefer using Unreal Engine over the Unity software, first read about FBX workflows with that engine: https://docs.unrealengine.com/latest/INT/Engine/Content/FBX/index.html

Working with a Modeler

Importing into blender

Once you have the FBX or OBJ files you or a modeler want to work with in a full modeling package, you can import them into a free program such as Blender.

Blender

Blender

Blender

From there you can edit the model, animate it, or perform other operations.

Exporting for Poly

If you want to be able to share the model using Poly again, you can export it as an OBJ (with an MTL materials file).

model using Poly

Next, click the upload OBJ button on the Poly site.

button on the Poly site

Finally, drag and drop the .obj and .mtl files onto the page, and then click Publish to publish the model.

drag and drop the .obj

The disadvantage, however, is that the model is a single mesh OBJ, and you also can’t use it in Blocks to remix or view, so it’s useful mostly as a way to quickly share a full model. But this can also be a great way for a modeler to show you the work in progress (as well as allow you to download the OBJ for testing in the Unity software). Keep in mind that files uploaded this way won’t show up in Blocks even if you Like them. So pay attention to objects you Like to see if they say “Uploaded OBJ File,” because that won’t be usable in Blocks.

Review

Let’s review what we covered in this article.

  • Take preproduction reference images into Blocks when you have them.
  • Quickly “sketch” usable models in Blocks.
    • Use remixable models as starting points when it makes sense.
    • You can bring in other models to use for reference or scaling purposes.
  • Publish your models to Poly (you can choose to make them unlisted).
  • Download the OBJ or FBX, depending on your needs.
  • Import the model into Unity software for prototyping.
  • Share the Poly page with a modeler so they can modify the OBJ and FBX or simply use it as a reference along with the preproduction sketch and even screenshots (or a playable version) of the Unity software scene to begin developing a finalized asset in a tool like Blender.
  • The modeler can also use Poly as a way to provide you with quick previews that you can download and insert in your scene.
  • Rinse and repeat to quickly build your commercial or gaming project!

In the future, Google Blocks may also incorporate animation (see this article: https://vrscout.com/news/animating-vr-characters-google-blocks/ ) so watch for that to make your future workflow even more awesome.

Intel Software Engineers Assist with Unreal* Engine 4.19 Optimizations

$
0
0

Unreal Engine Logo

The release of Epic’s Unreal* Engine 4.19 marks a new chapter in optimizing for Intel technology, particularly in the case of optimizing for multicore CPUs. In the past, game engines traditionally followed console design points, in terms of graphics features and performance. In general, most games weren’t optimized for the CPU, which can leave a lot of PC performance sitting idle. Intel’s work with Unreal Engine 4 seeks to unlock the potential of games as soon as developers work in the engine, to fully take advantage of all the extra CPU computing power that a PC platform provides.

Intel's enabling work for Unreal Engine version 4.19 delivered the following:

  • Increased the number of worker threads to match a user’s CPU
  • Increased the throughput of the cloth physics system
  • Integrated support for Intel® VTune™ Amplifier

Each of these advances enable Unreal Engine users to take full advantage of Intel® Architecture and harness the power of multicore systems. Systems such as cloth physics, dynamic fracturing, CPU particles, and enhanced operability with Intel tools such as Intel VTune Amplifier and the C++ compiler will all benefit. This white paper will discuss in detail the key improvements and provide developers with more reasons to consider the Unreal Engine for their next PC title.

Unreal* Engine History

Back in 1991, Tim Sweeney founded Epic MegaGames (later dropping the “Mega”) while still a student at the University of Maryland. His first release was ZZT*, a shareware puzzle game. He wrote the game in Turbo Pascal using an object-oriented model, and one of the happy results was that users could actually modify the game’s code. Level editors were already common, but this was a big advance.

In the years that followed, Epic released popular games such as Epic Pinball*, Jill of the Jungle*, and Jazz Jackrabbit*. In 1995, Sweeney began work on a first-person shooter to capitalize on the success of games such as DOOM*, Wolfenstein*, Quake*, and Duke Nukem*. In 1998, Epic released Unreal*, probably the best-looking shooter of its time, offering more detailed graphics and capturing industry attention. Soon, other developers were calling and asking about licensing the Unreal Engine (UE) for their own games.

In an article for IGN in 2010, Sweeney recalled that the team was thrilled by the inquiries, and said their early collaboration with those partners defined the style of their engine business from day one. They continue to use, he explained, “a community-driven approach, and open and direct communication between licensees and our engine team.” By focusing on creating cohesive tools and smoothing out technical hurdles, their goal was always to unleash the creativity of the gaming community. They also provided extensive documentation and support, something early engines often lacked.

Today, the UE powers most of the top revenue-producing titles in the games industry. In an interview with VentureBeat in March 2017, Sweeney said developers have made more than USD 10 billion to date with Unreal games. “We think that Unreal Engine’s market share is double the nearest competitor in revenues,” Sweeney said. “This is despite the fact that Unity* has more users. This is by virtue of the fact that Unreal is focused on the high end. More games in the top 100 on Steam* in revenue are Unreal, more than any other licensable engine competitor combined.”

Intel Collaboration Makes Unreal Engine Better

Game developers who currently license the UE can easily take advantage of the optimizations described here. The work will help them grow market share for their games by broadening the range of available platforms, from laptops and tablets with integrated graphics to high-end desktops with discrete graphics cards. The optimizations will benefit end users on most PC-based systems by ensuring that platforms can deliver high-end effects such as dynamic cloth and interactive physics. In addition, optimized Intel tools will continue to make Intel Architecture a preferred platform of choice.

According to Jeff Rous, Intel Developer Relations Engineer, the teams at Intel and Epic Games have collaborated since the late 1990s. Rous has personally worked on UE optimization for about six years, involving extensive collaboration and vibrant communication with Epic engineers over email and conference calls, as well as visits to Epic headquarters in North Carolina two or three times a year for week-long deep dives. He has worked on specific titles, such as Epic’s own Fortnite* Battle Royale, as well as UE code optimization.

Prior to the current effort, Intel worked closely with Unreal on previous UE4 releases. There is a series of optimization tutorials at the Intel® Developer Zone, starting with the Unreal* Engine 4 Optimization Tutorial, Part 1. The tutorials cover the tools developers can use inside and outside of the engine, as well as some best practices for the editor, and scripting to help increase the frame rate and stability of a project.

Intel® C++ Compiler Enhancements

For UE 4.12, Intel added support for the Intel C++ Compiler into the public engine release. Intel C++ Compilers are standards-based C and C++ tools that speed application performance. They offer seamless compatibility with other popular compilers, dev environments, and operating systems, and boost application performance through superior optimizations and single instruction multiple data (SIMD) vectorization, integration with Intel® Performance Libraries, and by leveraging the latest OpenMP* 5.0 parallel programming models.

Scalar and vectorized loop versions

Figure 1: Scalar and vectorized loop versions with Intel® Streaming SIMD Extensions, Intel® Advanced Vector Extensions, and Intel® Advanced Vector Extensions 512.

Since UE 4.12, Intel has continued to keep the code base up to date, and tests on the infiltrator workload show significant improvements for frame rates.

Texture compression improvement

UE4 also launched with support for Intel’s fast texture compressor. ISPC stands for Intel® SPMD (single program, multiple data) program compiler, and allows developers to easily target multicore and new and future instruction sets through the use of a code library. Previous to integrating the ISPC texture compression library, ASTC (Adaptive Scalable Texture Compression), the newest and most advanced texture compression format, would often take minutes to compress per texture. On the Sun Temple* demo (part of the UE4 sample scenes pack), the time it took to compress all textures went from 68 minutes to 35 seconds, with better quality over the reference encoder that was used previously. This allows content developers to build their projects faster, saving hours per week of a typical developer’s time.

Optimizations for UE 4.19

Intel’s work specifically with UE 4.19 offers multiple benefits for developers. At the engine level, optimizations improve scaling mechanisms and tasking. Other work at the engine level ensures that the rendering process isn’t a bottleneck due to CPU utilization.

In addition, the many middleware systems employed by game developers will also benefit from optimizations. Physics, artificial intelligence, lighting, occlusion culling, virtual reality (VR) algorithms, vegetation, audio, and asynchronous computing all stand to benefit.

To help understand the benefits of the changes to the tasking system in 4.19, an overview of the UE threading model is useful.

UE4 threading model

Figure 2 represents time, going from left to right. The game thread runs ahead of everything else, while the render thread is one frame behind the game thread. Whatever is displayed thus runs two frames behind.

Game, render, audio threading model of Unreal Engine 4

Figure 2: Understanding the threading model of Unreal Engine 4.

Physics work is generated on the game thread and executed in parallel. Animation is also evaluated in parallel. Evaluating the animation in parallel was used to good effect in the recent VR title, Robo Recall*.

The game thread, shown in Figure 3, handles updates for gameplay, animation, physics, networking, and most importantly, actor ticking.

Developers can control the order in which objects tick, by using Tick Groups. Tick Groups don’t provide parallelism, but they do allow developers to control dependent behavior to better schedule parallel work. This is vital to ensure that any parallel work does not cause a game thread bottleneck later.

Game thread and related jobs illustration

Figure 3: Game thread and related jobs.

As shown below in Figure 4, the render thread handles generating render commands to send to the GPU. Basically, the scene is traversed, and then command buffers are generated to send to the GPU. The command buffer generation can be done in parallel to decrease the time it takes to generate commands for the whole scene and kick off work sooner to the GPU.

breaking draw calls into chunks

Figure 4: The render thread model relies on breaking draw calls into chunks.

Each frame is broken down into phases that are done one after another. Within each phase, the render thread can go wide to generate the command lists for that phase:

  • Depth prepass
  • Base pass
  • Translucency
  • Velocity

Breaking the frame into chunks enables farming them into worker tasks with a parallel command list that can be filled up with the results of those tasks. Those get serialized back and used to generate draw calls. The engine doesn’t join worker threads at the call site, but instead joins at sync points (end of phases), or at the point where they are used if fast enough.

Audio thread

The main audio thread is analogous to the render thread, and acts as the interface for the lower-level mixing functions by performing the following tasks:

  • Evaluating sound queue graphs
  • Building wave instances
  • Handling attenuation, and so on

The audio thread is the thread that all user-exposed APIs (such as Blueprints and Gameplay) interact with. The decoding and source-worker tasks decode the audio information, and also perform processing such as spatialization and head-related transfer function (HRTF) unpacking. (HRTF is vital for players in VR, as the algorithms allow users to detect differences in sound location and distance.)

The audio hardware thread is a single platform-dependent thread (for example, XAudio2* on Microsoft Windows*), which renders directly to the output hardware and consumes the mix. This isn’t created or managed by UE, but the optimization work will still impact thread usage.

There are two types of tasks—decoding and source worker.

  • Decoding: decodes a block of compressed source files. Uses double buffering to decode compressed audio as it's being played back.
  • Source Worker: performs the actual source processing for sources, including sample rate conversion, spatialization (HRTF), and effects. The Source Worker is a configurable number in an INI file.
    • If you have four workers and 32 sources, each will mix eight sources.
    • The Source Worker is highly parallelizable, so you can increase the number if you have more CPU power.

Robo Recall was also the first title to ship with the new audio mixing and threading system in the Unreal Engine. In Robo Recall, for example, the head-related transfer function took up nearly half of the audio time.

CPU worker thread scaling

Prior to UE 4.19, the number of available worker threads on the task graph was limited and did not take Intel® Hyper-Threading Technology into account. This caused a situation on systems with more than six cores where entire cores would sit idle. Correctly creating the right number of worker threads available on the task graph (UE’s internal work scheduler) allows for content creators to scale visual-enhancing systems such as animation, cloth, destruction, and particles beyond what was possible before.

In UE 4.19, the number of worker threads on the task graph is calculated based on the user’s CPU, up to a current max of 22 per priority level:

if (NumberOfCoresIncludingHyperthreads > NumberOfCores)
    {
      NumberOfThreads = NumberOfCoresIncludingHyperthreads - 2;
    }
    else
    {
      NumberOfThreads = NumberOfCores - 1;
    }

The first step in parallel work is to open the door to the possibility that a game can use all of the available cores. This is a fundamental issue to make scaling successful. With the changes in 4.19, content can now do so and take full advantage of enthusiast CPUs through systems such as cloth physics, environment destruction, CPU-based particles, and advanced 3D audio.

Hardware thread utilization

Figure 5: Unreal Engine 4.19 now has the opportunity to utilize all available hardware threads.

In the benchmarking example above, the system is at full utilization on an Intel® Core™ i7-6950X processor at 3.00 GHz system, tested using a synthetic workload.

Destruction benefits

One benefit from better thread utilization in multicore systems is in destruction. Destruction systems use the task graph to simulate dynamic fracturing of meshes into smaller pieces. A typical destruction workload consists of a few seconds of extensive simulation, followed by a return to the baseline. Better CPUs with more cores can keep the pieces around longer, with more fracturing, which greatly enhances realism.

Rous believes there is more that developers can do with destruction and calls it a good target for improved realism with the proper content. “It’s also easy to scale-up destruction, by fracturing meshes more and removing fractured chunks after a longer length of time on a more powerful CPU,” he said. “Since destruction is done through the physics engine on worker threads, the CPU won’t become the rendering bottleneck until quite a few systems are going at once.”

Simulation of dynamic fracturing of meshes

Figure 6: Destruction systems simulate dynamic fracturing of meshes into small pieces.

Cloth System Optimization

Cloth systems are used to add realism to characters and game environments via a dynamic 3D mesh simulation system that responds to the player, wind, or other environmental factors. Typical cloth applications within a game include player capes or flags.

The more realistic the cloth system, the more immersive the gaming experience. Generally speaking, the more cloth systems enabled, the more realistic the scene.

Developers have long struggled with the problem of making cloth systems appear realistic. Otherwise, characters are restricted to tight clothing, and any effects of wind blowing through clothing is lost. Modeling a cloth system has been a challenge, however.

Early attempts at cloth systems

According to Donald House at Texas A&M University, the first important computer graphics model for cloth simulation was presented by Jerry Weil in 1986. House and others presented an entire course on “Cloth and Clothing in Computer Graphics,” and described Weil’s work in detail. Weil developed “a purely geometric method for mimicking the drape of fabric suspended at constraint points,” House wrote. There were two phases in Weil’s simulation process. First, geometrically approximate the cloth surface with catenary curves, producing triangles of constraint points. Then, by applying an iterative relaxation process, the surface is smoothed by interpolating the original catenary intersection points. This static draping model could also represent dynamic behavior by applying the full approximation and relaxation process once, and then successively moving the constraint points slightly and reapplying the relaxation phase.

Around the same time, continuum models emerged that used physically based approaches to cloth behavior modeling. These early models employed continuum representations, modeling cloth as an elastic sheet. The first work in this area is a 1987 master’s thesis by Carl R. Feynman, who superimposed a continuum elastic model on a grid representation. Due to issues with simulation mesh sizes, cloth modeling using continuum techniques has difficulty capturing the complex folding and buckling behavior of real cloth.

Particle models gain traction

Particle models gained relevance in 1992, when David Breen and Donald House developed a non-continuum interacting particle model for cloth drape, which “explicitly represents the micro-mechanical structure of cloth via an interacting particle system,” as House described it. He explained that their model is based on the observation that cloth is “best described as a mechanism of interacting mechanical parts rather than a substance, and derives its macro-scale dynamic properties from the micro-mechanical interaction between threads.” In 1994 it was shown how this model could be used to accurately reproduce the drape of specific materials, and the Breen/House model has been expanded from there. One of the most successful of these models was by Eberhard, Weber, and Strasser in 1996. They used a Lagrangian mechanics reformulation of the basic energy equations suggested in the Breen/House model, resulting in a system of ordinary differential equations from which dynamics could be calculated.

The dynamic mesh simulation system is the current popular model. It responds to the player, wind, or other environmental factors, and results in more realistic features such as player capes or flags.

The UE has undergone multiple upgrades to enhance cloth systems; for example, in version 4.16, APEX Cloth* was replaced with NVIDIA’s NvCloth* solver. This low-level clothing solver is responsible for the particle simulation that runs clothing and allows integrations to be lightweight and very extensible, because developers now have direct access to the data.

More triangles, better realism

In UE 4.19, Intel engineers worked with the UE team to optimize the cloth system further to improve throughput. Cloth simulations are treated like other physics objects and run on the task graph’s worker threads. This allows developers to scale content on multicore CPUs and avoid bottlenecks. With the changes, the amount of cloth simulations usable in a scene has increased by approximately 30 percent.

Cloth is simulated in every frame, even if the player is not looking at that particular point; simulation results will determine if the cloth system shows up in a player’s view. Cloth simulation uses the CPU about the same amount from frame to frame, assuming more systems aren’t added. It’s easily predictable and developers can tune the amount they’re using to fit the available headroom.

Examples of cloth systems

Figure 7: Examples of cloth systems in the Content Examples project.

For the purposes of the graphs in this document, the cloth actors used have 8,192 simulated triangles per mesh, and were all within the viewport when the data was captured. All data was captured on an Intel® Core™ i7-7820HK processor.

 CPU Usage

Figure 8: Different CPU usages between versions of Unreal Engine 4, based on number of cloth systems in the scene.

 frames per second

Figure 9: Difference in frames per second between versions of Unreal Engine 4 based on number of cloth systems in the scene.

Enhanced CPU Particles

Particle systems have been used in computer graphics and video games since the very early days. They’re useful because motion is a central facet of real life, so modeling particles to create explosions, fireballs, cloud systems, and other events is crucial to develop full immersion.

High-quality features available to CPU particles include the following:

  • Light emission
  • Material parameter control
  • Attractor modules

Particles on multicore systems can be enhanced by using CPU systems in tandem with GPU ones. Such a system easily scales—developers can keep adding to the CPU workload until they run out of headroom. Engineers have found that pairing CPU particles with GPU particles can improve realism by adding light casting, allowing light to bounce off objects they run into. Each system has inherent limitations, so pairing them results in a system greater than the sum of their parts.

CPU particles emitting light

Figure 10: CPU Particles can easily scale based on available headroom.

Intel® VTune™ Amplifier Support

The Intel VTune Amplifier is an industry-standard tool to determine thread bottlenecks, sync points, and CPU hotspots. In UE 4.19, support for Intel VTune Amplifier ITT markers was added to the engine. This allows users to generate annotated CPU traces that give deep insight into what the engine is doing at all times.

ITT APIs have the following features:

  • Control application performance overhead based on the amount of traces that you collect.
  • Enable trace collection without recompiling your application.
  • Support applications in C/C++ and Fortran environments.
  • Support instrumentation for tracing application code.

Users can take advantage of this new functionality by launching Intel VTune Amplifier and running a UE workload through the UI with the -VTune switch. Once inside the workload, simply type Stat Namedevents on the console to begin outputting the ITT markers to the trace.

Intel VTune Amplifier trace in Unreal Engine 4.19

Figure 11:Example of annotated Intel VTune Amplifier trace in Unreal Engine 4.19.

Conclusion

Improvements involved solving technical challenges at every layer—the engine, middleware, the game editor, and in the game itself. Rather than working on a title by title basis, engine improvements benefit the whole Unreal developer ecosystem. The advances in 4.19 improve CPU workload challenges throughout the ecosystem in the following areas:

  • More realistic destruction, thanks to more breakpoints per object.
  • More particles, leading to better animated objects such as vegetation, cloth, and dust particles.
  • More realistic background characters.
  • More cloth systems.
  • Improved particles (for example, physically interacting with character, NPCs, and environment).

As more end users migrate to powerful multicore systems, Intel plans to pursue a roadmap that will continue to take advantage of higher core counts. Any thread-bound systems or bottlenecked operations are squarely in the team’s crosshairs. Developers should be sure to download the latest version of the UE, engage at the Intel Developer Zone, and see for themselves.

Further Resources

Unreal* Engine 4 Optimization Guide

CPU Optimizations for Cloth Simulations

Setting up Destructive Meshes

CPU Scaling Sample

Intel® Graphics Performance Analyzers (Intel® GPA) 2018 R1 Release Notes

$
0
0

Thank you for choosing the Intel® Graphics Performance Analyzers (Intel® GPA), available as a standalone product and as part of Intel® System Studio.

Contents

Introduction
What's New
System Requirements and Supported Platforms
Installation Notes
Technical Support and Troubleshooting
Known Issues and Limitations
Legal Information

Introduction

Intel® GPA provides tools for graphics analysis and optimizations for making games and other graphics-intensive applications run even faster. The tools support the platforms based on the latest generations of Intel® Core™ and Intel Atom™ processor families, for applications developed for  Windows*, Android*, Ubuntu*, or macOS*.

Intel® GPA provides a common and integrated user interface for collecting performance data. Using it, you can quickly see performance opportunities in your application, saving time and getting products to market faster.

For detailed information and assistance in using the product, refer to the following online resources:

  • Home Page - view detailed information about the tool, including links to training and support resources, as well as videos on the product to help you get started quickly.
  • Getting Started - get the main features overview and learn how to start using the tools on different host systems.
  • Training and Documentation - learn at your level with Getting Started guides, videos and tutorials.
  • Online Help for Windows* Host - get details on how to analyze Windows* and Android* applications from a Windows* system.
  • Online Help for macOS* Host - get details on how to analyze Android* or macOS* applications from a macOS* system.
  • Online Help for Ubuntu* Host - get details on how to analyze Android* or Ubuntu* applications from an Ubuntu* system.
  • Support Forum - report issues and get help with using Intel® GPA.

What's New

Intel® GPA 2018 R1 offers the following new features:

New Features for Analyzing All Graphics APIs

Graphics Frame Analyzer

  • API Log pane now contains a new Frame Statistic tab, and separate tabs for Resource History and Pixel History. The Resource History tab enables you to select a target resource, and in the Pixel History tab you can select pixel coordinates. 
  • API Log and Metrics can be exported now.
  • Input/Output Geometry viewer now provides additional information about the topology, primitive count, and bounding box.
  • Frame Overview pane shows full-frame FPS along with a GPU duration time.
  • Information about systems where a frame is captured and replayed is shown.

New Features for Analyzing Microsoft DirectX* Applications

Graphics Monitor

  • New User Interface is now available on Windows*
  • Remote profiling of DirectX* 9 or DirectX*10 frames is discontinued.

Graphics Frame Analyzer

  • New User Interface for DirectX* 11 frames. The following Legacy User Interface features are transferred to the new interface:
    • Render Target overdraw view
    • Shader replacement experiment allowing the user to import the HLSL shader code and view performance impacts on the entire frame
  • Default layout of D3D Buffers is now based on a specific buffer usage in a frame.
  • Samples count is shown as a parameter for 2D Multisample Textures or 2D Multisample Texture Arrays.
  • API Call arguments including structures, arrays and enums are correctly shown for DirectX11 frames.
  • API Log contains calls from the D3D11DeviceContext interface only.
  • List of bound shader resources (input elements, SRVs, UAVs, CBVs, Sampler, RTVs, DSV) is shown along with a shader code.
  • Target GPU adapter can be selected on multi-GPU machines for DirectX* 11 and DirectX* 12 frames.
  • Intel Gen Graphics Intermediate Shader Assembly (ISA) code is added for DirectX* 11 frames.
  • Input-Assembly layout is shown for DirectX*11 and DirectX*12 frames in the Geometry viewer.

New Features for Analyzing macOS Metal* Applications

Multi-Frame Analyzer

  • Ability to export the Metal source or LLVM disassembly codes for a selected shader.
  • Shader replacement experiment allowing the user to import a modified shader and view the performance impacts on the entire frame.

Many defect fixes and stability improvements

Known Issues

  • Full Intel GPA metrics are not supported on macOS* 10.13.4 for Skylake-based and Kaby Lake-based Mac Pro systems.  For full metric support, please do not upgrade to macOS* 10.13.4.
  • Metrics in the System Analyzer's system view are inaccurate for Intel® Graphics Driver for Windows* Version 15.65.4.4944. You can use Intel® Graphics Driver for Windows* Version 15.60.2.4901 instead.

System Requirements and Supported Platforms

The minimum system requirements are: 

  • Host Processor: Intel® Core™ Processor
  • Target Processor: See the list of supported Windows* and Android* devices below
  • System Memory: 8GB RAM
  • Video Memory: 512MB RAM
  • Minimum display resolution for client system: 1280x1024
  • Disk Space: 300MB for minimal product installation

Direct installation of Intel® GPA on 32-bit Windows* systems is not supported. However, if you need to analyze an application on a 32-bit Windows* target system, you can use the following workaround:

  1. Copy the 32-bit *.msi installer distributed with the 64-bit installation from your analysis system to the target system.
  2. Run the installer on the target system to install System Analyzer and Graphics Monitor.
  3. Start the Graphics Monitor and the target application on the 32-bit system and connect to it from the 64-bit host system.

For details, see the Running System Analyzer on a Windows* 32-bit System article.

The table below shows platforms and applications supported by Intel® GPA 2018 R1

Target System
(the system where your game runs)
Host System
(your development system where you run the analysis)
Target Application
(types of supported applications running on the target system)

Windows* 7 SP1/8/8.1/10

Windows* 7 SP1/8/8.1/10

Microsoft* DirectX* 9/9Ex, 10.0/10.1, 11.0/11.1/11.2/11.3

Windows* 10

Windows* 10

Microsoft* DirectX* 12, 12.1

Google* Android* 4.1, 4.2, 4.3, 4.4, 5.x, 6.0

The specific version depends on the officially-released OS for commercial version of Android* phones and tablets.
See the list of supported devices below.

NOTE: Graphics Frame Analyzer does not currently support GPU metrics for the Intel® processor code-named Clover Trail+.

Windows* 7 SP1/8/8.1/10
or
macOS* 10.11, 10.12
or
Ubuntu* 16.04

OpenGL* ES 1.0, 1.1, 2.0, 3.0, 3.1, 3.2

Ubuntu* 16.04

Ubuntu* 16.04

OpenGL* 3.2, 3.3, 4.0, 4.1 Core Profile

macOS* 10.12 and 10.13macOS* 10.12 and 10.13

OpenGL* 3.2, 3.3, 4.0, 4.1 Core Profile

and

Metal* 1 and 2

Intel® GPA does not support the following Windows* configurations: All server editions, Windows* 8 RT, or Windows* 7 starter kit.

Supported Windows* Graphics Devices

Intel® GPA supports the following graphics devices as targets for analyzing Windows* workloads. All these targets have enhanced metric support:

TargetProcessor
Intel® UHD Graphics 6308th generation Intel® Core™ processor
Intel® UHD Graphics 6307th generation Intel® Core™ processor
Intel® UHD Graphics 6207th generation Intel® Core™ processor
Intel® HD Graphics 6207th generation Intel® Core™ processor
Intel® HD Graphics 6157th generation Intel® Core™ m processor
Intel® HD Graphics 5306th generation Intel® Core™ processor
Intel® HD Graphics 5156th generation Intel® Core™ m processor
Iris® graphics 61005th generation Intel® Core™ processor
Intel® HD Graphics 5500 and 60005th generation Intel® Core™ processor
Intel® HD Graphics 53005th generation Intel® Core™ m processor family
Iris® Pro graphics 52004th generation Intel® Core™ processor
Iris® graphics 51004th generation Intel® Core™ processor
Intel® HD Graphics 4200, 4400, 4600, and 50004th generation Intel® Core™ processor
Intel® HD Graphics 2500 and 40003rd generation Intel® Core™ processor
Intel® HD Graphics
Intel® Celeron® processor N3000, N3050, and N3150
Intel® Pentium® processor N3700

Although the tools may appear to work with other graphics devices, these devices are unsupported. Some features and metrics may not be available on unsupported platforms. If you run into in an issue when using the tools with any supported configuration, please report this issue through the Support Forum.

Driver Requirements for Intel® HD Graphics

When running Intel® GPA on platforms with supported Intel® HD Graphics, the tools require the latest graphics drivers for proper operation. You may download and install the latest graphics drivers from http://downloadcenter.intel.com/.

Intel® GPA inspects your current driver version and notifies you if your driver is out-of-date.

Supported Devices Based on Intel® Atom™ Processor

Intel® GPA supports the following devices based on Intel® Atom™ processor:

Processor ModelGPUAndroid* VersionSupported Tools

Intel® Atom™ Z35XX 

Imagination Technologies* PowerVR G6430

Android* 4.4 (KitKat), Android* 5.x (Lollipop)

System Analyzer
Graphics Frame Analyzer
Trace Analyzer [Beta]

Intel® Atom™ Z36XXX/Z37XXX 

Intel® HD Graphics

Android* 4.2.2 (Jelly Bean MR1)
Android* 4.4 (KitKat)
Android* 5.x (Lollipop)

 

System Analyzer
Graphics Frame Analyzer
Trace Analyzer [Beta]

Intel® Atom™ Z25XX 

Imagination Technologies* PowerVR SGX544MP2

Android* 4.2.2 (Jelly Bean MR1)
Android* 4.4 (KitKat)

 

System Analyzer
Graphics Frame Analyzer
Trace Analyzer [Beta]

Intel® Atom™ x7-Z8700, x5-Z8500, and x5-Z8300 

Intel® HD Graphics

Android* 5.x (Lollipop), Android* 6.0 (Marshmallow)

System Analyzer
Graphics Frame Analyzer
Trace Analyzer [Beta]

Supported ARM*-Based Devices

The following devices are supported with Intel® GPA:

ModelGPUAndroid* Version

Samsung* Galaxy S5

Qualcomm* Adreno 330

Android* 5.0

Samsung* Galaxy Nexus (GT-i9500)

Imagination Technologies* PowerVR SGX544

Android* 4.4

Samsung* Galaxy S4 Mini (GT-I9190)

Qualcomm* Adreno 305

Android* 4.4

Samsung* Galaxy S III (GT-i9300)

ARM* Mali 400MP

Android* 4.3

Google* Nexus 5

Qualcomm* Adreno 330

Android* 5.1

Nvidia* Shield tablet

NVIDIA* Tegra* K1 processor

Android* 5.1

Your system configuration should satisfy the following requirements:

  • Your ARM*-based device is running Android* 4.1, 4.2, 4.3, 4.4, 5.0, 5.1, or 6.0
  • Your Android* application uses OpenGL* ES 1.0, 1.1, 2.0, 3.0, 3.1, or 3.2
  • Regardless of your ARM* system type, your application must be 32-bit

For support level details for ARM*-based devices, see this article.

Installation Notes

Installing Intel® GPA 

Download the Intel® GPA installer from the Intel® GPA Home Page.

Installing Intel® GPA on Windows* Target and Host Systems

To install the tools on Windows*, download the *.msi package from the Intel® GPA Home Page and run the installer file.

The following prerequisites should be installed before you run the installer:

  • Microsoft DirectX* Runtime June 2010
  • Microsoft .NET 4.0 (via redirection to an external web site for download and installation)

If you use the product in a host/target configuration, install Intel® GPA on both systems. For more information on the host/target configuration, refer to Best Practices.

For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Installing Intel® GPA on Ubuntu* Host System

To install Intel® GPA on Ubuntu*, download the .tar package, extract the files, and run the .deb installer.

It is not necessary to explicitly install Intel® GPA on the Android* target device since the tools automatically install the necessary files on the target device when you run System Analyzer. For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Installing Intel® GPA on macOS* Host System

To install the tools on macOS*, download the .zip package, unzip the files, and run the .pkg installer.

It is not necessary to explicitly install Intel® GPA on the Android* target device because the tools automatically install the necessary files on the target device when you run the System Analyzer. For details on how to set up an Android* device for analysis with Intel® GPA, see Configuring Target and Analysis Systems.

Technical Support and Troubleshooting

For technical support, including answers to questions not addressed in the installed product, visit the Support Forum.

Troubleshooting Android* Connection Problems

If the target device does not appear when the adb devices command is executed on the client system, do the following:

  1. Disconnect the device
  2. Execute $ adb kill-server
  3. Reconnect the device
  4. Run $ adb devices

If these steps do not work, try restarting the system and running $adb devices again. Consult product documentation for your device to see if a custom USB driver needs to be installed. 

Known Issues and Limitations

General

  • Your system must be connected to the internet while you are installing Intel® GPA.
  • Selecting all ergs might cause a significant memory usage in Graphics Frame Analyzer.
  • Intel® GPA uses sophisticated techniques for analyzing graphics performance which may conflict with third-party performance analyzers. Therefore, ensure that other performance analyzers are disabled prior to running any of these tools. For third-party graphics, consult the vendor's website.
  • Intel® GPA does not support use of Remote Desktop Connection.
  • Graphics Frame Analyzer (DirectX* 9,10,11) runs best on systems with a minimum of 4GB of physical memory. Additionally, consider running the Graphics Frame Analyzer (DirectX* 9,10,11) in a networked configuration (the server is your target graphics device, and the client running the Graphics Frame Analyzer is a 64-bit OS with at least 8GB of memory).
  • On 64-bit operating systems with less than 8GB of memory, warning messages, parse errors, very long load times, or other issues may occur when loading a large or complex frame capture file.

Analyzing Android* Workloads

  • Graphics Frame Analyzer does not currently support viewing every available OpenGL/OpenGL ES* texture format.
  • Intel® GPA provides limited support for analyzing browser workloads on Android*. You can view metrics in the System Analyzer, but the tools do not support creating or viewing frame capture files or trace capture files for browser workloads. Attempting to create or view these files may result in incorrect results or program crashes.
  • Intel® GPA may fail to analyze OpenGL* multi-context games.

Analyzing Windows* Workloads

  • The Texture 2x2 experiment might work incorrectly for some DirectX* 12 workloads.
  • Intel® GPA may show offsets used in DirectX* 12 API call parameters in scientific format.
  • Render Target visualization experiments “Highlight” and “Hide” are applied to all Draw calls in a frame. As a result, some objects may disappear and/or be highlighted incorrectly.
  • Frame Analyzer may crash if the ScissorRect experiment is deselected. The application will go back to Frame File open view.
  • Downgrade from 17.2 to 17.1 might not be successful.
  • The Overdraw experiment for Render Targets with 16-bit and 32-bit Alpha channel is not supported now.
  • To view Render Targets with 16-bit and 32-bit Alpha channel, you should disable Alpha channel in the Render Targets viewer.
  • To ensure accurate measurements on platforms based on Intel® HD Graphics, profile your application in the full-screen mode. If windowed mode is required, make sure only your application is running. Intel® GPA does not support profiling multiple applications simultaneously.
  • For best results when analyzing frame or trace capture files on the same system where you run your game, follow these steps:
    • Run your game and capture a frame or trace file.
    • Shut down your game and other non-essential applications.
    • Launch the Intel® GPA.
  • To run Intel® GPA on hybrid graphics solutions (a combination of Intel® Processor Graphics and third-party discrete graphics), you must first disable one of the graphics solutions.
  • Secure Boot, also known as Trusted Boot, is a security feature in Windows* 8 enabled in BIOS settings which can cause unpredictable behavior when the "Auto-detect launched applications" option is enabled in Graphics Monitor Preferences. Disable Secure Boot in the BIOS to use the auto-detection feature for analyzing application performance with Intel® GPA. The current version of the tools can now detect Secure Boot, and warns you of this situation.
  • To view the full metric set with the tools for Intel® Processor Graphics on systems with one or more third-party graphics device(s) and platforms based on Intel® HD Graphics, ensure that Intel is the preferred graphics processor. You can set this in the Control Panel application for the third-party hardware. Applications running under Graphics Monitor and a third-party device show GPU metrics on DirectX* 9 as initialized to 0 and on DirectX* 10/11 as unavailable.
  • When using the Intel® GPA, disable the screen saver and power management features on the target system running the Graphics Monitor — the Screen Saver interferes with the quality of the metrics data being collected. In addition, if the target system is locked (which may happen when a Screen Saver starts), the connection from the host system to the target system will be terminated.
  • Intel® GPA does not support frame capture or analysis for:
    • applications that execute on the Debug D3D runtime system
    • applications that use the Reference D3D Device
  • System Analyzer HUD may not operate properly when applications use copy-protection, anti-debugging mechanisms, or non-standard encrypted launching schemes.
  • Intel® GPA provides analysis functionality by inserting itself between your application and Microsoft DirectX*. Therefore, the tools may not work correctly with certain applications which themselves hook or intercept DirectX* APIs or interfaces.
  • Intel® GPA does not support Universal Windows Platform applications where the graphics API uses compositing techniques such as HTML5 or XAML interop.  Only traditional DirectX* rendering is supported. To workaround this limitation, port your application as a Desktop application, and then use the full Intel® GPA suite of tools.
  • In some cases, the Overview tab in Graphics Frame Analyzer (DirectX* 9,10,11) can present GPU Duration values higher than Frame Duration values measured during game run time. This could be a result of Graphics Frame Analyzer (DirectX* 9,10,11) playing the captured frame back in off-screen mode which can be slower than on-screen rendering done in the game.

    To make playback run on-screen use this registry setting on the target system: HKEY_CURRENT_USER\Software\Intel\GPA\16.4\ForceOnScreenPlaybackForRemoteFA = 1 and connect to the target with Graphics Frame Analyzer (DirectX* 9,10,11) running on a separate host. If these requirements are met, the playback runs in off-screen mode on the target. If the frame was captured from the full-screen game, but playback renders it in a windowed mode, then try pressing Alt+Enter on the target to switch playback to full-screen mode.

  • Frame capture using Graphics Monitor runs best on 64-bit operating systems with a minimum of 4GB of physical memory.
    On 32-bit operating systems (or 64-bit operating systems with <4GB of memory), out of memory or capture failed messages can occur.
  • Scenes that re-create resource views during multi-threaded rendering have limited support in the current Intel® GPA version, and might have issues with frame replays in Graphics Frame Analyzer.

*Other names and brands may be claimed as the property of others.

** Disclaimer: Intel disclaims all liability regarding rooting of devices. Users should consult the applicable laws and regulations and proceed with caution. Rooting may or may not void any warranty applicable to your devices.

Character Modeling

$
0
0

modeled character

Introduction

Character modeling is the process of creating a character within the 3D space of computer programs. The techniques for character modeling are essential for third - and first - person experiences within film, animation, games, and VR training programs. In this article, I explain how to design with intent, how to make a design model - ready, and the process of creating your model. In later lessons, we will continue to finish the model using retopologizing techniques.

characters within the 3D space

Design and Drawing

The first step to designing a character is to understand its purpose in the application or scene. For example, if this character is to be created for a first-person training program, you may only need to model floating hands. This could be how your character is designed for a training application.

Additionally, for film, games, and VR the character's design is key. The design must fit into the world and also visually describe their personality. If they have big, wide eyes, they're probably cartoony and cute. If they wear one sock higher than the other, they might be quirky or stressed. Let their design tell a story about what kind of person they are.

Below I've provided a sample design of the character I will be modeling throughout the article. With it, I've provided a breakdown that explains how his design affects your perception of his character.

Simple Design Breakdown

  • Round shapes indicate that the character is nice and friendly; you want the audience to like this character.
  • Big eyes show youth and make the character cute; also very expressive.
  • Details like the propeller hat and the striped shirt indicate that he's fun and silly.

Model-Ready: Static vs Animated

design of the character model

Once you have a design, it's important to distinguish whether your character is static or animated. This will determine how your go about creating your blueprints for your character model. These blueprints are called orthographic drawings. Orthographic drawings are front, side, and top drawings of your model. You may see these types of drawings for 2D animation or concept art. However, orthographic drawings for 3D character models are different. Below I will explain the different requirements for static and animated orthographic drawings.

Animated

An animated model must be set up properly for rigging. The following requirements are necessary for a character to be bound to a rig:

  • The drawings must be done in a T pose or A pose
  • They must have a slight bend at the knees and arms
  • Fingers and legs must be spread apart
  • They must have a blank expression

Skipping any of these steps will make it difficult to achieve clean results with rigging and animation. I've provided some example orthographic, T-pose drawings of the character I will be modeling.

T-pose drawings of the character

Static

A static model, like a statue or action figure for instance, will hold the same pose. Therefore, it doesn't need a rig. Rather, it just needs to be modeled in the pose and expression the design calls for. The only requirement for your orthographic views is the drawings must be representative of the pose and expression of the finished character model, for all orthographic angles.

static model

Notice, for both of the animated and static drawings the side and front views of the body line up correspondingly. This is important to ensure that the model will be proportionally correct when these blueprints guide you through the modeling process. To continue forward, save each orthographic view as its own .jpg or .png file.

orthographic view
Orthographic View

Now, you're ready to continue onto the modeling section! Since head modeling tends to be more difficult, I've chosen to focus on head modeling for the majority of the section. However, I believe once you are able to understand how to model the head, creating the rest of the character will come easily. Additionally, the same techniques will apply, and I will continue to guide you with step-by-step processes and images.

Modeling

Setup

Now that your orthographic drawings are done you can bring them into your 3D program of choice. To do so, you'll bring them in as image planes. As you can see by the images below, the drawings on the image planes line up accordingly with one another. This is essential. A little bit of difference is okay, but if they're far off, the image planes can warp the proportions of your character model. Once you have your planes in place, we are officially ready to begin modeling.

image planes

Tips Before You Start

The three keys to character modeling are symmetry, simplicity, and toggling.

  • Symmetry: Throughout each piece of the body, it's important for us to have symmetry to maintain proper functionality for animation.
  • Simplicity: Never start with a dense mesh. Starting with a low polygon count will allow you to easily shape the mesh. For instance, in the video I start with a cube, three subdivisions across the depth, width, and height.
  • Toggling: It's important to toggle mesh-smoothing on and off. Often, messy geometry will appear clean while the mesh is smoothed. 

I do all three of these processes throughout the head modeling video. Watching it will help you understand how these techniques fit into the workflow. Now let's get started!

Modeling The Head

For modeling the head, we are going to go through four stages. These stages will apply to creating the head and the rest of the body.

  1. Low Poly-Stage: shaping a low-poly primitive object (a cube for instance) to the piece of the body you are creating.
  2. Pre-planning Stage: increase the polygon count and continue to shape the mesh.
  3. Planning Stage: plan a space for the details, like the facial features on a head model, for instance.
  4. Refinement stage: tweak and add topology as you see fit so you are able to match your design.

the four stages for modeling the head

Stages one and two

To begin, I'm going to shape a low-poly cube into my character's head. As you watch the video, you'll notice that I use the Translate tool to shape the head, as well as the Insert Edgeloop tool and the Smooth button for further detailing the head. I like using insert edge loop when I need more topology in a particular area. On the other hand, the Smooth button helps when I like to increase the topology on the entire mesh while maintaining the smooth-mesh shape.

Stage three and four

Now that there's more topology, we can begin planning for the eyes, nose, and mouth. You'll be using your orthographic drawings to guide you on the placement for each of these. Again, it's important to follow along with the video so that you can see the process. From here, the steps that follow are:

  1. Plan/shape the vertices of your mesh for the facial features.
  2. Extrude to build a space for the eye sockets, mouth and nose.
  3. Continue to form shape without adding more topology.
  4. Slowly add or extrude polygons.
  5. Use sculpt tools or soft-selection to match the mesh and orthographic drawings as best as possible.
  6. Repeat steps three and four a few times until your topology matches your drawings.

planning the eyes, nose, and mouth

During this stage, I like using the Edgeslide tool, so that when I translate the vertices the head shape will not be altered. Next, you can move onto the refinement stage. After you've finished refining the facial features, you can begin to model the eyes.

Modeling the Eyes

The next step is to make eyeballs that fit inside the head, and for the sockets to fit around them. The process is as follows:

  1. Make a sphere.
  2. Move and uniformly scale the sphere to fit roughly inside the socket.
  3. Rotate the sphere 90 degrees so that the pole is facing outward.
  4. Adjust the socket as necessary so that it rests on the eyeball.
  5. Shape the iris, pupil, and cornea as demonstrated below.
  6. Select the new group and scale the group -1 across the x axis.

Follow the steps as guided with the images below.

modeling the eyes
Figure 1. Uniformly scale, and translate a sphere to roughly fit inside the socket.
Then rotate the sphere 90 degrees so that the sphere's pole is facing outward.
Figure 2. Adjust the socket to fit around the eye.

Duplicate the eye. The "eye" mesh we do not edit will be the cornea.

creating the iris and pupil
Figure 3. Pick a sphere, select the edges as shown, and scale.
Figure 4. Translate the edges back to fit inside the cornea. Now we've created the iris.
5. Select the inner faces, and extrude inward to create the pupil.

Group the eye pieces. Rename the group and then proceed to duplicate it.

group and scale eye piece
Figure 6. Select the new group and scale the group -1 across the x axis.

Now, the only things missing from the head are the eyelids, ears, and neck. However, we won't be doing those until we finish retopologizing our model. As for the hair and eyebrows, I typically like to create low-poly simple shapes.

Patching a Mistake

If it's your first time making a model, it's possible you ran into several complications throughout this process. Below I've provided some possible problems with their corresponding solutions.

  • My symmetry tool isn't working properly.
    • This is an indication of asymmetry. Go through the following steps to troubleshoot the problem.
    1. Check for and delete extra vertices and faces.
    2. Make a duplicate and hide or move the original. Next, you'll need to delete half of the duplicate's faces, ensure the vertices that cut down the middle of the mesh are in line with the axis of symmetry, and then use the mirror tool across the axis of symmetry.
    3. Delete the object's history.
  • My mesh is asymmetrical.
    • This sometimes happens when you move vertices after forgetting to turn symmetry back on.
    1. Make a duplicate and hide or move the original. Next, you'll need to delete half of the duplicate's faces, ensure the vertices that cut down the middle of the mesh are in line with the axis of symmetry, and then use mirror tool across the axis of symmetry.
  • I can't get my character's eyes to fit inside both socket and head.
    • This is likely the case for eyes that have an oval shape or are really far spread apart. For these instances you'll probably need to use a lattice deformer on your geometry. Animating a texture map is also a possible solution.
  • When I group and mirror the mesh, it doesn't mirror.

Arms and Edgeloop Placement

Next, I'm going to make another complex piece of geometry: the arm. Before I begin to explain my process, it's important to understand the importance of edgeloop placement. Edgeloops not only allow for you to add topology, but also allow the mesh to bend when it's rigged. At least three edgeloops are needed at joints such as the knuckles, elbows, shoulders, and knees.

Also, remember how we drew a slight bend in the character's arm? You'll need to model that bend. When the character is rigged, the joints will be placed along that bend; this helps the IK joints figure out which way to bend. However, if the joints are placed in a straight line, the joints could bend backwards, giving your character a broken arm or leg.

Modeling The Arms

My process for modeling the arms starts with the fingers and works backwards. I find that doing it this way makes the end mesh cleaner. Following this order, I've simplified process into four stages:

  1. Finger Stage: model all fingers and thumb.
  2. Palm Stage: model the palm.
  3. Attaching Stage: attach the fingers and thumb to the hand.
  4. Arm Stage: extrude and shape the arm.

Now that you have a basic understanding of our goal, here are the detailed steps with images to show the process.

Stage one

add edge loops to model fingers

Figure 1. Make a low-poly cube to model a finger, toggle views to match your drawings.

add edge loops to model fingers

Figure 2. Add edgeloops at the knuckles, and refine.

duplicate and adjust mesh across fingers

Figure 3. Duplicate, tweak, and translate the finger model to create the other fingers.

model the thumb

Figure 4. Model the thumb from a low-poly cube. Refine the thumb. Toggle views to check their placement; then combine the fingers and thumb into one mesh.

Stage two

steps to create and shape the palm
Figure 6. Create a cube with the proper amount of subdivisions to attach the fingers.
Figure 7. Delete every other edgeloop (for simplicity) and shape the palm.

Note: Doing it this way makes it easy to push the shape of the palm at a lower subdivision, and it ensures that there will be enough geometry to attach the fingers when we increase the topology.

Stage three

steps to shape the palm
Figure 8. Add back in the palm's topology.
Figure 9. Combine the palm and finger mesh.
Figure 10. Attach the fingers.

Note: I prefer using the Target Weld tool to attach the palm to the fingers.

usage of Target Weld tool
Figure 11. Clean the geometry.

Stage four

extrude the arm
Figure 12. Extrude the arm.

extrude the arm
Figure 13. Add edgeloops, and ensure that the mesh is hollow.

Once the arm is made, we can duplicate it onto the otherside like we did for the eyeballs. Here's a refresher of the steps:

  1. Duplicate and group the arm.
  2. Scale the group -1 across the x axis and ungroup the arm.

Modeling The Body

At this point, you've learned most of the techniques needed to finish your character model! The rest of the body follows similar steps we have taken to model the head and arms. If you follow along the video, and follow these steps you'll be in good shape.

  1. Ask yourself, "What primitive mesh will work best for each one?"
    • For example: a cylinder works great for pant legs, but a cube could work better for a shoe.
  2. Create a vague plan.
    • For example: "I'm going to use the cylinder to create one pant leg, finish the left side of the pants, then use the mirror tool to finish the model."
  3. Move, scale and edit the low-poly, primitive mesh to match with the orthographic drawings.
  4. Slowly add or extrude polygons.
  5. Use sculpt tools or soft-selection to match the mesh and orthographic drawings as best as possible.
  6. Repeat steps five and six a few times until your topology matches your drawings.
  7. Mirror your model if needed!

Below, are some example images I've provided for each of the remaining parts of the body.

shirt
Shirt

shorts
Shorts

legs
Leg(s)

shoes
Shoe(s)

Great! Your model has been made! However, before we move on, you need to double check these things to make sure you are ready to move onto retopologizing.

  • Is your model symmetrical?
  • Do your knees and arms have a bend? (only applies if character will be rigged)
  • Have you modeled everything for this character?
  • Does your character relatively match your drawings?

If none of these questions bring up concerns, then you are ready to move onto the following article for character retopology.

Resources

Thank you for continuing with me throughout this article-video combination. Since it's best to learn from multiple sources, I intend to provide resources that have helped me and my peers on our curiosity journeys. Here are some listed below.

Helpful YouTube* Channels for Everything Related to 3D:

  • Maya* Learning Channel
  • Pixologic Learning Channel
  • Blender* Guru
  • Blender
  • James Taylor (MethodJTV*)

Other Character Modeling Resources:

  • Linda.com*
  • Pluralsight*
  • AnimSchool*

Other Rigging Resources:

  • Rapid Rig* and Mixamo* (auto-rigging)
  • Pluralsight*
  • AnimSchool

Unreal Engine 4 Parallel Processing School of Fish

$
0
0

Nikolay Lazarev

Integrated Computer Solutions, Inc.

General Description of the Flocking Algorithm

The implemented flocking algorithm simulates the behavior of a school, or flock, of fish. The algorithm contains four basic behaviors:

  • Cohesion: Fish search for their neighbors in a radius defined as the Radius of Cohesion. The current positions of all neighbors are summed. The result is divided by the number of neighbors. Thus, the center of mass of the neighbors is obtained. This is the point to which the fish strive for cohesion. To determine the direction of movement of the fish, the current position of the fish is subtracted from the result obtained earlier, and then the resulting vector is normalized.
  • Separation: Fish search for their neighbors in a radius defined as the Separation Radius. To calculate the motion vector of an individual fish in a specific separation direction from a school, the difference in the positions of the neighbors and its own position is summed. The result is divided by the number of neighbors and then normalized and multiplied by -1 to change the initial direction of the fish to swim in the opposite direction of the neighbors.
  • Alignment: Fish search for their neighbors in a radius defined as the Radius of Alignment. The current speeds of all neighbors are summed, then divided by the number of neighbors. The resulting vector is normalized.
  • Reversal: All of the fish can only swim in a given space, the boundaries of which can be specified. The moment a fish crosses a boundary must be identified. If a fish hits a boundary, then the direction of the fish is changed to the opposite vector (thereby keeping the fish within the defined space).

These four basic principles of behavior for each fish in a school are combined to calculate the total position values, speed, and acceleration of each fish. In the proposed algorithm, the concept of weight coefficients was introduced to increase or decrease the influence of each of these three modes of behavior (cohesion, separation, and alignment). The weight coefficient was not applied to the behavior of reversal, because fish were not permitted to swim outside of the defined boundaries. For this reason, reversal had the highest priority. Also, the algorithm provided for maximum speed and acceleration.

According to the algorithm described above, the parameters of each fish were calculated (position, velocity, and acceleration). These parameters were calculated for each frame.

Source Code of the Flocking Algorithm with Comments

To calculate the state of fish in a school, double buffering is used. Fish states are stored in an array of size N x 2, where N is the number of fish, and 2 is the number of copies of states.

The algorithm is implemented using two nested loops. In the internal nested loop, the direction vectors are calculated for the three types of behavior (cohesion, separation, and alignment). In the external nested loop, the final calculation of the new state of the fish is made based on calculations in the internal nested loop. These calculations are also based on the values ​​of the weight coefficients of each type of behavior and the maximum values ​​of speed and acceleration.

External loop: At each iteration of a cycle, a new value for the position of each fish is calculated. As arguments to the lambda function, references are passed to:

agentsArray of fish states
currentStatesIndexIndex of array where the current states of each fish are stored
previousStatesIndexIndex of array where the previous states of each fish are stored
kCohWeighting factor for cohesion behavior
kSepWeighting factor for separation behavior
kAlignWeighting factor for alignment behavior
rCohesionRadius in which neighbors are sought for cohesion
rSeparationRadius in which neighbors are sought for separation
rAlignmentRadius in which the neighbors are sought for alignment
maxAccelMaximum acceleration of fish
maxVelMaximum speed of fish
mapSzBoundaries of the area in which fish are allowed to move
DeltaTimeElapsed time since the last calculation
isSingleThreadParameter that indicates in which mode the loop will run

ParllelFor can be used in either of two modes, depending on the state of the isSingleThread Boolean variable:

     ParallelFor(cnt, [&agents, currentStatesIndex, previousStatesIndex, kCoh, kSep, kAlign, rCohesion, rSeparation, 
            rAlignment, maxAccel, maxVel, mapSz, DeltaTime, isSingleThread](int32 fishNum) {

Initializing directions with a zero vector to calculate each of the three behaviors:

     FVector cohesion(FVector::ZeroVector), separation(FVector::ZeroVector), alignment(FVector::ZeroVector);

Initializing neighbor counters for each type of behavior:

     int32 cohesionCnt = 0, separationCnt = 0, alignmentCnt = 0;

Internal nested loop. Calculates the direction vectors for the three types of behavior:

     for (int i = 0; i < cnt; i++) {

Each fish should ignore (not calculate) itself:

     if (i != fishNum) {

Calculate the distance between the position of a current fish and the position of each other fish in the array:

     float distance = FVector::Distance(agents[i][previousStatesIndex].position, agents[fishNum][previousStatesIndex].position);

If the distance is less than the cohesion radius:

     if (distance < rCohesion) {

Then the neighbor position is added to the cohesion vector:

     cohesion += agents[i][previousStatesIndex].position;

The value of the neighbor counter is increased:

     cohesionCnt++;
     }

If the distance is less than the separation radius:

     if (distance < rSeparation) {

The difference between the position of the neighbor and the position of the current fish is added to the separation vector:

     separation += agents[i][previousStatesIndex].position - agents[fishNum][previousStatesIndex].position;

The value of the neighbor counter is increased:

     separationCnt++;
     }

If the distance is less than the radius of alignment:

     if (distance < rAlignment) {

Then the velocity of the neighbor is added to the alignment vector:

     alignment += agents[i][previousStatesIndex].velocity;

The value of the neighbor counter is increased:

     alignmentCnt++;
                      }
             }

If neighbors were found for cohesion:

     if (cohesionCnt != 0) {

Then the cohesion vector is divided by the number of neighbors and its own position is subtracted:

     cohesion /= cohesionCnt;
     cohesion -= agents[fishNum][previousStatesIndex].position;

The cohesion vector is normalized:

     cohesion.Normalize();
     }

If neighbors were found for separation:

     if (separationCnt != 0) {

The separation vector is divided by the number of neighbors and multiplied by -1 to change the direction:

            separation /= separationCnt;
            separation *= -1.f;

The separation vector is normalized:

              separation.Normalize();
     }

If neighbors were found for alignment:

     if (alignmentCnt != 0) {

The alignment vector is divided by the number of neighbors:

            alignment /= alignmentCnt;

The alignment vector is normalized:

            alignment.Normalize();
     }

Based on the weight coefficients of each of the possible types of behavior, a new acceleration vector is determined, limited by the value of the maximum acceleration:

agents[fishNum][currentStatesIndex].acceleration = (cohesion * kCoh + separation * kSep + alignment * kAlign).GetClampedToMaxSize(maxAccel);

To limit the acceleration vector along the Z-axis:

   agents[fishNum][currentStatesIndex].acceleration.Z = 0;

Add to the previous position of the fish the result of the multiplication of the new velocity vector and the time elapsed since the last calculation:

     agents[fishNum][currentStatesIndex].velocity += agents[fishNum][currentStatesIndex].acceleration * DeltaTime;

The velocity vector is limited to the maximum value:

     agents[fishNum][currentStatesIndex].velocity =
                 agents[fishNum][currentStatesIndex].velocity.GetClampedToMaxSize(maxVel);

To the previous position of a fish, the multiplication of the new velocity vector and the time elapsed since the last calculation is added:

     agents[fishNum][currentStatesIndex].position += agents[fishNum][currentStatesIndex].velocity * DeltaTime;

The current fish is checked to be within the specified boundaries. If yes, the calculated speed and position values are saved. If the fish has moved beyond the boundaries of the region along one of the axes, then the value of the velocity vector along this axis is multiplied by -1 to change the direction of motion:

agents[fishNum][currentStatesIndex].velocity = checkMapRange(mapSz,
               agents[fishNum][currentStatesIndex].position, agents[fishNum][currentStatesIndex].velocity);
               }, isSingleThread);

For each fish, collisions with world-static objects, like underwater rocks, should be detected, before new states are applied:

     for (int i = 0; i < cnt; i++) {

То detect collisions between fish and world-statiс objects:

            FHitResult hit(ForceInit);
            if (collisionDetected(agents[i][previousStatesIndex].position, agents[i][currentStatesIndex].position, hit)) {

If a collision is detected, then the previously calculated position should be undone. The velocity vector should be changed to the opposite direction and the position recalculated:

                   agents[i][currentStatesIndex].position -= agents[i]  [currentStatesIndex].velocity * DeltaTime;
                   agents[i][currentStatesIndex].velocity *= -1.0; 
                   agents[i][currentStatesIndex].position += agents[i][currentStatesIndex].velocity * DeltaTime;  
            }
     }

Having calculated the new states of all fish, these updated states will be applied, and all fish will be moved to a new position:

for (int i = 0; i < cnt; i++) {  
           FTransform transform; 
            m_instancedStaticMeshComponent->GetInstanceTransform(agents[i][0]->instanceId, transform);

Set up a new position of the fish instance:

     transform.SetLocation(agents[i][0]->position);

Turn the fish head forward in the direction of movement:

     FVector direction = agents[i][0].velocity; 
     direction.Normalize();
     transform.SetRotation(FRotationMatrix::MakeFromX(direction).Rotator().Add(0.f, -90.f, 0.f).Quaternion());

Update instance transform:

            m_instancedStaticMeshComponent->UpdateInstanceTransform(agents[i][0].instanceId, transform, false, false);
     }

Redraw all the fish:

     m_instancedStaticMeshComponent->ReleasePerInstanceRenderData();

     m_instancedStaticMeshComponent->MarkRenderStateDirty();

Swap indexed fish states:

      swapFishStatesIndexes();

Complexity of the Algorithm: How Increasing the Number of Fish Affects Productivity

Suppose that the number of fish participating in the algorithm is N. To determine the new state of each fish, the distance to all the other fish must be calculated (not counting additional operations for determining the direction vectors for the three types of behavior). The initial complexity of the algorithm will be O(N2). For example, 1,000 fish will require 1,000,000 operations.

Figure 1

Figure 1: Computational operations for calculating the positions of all fish in a scene.

Compute Shader with Comments

Structure describing the state of each fish:

     struct TInfo{
              int instanceId;
              float3 position;
              float3 velocity;
              float3 acceleration;
     };

Function for calculating the distance between two vectors:

     float getDistance(float3 v1, float3 v2) {
              return sqrt((v2[0]-v1[0])*(v2[0]-v1[0]) + (v2[1]-v1[1])*(v2[1]-v1[1]) + (v2[2]-v1[2])*(v2[2]-v1[2]));
     }

     RWStructuredBuffer<TInfo> data;

     [numthreads(1, 128, 1)]
     void VS_test(uint3 ThreadId : SV_DispatchThreadID)
     {

Total number of fish:

     int fishCount = constants.fishCount;

This variable, created and initialized in C++, determines the number of fish calculated in each graphics processing unit (GPU) thread (by default:1):

     int calculationsPerThread = constants.calculationsPerThread;

Loop for calculating fish states that must be computed in this thread:

     for (int iteration = 0; iteration < calculationsPerThread; iteration++) {

Thread identifier. Corresponds to the fish index in the state array:

     int currentThreadId = calculationsPerThread * ThreadId.y + iteration;

The current index is checked to ensure it does not exceed the total number of fish (this is possible, since more threads can be started than there are fish):

     if (currentThreadId >= fishCount)
            return;

To calculate the state of fish, a single double-length array is used. The first N elements of this array are the new states of fish to be calculated; the second N elements are the older states of fish that were previously calculated.

Current fish index:

    int currentId = fishCount + currentThreadId;

Copy of the structure of the current state of fish:

     TInfo currentState = data[currentThreadId + fishCount];

Copy of the structure of the new state of fish:

     TInfo newState = data[currentThreadId];

Initialize direction vectors for the three types of behavior:

     float3 steerCohesion = {0.0f, 0.0f, 0.0f};
     float3 steerSeparation = {0.0f, 0.0f, 0.0f};
     float3 steerAlignment = {0.0f, 0.0f, 0.0f};

Initialize neighbors counters for each type of behavior:

     float steerCohesionCnt = 0.0f;
     float steerSeparationCnt = 0.0f;
     float steerAlignmentCnt = 0.0f;

Based on the current state of each fish, direction vectors are calculated for each of the three types of behaviors. The cycle begins with the middle of the input array, which is where the older states are stored:

     for (int i = fishCount; i < 2 * fishCount; i++) {

Each fish should ignore (not calculate) itself:

     if (i != currentId) {

Calculate the distance between the position of current fish and the position of each other fish in the array:

     float d = getDistance(data[i].position, currentState.position);

If the distance is less than the cohesion radius:

     if (d < constants.radiusCohesion) {

Then the neighbor’s position is added to the cohesion vector:

     steerCohesion += data[i].position;

And the counter of neighbors for cohesion is increased:

            steerCohesionCnt++;
     }

If the distance is less than the separation radius:

     if (d < constants.radiusSeparation) {

Then the separation vector is added to the difference between the position of the neighbor and the position of the current fish:

     steerSeparation += data[i].position - currentState.position;

The counter of the number of neighbors for separation increases:

            steerSeparationCnt++;
     }

If the distance is less than the alignment radius:

     if (d < constants.radiusAlignment) {

Then the velocity of the neighbor is added to the alignment vector:

     steerAlignment += data[i].velocity;

The counter of the number of neighbors for alignment increases:

                          steerAlignmentCnt++;
                   }
            }
     }

If neighbors were found for cohesion:

   if (steerCohesionCnt != 0) {

The cohesion vector is divided by the number of neighbors and its own position is subtracted:

     steerCohesion = (steerCohesion / steerCohesionCnt - currentState.position);

The cohesion vector is normalized:

            steerCohesion = normalize(steerCohesion);
     }

If neighbors were found for separation:

     if (steerSeparationCnt != 0) {

Then the separation vector is divided by the number of neighbors and multiplied by -1 to change the direction:

     steerSeparation = -1.f * (steerSeparation / steerSeparationCnt);

The separation vector is normalized:

            steerSeparation = normalize(steerSeparation);
     }

If neighbors were found for alignment:

     if (steerAlignmentCnt != 0) {

Then the alignment vector is divided by the number of neighbors:

     steerAlignment /= steerAlignmentCnt;

The alignment vector is normalized:

           steerAlignment = normalize(steerAlignment);
     }

Based on the weight coefficients of each of the three possible types of behaviors, a new acceleration vector is determined, limited by the value of the maximum acceleration:

     newState.acceleration = (steerCohesion * constants.kCohesion + steerSeparation * constants.kSeparation
            + steerAlignment * constants.kAlignment);
     newState.acceleration = clamp(newState.acceleration, -1.0f * constants.maxAcceleration,
            constants.maxAcceleration);

To limit the acceleration vector along the Z-axis:

     newState.acceleration[2] = 0.0f;

To the previous velocity vector, the product of the new acceleration vector and the time elapsed since the last calculation is added. The velocity vector is limited to the maximum value:

     newState.velocity += newState.acceleration * variables.DeltaTime;
     newState.velocity = clamp(newState.velocity, -1.0f * constants.maxVelocity, constants.maxVelocity);

Add to the previous position of the fish the result of the multiplication of the new velocity vector and the time elapsed since the last calculation:

     newState.position += newState.velocity * variables.DeltaTime;

The current fish is checked to be within the specified boundaries. If yes, the calculated speed and position values are saved. If the fish has moved beyond the boundaries of the region along one of the axes, then the value of the velocity vector along this axis is multiplied by -1 to change the direction of motion:

                   float3 newVelocity = newState.velocity;
                   if (newState.position[0] > constants.mapRangeX || newState.position[0] < -constants.mapRangeX) {
                          newVelocity[0] *= -1.f;
                   }    

                   if (newState.position[1] > constants.mapRangeY || newState.position[1] < -constants.mapRangeY) {
                          newVelocity[1] *= -1.f;
                   }
                   if (newState.position[2] > constants.mapRangeZ || newState.position[2] < -3000.f) {
                          newVelocity[2] *= -1.f;
                   }
                   newState.velocity = newVelocity;

                   data[currentThreadId] = newState;
            }
     }         

Table 1: Comparison of algorithms.

Fish

Algorithm (FPS)

Computing Operations

CPU SINGLE

CPU MULTI

GPU MULTI

100

62

62

62

10000

500

62

62

62

250000

1000

62

62

62

1000000

1500

49

61

62

2250000

2000

28

55

62

4000000

2500

18

42

62

6250000

3000

14

30

62

9000000

3500

10

23

56

12250000

4000

8

20

53

16000000

4500

6

17

50

20250000

5000

5

14

47

25000000

5500

4

12

35

30250000

6000

3

10

31

36000000

6500

2

8

30

42250000

7000

2

7

29

49000000

7500

1

7

27

56250000

8000

1

6

24

64000000

8500

0

5

21

72250000

9000

0

5

20

81000000

9500

0

4

19

90250000

10000

0

3

18

100000000

10500

0

3

17

110250000

11000

0

2

15

121000000

11500

0

2

15

132250000

12000

0

1

14

144000000

13000

0

0

12

169000000

14000

0

0

11

196000000

15000

0

0

10

225000000

16000

0

0

9

256000000

17000

0

0

8

289000000

18000

0

0

3

324000000

19000

0

0

2

361000000

20000

0

0

1

400000000

Figure 2

Figure 2: Comparison of algorithms.

Laptop Hardware:
CPU – Intel® Core i7-3632QM processor 2.2 GHz with turbo boost up to 3.2 GHz
GPU - NVIDIA GeForce* GT 730M
RAM - 8 GB DDR3*

Detecting Diabetic Retinopathy Using Deep Learning on Intel® Architecture

$
0
0

Abstract

Diabetic retinopathy (DR) is one of the leading causes of preventable blindness. This is rampant in people across the globe. Detecting it is a time-consuming and manual process. This experiment aims to automate the preliminary DR detection based on the retinal image of a patient's eye. TensorFlow* based implementation uses convolutional neural networks to take a retinal image, analyze it, and learn the characteristics of an eye that shows signs of diabetic retinopathy to detect this condition. A simple transfer learning approach with an Inception* v3 architecture model on an ImageNet* dataset was used to train and test on a retina dataset. The experiments were run on Intel® Xeon® Gold processor powered systems. The tests resulted in a training accuracy of about 83 percent, and test accuracy was approximately 77 percent (refer Configurations).

Introduction

Diabetic retinopathy (DR) is one of the leading causes of preventable blindness. It affects up to 40 percent of diabetic patients, with nearly 100 million cases worldwide, as of 2010. Currently, detecting DR is a time-consuming and manual process that requires a trained clinician to examine and evaluate digital color fundus photographs of the retina. By the time human readers submit their reviews, often a day or two later, the delayed results lead to lost follow up, miscommunication, and delayed treatment. The objective of this experiment is to develop an automated method for DR screening. Consultation of the eyes with DR by an ophthalmologist for further evaluation and treatment would aid in reducing the rate of vision loss, enabling timely and accurate diagnoses.

Continued research in the Deep Learning space resulted in the evolution of many frameworks to solve the complex problem of image classification, detection, and segmentation. These frameworks have been optimized specific to the hardware where they are run for better accuracy, reduced loss, and increased speed. Intel has optimized the TensorFlow* library for better performance on their Intel® Xeon® Gold processors. This paper discusses the training and inferencing DR detection problem that is built using the Inception* v3 architecture with TensorFlow framework on Intel® processor powered clusters. A transfer learning approach was used by taking the weights for Inception v3 architecture on an ImageNet* dataset and using those weights on a retina dataset to train, validate, and test.

Document Content

This section describes in detail the end-to-end steps, from choosing the environment, to running the tests on the trained DR detection model.

Choosing the Environment

Hardware 

The detailed experiments performed on an Intel Xeon Gold processor powered system are as listed in the following table:

ComponentsDetails
Architecturex86_64
CPU op-mode(s)32 bit, 64 bit
Byte orderLittle-endian
CPU(s)24
Core(s) per socketSix
Socket(s)Two
CPU familySix
Model85
Model nameIntel® Xeon® Gold 6128 processor @ 3.40 GHz
RAM92 GB

Table 1. Intel® Xeon® Gold processor configuration.

Software 

An Intel® optimized TensorFlow framework along with Intel® Distribution for Python* were used as the software configuration.

Software/LibraryVersion
TensorFlow*1.4.0 (Intel® optimized)
Python*3.6 (Intel optimized)

Table 2. On Intel® Xeon® Gold processor.

The listed software configurations are available on the hardware environments chosen, and no source build for TensorFlow was necessitated.

Dataset

The dataset is a small, curated subset of images that was created from Kaggle's Diabetic Retinopathy Detection challenge’s train dataset. The dataset contains a large set of high-resolution retina images taken under a variety of imaging conditions. A left and right field is provided for every subject. Images are labeled with a subject ID as well as either left or right (for example, 1_left.jpeg is the left eye of patient ID 1). As the images are from different cameras, they may be of different quality in terms of exposure and focus sharpness. Also, some of the images are inverted. The data also has noise in both images and labels.

The presence of disease in each image is labeled on a scale from 0 to 1, as follows:

        0: No Disease

        1: Disease

The dataset provided is split into training set (90 percent files) and test set (10 percent files) for this experiment.

Inception* v3 Architecture

The Inception v3 architecture was built on the intent to improve the utilization of computing resources inside a deep neural network. The main idea behind Inception v3 is the approximation of a sparse structure with spatially repeated dense components and using dimension reduction as used in a network-in-network architecture to keep the computational complexity in bounds, but only when required. The computational cost of Inception v3 is also much lower than other topologies such as AlexNet, VGGNet*, ResNet*, and so on. More information on Inception v3 is given in Going deeper with convolutions3. The Inception v3 architecture is mentioned in the following figure:

Inception* v3 model

Figure 1, Inception* v3 model3.

To accelerate the training process, the transfer learning technique was applied by using a pre-trained Inception v3 model on the ImageNet dataset. The pre-trained model already learned the knowledge on data and stored that in the form of weights. These weights are directly used as initial weights, and they are readjusted when the model is retrained on the retina dataset. The pre-trained model was downloaded from here4:

Execution Steps

This section describes the steps followed in the end-to-end process for training, validation, and testing the retinopathy detection model on Intel® architecture.

These steps include:

  1. Preparing input
  2. Model training
  3. Inference

Preparing Input

Image Directories

The dataset was downloaded from the Nomikxyz / retinopathy-dataset1.

  • The files were extracted and separated into different directories based on the DR types.
  • Nearly 2063 images (diseased and non-diseased folders) were separated and put into a different directory from the master list.
  • There were 1857 JPEG images of retinas for training, 206 images for testing, and a .CSV file where the level of the disease is written for the train images.

Processing and Data Transformations

  • Images from the training and test datasets have very different resolutions, aspect ratios, colors, are cropped in various ways, and some are of very low quality, out of focus, and so on.
  • To help improve the results during training, the images are augmented through simple distortions like crops, scales, and flips.
  • Images were of varying sizes and were cropped to 299 pixels wide by 299 pixels high.

Model Training

Transfer learning is a technique that reduces the time taken to train from scratch by taking a fully-trained model for a set of categories like ImageNet and retrains from the existing weights for new classes. In the experiment, we retrained the final layer from scratch, while leaving all the others untouched. The following command was run that accesses the training images and trains the algorithm toward detecting diseased images.

The retrain.py was run on the retina dataset as follows:

python retrain.py \
  --bottleneck_dir=bottlenecks \
  --how_many_training_steps=300 \
  --model_dir=inception \
  --output_graph=retrained_graph.pb \
  --output_labels=retrained_labels.txt \
  --image_dir=<>

The mentioned script loads the pre-trained Inception v3 model, removes the old top layer, and trains the retina images. Though there were no retina class/images in the original ImageNet classes when the full network was trained on it, with transfer learning the lower layers are trained to distinguish between generic features (for example, edge detectors or color blob detectors) that can be reused for other recognition tasks without any modification.

Retraining with Bottlenecks

TensorFlow computes all the bottleneck values as the first step in training. In this step, it analyzes all the images on disk and calculates the bottleneck values for each of them. Bottleneck is an informal term we often use for the last-but-one layer before the final output layer that actually does the classification. This penultimate layer has been trained to output a set of values that is good enough for the classifier to use, to distinguish between all the classes it has been asked to recognize. The reason our final layer retraining can work on new classes is that it turns out that the kind of information needed to distinguish between all of the 1,000 classes in ImageNet is often also useful to distinguish between new kinds of objects like retina, traffic signal, accidents, and so on.

The bottleneck values are then stored as they will be required for each iteration of training. The computation of these values is faster because TensorFlow takes the help of the existing pre-trained model to assist it with the process. As every image is reused multiple times during training, and calculating each bottleneck takes a significant amount of time, it speeds things up to cache these bottleneck values on disk so they do not have to be repeatedly recalculated, and the values are stored in the bottleneck directory.

Training

After the bottlenecks are complete, the actual training of the top layer of the network begins. During the run, the following outputs are generated showing the progress of algorithm training:

  • Training accuracy shows the percentage of the images used in the current training batch that were labeled with the correct class.
  • Validation accuracy is the precision (percentage of correctly labelled images) on a randomly selected group of images from a different set.
  • Cross entropy is a loss function that tells us how well the learning process is progressing.

Training was run on nearly 2063 images with a batch size of 100 for 300 steps/iterations and we observed training accuracy at 83.0 percent (refer Configurations).

Testing

We ran the label_image.py to the trained model on 206 test images with the following script and observed testing accuracy at about 77.2 percent.

python -m scripts.label_image \
    --graph=tf_files/retrained_graph.pb  \
    --image=<>

Diseased versus Not probability

Figure 2. Diseased versus Not probability.

Conclusion

In this paper we explained how training and testing retinopathy detection was done using transfer learning where the weights from the model trained Inception v3 on the ImageNet dataset was used. These weights were readjusted when the model was retrained using the Intel Xeon Gold processor-powered environment. The experiment can be extended by applying different optimization algorithms, changing learning rates, and varying input sizes so that the accuracy can be improved further.

About the Author

Lakshmi Bhavani Manda and Ajit Kumar Pookalangara, are part of the Intel team working on the artificial intelligence (AI) evangelization.

Configurations 

For performance reference under Abstract and Training sections:

        Hardware: refer Hardware under Choosing the Environment

        Software: refer Software under Choosing the Environment

        Test performed: executed on remaining 10% of the images using the trained model

For more information go to Product Performance site.

References

1. For curated dataset:
https://github.com/Nomikxyz/retinopathy-dataset

2. TensorFlow for Poets tutorial:
https://codelabs.developers.google.com/codelabs/tensorflow-for-poets

3. Rethinking the Inception Architecture for Computer Vision::
https://arxiv.org/pdf/1512.00567v3.pdf

4. Dataset Link:
https://storage.googleapis.com/download.tensorflow.org/models/inception_dec_2015.zip

Related Resources

TensorFlow* Optimizations on Modern Intel® Architecture: https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture

Build and Install TensorFlow* on Intel® Architecture: https://software.intel.com/en-us/articles/build-and-install-tensorflow-on-intel-architecture

Qlik Increases Big Data Analysis Speed on Intel® Xeon® Platinum 8168 processor

$
0
0

Qlik enables organizations to analyze disparate data sources using a visual interface.

Performance is essential to enable users to explore their data intuitively. Data is cached in memory. As users make new selections, everything based on the selection is recalculated and the visualization is updated. If a page takes longer than a second or two to update, users will lose their train of thought, and their patience.

Qlik worked with Intel to benchmark the performance of the new Intel® Xeon® Platinum 8168 processor, and compared its performance to the previous generation Intel® Xeon® processor E5-2699 v4, and the previous generation v3. The test used an internal Qlik benchmark that performs over 80 selections at the same time, simulating user interactions. The data set comprised 1 billion records of sales data, representing different customers in different countries. The calculations involved processing the data but excluding a single week from the data presented. Reports included the sales by year, the top ten customers with their sales total, and gross margins by product category shown using a treemap (a grid of proportionally sized boxes). The scenario stresses the processor’s CPU and memory.

View complete Solution Brief (PDF)


API Without Secrets: Introduction to Vulkan* Part 7

$
0
0

Tutorial 7: Uniform Buffers — Using Buffers in Shaders

Go back to previous tutorial: Introduction to Vulkan Part 6 – Descriptor Sets 

This is the time to summarize knowledge presented so far and create a more typical rendering scenario. Here we will see an example that is still very simple, yet it reflects the most common way to display 3D geometry on screen. We will extend code from the previous tutorial by adding a transformation matrix to the shader uniform data. This way we can see how to use multiple different descriptors in a single descriptor set.

Of course, knowledge presented here applies to many other use cases, as descriptor sets may contain multiple types of resources, both various or identical. Nothing stops us from creating a descriptor set with many storage buffers or sampled images. We can also mix them as shown in this tutorial — here we use both a texture (combined image sampler) and a uniform buffer. We will see how to create a layout for such a descriptor, how to create a descriptor set, and how to populate it with appropriate resources.

In a previous part of the tutorial we learned how to create images and use them as textures inside shaders. This knowledge is also used in this tutorial, but here we focus only on buffers, and learn how to use them as a source of uniform data. We also see how to prepare a projection matrix, how to copy it to a buffer, and how to access it inside shaders.

Creating a Uniform Buffer

In this example we want to use two types of uniform variables inside shaders: combined image sampler (sampler2D inside shader) and a uniform projection matrix (mat4). In Vulkan*, uniform variables other than opaque types like samplers cannot be declared in a global scope (as in OpenGL*); they must be accessed from within uniform buffers. We start by creating a buffer.

Buffers can be used for many different purposes — they can be a source of vertex data (vertex attributes); we can keep vertex indices in them so they are used as index buffers; they can contain shader uniform data, or we can store data inside buffers from within shaders and use them as storage buffers. We can even keep formatted data inside buffers, access it through buffer views, and treat them as texel buffers (similar to OpenGL's buffer textures). For all the above purposes we use the usual buffers, created always in the same way. But it is the usage provided during buffer creation that defines how we can use a given buffer during its lifetime.

We saw how to create a buffer in Introduction to Vulkan Part 4 – Vertex Attributes, so only source code is presented here without diving into specifics:

VkBufferCreateInfo buffer_create_info = {
    VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO, // VkStructureType      sType
    nullptr,                              // const void          *pNext
    0,                                    // VkBufferCreateFlags  flags
    buffer.Size,                          // VkDeviceSize         size
    usage,                                // VkBufferUsageFlags   usage
    VK_SHARING_MODE_EXCLUSIVE,            // VkSharingMode        sharingMode
    0,                                    // uint32_t             queueFamilyIndexCount
    nullptr                               // const uint32_t      *pQueueFamilyIndices
  };

  if( vkCreateBuffer( GetDevice(), &buffer_create_info, nullptr, &buffer.Handle ) != VK_SUCCESS ) {
    std::cout << "Could not create buffer!"<< std::endl;
    return false;
  }

  if( !AllocateBufferMemory( buffer.Handle, memoryProperty, &buffer.Memory ) ) {
    std::cout << "Could not allocate memory for a buffer!"<< std::endl;
    return false;
  }

  if( vkBindBufferMemory( GetDevice(), buffer.Handle, buffer.Memory, 0 ) != VK_SUCCESS ) {
    std::cout << "Could not bind memory to a buffer!"<< std::endl;
    return false;
  }

return true;

1. Tutorial07.cpp, function CreateBuffer()

We first create a buffer by defining its parameters in a variable of type VkBufferCreateInfo. Here we define the buffer's most important parameters — its size and usage. Next, we create a buffer by calling the vkCreateBuffer() function. After that, we need to allocate a memory object (or use a part of another, existing memory object) to bind it to the buffer through the vkBindBufferMemory() function call. Only after that can we use the buffer the way we want to in our application. Allocating a dedicated memory object is performed as follows:

VkMemoryRequirements buffer_memory_requirements;
vkGetBufferMemoryRequirements( GetDevice(), buffer, &buffer_memory_requirements );

VkPhysicalDeviceMemoryProperties memory_properties;
vkGetPhysicalDeviceMemoryProperties( GetPhysicalDevice(), &memory_properties );

for( uint32_t i = 0; i < memory_properties.memoryTypeCount; ++i ) {
  if( (buffer_memory_requirements.memoryTypeBits & (1 << i)) &&
    (memory_properties.memoryTypes[i].propertyFlags & property) ) {

    VkMemoryAllocateInfo memory_allocate_info = {
      VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO, // VkStructureType   sType
      nullptr,                                // const void       *pNext
      buffer_memory_requirements.size,        // VkDeviceSize      allocationSize
      i                                       // uint32_t          memoryTypeIndex
    };

    if( vkAllocateMemory( GetDevice(), &memory_allocate_info, nullptr, memory ) == VK_SUCCESS ) {
      return true;
    }
  }
}
return false;

2. Tutorial07.cpp, function AllocateBufferMemory()

To create a buffer that can be used as a source of shader uniform data, we need to create a buffer with the VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT usage. But, depending on how we want to transfer data to it, we may also need other usages as well. Here we want to use a buffer with a device-local memory bound to it because such memory may have better performance. But, depending on the hardware's architecture, it may not be possible to map such memory and copy data to it directly from the CPU. That's why we want to use a staging buffer through which data will be copied from the CPU to our uniform buffer. And in order to do that, our uniform buffer must also be created with the VK_BUFFER_USAGE_TRANSFER_DST_BIT usage, as it will be a target of data copy operation. Below, we can see how our buffer is finally created:

Vulkan.UniformBuffer.Size = 16 * sizeof(float);
if( !CreateBuffer( VK_BUFFER_USAGE_TRANSFER_DST_BIT | VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, Vulkan.UniformBuffer ) ) {
  std::cout << "Could not create uniform buffer!"<< std::endl;
  return false;
}

if( !CopyUniformBufferData() ) {
  return false;
}

return true;

3. Tutorial07.cpp, function CreateUniformBuffer()

Copying Data to Buffers

The next thing is to upload appropriate data to our uniform buffer. In it we will store 16 elements of a 4 x 4 matrix. We are using an orthographic projection matrix but we can store any other type of data; we just need to remember that each uniform variable must be placed at an appropriate offset, counting from the beginning of a buffer's memory. Such an offset must be a multiple of a specific value. In other words, it must be aligned to a specific value or it must have a specific alignment. The alignment of each uniform variable depends on the variable's type, and the specification defines it as follows:

  • A scalar variable whose type has N bytes must be aligned to an address that is a multiple of N.
  • A vector with two elements of size N (whose type has N bytes) must be aligned to 2 N.
  • A vector with three or four elements, each of size N, has an alignment of 4 N.
  • An array's alignment is calculated as an alignment of its elements, rounded up to a multiple of 16.
  • A structure's alignment is calculated as the largest alignment of any of its members, rounded up to a multiple of 16.
  • A row-major matrix with C columns has an alignment equal to the alignment of a vector with C elements of the same type as the elements of the matrix.
  • A column-major matrix has an alignment equal to the alignment of the matrix column type.

The above rules are similar to the rules defined for the standard GLSL 140 layout and we can apply it for Vulkan's uniform buffers as well. But we need to remember that if we place data at inappropriate offsets it will lead to incorrect values being fetched in shaders.

For the sake of simplicity, our example has only one uniform variable, so it can be placed at the very beginning of a buffer. To transfer data to it, we will use a staging buffer—it is created with VK_BUFFER_USAGE_TRANSFER_SRC_BIT usage and is backed by a memory supporting VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT property, so we can map it. Below, we can see how data is copied to the staging buffer:

const std::array<float, 16> uniform_data = GetUniformBufferData();

void *staging_buffer_memory_pointer;
if( vkMapMemory( GetDevice(), Vulkan.StagingBuffer.Memory, 0, Vulkan.UniformBuffer.Size, 0, &staging_buffer_memory_pointer) != VK_SUCCESS ) {
    std::cout << “Could not map memory and upload data to a staging buffer!” << std::endl;
    return false;
}

memcpy( staging_buffer_memory_pointer, uniform_data.data(), Vulkan.UniformBuffer.Size );

VkMappedMemoryRange flush_range = {
    VK_STRUCTURE_TYPE_MAPPED_MEMORY_RANGE,  // VkStructureType  sType
    nullptr,                                // const void      *pNext
    Vulkan.StagingBuffer.Memory,            // VkDeviceMemory   memory
    0,                                      // VkDeviceSize     offset
    Vulkan.UniformBuffer.Size               // VkDeviceSize     size
};
vkFlushMappedMemoryRanges( GetDevice(), 1, &flush_range );

vkUnmapMemory( GetDevice(), Vulkan.StagingBuffer.Memory );

4. Tutorial07.cpp, function CopyUniformBufferData()

First, we prepare the projection matrix data. It is stored in a std::array, but we can keep it in any other type of variable. Next, we map memory bound to the staging buffer. We need to have access to memory that is at least as big as the size of data we want to copy, so we also need to remember to create a staging buffer that is big enough to hold it. Next, we copy data to the staging buffer using an ordinary memcpy() function. Now, we must tell the driver which parts of the buffer's memory were changed; this operation is called flushing. After that, we unmap the staging buffer's memory. Keep in mind that frequent mapping and unmapping may impact performance of our application. In Vulkan, resources can be mapped all the time and it doesn't impact our application in any way. So, if we want to frequently transfer data using staging resources, we should map them only once and keep the acquired pointer for future use. Here we are unmapping it to just show you how to do it.

Now we need to transfer data from the staging buffer to our target, the uniform buffer. In order to do that, we need to prepare a command buffer in which we will record appropriate operations, and which we will submit for these operations to occur.

We start by taking any unused command buffer. It must be allocated from a pool created for a queue that supports transfer operations. Vulkan specification requires that at least one general purpose queue must be available — a queue that supports graphics (rendering), compute, and transfer operations. In the case of Intel® hardware, there is only one queue family with one general-purpose queue, so we don't have this problem. Other hardware vendors may support other types of queue families, maybe even a queue family that is dedicated for data transfer. In that case, we should choose a queue from such a family.

We start recording a command buffer by calling the vkBeginCommandBuffer() function. Next, we record the vkCmdCopyBuffer() command that performs the data transfer, where we tell it that we want to copy data from the very beginning of the staging buffer (0th offset) to the very beginning of our uniform buffer (also 0th offset). We also provide the size of data to be copied.

Next, we need to tell the driver that after the data transfer is performed, our whole uniform buffer will be used as, well, a uniform buffer. This is performed by placing a buffer memory barrier in which we tell it that, until now, we were transferring data to the buffer (VK_ACCESS_TRANSFER_WRITE_BIT), but from now on we will use (VK_ACCESS_UNIFORM_READ_BIT) as a source of data for uniform variables. The buffer memory barrier is placed using the vkCmdPipelineBarrier() function call. It occurs after the data transfer operation (VK_PIPELINE_STAGE_TRANSFER_BIT) but before the vertex shader execution, as we access our uniform variable inside the vertex shader (VK_PIPELINE_STAGE_VERTEX_SHADER_BIT).

Finally, we can end the command buffer and submit it to the queue. The whole process is presented in the code below:

// Prepare command buffer to copy data from staging buffer to a uniform buffer
VkCommandBuffer command_buffer = Vulkan.RenderingResources[0].CommandBuffer;

VkCommandBufferBeginInfo command_buffer_begin_info = {
  VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO, // VkStructureType              sType
  nullptr,                                     // const void                  *pNext
  VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT, // VkCommandBufferUsageFlags    flags
  nullptr                                      // const VkCommandBufferInheritanceInfo  *pInheritanceInfo
};

vkBeginCommandBuffer( command_buffer, &command_buffer_begin_info);

VkBufferCopy buffer_copy_info = {
  0,                                // VkDeviceSize       srcOffset
  0,                                // VkDeviceSize       dstOffset
  Vulkan.UniformBuffer.Size         // VkDeviceSize       size
};
vkCmdCopyBuffer( command_buffer, Vulkan.StagingBuffer.Handle, Vulkan.UniformBuffer.Handle, 1, &buffer_copy_info );

VkBufferMemoryBarrier buffer_memory_barrier = {
  VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER, // VkStructureType    sType;
  nullptr,                          // const void        *pNext
  VK_ACCESS_TRANSFER_WRITE_BIT,     // VkAccessFlags      srcAccessMask
  VK_ACCESS_UNIFORM_READ_BIT,       // VkAccessFlags      dstAccessMask
  VK_QUEUE_FAMILY_IGNORED,          // uint32_t           srcQueueFamilyIndex
  VK_QUEUE_FAMILY_IGNORED,          // uint32_t           dstQueueFamilyIndex
  Vulkan.UniformBuffer.Handle,      // VkBuffer           buffer
  0,                                // VkDeviceSize       offset
  VK_WHOLE_SIZE                     // VkDeviceSize       size
};
vkCmdPipelineBarrier( command_buffer, VK_PIPELINE_STAGE_TRANSFER_BIT, VK_PIPELINE_STAGE_VERTEX_SHADER_BIT, 0, 0, nullptr, 1, &buffer_memory_barrier, 0, nullptr );

vkEndCommandBuffer( command_buffer );

// Submit command buffer and copy data from staging buffer to a vertex buffer
VkSubmitInfo submit_info = {
  VK_STRUCTURE_TYPE_SUBMIT_INFO,    // VkStructureType    sType
  nullptr,                          // const void        *pNext
  0,                                // uint32_t           waitSemaphoreCount
  nullptr,                          // const VkSemaphore *pWaitSemaphores
  nullptr,                          // const VkPipelineStageFlags *pWaitDstStageMask;
  1,                                // uint32_t           commandBufferCount
  &command_buffer,                  // const VkCommandBuffer *pCommandBuffers
  0,                                // uint32_t           signalSemaphoreCount
  nullptr                           // const VkSemaphore *pSignalSemaphores
};

if( vkQueueSubmit( GetGraphicsQueue().Handle, 1, &submit_info, VK_NULL_HANDLE ) != VK_SUCCESS ) {
  return false;
}

vkDeviceWaitIdle( GetDevice() );
return true;

5. Tutorial07.cpp, function CopyUniformBufferData()

In the code above we call the vkDeviceWaitIdle() function to make sure the data transfer operation is finished before we proceed. But in real-life situations, we should perform more appropriate synchronizations by using semaphores and/or fences. Waiting for all the GPU operations to finish may (and probably will) kill performance of our application.

Preparing Descriptor Sets

Now we can prepare descriptor sets — the interface between our application and a pipeline through which we can provide resources used by shaders. We start by creating a descriptor set layout.

Creating Descriptor Set Layout

The most typical way that 3D geometry is rendered is by multiplying vertices by model, view, and projection matrices inside a vertex shader. These matrices may be accumulated in a model-view-projection matrix. We need to provide such a matrix to the vertex shader in a uniform variable. Usually we want our geometry to be textured; the fragment shader needs access to a texture — a combined image sampler. We can also use a separate sampled image and a sampler; using combined image samplers may have better performance on some platforms.

When we issue drawing commands, we want the vertex shader to have access to a uniform variable, and the fragment shader to a combined image sampler. These resources must be provided in a descriptor set. In order to allocate such a set we need to create an appropriate layout, which defines what types of resources are stored inside the descriptor sets.

  std::vector<VkDescriptorSetLayoutBinding> layout_bindings = {
  {
    0,                                         // uint32_t           binding
    VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, // VkDescriptorType   descriptorType
    1,                                         // uint32_t           descriptorCount
    VK_SHADER_STAGE_FRAGMENT_BIT,              // VkShaderStageFlags stageFlags
    nullptr                                    // const VkSampler *pImmutableSamplers
  },                                                                  
  {                                                                   
    1,                                         // uint32_t           binding
    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,         // VkDescriptorType   descriptorType
    1,                                         // uint32_t           descriptorCount
    VK_SHADER_STAGE_VERTEX_BIT,                // VkShaderStageFlags stageFlags
    nullptr                                    // const VkSampler *pImmutableSamplers
  }
};

VkDescriptorSetLayoutCreateInfo descriptor_set_layout_create_info = {
  VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO, // VkStructureType  sType
  nullptr,                                             // const void      *pNext
  0,                                                   // VkDescriptorSetLayoutCreateFlags flags
  static_cast<uint32_t>(layout_bindings.size()),       // uint32_t         bindingCount
  layout_bindings.data()                               // const VkDescriptorSetLayoutBinding *pBindings
};

if( vkCreateDescriptorSetLayout( GetDevice(), &descriptor_set_layout_create_info, nullptr, &Vulkan.DescriptorSet.Layout ) != VK_SUCCESS ) {
  std::cout << "Could not create descriptor set layout!"<< std::endl;
  return false;
}

return true;

6. Tutorial07.cpp, function CreateDescriptorSetLayout()

Descriptor set layouts are created by specifying bindings. Each binding defines a separate entry in a descriptor set and has its own, unique index within a descriptor set. In the above code we define that descriptor set (and its layout); it contains two bindings. The first binding, with index 0, is for one combined image sampler accessed by a fragment shader. The second binding, with index 1, is for a uniform buffer accessed by a vertex shader. Both are single resources; they are not arrays. But, we can also specify that each binding represents an array of resources by providing a value greater than 1 in the descriptorCount member of the VkDescriptorSetLayoutBinding structure.

Bindings are also used inside shaders. When we define uniform variables, we need to specify the same binding value as the one provided during layout creation:

layout( set=S, binding=B ) uniform <variable type> <variable name>;

Two things are worth mentioning. Bindings do not need to be consecutive. We can create a layout with three bindings occupying, for example, indices 2, 5, and 9. But unused slots may still use some memory, so we should keep bindings as close to 0 as possible.

We also specify which shader stages need access to which types of descriptors (which bindings). If we are not sure, we can provide more stages. For example, let's say we want to create several pipelines, all using descriptor sets with the same layout. In some of these pipelines, a uniform buffer will be accessed in a vertex shader, in others in a geometry shader, and in still others in both vertex and fragment shaders. For such a purpose we can create one layout in which we can specify that the uniform buffer will be accessed by vertex, geometry, and fragment shaders. But we should not provide unnecessary shader stages because, as usual, it may impact the performance (though this does not mean it will).

After specifying an array of bindings, we provide a pointer to it in a variable of type VkDescriptorSetLayoutCreateInfo. The pointer to this variable is provided in the vkCreateDescriptorSetLayout() function, which creates the actual layout. When we have it we can allocate a descriptor set. But first, we need a pool of memory from which the set can be allocated.

Creating a Descriptor Pool

When we want to create a descriptor pool we need to know what types of resources will be defined in descriptor sets allocated from the pool. We also need to specify not only the maximum number of resources of each type that can be stored in descriptor sets allocated from the pool, but also the maximum number of descriptor sets allocated from the pool. For example, we can prepare a storage for one combined image sampler and for one uniform buffer, but for two sets in total. This means that we can have two sets, one with a texture and one with a uniform buffer, or only one set, with both a uniform buffer and a texture (in such a situation the second pool is empty, as it cannot contain either of these two resources).

In our example we need only one descriptor set, and we can see below how to create a descriptor pool for it:

std::vector<VkDescriptorPoolSize> pool_sizes = {
  {
    VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER,   // VkDescriptorType  type
    1                                            // uint32_t          descriptorCount
  },
  {
    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,           // VkDescriptorType  type
    1                                            // uint32_t          descriptorCount
  }
};

VkDescriptorPoolCreateInfo descriptor_pool_create_info = {
  VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO, // VkStructureType     sType
  nullptr,                                       // const void         *pNext
  0,                                             // VkDescriptorPoolCreateFlags flags
  1,                                             // uint32_t            maxSets
  static_cast<uint32_t>(pool_sizes.size()),      // uint32_t            poolSizeCount
  pool_sizes.data()                              // const VkDescriptorPoolSize *pPoolSizes
};

if( vkCreateDescriptorPool( GetDevice(), &descriptor_pool_create_info, nullptr, &Vulkan.DescriptorSet.Pool ) != VK_SUCCESS ) {
  std::cout << "Could not create descriptor pool!"<< std::endl;
  return false;
}

return true;

7. Tutorial07.cpp, function CreateDescriptorPool()

Now, we are ready to allocate descriptor sets from the pool using the previously created layout.

Allocating Descriptor Sets

Descriptor set allocation is pretty straightforward. We just need a descriptor pool and a layout. We specify the number of descriptor sets to allocate and call the vkAllocateDescriptorSets() function like this:

  VkDescriptorSetAllocateInfo descriptor_set_allocate_info = {
    VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO, // VkStructureType   sType
    nullptr,                                        // const void       *pNext
    Vulkan.DescriptorSet.Pool,                      // VkDescriptorPool  descriptorPool
    1,                                              // uint32_t      descriptorSetCount
    &Vulkan.DescriptorSet.Layout                // const VkDescriptorSetLayout *pSetLayouts
};

if( vkAllocateDescriptorSets( GetDevice(), &descriptor_set_allocate_info, &Vulkan.DescriptorSet.Handle ) != VK_SUCCESS ) {
    std::cout << "Could not allocate descriptor set!"<< std::endl;
    return false;
}

return true;

8. Tutorial07.cpp, function AllocateDescriptorSet()

Updating Descriptor Sets

We have allocated a descriptor set. It is used to provide a texture and a uniform buffer to the pipeline so they can be used inside shaders. Now we must provide specific resources that will be used as descriptors. For the combined image sampler, we need two resources — an image, which can be sampled inside shaders (it must be created with the VK_IMAGE_USAGE_SAMPLED_BIT usage), and a sampler. These are two separate resources, but they are provided together to form a single, combined image sampler descriptor. For details about how to create these two resources, please refer to the Introduction to Vulkan Part 6 – Descriptor Sets. For the uniform buffer we will provide a buffer created earlier. To provide specific resources to a descriptor, we need to update a descriptor set. During updates we specify descriptor types, binding numbers, and counts in exactly the same way as we did during layout creation. These values must match. Apart from that, depending on the descriptor type, we also need to create variables of type:

  • VkDescriptorImageInfo, for samplers, sampled images, combined image samplers, and input attachments
  • VkDescriptorBufferInfo, for uniform and storage buffers and their dynamic variations
  • VkBufferView, for uniform and storage texel buffers

Through them, we provide handles of specific Vulkan resources that should be used for corresponding descriptors. All this is provided to the vkUpdateDescriptorSets() function, as we can see below:

VkDescriptorImageInfo image_info = {
  Vulkan.Image.Sampler,                    // VkSampler        sampler
  Vulkan.Image.View,                       // VkImageView      imageView
  VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL // VkImageLayout    imageLayout
};

VkDescriptorBufferInfo buffer_info = {
  Vulkan.UniformBuffer.Handle,             // VkBuffer         buffer
  0,                                       // VkDeviceSize     offset
  Vulkan.UniformBuffer.Size                // VkDeviceSize     range
};

std::vector<VkWriteDescriptorSet> descriptor_writes = {
  {
    VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET,    // VkStructureType     sType
    nullptr,                                   // const void         *pNext
    Vulkan.DescriptorSet.Handle,               // VkDescriptorSet     dstSet
    0,                                         // uint32_t            dstBinding
    0,                                         // uint32_t            dstArrayElement
    1,                                         // uint32_t            descriptorCount
    VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER, // VkDescriptorType    descriptorType
    &image_info,                               // const VkDescriptorImageInfo  *pImageInfo
    nullptr,                                   // const VkDescriptorBufferInfo *pBufferInfo
    nullptr                                    // const VkBufferView *pTexelBufferView
  },
  {
    VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET,    // VkStructureType     sType
    nullptr,                                   // const void         *pNext
    Vulkan.DescriptorSet.Handle,               // VkDescriptorSet     dstSet
    1,                                         // uint32_t            dstBinding
    0,                                         // uint32_t            dstArrayElement
    1,                                         // uint32_t            descriptorCount
    VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER,         // VkDescriptorType    descriptorType
    nullptr,                                   // const VkDescriptorImageInfo  *pImageInfo
    &buffer_info,                              // const VkDescriptorBufferInfo *pBufferInfo
    nullptr                                    // const VkBufferView *pTexelBufferView
  }
};

vkUpdateDescriptorSets( GetDevice(), static_cast<uint32_t>(descriptor_writes.size()), &descriptor_writes[0], 0, nullptr );
return true;

9. Tutorial07.cpp, function UpdateDescriptorSet()

Now we have a valid descriptor set. We can bind it during command buffer recording. But, for that we need a pipeline object, which is created with an appropriate pipeline layout.

Preparing Drawing State

Created descriptor set layouts are required for two purposes:

  • Allocating descriptor sets from pools
  • Creating pipeline layout

The descriptor set layout specifies what types of resources the descriptor set contains. The pipeline layout specifies what types of resources can be accessed by a pipeline and its shaders. That's why before we can use a descriptor set during command buffer recording, we need to create a pipeline layout.

Creating a Pipeline Layout

The pipeline layout defines the resources that a given pipeline can access. These are divided into descriptors and push constants. To create a pipeline layout, we need to provide a list of descriptor set layouts, and a list of ranges of push constants.

Push constants provide a way to pass data to shaders easily and very, very quickly. Unfortunately, the amount of data is also very limited — the specification allows only 128 bytes to be available for push constants data provided to a pipeline at a given time. Hardware vendors may allow us to provide more data, but it is still a very small amount compared to usual descriptors, like uniform buffers.

In this example we don't use push constants ranges, so we only need to provide our descriptor set layout and call the vkCreatePipelineLayout() function. The code below does exactly that:

VkPipelineLayoutCreateInfo layout_create_info = {
  VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO, // VkStructureType              sType
  nullptr,                                       // const void                  *pNext
  0,                                             // VkPipelineLayoutCreateFlags  flags
  1,                                             // uint32_t                     setLayoutCount
  &Vulkan.DescriptorSet.Layout,                  // const VkDescriptorSetLayout *pSetLayouts
  0,                                             // uint32_t                     pushConstantRangeCount
  nullptr                                        // const VkPushConstantRange   *pPushConstantRanges
};

if( vkCreatePipelineLayout( GetDevice(), &layout_create_info, nullptr, &Vulkan.PipelineLayout ) != VK_SUCCESS ) {
  std::cout << "Could not create pipeline layout!"<< std::endl;
  return false;
}
return true;

10. Tutorial07.cpp, function CreatePipelineLayout()

Creating Shader Programs

Now we need a graphics pipeline. Pipeline creation is a very time-consuming process, from both the performance and code development perspective. I will skip the code and present only the GLSL source code of shaders.

The vertex shader used during drawing takes a vertex position and multiplies it by a projection matrix read from a uniform variable. This variable is stored inside a uniform buffer. The descriptor set, through which we provide our uniform buffer, is the first (and the only one in this case) in the list of descriptor sets specified during pipeline layout creation. So, when we record a command buffer, we can bind it to the 0th index. This is because indices to which we bind descriptor sets must match indices corresponding to descriptor set layouts that are provided during pipeline layout creation. The same set index must be specified inside shaders. The uniform buffer is represented by the second binding within that set (it has an index equal to 1), and the same binding number must also be specified. This is the whole vertex shader source code:

#version 450

layout(set=0, binding=1) uniform u_UniformBuffer {
    mat4 u_ProjectionMatrix;
};

layout(location = 0) in vec4 i_Position;
layout(location = 1) in vec2 i_Texcoord;

out gl_PerVertex
{
    vec4 gl_Position;
};

layout(location = 0) out vec2 v_Texcoord;

void main() {
    gl_Position = u_ProjectionMatrix * i_Position;
    v_Texcoord = i_Texcoord;
}

11. shader.vert, -

Inside the shader we also pass texture coordinates to a fragment shader. The fragment shader takes them and samples the combined image sampler. It is provided through the same descriptor set bound to index 0, but it is the first descriptor inside it, so in this case we specify 0 (zero) as the binding's value. Have a look at the full GLSL source code of the fragment shader:

#version 450

layout(set=0, binding=0) uniform sampler2D u_Texture;

layout(location = 0) in vec2 v_Texcoord;

layout(location = 0) out vec4 o_Color;

void main() {
  o_Color = texture( u_Texture, v_Texcoord );
}

12. shader.frag, -

The above two shaders need to be converted to a SPIR-V* assembly before we can use them in our application. The core Vulkan specification allows only for binary SPIR-V data to be used as a source of shader instructions. From them we can create two shader modules, one for each shader stage, and use them to create a graphics pipeline. The rest of the pipeline state remains unchanged.

Binding Descriptor Sets

Let's assume we have all the other resources created and ready to be used to draw our geometry. We start recording a command buffer. Drawing commands can only be called inside render passes. Before we can draw any geometry, we need to set all the required states — first and foremost, we need to bind a graphics pipeline. Apart from that, if we are using a vertex buffer, we need to bind the appropriate buffer for this purpose. If we want to issue indexed drawing commands, we need to bind a buffer with vertex indices too. And when we are using shader resources like uniform buffers or textures, we need to bind descriptor sets. We do this by calling the vkCmdBindDescriptorSets() function, in which we need to provide not only the handle of our descriptor set, but also the handle of the pipeline layout (so we need to keep it). Only after that can we record drawing commands. These operations are presented in the code below:

vkCmdBeginRenderPass( command_buffer, &render_pass_begin_info, VK_SUBPASS_CONTENTS_INLINE );

vkCmdBindPipeline( command_buffer, VK_PIPELINE_BIND_POINT_GRAPHICS, Vulkan.GraphicsPipeline );

// ...

VkDeviceSize offset = 0;
vkCmdBindVertexBuffers( command_buffer, 0, 1, &Vulkan.VertexBuffer.Handle, &offset );

vkCmdBindDescriptorSets( command_buffer, VK_PIPELINE_BIND_POINT_GRAPHICS, Vulkan.PipelineLayout, 0, 1, &Vulkan.DescriptorSet.Handle, 0, nullptr );

vkCmdDraw( command_buffer, 4, 1, 0, 0 );

vkCmdEndRenderPass( command_buffer );

13. Tutorial07.cpp, function PrepareFrame()

Don't forget that a typical frame of animation requires us to acquire an image from a swapchain, record the command buffer (or more) as presented above, submit it to a queue, and present a previously acquired swapchain image so it gets displayed according to the present mode requested during swapchain creation.

Tutorial 7 Execution

Have a look at how the final image generated by the sample program should appear:

track with Intel logo

We still render a quad that has a texture applied to its surface. But the size of the quad should remain unchanged when we change the size of our application's window.

Cleaning Up

As usual, at the end of our application, we should perform a cleanup.

// ...

if( Vulkan.GraphicsPipeline != VK_NULL_HANDLE ) {
  vkDestroyPipeline( GetDevice(), Vulkan.GraphicsPipeline, nullptr );
  Vulkan.GraphicsPipeline = VK_NULL_HANDLE;
}

if( Vulkan.PipelineLayout != VK_NULL_HANDLE ) {
  vkDestroyPipelineLayout( GetDevice(), Vulkan.PipelineLayout, nullptr );
  Vulkan.PipelineLayout = VK_NULL_HANDLE;
}

// ...

if( Vulkan.DescriptorSet.Pool != VK_NULL_HANDLE ) {
  vkDestroyDescriptorPool( GetDevice(), Vulkan.DescriptorSet.Pool, nullptr );
  Vulkan.DescriptorSet.Pool = VK_NULL_HANDLE;
}

if( Vulkan.DescriptorSet.Layout != VK_NULL_HANDLE ) {
  vkDestroyDescriptorSetLayout( GetDevice(), Vulkan.DescriptorSet.Layout, nullptr );
  Vulkan.DescriptorSet.Layout = VK_NULL_HANDLE;
}

DestroyBuffer( Vulkan.UniformBuffer );

14. Tutorial07.cpp, function destructor

Most of the resources are destroyed, as usual. We do this in the order opposite to the order of their creation. Here, only the part of code relevant to our example is presented. The graphics pipeline is destroyed by calling the vkDestroyPipeline() function. To destroy its layout we need to call the vkDestroyPipelineLayout() function. We don't need to destroy all descriptor sets separately, because when we destroy a descriptor pool, all sets allocated from it also get destroyed. To destroy a descriptor pool we need to call the vkDestroyDescriptorPool() function. The descriptor set layout needs to be destroyed separately with the vkDestroyDescriptorSetLayout() function. After that, we can destroy the uniform buffer; it is done with the vkDestroyBuffer() function. But we can't forget to destroy the memory bound to it via the vkFreeMemory() function, as seen below:

if( buffer.Handle != VK_NULL_HANDLE ) {
  vkDestroyBuffer( GetDevice(), buffer.Handle, nullptr );
  buffer.Handle = VK_NULL_HANDLE;
}

if( buffer.Memory != VK_NULL_HANDLE ) {
  vkFreeMemory( GetDevice(), buffer.Memory, nullptr );
  buffer.Memory = VK_NULL_HANDLE;
}

15. Tutorial07.cpp, function DestroyBuffe()

Conclusion

In this part of the tutorial we extended the example from the previous part by adding the uniform buffer to a descriptor set. Uniform buffers are created as are all other buffers, but with the VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT usage specified during their creation. We also allocated dedicated memory and bound it to the buffer, and we uploaded projection matrix data to the buffer using a staging buffer.

Next, we prepared the descriptor set, starting with creating a descriptor set layout with one combined image sampler and one uniform buffer. Next, we created a descriptor pool big enough to contain these two types of descriptor resources, and we allocated a single descriptor set from it. After that, we updated the descriptor set with handles of a sampler, an image view of a sampled image, and the buffer created in this part of the tutorial.

The rest of the operations were similar to the ones we already know. The descriptor set layout was used during pipeline layout creation, which was then used when we bound the descriptor sets to a command buffer.

We have seen once again how to prepare shader code for both vertex and fragment shaders, and we learned how to access different types of descriptors provided through different bindings of the same descriptor set.

The next parts of the tutorial will be a bit different, as we will see and compare different approaches to managing multiple resources and handling various, more complicated, tasks.

Modular Concepts for Game and Virtual Reality Assets

$
0
0

In-game environment

A hot topic and trend of the games industry is modularity, or the process of organizing groups of assets into reusable, inter-linkable modules to form larger structures and environments. The core idea is reusing as much work as possible in order to save memory, improve load times, and streamline production. There are however drawbacks to overcome with these methods. Creating variation in surfaces is important so that repetition of modules is not noticeable, and a viewer feels immersed in a realistic environment.

One of the biggest issues with real-time environments is that we cannot do all the creation in-engine. As artists, we rely on a plethora of programs that all get consolidated into a workflow, with the final product reaching the engine. This is a challenge because it is important that the whole scene shares consistent and equal detail, rather than inconsistent pockets of micro detail where one spent more time. This requires a lot of rapid iteration and early testing. With the current next gen tools and engines, it is possible to add detail within the final scene by leveraging the use of advanced materials/shaders to increase the visual quality across the scene and break up repetition.

Pros

  • Build large environments quickly
  • Memory efficient

Cons

  • Extra planning time required at the start
  • Can look repetitive, boxy or mechanically aligned; boring

In this article I aim to share my experiences learned in creating environment art for virtual reality (VR) games that can be applied to any 3D application or experience.

How To Think as a Designer: Basic Fundamentals

Understanding how elements of architecture come together to form details and interesting spaces is just as important as knowing level design strategies of how to convey importance to a user when it comes to creating modular assets. Understanding how to simplify visuals into believable and reusable prefabs while working within design and hardware constraints is all a balancing act that gets easier with practice and study. In addition, this plays into how we as artists look at reference and decide which assets to make first. I highly recommend aiming for the minimum amount of work, or the maximum amount of reusability when beginning a scene. Iterative strategies such as this improve the overall quality without stretching budget, as well as inform next steps.

Games Versus Commercial

The biggest difference between games and commercial applications should be that the art created accommodates a player whereas, in commercial, our art accommodates a consumer. While a player is still a consumer, a game has rules and mechanics that must be emphasized in the layout and design of our spaces to accent what makes the game fun. Poor level design begets a poor user experience. Whereas with just a consumer, we must focus on an invitation to someplace comfortable and easy on the eyes. With VR, these lines are merging together regardless of the application you are making, whether it’s a game or an experience. Games typically require more lateral thinking and communication between artists, designers, and programmers; it is important to get feedback early so as not to cause problems down the line.

As we build our interlocking assets, we must always check out the pieces in the engine, seeing how simple untextured or textured models line up or don’t line up when trying to put pieces together to form simple structures. During these tests of assets the key things to check are:

Ease of use: Does the asset line up well with others? Is the pivot point of the mesh in the correct spot and snapped to the grid? Overall, is it not a hassle for you to use?

Repetition: Do we need new pieces to break up a kit that maybe is too small? If the viewer can easily see each piece and notice the prefab architecture we will have a hard time immersing people in a space.

Forms and shapes: Does everything have a unique silhouette and play with light well in the engine? Or does a flat surface break up well?

Composition: Is everything working together? Can an interesting image be composed?

Readability, color/detail, scale and the first two seconds: This is a later check after getting in some texture and base lighting for a scene. Make sure that pathing is visible or importance is conveyed right away. To get this right we may need more assets, or sometimes, less. This is all about visual balance and iterating quickly from rough geo and textures to final assets.

Learning how to think as a designer and understand how your work affects the whole process of development as well as the viewer experience is what ultimately allows you to make subjective decisions that can improve the design or limit cluttering of conveyance. Understanding why real-world places look and feel a certain way or how a shape suggests use of an object is the core of being an artist and a designer. I see a lot of technical breakdowns but understanding and studying these fundamentals through observing references will make you better, faster. Technology is just a tool. Modularity is a set of wrenches in the tool box. Building an airplane is easy if you know what you are doing.

Importance of Planning

Understanding the elements of design helps to make decisions when planning out our modular set and assessing its needs. As stated before, modularity requires a great deal of planning ahead of production. Reference should be the first place you start. Begin to break down images and pull out assets and ideas from real-world spaces. Good reference shows variety in a natural sense. Immersion is in the details.

The tools I use at this stage are Pinterest*, my ultimate tool when it comes to finding reference. Use it in tandem with a free tool called PureRef to view the references and you have a very powerful combo. Trello* can help manage tasks based on images and show progress. These tools can help limit creative friction or indecisiveness, especially if screenshots are posted into Trello to keep track of what you did today so that tomorrow you know where to pick up. In time, this is a real lifesaver for personal projects, to keep them going, as you see how far you've come.

When working with a client, get as much reference as possible; pictures, 360-degree videos, similar spaces, and any important details. In some cases, it may be good to do storyboards beforehand so as we work, we can think laterally as to the target goal, and the client is also on the same page. This could be as simple as photos of the place with a green screened character wearing a headset to imply what each scene is. Then move onto blockout or rough 3D layout, and then the final look pass. It is also important to consider the usage of sites like TurboSquid*, Cubebrush*, or the Unity* Asset Store, that can help cut costs of production time on asset creation. Investing in Quixel Megascans* and Substance Source can really help get in quality materials early. This also helps greatly since purchased assets oftentimes have lower quality textures.

General Process Overview

It is important to try different workflows and adopt other’s workflows to see if one process speeds you up or slows you down. One thing I learned as an artist is that there is no one sacred way or secret sauce to being good. Everyone has a different process or tricks depending on how they work, what they are working on, and where they work. Here is a basic process you can use to see the general overview of what goes in to a complete personal environment. You have to be flexible and iterative. Know when something is good enough and minimize creative friction by keeping tasks and files organized. Figure out a good naming convention for your assets early.

  • Gather reference: Learn your space. Map it out in your mind. What does it feel like?
  • Sort reference into tasks: Assets, materials, modular components.
  • Check 1: Attempt a basic 3D layout or blockout of primitive geometry. Sense of scale, space, and interconnectivity of assets is important.
  • Test early lighting: Adjust elements to suit ideal composition.
  • Once your meshes work well together and the scene is composed, improve the detail of the meshes to a final in game and begin unwrapping.
  • Check 2: Apply textures early, focusing on basic albedo, basic normal maps, and the overall readability with temp lighting. Note if your albedo is too dark.
  • Begin to follow your production schedule and continue to work up the detail. High poly creation, final texture passes; focus on larger or frequently used assets first as a baseline to always check if everything is cohesive.
  • Create supporting props to fill the scene with desired detail giving a sense of character, story, and scale.
  • Final lighting pass: Take screenshots, compare changes in external photo viewer, tweak until desired look is achieved.

    The greatest advantage in my opinion of working modularly is that you are ever aware of how many assets you have to be consistently updating. In a professional setting this helps a lot if changes need to be made, either in layout or optimization.

Texture Types

As we begin to look at the reference we want to capture the atmosphere, space, and harmony. However, it’s also crucial to be able to analyze each bit of reference and see where every material is used, so as to notice where the same materials are applied. The goal should be as few materials as possible to limit the work into a manageable scope and reduce the overall loading of textures, which take up a great amount of memory at runtime. To grasp modular concepts, it's necessary to understand the three types of textures we will be working with.

Tileable: Textures that tile infinitely in either all four directions or only tiles horizontally or vertically are also known as trim sheets. Tileable textures are your frontline infantry and should make up most of your scene as they are efficient and reusable. Oftentimes, we can start here and model directly from these textures to get our first building blocks. We will cover that further along.

tilable texture for 3d object
Tileable texture

Tile notes and tips:

  • Substance Designer is incredible for making just about any tile texture or material imaginable. It excels at achieving hyperrealism without using scan data, and is completely nondestructive. Its node-based system allows for the ability to open parameters for use in engine, leverage random seeds for quick iterations, limitless customization to tweak on the fly, and happy accidents.
  • Priority number one for physically based rendering materials is the height and normal, then roughness, then albedo. This way, the initial read feels correct and the details are lining up, but we can also use/create Active Objects(AO), curvature, and normal map color channels to create masks from the height data in order to create the roughness and albedo.
  • Decide how the texture should tile and be aware of repetition! My rock texture is noticeably tiling, but this study example was to create a more unique piece that would blend between a less noticeably repetitive version of the same rock material. This is called having low noise; supplying areas of rest, but details that invite the viewer in.
  • Use the histogram! To make an albedo work in any lighting condition and look natural, your albedo should have a nice cure in the histogram with the middle landing around a mean luminance of 128. This number can be lower, as albedo information for materials changes for naturally darker surfaces, and vice versa. Furthermore, getting a wide range of values for the color or a softer curve is more natural. Check the value curves on textures at Textures.com to help give yourself proper value and color. It is worth mentioning that mid gray is a value of 186 RGB. I would further recommend using this value on all assets in a scene when tweaking early lighting tests to see how geometry/height maps add in light versus shadow details.
  • Albedo and roughness are two sides of the same coin. Both interpret light to create value. Albedo is just the base coat, but if you look at a wall that has good specular breakup, you notice that areas with more reflectivity are brighter and inherit light color, while duller areas are darker and maintain more of the raw albedo color.
  • Learn how to create generators in Substance Designer to speed up your workflow, and reuse graphs to give yourself a good base. These can be shapes and patterns, edge damage, cracks, moisture damage, and so on. Please, accredit generators and graphs supplied by other artists that you may use in a Frankenstein graph.

Trim sheets:

trim sheet texture

asset with texture variants
Using Substance Smart materials to create to interchangeable variants.

high poly trim sheet
Our trim sheet high polygon mesh:

Trim notes and tips:

  • Create a 1m by 1m plane as your template outline. Keep this to export as your low poly mesh as well as export with high poly to ensure no gaps between elements.
  • Snap elements to grid! Makes for conversion to scene scale relative and accurate for texel density!
  • In order for trims to tile in the bake, the geometry has to go past the 1 x 1 plane.
  •  Don’t unwrap! This gets baked to a flat plane so we do not have to worry about geometry or unwrap these objects. The base plane is already unwrapped to 0-1, so we are good to go.
  • Floaters. We can make 3D objects to use as details that sit or float on top of other elements such as the small bolts in the bottom corner. Because the texture is baked flat, the bake won’t recognize the change in depth, as the normals should appear to blend seamlessly. This saves a lot of time as we don't need to model details into complicated meshes, but rather reuse these elements like screws, bolts, or any concave/convex shape to place where we want. This also makes it nondestructive if we are not happy with the bake result.
  • Save space for small, reusable details. Sometimes floaters don’t turn out too well if they are small and there are not enough pixels to support the normal. By putting details at the bottom, we can reuse these on the game mesh by putting small planes as floaters that have the details mapped to them.
  • Forty-five-degree bevels in the normal map help to smooth edges, and at this angle they are very reusable. Sunset Overdrive has a great breakdown of this technique. This isn't to say that each edge needs this, however; it is mostly for lower poly objects with hard edges.
  • Simplicity is often better. Have some unique elements, but low detailed pieces are far more reusable and less noticeable than, say, a trim with lots of floaters, so find a good balance.
  • Your trim can be composed of different materials such as metal trims, wood mouldings, or rubber strips all on one texture sheet. Baking out a material ID mask helps to make a sheet more versatile for saving memory. This is where really good planning helps.
  • It is also possible to create trims without geo in Substance Designer and Quixel* NDO Painter. Details can also be added to trim geo using alpha masks in Substance Painter.

Further creative examples:

example of a texture
Image of texture
This was for a mobile AR project. This method kept the texture size low and the detail high.

 

If trim sheets are made to the grid, they can be easily used interchangeably by adding edge loops to a plane and breaking off the components. This method is quick for prototyping since the UVs will be one to one to ensure tiling. If planned accordingly, one can also interchange textures if the textures share the same grid space. Check out this breakdown by Jacob Norris for details.

Here, the elements come together to form a basic corridor using one trim and one tileable.

Unique: A texture that focuses on one asset like a prop. It utilizes baked maps from a high polygon model to derive unique, non-tiling detail. These should be the finishing touches on a scene, and ideally, similar props always featured together are on a single texture atlas.

An exception to doing a unique asset first would be with a hero prop, which is a category for a unique asset of great significance to the overall scene. In this case getting a basic or final unique asset can come first.


Chess set uses one texture for all pieces. Each piece is baked from a high poly to a low poly and each surface face has unique detail.

Hybrid: Features the use of both tiling elements and unique elements on the same 0-1 texture. This could also be material blending on larger assets that use unique normals and masks to pump in tiling materials such as ornate stone walls or large rocks.

This hybrid example uses a unique normal map to drive the wear generators in Substance Painter to get masks that blend the rust and metals together. The red lines also indicate where I made cuts to form new geometry from the existing mesh. Yay, recycling!

The elements coming together to make an old sci fi warehouse kit. One trim (two color variants), one hybrid, and two tile textures.

Texel Density and Variation

One last thing to consider at an early stage is texel density or how many pixels per unit our assets will use. This can be pretty lengthy for first timers, and I highly recommend this article by Leonardo Lezzi. For first-person experiences we would want 1k or a 1024 x 1024 texture per one meter, or a texel density of 10.24 pixels per cm. We primarily want to use tileable textures in order to maximize visual detail on larger surfaces. Exception to these rules would be anything interactable that will be close to the player, such as a gun. I like to use Nightshade UV for Maya* when unwrapping. The tool has a built in texel density feature to set or track the pixels per unit. Below is an example of a 3m x 3m wall with a texel density of 1k per meter.

VR needs at least 1K per meter, but as the headsets evolve, that value will likely increase to 2K per meter. This poses a lot of challenges for VR artists as the hardware of computers will still limit texture sizes for many users. VR headsets already render two images at a wider field of view, making higher frame rates and nice details tricky. In order to combat this, it is possible to create shaders in Unreal Engine* 4 (UE4) and Unity* that use lower resolution detail maps to create the illusion of nearly 8K. A short demonstration of this technique can be seen on Youtube: Detail Textures: Quick Tutorial for UE4 & Megascans.

These shader setups are also critical to adding variation and breakup to surface tiling, either by using mask-based material blending or using vertex colors to blend materials or colors. This topic is bit of a rabbit hole as there are so many different ways to achieve variation through shader techniques. Amplify Shader Editor for Unity is a great tool to allow for this style of AAA art development. Additionally, this great breakdown by senior environment artist Yannick Gombart demystifies some of these techniques.

The Grid

I remember that, at first, modularity was a difficult concept to grasp. Take a look at Tetris*. All Tetris is is a stacking game of pieces or modules, which are paired together to create new, interlocking shapes that all fall onto a grid.

The grid is the guide to our modular blueprints according to a set footnote or scale. The footnote informs the basis of our modular construction and usually depends on the type of game we are making. If we were in Dungeons & Dragons*, is the base tile size for our modular corridors 5 feet or 10 feet? How does that impact player movement and sense of space? In Tetris, it would be the base cube size.

Since we are discussing VR, we are talking about first-person games and applications; a good footnote would be three meters by three meters or four by four. Always work in centimeters when modeling, so that would be 300 by 300. The metric system is easily divisible, making modules easy to break up into round units, and it is what game engines use by default. When deciding a footnote, keep in mind that first-person VR tends to make objects appear smaller than they actually are, so exaggeration in shapes makes things feel more realistic or clear. To begin, we need to change our modeling application’s grid to mimic our engine’s grid so that they will integrate seamlessly.

How To Set Up our Scale in Maya*

First, let’s ensure we are using centimeters in Maya.
Go to Window > Setting/Preferences > Preferences

Click on Settings and under Working Units check that Linear is set to Centimeter.

Now, let’s set up our grid.

Go to Display > Grid Options

Under Size set the following values.

  • Length/Width: 1,000 units. This is the overall grid size of our perspective view and not the size of each grid unit.
  • Grid lines every: 10 units (this controls grid unit lines as it does in Unity or UE4; so, if you can set this value to 5, 10, 50, 100 it will match the UE4 grid snaps. For Unity, I use a ProGrids plugin that mirrors a more robust grid, like that of Unreal). Changing this value is what will mirror how assets snap and line up with each other in the engine.
  • Subdivisions: 1. This changes how many grid lines we have per unit. At 1 it will be every 10 units, but at 2, we change this to grid lines every 5 units. It’s a quicker way to divide the grid into different snapping values, by sliding the input, rather than inputting a unique number each time for the Grid lines every field.
    Finally, create a cube 100 x 100 x 100 and export it as an .FBX file, in order to see if your asset matches the default cube size in the engine (usually 1m by 1m).
  • Useful hotkeys:
    • By holding X, you can snap an object to the grid as you translate.
    • Holding V snaps the selection to the vertices.
    • Pressing the Ins or the D key allows for moving an object’s pivot point. In tandem with holding X or V, you can get the pivot on the grid.
    • Holding J snaps the rotation.

Again, with whatever modeling software you use, the goal is for the grid to match that of the engine’s grid.

Here is a great walkthrough with images for grid setup, as well as a good site in general for good level design and environment art practices.

Bringing It All Together—The Blockout

Now that we are aware of our metrics and know what textures we need to look out for, we can go one of two ways. We can either make a quick texture set, usually a trim sheet and a tiling texture to create modules from, or we can go straight into a modeling program. Either way, breaking down the reference into working units and materials reduces creative friction early by having a good understanding of the space and sense of production scope.

As an example, here are some ways to look at a reference, in order to plan:

Some modeling programs have a distance tool to measure your units. I have measured a base man here to use as a scale reference for when I work in my scenes.

I can now overlay the humanoid reference and scale him accordingly to get the units for the reference. I have also color coated some of the texture callouts, highlighting essentials that I can use to plan a production schedule.

Here is another example of working from a photo reference. I used the park bench to estimate the building scale, sliced it into modules, and highlighted the trims.

Back to the room image. With our units in mind, I can now do a quick blockout in Maya using simple planes to get my scale. This establishes my footnote and serves as the base or root guideline for all of my high poly meshes and game assets.

From here I can create my materials and set of assets such as this:

As mentioned earlier, we need to check this in-engine constantly to see if our modules line up as intended. Work with units of 5, 10, 50, 100 cm. Keep in mind that working with meters ensures perfect tiling of textures from asset to asset (if our texel density is 1k per meter). It is important to export modules with good pivot points snapped on the grid with the front face facing the Z-direction. It is also good to establish naming conventions so it will be easy to look up each module in the engine.

In Unreal, for this project I assembled my level using the base kit. From here it is about polishing upwards by creating materials, high poly meshes, and unwrapping. This scene was made from various references and is a custom space. For this I drew a basic top-down map as a floor plan, then started with my wooden beams and railings, then filled in the other details to support them.

Keeping track of progress looks something like this in Trello:

From the left we have my reference images, which help inform space, props, and materials. Next I have my progress, showing each stage. Lastly, creating different columns for an asset catalog, to keep track of each asset. Each card has a checklist and any images or notes I need to keep in mind. The color coding informs the material type (tile, trim, hybrid, unique) for each card. From here it's just about polish.

Process Continued—Look Dev

It is important to note that when working within games it’s ideal to be as fast as possible. Just because you know how to save time by using modular concepts does not mean your final will look the way you want it to. There is a great deal of back and forth to do, and in order to mitigate risk, it is key to iterate. Get to a final early and do a look dev test such as this one:

This image is by no means a final environment. There are many changes I want to make and materials to change out. To get to this stage, I used basic primitives, some premade assets, and a mix between my own materials made in Substance Designer, and materials downloaded from Substance Source. The goal is selling on a sense of atmosphere to inform an entire level. This scene was assembled in little over a week’s time, which is very fast for one artist!

Now, something I realized early when using modularity and with level design in general is that if you always stay on the grid, a scene can become very boring. You want to strike a balance between a nice, organic composition, and the ease of use the grid offers.

In order to attain this balance, I use Oculus Medium* to start my blockout in. Working in VR with a voxel-based sculpting program is incredibly freeing, and simple for an artist to previsualize entire scenes. Using PureRef and Pinterest, it is possible to make large images that can be loaded into Medium to be used as reference, which makes the early creation and inspiration seamless. Additionally, because you are working in VR, I find it easy to judge scale and can get a sense of space far faster and easier than working back and forth between the engine and Maya. Furthermore, it is easy to perceive a scene in Medium from any angle, including basic lighting and ambient occlusion. This makes for a powerful iteration tool that gets you to the essence of your scene faster.

Here is my decimated mesh in Maya. It doesn't look like much, but it gives me enough ideas for how I can move on a space, early and more organically.

From here, I add the Oculus mesh to a layer to set as a wireframe, in order to reverse engineer it with modular pieces.

Next is to go wild with primitives, working around my footnote and getting things to snap together.

Now, I set up a camera in Maya and reorient assets to compose something more interesting. From here I save the file as an .MA, which Unity can read, in order to work quickly between Maya and Unity. The far wall with the weird growth was made using Oculus Medium and remeshed in ZBrush*. I wanted to capture a cool set piece that had an organic cyber punk fusion. The stairs are also made in Medium, and reflect my initial sculpt.

Here I have the Maya file imported in order to test lighting early.

Getting the scene filled out a bit more.

It's a little hard to tell, but I overlaid my original Oculus sculpt into this scene. I absolutely love this. This is where the initial payoff comes full circle. Now I can see with fresh eyes and notice a sense of rigidity that I must break up.

Now I have offset the wall to an angle and added some elements to create some more organic motion as eyes look through the scene. The key addition is the stairs, which have a unique workflow that I like using.

To easily get everything tiling perfectly I use two planes, shown on the left. I weld them all together and we have it tiling easily. Next, I add some bevels to remove hard edges. Lastly, I use Maya’s sculpt tool to add in noise to the geo, to break up the rigidity. This is the final stage of getting usable modules unwrapped. I prefer to use Nightshade UV, which makes things much easier than using the default tools.

I then export my scene in chunks as FBX files and begin assigning materials in the engine, finding values that work well with my lighting while also adjusting the lighting. My current step is the hardest part, polish. From here on it's on to the grind: high poly meshes, trims and tiles, and shaders to add variation control. Then, as a final stage we go on to optimization, to get this scene running in VR. Much of my look dev post effects wouldn't be viable. Also, I have yet to have the right shaders come online for the final look and style but that's okay, because I have the mood I’m looking for. When optimizing environments the three major drawbacks are:

  • Geo: Simplifying the geometry and edge flow.
  • Textures: Lowering the resolution; finding balance between lowering different maps such as roughness, masks, and ambient occlusion but keeping normal maps at set size.
  • Shaders: Keeping the nodes short and sweet. Blend only up to three full materials within one shader at a time and utilize masks and constants to do the rest.

Note: sometimes it can be outside art causing performance issues (such as characters, scripts, or post effects), so be sure to check everything else first. Optimization is another puzzle and is tricky to get right, but modular workflows help make it a bit easier with fewer assets to optimize, and the possibility of reworking areas if need be.

Last Words and Advice

Projects meet a serious halt for not being well organized and/or too much polish on one asset too early, and not enough overall progress early on. It is a hard lesson learned. Do not feel the need to be good enough too soon; it’s an iterative process. The hardest part about being an artist is knowing how to take feedback positively and build up self-worth when starting out. Sometimes it’s grind, grind, grind, and it goes by all too quickly. Other times, problems abound and you are in a lull. In the end, have a goal. Know where you want to be. It’s easy to get lost in your work and feel aimless or that you need to do more to get better.

The best advice I ever received was the only thing that truly matters is trying to enjoy the ride. Once you stop having fun a project can die quickly. Or worse, you could wake up when you are 30 and wonder where years of your life vanished. Sometimes you just need a little more time to get better at the individual steps. Learning how to use software is the same as with any tool. The more you use it, the faster you get. You pick up tricks along the way and ideally never stop learning. Create an environment or project that focuses on a goal or something you want to learn or improve on. Don't aspire to just make something cool. That is not a goal and you will likely never be finished.

Modularity is one of the core pillars of game development for environment artists. Once you understand the basic fundamentals, you can begin to break scenes down into more efficient and approachable methods for environment creation. The most valuable asset an artist has is the sense of community within art production. It really is a glass door community made open and accessible, thanks to ArtStation*, Polycount, YouTube*, Twitch*, and more publishers of articles that focus on art production. Artists are finally in the limelight and you know who the rock stars are. Learn from these people. Follow them on ArtStation and aspire to meet your goals as you watch your peers meet theirs. Keep your head above water, stay inspired at every step of the way, and strive to see things from different angles. The rest will come in time.

Useful Links

Troubleshooting Visual Studio Command Prompt Integration Issue

$
0
0

Issue Description

Nmake and ifort are not recognized from a command window, however using Intel Fortran under Visual Studio works perfectly.

Troubleshooting

Follow below checklist to troubleshooting Visual Studio command environmental issues:

1. Verify whether ifort and nmake are installed correctly:

    For Visual Studio 2017, nmake is installed at:

    C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Tools\MSVC\14.10.25017\bin\HostX64\x64\nmake.exe

    Find this by running below commands from a system ifort or nmake setup correctly:

> where nmake
  > where ifort

    Also check whether the location is included from PATH environment:

> echo %PATH%

2. If nmake can be found, verify if VS setup script runs properly.
    Start a cmd window, and run Visual Studio setup script manually:

> "C:\Program Files (x86)\Microsoft Visual Studio\2017 \Professional\VC\Auxiliary\Build\vcvars64.bat"

    An expected output is as below

    vscmd_setup.png

3.If nmake cannot be found. It’s your Visual Studio installation is incomplete. Please try re-install Visual Studio and find instructions from below articles:
 
4.Got an error in step 2?
> "C:\Program Files (x86)\Microsoft Visual   Studio\2017\Professional\VC\Auxiliary\Build\vcvars64.bat"
  \Common was unexpected at this time.
If yes, try debug the setup script set VSCMD_DEBUG environment variable:
> set VSCMD_DEBUG=3
Run the setup script again and redirect the output to log file:
> "C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\VC\Auxiliary\Build\vcvars64.bat"> setup.log 2>&1

5.If you got the same error as above, there are some references from Visual Studio community:

    The solution is to remove all quotes from PATH environment variable value.
 
6.If you got a different error, again get an expected output from any system that runs the script correctly and compare with your current one. This will help you locate which command in the setup script that triggers the error. 
    You may also consider to report such issue to Visual Studio community directly at

 

AI-Driven Test System Detects Bacteria In Water

$
0
0

Hands in water

“Clean water and health care and school and food and tin roofs and cement floors, all of these things should constitute a set of basics that people must have as birthrights.”1

– Paul Farmer, American Doctor, Anthropologist, Co-Founder,
Partners In Health

Challenge

Obtaining clean water is a critical problem for much of the world’s population. Testing and confirming a clean water source typically requires expensive test equipment and manual analysis of the results. For regions in the world in which access to clean water is a continuing problem, simpler test methods could dramatically help prevent disease and save lives.

Solution

To apply artificial intelligence (AI) techniques to evaluating the purity of water sources, Peter Ma, an Intel® Software Innovator, developed an effective system for identifying bacteria using pattern recognition and machine learning. This offline analysis is accomplished with a digital microscope connected to a laptop computer running the Ubuntu* operating system and the Intel® Movidius™ Neural Compute Stick. After analysis, contamination sites are marked on a map in real time.

Background and History

Peter Ma, a prolific contributor in the Intel® AI Academy program, regularly participates in hackathons and has won awards in a number of them. “I think everything started as a kid; I've always been intrigued by new technologies,” Peter said.

Winning the Move Your App! Developer Challenge in 2010, a contest hosted by TEDprize, led to a speaking appearance at TEDGlobal and reinforced Peter’s desire to use technology to improve human lives. The contest was based on a challenge by a celebrity chef and restaurateur, to tackle child obesity.

Over several years, Peter has established an active consulting business around his design and development skills. “I build prototypes for different clients,” Peter said, “ranging from Fortune 500 to small startups. Outside of my consulting gigs, I attend a lot of hackathons and build out my own ideas. I built the Clean Water AI specifically for World Virtual GovHack, targeting the Water Safety and Food Security challenge.”

Based in Dubai, United Arab Emirates, the GovTechPrize offers awards annually in several different categories to acknowledge technology solutions that target pressing global challenges. The World Virtual GovHack was added to the awards roster, framed as a global virtual hackathon, to encourage students and startups to tackle difficult challenges through experimentation with advanced technologies.

Develop the future of ∀I for All

Peter Ma

Figure 1. Peter Ma demonstrates the clean water test system.

“We originally started to work on this December of 2017,” Peter said, “specifically for World Virtual GovHack. I won first place and was presented USD 200,000 by His Highness Mansoor of Dubai at the awards ceremony in February 2018. This makes it possible to take the project much further. We are currently in the prototyping stage and working on the next iteration of the prototype so it can be in one single IoT device. I think in the world of innovation, there is never completion—only improvements from your last iteration.”

Peter’s success rate at hackathons is impressive, inspiring other projects, including Doctor Hazel, Vehicle Rear Vision, and Anti-Snoozer. “I think I do well in most hackathons,” he said, “because I focus mostly on how technologies can better people's lives—rather than just what technologies can do.”

Notable Project Milestones

  • Started development work in December 2017.
  • Submitted the project in February 2018 and won first prize in the World Virtual GovHack, receiving USD 200,000 that will help fund the next phase of the Clean Water AI project.
  • Began work on a prototype for a new version of the test system that can be embedded in a self-contained Internet of Things device.
  • Garnered a first place finish in the SAP Spatial Hackathon by mapping out a water system in San Diego to demonstrate how water contamination can be stopped once it has started.
  • Slated to present a demo at the O’Reilly AI Conference in New York in April 2018.

Peter Ma receives the top GovTechPrize

Figure 2. Peter Ma receives the top GovTechPrize in the Water Safety and Food Security category.

Every minute a newborn dies from infection caused by lack of safe water and an unclean environment.2

– World Health Organization, 2017

Enabling Technologies

The Clean Water AI project benefited from access to the Intel® AI DevCloud, a free hosting platform made available to Intel AI Academy members. Powered by Intel® Xeon® Scalable processors, the platform is optimized for deep learning training and inference compute needs. Peter took advantage of Intel AI DevCloud to train the AI model and Intel Movidius Neural Compute Stick to perform water testing in real time. The Neural Compute Stick supports both the Caffe* and TensorFlow* frameworks, popular with deep learning developers.

The Intel® Movidius™ Software Development Kit also figured heavily in the development, providing a streamlined mechanism to profile, tune, and deploy the convolutional neural network capabilities on the Neural Compute Stick. Because the Clean Water AI test system must be able to perform real- time analysis and identify contaminants without access to the cloud, the self-contained, AI-optimized features of the Neural Compute Stick are essential to the operation of the test system. The Neural Compute Stick is a compact, fanless device, the size of a typical thumb drive, with fast USB 3.0 throughput, making it an effective way to deploy efficient deep learning capabilities at the edge of an Internet of Things network.

“Intel provides both hardware and software needs in artificial intelligence—from training through deployment. For startups, it is relatively inexpensive to build the prototype. The AI is being trained through Intel AI DevCloud for free; anyone can sign up. The Intel Movidius Neural Compute Stick costs about USD 79, and it allows the AI to run in real time.”

– Peter Ma, Software Innovator, Intel AI Academy

The neural compute stick acts as an inference accelerator with the added advantage that it does not require an Internet link to operate. All of the data needed by the neural network is stored locally, which makes the rapid, real-time operation possible. Any test system dependent on accessing data from a remote server is going to be burdened by availability of connections (particularly in rural areas where the testing is very important), as well as potential service disruption and lag time in performing analyses. For developers that need more inference performance for an intensive application, up to four compute sticks can be combined at once for a given solution.

Clean Water AI Test System

The Clean Water AI test system is composed of simple, inexpensive, off-the-shelf components:

  • A digital microscope, available at many sources for USD 100 or less
  • A modestly equipped computer running the Ubuntu operating system
  • An Intel Movidius Neural Compute Stick running machine learning and AI in real time

The entire test system can be constructed for well under USD 500, making it within reach of organizations that cannot usually afford expensive traditional test systems.

Figure 3 shows the basic test setup.

Microscope, laptop, and compute stick

Figure 3. The basic test system—microscope, laptop, and compute stick—can be assembled for less than USD 500.

The convolutional neural network at the heart of the test system determines the shape, color, density, and edges of the bacteria. Identification at this point is limited to Escherichia coli (E. coli) and the bacterium that causes cholera, but because different types of bacteria have distinctive shapes and physical characteristics, the range of identification can be extended to many different types. Project goals on the near horizon include distinguishing between good microbes and harmful bacteria, detection of substances such as minerals, and satisfying the certification requirements necessary in different geographies.

To refine the approach and sharpen the precision of identification, Peter has continuing training. Currently, the confidence level for testing is above 95 percent, as high as 99 percent, assessing clean water compared with contaminated water, but this is likely to improve further as additional images are added to the system and more training is performed.

In a video demonstration of the Clean Water AI test system, Peter uses the microscope to first capture an image of clean water and then compares that with a sample showing contaminated water, as shown in Figure 4. The AI immediately detects the harmful bacteria and can flag the contamination on a map. All of these activities are carried out in real time.

Contaminated water

Figure 4. Screenshot of a sampling of contaminated water and the map indicating the location.

E. coli bacteria, shown in the rendering in Figure 5, is typically present in contaminated water and can be accurately identified by the AI according to shape and size.

For more information about Peter Ma's Clean Water AI project, visit https://devmesh.intel.com/projects/ai-bacteria-classification.

E. coli bacteria

Figure 5. A rendering of E. coli bacteria, one of the most common and dangerous water contaminants.

AI is Changing the Landscape of Business and Science

Through the design and development of specialized chips, sponsored research, educational outreach, and industry partnerships, Intel is firmly committed to advancing the state of artificial intelligence (AI) to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, and other industry sectors. Intel works closely with government organizations and corporations to uncover and advance solutions that solve major challenges.

For example, an engagement with NASA focused on sifting through many images of the moon and identifying different features, such as craters. By using AI and compute techniques, NASA was able to achieve its results from this project in two weeks rather than a couple of years.

The Intel AI portfolio includes:

Xeon inside

Intel Xeon Scalable processors: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning.

Logos

Framework Optimization: Achieve faster training of deep neural networks on a robust scalable infrastructure.

Movidius chip

Intel® Movidius™ Myriad™ Vision Processing Unit (VPU): Create and deploy on-device neural networks and computer vision applications.

For more information, visit this portfolio page: https://ai.intel.com/technology

Rome fountain

“At Intel we have a pretty pure motivation: we want to change the face of computing and increase the capabilities of humanity and change every industry out there. AI today is really a set of tools. It allows us to sift through data in much more scalable ways, scaling our intelligence up. We want our machines to personalize and change and adapt to the way we shop and the way we interact with others.

There are already vast changes taking place, which are happening under the hood. Intel has a broad portfolio of products for AI. We start with the Intel Xeon Scalable processor, which is a general-purpose computing platform that also has very efficient inference for deep learning.”

– Naveen Rao, Intel VP and GM, Artificial Intelligence Products Group

 

The Promise of AI

The possibilities of AI are only beginning to be recognized and exploited. Intel AI Academy works collaboratively with leaders in this field and talented software developers and system architects exploring new solutions that promise to reshape life in today’s world. We invite interested, passionate innovators to join us in this effort and become part of an exciting community to make contributions and take advanced technology in new directions for the benefit of the global community.

Join today: https://software.intel.com/ai/sign-up

“I see AI playing a major part in helping governments and non government organizations in the future,” Peter said, “especially in terms of monitoring resources, such as ensuring water safety. AI can reduce costs and provide more accurate continuous monitoring than current systems. An AI device for water safety typically requires very little maintenance, because it will be based on optical readings, rather than chemical based.”

“We have the ability to provide clean water for every man, woman and child on the Earth. What has been lacking is the collective will to accomplish this. What are we waiting for? This is the commitment we need to make to the world, now.”3

– Jean-Michel Cousteau

Resources

Clean Water AI

Intel AI Academy
Clean Water Project in Intel DevMesh
Build an Image Classifier on the Intel Movidius Neural Compute Stick
Getting the Most Out of IA with Caffe* Deep Learning Framework
Rapid Waterborne Pathogen Detection with Mobile Electronics
Intel Movidius Neural Compute Stick
GovTech Prize
Artificial Intelligence: Teaching Machines to Learn Like Humans
Intel Processors for Deep Learning Training

 

1 http://richiespicks.pbworks.com/w/page/65740422/MOUNTAINS%20BEYOND%20MOUNTAINS

2 http://www.who.int/mediacentre/factsheets/fs391/en/

3 http://www.architectsofpeace.org/architects-of-peace/jean-michel-cousteau?page=2

Merging MPI Intercommunicators

$
0
0

Sometimes you may have completely separate instances for an MPI job.  You can connect these separate instances using techniques such as MPI_Comm_accept/MPI_Comm_connect, which creates an intercommunicator.  But in more complex situations, you may have a significant number of separate intercommunicators, and want to send data between arbitrary ranks, or perform collective operations across all ranks, or other functions which are most efficiently handled over a single intracommunicator.

In order to handle this situation, the MPI standard includes a function call MPI_Intercomm_merge.  This function takes three arguments.  The first argument is the intercommunicator to be merged.  The second argument is a boolean argument indicating whether the ranks will be numbered in the high range (true) or the low range (false) in the resulting intracommunicator.  The third argument is a pointer to the new intracommunicator.  When you call MPI_Intercomm_merge, you must call it from every rank in both sides of the intercommunicator, and all ranks on a particular side must have the same high/low argument.  The two sides can have the same or different values.  If the same, the resulting rank order will be arbitrary.  If the two are different, you will end up with the ranks with the low (false) argument having lower rank numbers, and the ranks with the high (true) argument having higher rank numbers.  For example, if you have an intercommunicator with 2 ranks on side A and 3 ranks on side B, and you call MPI_Intercomm_merge with false on side A and true on side B, the side A ranks will have new ranks 0 and 1, and the side B ranks will have rank numbers 2, 3, and 4.

In a more complex situation, you may need to merge multiple intercommunicators.  This can be done in one of several ways, depending on how your ranks join the intercommunicator.  If you have separate ranks joining independently, you can merge them as each joins, and use the resulting intracommunicator as the base intracommunicator for the newly joining ranks.

MPI_Comm_accept(port, MPI_INFO_NULL, 0, localcomm, &intercomm[0]);
MPI_Intercomm_merge(intercomm[0], false, &localcomm);

This will update localcomm to include all ranks as they join.  You can also merge them after all have joined.  This will require multiple steps of creating new intercommunicators to merge, but can also lead to the same end result.

Once this is done, you can now use collectives across the new intracommunicator as if you had started all ranks under the same intracommunicator originally.

Getting Started with Parallel STL

$
0
0

Parallel STL is an implementation of the C++ standard library algorithms with support for execution policies, as specified in the working draft N4659 for the next version of the C++ standard, commonly called C++17. The implementation also supports the unsequenced execution policy specified in the ISO* C++ working group paper P0076R3.

Parallel STL offers efficient support for both parallel and vectorized execution of algorithms for Intel® processors. For sequential execution, it relies on an available implementation of the C++ standard library.

Parallel STL is available as a part of Intel® Parallel Studio XE and Intel® System Studio.

 

Prerequisites

To use Parallel STL, you must have the following software installed:

  • C++ compiler with:
    • Support for C++11
    • Support for OpenMP* 4.0 SIMD constructs
  • Intel® Threading Building Blocks (Intel® TBB) 2018

The latest version of the Intel® C++ Compiler is recommended for better performance of Parallel STL algorithms, comparing to previous compiler versions.

To build an application that uses Parallel STL on the command line, you need to set the environment variables for compilation and linkage. You can do this by calling suite-level environment scripts such as compilervars.{sh|csh|bat}, or you can set just the Parallel STL environment variables by running pstlvars.{sh|csh|bat} in <install_dir>/{linux|mac|windows}/pstl/bin.

<install_dir> is the installation directory, by default, it is:

For Linux* and macOS*:

  • For super-users:      /opt/intel/compilers_and_libraries_<version>
  • For ordinary users:  $HOME/intel/compilers_and_libraries_<version>

For Windows*:

  • <Program Files>\IntelSWTools\compilers_and_libraries_<version>

 

Using Parallel STL

Follow these steps to add Parallel STL to your application:

  1. Add the <install_dir>/pstl/include folder to the compiler include paths. You can do this by calling the pstlvars script.

  2. Add #include "pstl/execution" to your code. Then add a subset of the following set of lines, depending on the algorithms you intend to use:

    • #include "pstl/algorithm"
    • #include "pstl/numeric"
    • #include "pstl/memory"
  3. When using algorithms and execution policies, specify the namespaces std and std::execution, respectively. See the 'Examples' section below.
  4. For any of the implemented algorithms, pass one of the values seq, unseq, par or par_unseq as the first parameter in a call to the algorithm to specify the desired execution policy. The policies have the following meaning:

     

    Execution policy

    Meaning

    seq

    Sequential execution.

    unseq

    Try to use SIMD. This policy requires that all functions provided are SIMD-safe.

    par

    Use multithreading.

    par_unseq

    Combined effect of unseq and par.

     

  5. Compile the code as C++11 (or later) and using compiler options for vectorization:

    • For the Intel® C++ Compiler:
      • For Linux* and macOS*: -qopenmp-simd or -qopenmp
      • For Windows*: /Qopenmp-simd or /Qopenmp
    • For other compilers, find a switch that enables OpenMP* 4.0 SIMD constructs.

    To get good performance, specify the target platform. For the Intel C++ Compiler, some of the relevant options are:

    • For Linux* and macOS*: -xHOST, -xSSE4.1, -xCORE-AVX2, -xMIC-AVX512.
    • For Windows*: /QxHOST, /QxSSE4.1, /QxCORE-AVX2, /QxMIC-AVX512.
    If using a different compiler, see its documentation.

     

  6. Link with the Intel TBB dynamic library for parallelism. For the Intel C++ Compiler, use the options:

    • For Linux* and macOS*: -tbb
    • For Windows*: /Qtbb (optional, this should be handled by #pragma comment(lib, <libname>))

Version Macros

Macros related to versioning, as described below. You should not redefine these macros.

PSTL_VERSION

Current Parallel STL version. The value is a decimal numeral of the form xyy where x is the major version number and yy is the minor version number.

PSTL_VERSION_MAJOR

PSTL_VERSION/100; that is, the major version number.

PSTL_VERSION_MINOR

PSTL_VERSION - PSTL_VERSION_MAJOR * 100; that is, the minor version number.

Macros

PSTL_USE_PARALLEL_POLICIES

This macro controls the use of parallel policies.

When set to 0, it disables the par and par_unseq policies, making their use a compilation error. It's recommended for code that only uses vectorization with unseq policy, to avoid dependency on Intel® TBB runtime library.

When the macro is not defined (default) or evaluates to a non-zero value all execution policies are enabled.

PSTL_USE_NONTEMPORAL_STORES

This macro enables the use of #pragma vector nontemporal in the algorithms std::copy, std::copy_n, std::fill, std::fill_n, std::generate, std::generate_n with the unseq policy. For further details about the pragma, see the User and Reference Guide for the Intel® C++ Compiler at https://software.intel.com/en-us/node/524559.

If the macro evaluates to a non-zero value, the use of #pragma vector nontemporal is enabled.

When the macro is not defined (default) or set to 0, the macro does nothing.

 

Examples

Example 1

The following code calls vectorized copy:

#include "pstl/execution"
#include "pstl/algorithm"
void foo(float* a, float* b, int n) {
    std::copy(std::execution::unseq, a, a+n, b);
}

Example 2

This example calls the parallelized version of fill_n:

#include <vector>
#include "pstl/execution"
#include "pstl/algorithm"

int main()
{
    std::vector<int> data(10000000);
    std::fill_n(std::execution::par_unseq, data.begin(), data.size(), -1);  // Fill the vector with -1

    return 0;
}

Implemented Algorithms

Parallel STL supports all of the aforementioned execution policies only for the algorithms listed in the following table. Adding a policy argument to any of the rest of the C++ standard library algorithms will result in sequential execution.

 

Algorithm

Algorithm page at cppreference.com

adjacent_find

http://en.cppreference.com/w/cpp/algorithm/adjacent_find

all_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

any_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

copy

http://en.cppreference.com/w/cpp/algorithm/copy

copy_if

http://en.cppreference.com/w/cpp/algorithm/copy

copy_n

http://en.cppreference.com/w/cpp/algorithm/copy_n

count

http://en.cppreference.com/w/cpp/algorithm/count

count_if

http://en.cppreference.com/w/cpp/algorithm/count

destroy

http://en.cppreference.com/w/cpp/memory/destroy

destroy_n

http://en.cppreference.com/w/cpp/memory/destroy_n

equal

http://en.cppreference.com/w/cpp/algorithm/equal

exclusive_scan

http://en.cppreference.com/w/cpp/algorithm/exclusive_scan

fill

http://en.cppreference.com/w/cpp/algorithm/fill

fill_n

http://en.cppreference.com/w/cpp/algorithm/fill_n

find

http://en.cppreference.com/w/cpp/algorithm/find

find_end

http://en.cppreference.com/w/cpp/algorithm/find_end

find_if

http://en.cppreference.com/w/cpp/algorithm/find

find_if_not

http://en.cppreference.com/w/cpp/algorithm/find

for_each

http://en.cppreference.com/w/cpp/algorithm/for_each

for_each_n

http://en.cppreference.com/w/cpp/algorithm/for_each_n

generate

http://en.cppreference.com/w/cpp/algorithm/generate

generate_n

http://en.cppreference.com/w/cpp/algorithm/generate_n

inclusive_scan

http://en.cppreference.com/w/cpp/algorithm/inclusive_scan

is_sorted

http://en.cppreference.com/w/cpp/algorithm/is_sorted

is_sorted_until

http://en.cppreference.com/w/cpp/algorithm/is_sorted_until

max_element

http://en.cppreference.com/w/cpp/algorithm/max_element

merge

http://en.cppreference.com/w/cpp/algorithm/merge

min_element

http://en.cppreference.com/w/cpp/algorithm/min_element

minmax_element

http://en.cppreference.com/w/cpp/algorithm/minmax_element

mismatch

http://en.cppreference.com/w/cpp/algorithm/mismatch

move

http://en.cppreference.com/w/cpp/algorithm/move

none_of

http://en.cppreference.com/w/cpp/algorithm/all_any_none_of

partition_copy

http://en.cppreference.com/w/cpp/algorithm/partition_copy

reduce

http://en.cppreference.com/w/cpp/algorithm/reduce

remove_copy

http://en.cppreference.com/w/cpp/algorithm/remove_copy

remove_copy_if

http://en.cppreference.com/w/cpp/algorithm/remove_copy

replace_copy

http://en.cppreference.com/w/cpp/algorithm/replace_copy

replace_copy_if

http://en.cppreference.com/w/cpp/algorithm/replace_copy

search

http://en.cppreference.com/w/cpp/algorithm/search

search_n

http://en.cppreference.com/w/cpp/algorithm/search_n

sort

http://en.cppreference.com/w/cpp/algorithm/sort

stable_sort

http://en.cppreference.com/w/cpp/algorithm/stable_sort

transform

http://en.cppreference.com/w/cpp/algorithm/transform

transform_exclusive_scan

http://en.cppreference.com/w/cpp/algorithm/transform_exclusive_scan

transform_inclusive_scan

http://en.cppreference.com/w/cpp/algorithm/transform_inclusive_scan

transform_reduce

http://en.cppreference.com/w/cpp/algorithm/transform_reduce

uninitialized_copy

http://en.cppreference.com/w/cpp/memory/uninitialized_copy

uninitialized_copy_n

http://en.cppreference.com/w/cpp/memory/uninitialized_copy_n

uninitialized_default_construct

http://en.cppreference.com/w/cpp/memory/uninitialized_default_construct

uninitialized_default_construct_n

http://en.cppreference.com/w/cpp/memory/uninitialized_default_construct_n

uninitialized_fill

http://en.cppreference.com/w/cpp/memory/uninitialized_fill

uninitialized_fill_n

http://en.cppreference.com/w/cpp/memory/uninitialized_fill_n

uninitialized_move

http://en.cppreference.com/w/cpp/memory/uninitialized_move

uninitialized_move_n

http://en.cppreference.com/w/cpp/memory/uninitialized_move_n

uninitialized_value_construct

http://en.cppreference.com/w/cpp/memory/uninitialized_value_construct

uninitialized_value_construct_n

http://en.cppreference.com/w/cpp/memory/uninitialized_value_construct_n

unique_copy

http://en.cppreference.com/w/cpp/algorithm/unique_copy

Known limitations

Parallel and vector execution is only supported for a subset of aforementioned algorithms if random access iterators are provided, while for the rest execution will remain serial.

Legal Information

Optimization Notice
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.
* Other names and brands may be claimed as the property of others.
© Intel Corporation

Using Intel® Compilers to Mitigate Speculative Execution Side-Channel Issues

$
0
0

Disclaimers

Intel technologies' features and benefits depend on system configuration and may require enabled hardware, software, or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at www.intel.com.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request.

Intel provides these materials as-is, with no express or implied warranties.

Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Copyright © [2018], Intel Corporation. All rights reserved.

Introduction

Side channel methods are techniques that may allow an attacker to obtain secret or privileged information through observing the system that they would not normally be able to access, such as measuring microarchitectural properties about the system. For background information relevant to this article, refer to the overview in Intel Analysis of Speculative Execution Side Channels. This article describes Intel® C++ Compiler support and Intel® Fortran Compiler support for speculative execution side channel mitigations.

Mitigating Bounds Check Bypass (Spectre Variant 1)

Please read Intel's Analysis of Speculative Execution Side Channels for details, exploit conditions, and mitigations for the exploit known as bounds check bypass (Spectre variant 1).

One mitigation for Spectre variant 1 is through use of the LFENCE instruction. The LFENCE instruction does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. _mm_lfence() is a compiler intrinsic or assembler inline that issues an LFENCE instruction and also ensures that compiler optimizations do not move memory references across that boundary. Inserting an LFENCE between a bounds check condition and memory loads helps ensure that the loads do not occur until the bounds check condition has actually been completed.

The Intel C++ Compiler and Intel Fortran Compiler both allow programmers to insert LFENCE instructions, which can be used to help mitigate bounds check bypass (Spectre variant 1).

LFENCE in C/C++

You can insert LFENCE instructions in a C/C++ program as shown in the example below:

    #include <intrin.h>
    #pragma intrinsic(_mm_lfence)

    if (user_value >= LIMIT)
    {
        return STATUS_INSUFFICIENT_RESOURCES;
    }
    else
    {    
        _mm_lfence();	/* manually inserted by developer */
        x = table[user_value];
        node = entry[x];
    }

LFENCE in Fortran

You can insert an LFENCE instruction in Fortran applications as shown in the example below.
Implement the following subroutine, which calls _mm_lfence() intrinsics:

    interface 
        subroutine for_lfence() bind (C, name = "_mm_lfence") 
            !DIR$ attributes known_intrinsic, default :: for_lfence
        end subroutine for_lfence
    end interface
 
    if (untrusted_index_from_user .le. iarr1%length) then
        call for_lfence()
        ival = iarr1%data(untrusted_index_from_user)
        index2 = (IAND(ival,1)*z'100') + z'200'    
        if(index2 .le. iarr2%length) 
            ival2 = iarr2%data(index2)
    endif

The LFENCE intrinsic is supported in the following Intel compilers:

  • Intel C++ Compiler 8.0 and later for Windows*, Linux*, and macOS*
  • Intel Fortran Compiler 8.0 and later for Windows, Linux and macOS

 

Mitigating Branch Target Injection (Spectre Variant 2)

Intel's whitepaper on Retpoline: A Branch Target Injection Mitigation discusses the details, exploit conditions, and mitigations for the exploit known as branch target injection (Spectre variant 2). While there are a number of possible mitigation techniques for this side channel method, the mitigation technique described in that document is known as retpoline, which is a technique employed by the Intel® C++ and Fortran compilers.

The Intel C++ and Fortran compilers have command line options that can be used to help mitigate branch target injection (Spectre variant 2). These options replace all indirect branches (calls/jumps) with the retpoline code sequence. The thunk-inline option inserts a full retpoline sequence at each indirect branch that needs mitigation. The thunk-extern option reduces code size by sharing the retpoline sequence.

The compiler options implemented are:

  • -mindirect-branch=thunk-inline for Linux or macOS
  • -mindirect-branch=thunk-extern for Linux or macOS
  • /Qindirect-branch:thunk-inline for Windows
  • /Qindirect-branch:thunk-extern for Windows

The command line option is included in the following Intel compilers:

  • Intel® C++ Compiler 18.0 update 2 and later for Windows, Linux, and macOS
  • Intel® Fortran Compiler 18.0 update 2 and later for Windows, Linux, and macOS

Refer to the Intel Compilers - Supported compiler versions article for updates on the availability of mitigation options in supported Intel Compilers.

How to Obtain the Latest Intel® C++ Compiler and Intel® Fortran Compiler

The Intel® C++ Compiler is distributed as part of the Intel® Parallel Studio XE and Intel® System Studio tool suites. The Intel® Fortran Compiler is distributed as part of Intel® Parallel Studio XE 2018. You can be downloaded these from the Intel Registration Center. Intel® Parallel Studio XE 2018 update 2 or later and Intel® System Studio 2018 update 1 or later contain support for retpoline. Refer to the Intel Compilers - Supported compiler versions article for updates on the availability of mitigation options in supported Intel Compilers.

Conclusion and Further Reading

Visit the Intel Press Room for the latest updates regarding the Spectre and Meltdown issues, and Intel’s Side Channel Security Support website for additional software-specific, up-to-date information. You can find more detailed explanations of Speculative Execution Side-Channel Mitigations and Intel’s Mitigation Overview for Potential Side-Channel Cache Exploits in Linux* on our Side-Channel Security Support website.

Refer to our support site for support options if you experience any issues.

Intel continues to work on improving Intel software development products for the identified security issues. We will continue to revise this article with Intel® C++ Compiler and Intel Fortran Compiler product updates as they become available.


Why and How to Replace Perl Compatible Regular Expressions (PCRE) with Hyperscan

$
0
0

Introduction to PCRE and Hyperscan

Perl Compatible Regular Expressions (PCRE), is a widely used regular expression matching library written in the C language, inspired by the regular expression capabilities of the Perl programming language. Its syntax is much more powerful and flexible than many other regular expression libraries, such as the Portable Operating System Interface for UNIX* (POSIX).

Hyperscan is a high performance, multi-pattern regular expression matching library developed by Intel, which supports the same syntax as PCRE. In this article, we will describe the API differences and provide a performance contrast between PCRE and Hyperscan, then show how to replace PCRE with Hyperscan in a typical scenario.

Functionality Comparison

PCRE supports only block mode compilation and matching, while Hyperscan supports both block and streaming mode. Streaming mode is more practical and flexible in real network scenarios.

PCRE supports only single pattern compilation and matching, while Hyperscan can support multiple patterns. In real scenarios, it is common that multiple patterns are applied, and Hyperscan can efficiently complete all work in one compilation and scan.

API Comparison

Both PCRE and Hyperscan interfaces have compile time and runtime phases. When replacing PCRE with Hyperscan, the basic idea is to replace the PCRE API with the Hyperscan API at both compile time and runtime.

Compile time API changes

With PCRE, we often use the following API to compile each pattern:

#include <pcre.h>
pcre *pcre_compile2(const char *pattern, int options, int *errorcodeptr, const char **errptr, int *erroffset, const unsigned char *tableptr); 

It is very easy to replace this API with the Hyperscan compile time API:

#include <hs_compile.h>
hs_error_t hs_compile(const char *expression, unsigned int flags,
               unsigned int mode, const hs_platform_info_t *platform,
               hs_database_t **db, hs_compile_error_t **error);

Parameters:

expression – single pattern. Corresponding to “pattern” of pcre.
flags – flag of single pattern. Corresponding to “options” of pcre.
mode – mode selection. What corresponds to pcre is HS_MODE_BLOCK.
platform – platform information.
db – Generated Hyperscan database. Corresponding to the return value of pcre_compile2.
error – return the compile error.

Return value is HS_SUCCESS(0) or error code.

Hyperscan also provides an API for multi-pattern compilation:

hs_error_t hs_compile_multi(const char *const *expressions,
               const unsigned int *flags, const unsigned int *ids,
               unsigned int elements, unsigned int mode,
               const hs_platform_info_t *platform,
               hs_database_t **db, hs_compile_error_t **error);

This API supports the compiling of several patterns together to generate one Hyperscan database. There are some differences in the parameters:

expressions – the array of patterns.
flags – the flag array of patterns.
ids – the id array of patterns.
elements – the number of patterns.

The others remain the same.

Run time API replacement

To build the compiled PCRE database, we often use the following API to scan the input data:

#include <pcre.h>
int pcre_exec(const pcre *code, const pcre_extra *extra, const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize); 

It is easy to replace it with the Hyperscan runtime API:

#include <hs_runtime.h>
hs_error_t hs_scan(const hs_database_t *db, const char *data,
               unsigned int length, unsigned int flags,
               hs_scratch_t *scratch, match_event_handler onEvent,
               void *context);

Parameters:

db – Hypersan data base. Corresponding to “code” of pcre.
data – input data. Corresponding to “subject” of pcre.
length – length of the input data. Corresponding to “length” of pcre.
flags – reserved options.
scratch – the space storing temporary state during runtime, allocated by hs_alloc_scratch().
onEvent – callback function at match, user-defined.
context – callback function parameter, user-defined.

Return value is HS_SUCCESS(0) or error code.

Using prefilter mode

Hyperscan does not completely match PCRE syntax; for example, it doesn’t support Back Reference and Zero-Width Assertion. However, Hyperscan’s performance advantage over PCRE makes it worthwhile to convert the unsupported pattern to its superset, which can be supported by Hyperscan’s prefilter mode. For example:

Convert /foo(\d)+bar\1+baz/ to /foo(\d)+bar(\d)+baz/

At compile time, each pattern goes through the classifier first, which decides whether to use Hyperscan, prefilter mode, or PCRE. The patterns supported by Hyperscan and prefilter mode are multi-compiled together to generate one Hyperscan database; in addition, every pattern that uses prefilter mode or PCRE is compiled separately to generate the PCRE database:


Figure 1. Compile time.

At runtime, the input data is scanned against the Hyperscan database and each non-prefilter mode PCRE database. If Hyperscan finds a match, it should be reconfirmed with PCRE:


Figure 2. Runtime.

Please refer to API Reference: Constants for more API information.

Performance Comparison

To contrast Hyperscan and PCRE performance, we used the Hyperscan performance testing tool hsbench. We selected the following 15 regular expressions, which include both plain text strings and regular expression rules:

IDSignature
1Twain
2(?i)Twain
3[a-z]shing
4Huck[a-zA-Z]+|Saw[a-zA-Z]+
5\b\w+nn\b
6[a-q][^u-z]{13}x
7Tom|Sawyer|Huckleberry|Finn
8(?i)Tom|Sawyer|Huckleberry|Finn
9.{0,2}(Tom|Sawyer|Huckleberry|Finn)
10.{2,4}(Tom|Sawyer|Huckleberry|Finn)
11Tom.{10,25}river|river.{10,25}Tom
12[a-zA-Z]+ing
13\s[a-zA-Z]{0,12}ing\s
14([A-Za-z]awyer|[A-Za-z]inn)\s
15["'][^"']{0,30}[?!\.]["']

We ran the test on a single core of an Intel® Core™ i7-8700K processor at 3.70 GHz. We chose an e-book from The Entire Project Gutenberg Works of Mark Twain by Mark Twain, containing about 20M words (18,905,427 bytes) as input, and then looped for 200 times using hsbench. Time spent and throughput of PCRE (v8.41, just-in-time mode) and Hyperscan (v4.7.0) are as follows:

 Corpus: mtent.txtTotal Data: 18905427 Bytes x 200
 Time(s)Throughput (Mbit/s)
IDpcre_jithspcre_jiths
13.5181.4528,598.2620,832.43
23.6311.4988,330.6820,192.71
33.3272.3559,091.8812,844.45
41.5822.24619,120.5313,467.80
512.9012.0672,344.6814,634.10
61.4621.58520,689.9319,084.34
74.7553.0376,361.459,960.05
811.0753.082,731.269,821.00
931.6843.037954.709,960.05
1031.0143.042975.329,943.68
113.1432.0979,624.1414,424.74
127.423.2794,076.649,224.97
138.4634.453,574.236,797.46
146.4232.894,709.4310,466.67
152.3954.26712,629.937,088.98
Total132.79340.382227.79749.0635234
Multi132.79313.4227.792,257.36

The results show that Hyperscan has a performance advantage over PCRE for most of the rules tested. The highest throughput (see test 9) is 10.4 times greater using Hyperscan.

Multiple pattern matching test results also show the advantage of using Hyperscan. Multi-pattern matching is very common in practical use. Hyperscan can compile all the patterns simultaneously into one database which is scanned against the input corpus only once. The results above (see the rows labeled Total and Multi) show that it takes Hyperscan only 13.4 seconds to perform multiple pattern matching, and a total of 40.382 seconds when the 15 rules are scanned separately in single pattern scans. Because PCRE supports only single pattern compilation and scanning, each of the 15 rules must be compiled separately and scanned against the input corpus. Altogether, it takes PCRE 132.793 seconds to complete all scans. The throughput histogram is as follows:

Replacement Pseudo-code Samples

Assume that we have a pattern set and input corpus:

// patterns
const char *pats[];
unsigned flags[];

// input data
const char *buf = "blah...................blah";
size_t buf_len = strlen(buf);

When using PCRE, we may have this kind of implementation:

// pcre compile
for each pattern i
    pcres[i] = pcre_compile2(pats[i], flags[i], …);

// pcre runtime
for each pattern i
    ret = pcre_exec(pcres[i], …, buf, buf_len, …, &ovector[0], …);
    if ret >= 0
        report pattern i match at ovector[1]

Now we’ll describe the details of replacing PCRE with Hyperscan.

In addition to a possible requirement for prefiltering mode, we also have to be careful about a pattern having variable width, which means that a pattern may consume a different amount of data to get matches. This is because Hyperscan reports all matches but PCRE only reports one match under greedy or ungreedy mode. We also need to reconfirm the match from a variable width pattern with PCRE. We may use the following function to check whether a pattern has variable width or not:

bool is_variable_width(re, flags) {
    hs_expr_info_t *info = NULL;
    ret = hs_expression_info(re, flags, &info, ...);
    if (ret == HS_SUCCESS) and info and (info->min_width == info->max_width)
        return false
    else
        return true
}

Here we show two different scenarios - single compile or multi-compile.

Single pattern compile

// try compile hs, prefilter mode and pcre compile
for each pattern i
    dbs[i].pcre = NULL;
    ret = hs_compile(pats[i], flags[i], HS_MODE_BLOCK, …, &dbs[i].hs, …);
    if ret == HS_SUCCESS
        hs_alloc_scratch(dbs[i].hs, dbs[i].scratch);
        if pats[i] has variable width
            dbs[i].pcre = pcre_compile2(pats[i], flags[i], …);
else
        dbs[i].pcre = pcre_compile2(pats[i], flags[i], …);
        ret = hs_compile(pats[i], flags[i] | HS_FLAG_PREFILTER, HS_MODE_BLOCK, …, &dbs[i].hs, …);
        if ret == HS_SUCCESS
            hs_alloc_scratch(dbs[i].hs, dbs[i].scratch);
        else
            dbs[i].hs = NULL;

// runtime
on_match(…, to, …, ctx) { // hs callback
    if !ctx->pcre // not prefilter mode pattern
        report pattern ctx->i match at to
        return
    
    // got a match from a prefilter mode or variable width pattern, need pcre confirm
    ret = pcre_exec(ctx->pcre, …, buf, buf_len, …, &ovector[0], …);
    if ret >= 0
        report pattern ctx->i match at ovector[1]
    return
}

for each pattern i
    if dbs[i].hs
        ctx = pack dbs[i].pcre and i
        hs_scan(dbs[i].hs, buf, buf_len, 0, dbs[i].scratch, on_match, &ctx);
    else
        ret = pcre_exec(dbs[i].pcre, …, buf, buf_len, …, &ovector[0], …);
        if ret >= 0
            report pattern i match at ovector[1]

// house clean
for each pattern i
    if dbs[i].hs
        hs_free_scratch(dbs[i].scratch);
        hs_free_database(dbs[i].hs);

Multi-pattern compile

// try hs, use prefilter mode and pcre compile if failed
for each pattern i
ret = hs_compile(pats[i], flags[i], HS_MODE_BLOCK, …, &hs, …);
if ret == HS_SUCCESS
        store pats[i] to hs_pats[]
        store flags[i] | HS_FLAG_PREFILTER to hs_flags[]
        store ids[i] to hs_ids[]
        if pats[i] has variable width
            id2pcre[ids[i]] = pcre_compile2(pats[i], flags[i], …);
    else
        ret = hs_compile(pats[i], flags[i] | HS_FLAG_PREFILTER, HS_MODE_BLOCK, …, &hs, …);
        if ret == HS_SUCCESS
            store pats[i] to hs_pats[]
            store flags[i] | HS_FLAG_PREFILTER to hs_flags[]
            store ids[i] to hs_ids[]
            id2pcre[ids[i]] = pcre_compile2(pats[i], flags[i], …);
        else
            pcres[n_pcre++] = pcre_compile2(pats[i], flags[i], …);

// hs multi compile
hs_compile_multi(hs_pats, hs_flags, hs_ids, n_hs, HS_MODE_BLOCK, …, &hs, …);
hs_alloc_scratch(hs, &scratch);

// hs runtime for multi compiled part
on_match(id, …, to, …, ctx) {
    if ctx[id] // got match from a prefilter mode or variable width pattern, need pcre confirm
        ret = pcre_exec(ctx[id], …, buf, buf_len, …, &ovector[0], …);
        if ret >= 0
            report pattern id match at ovector[1]
    else
        report pattern id match at to
    return
}

ctx = id2pcre; // user defined on_match context
hs_scan(hs, buf, buf_len, 0, scratch, on_match, ctx);

// pcre runtime for the rest
for each db i in pcres[]
    ret = pcre_exec(pcres[i], …, buf, buf_len, …, &ovector[0], …);
    if ret >= 0
        report pattern ids[i] match at ovector[1]

// house clean
hs_free_scratch(scratch);
hs_free_database(hs_db);

Summary

Hyperscan is a high performance regular expression matching library that is very suitable for multi-pattern matching and is faster than PCRE. In this article, we showed how to replace PCRE with Hyperscan in a typical scenario. In addition to its performance advantage, Hyperscan has another superior feature, the streaming mode. It enables us to deal with the case that input data is separated into pieces of blocks. For these reasons, Hyperscan is expected to replace many of the existing regular expression matching engines in a greater number of scenarios.

Better Generative Modelling through Wasserstein GANs

$
0
0

The following research uses Intel® AI DevCloud, a cloud-hosted hardware and software platform available for developers, researchers and startups to learn, sandbox and get started on their Artificial Intelligence projects. This free cloud compute is available for Intel® AI Academy members.

Overview

The year 2017 was a period of scientific breakthroughs in deep learning, with the publication of numerous research papers. Every year seems like a big leap toward artificial general intelligence, or AGI.

One exciting development involves generative modelling and the use of Wasserstein GANs (Generative Adversarial Networks). An influential paper on the topic has completely changed the approach to generative modelling, moving beyond the time when Ian Goodfellow published the original GAN paper.

Why Wasserstein GANs are such a big deal:

  • With Wasserstein GAN, you can train the discriminator to convergence. If true, it would totally remove the need to balance generator updates with discriminator updates, as earlier the updates of generator and discriminator were happening with no correlation to each other.
  • The initial paper (Soumith et al.) proposed a new GAN training algorithm that works well on the commonly used GAN datasets.
  • Usually theory justified papers don't provide good empirical results, but the training algorithm mentioned in the paper is backed up by theory and it explains why WGANs work so much better.

Introduction

This paper differs from earlier work: the training algorithm is backed up by theory, and few examples exist where theory-justified papers gave good empirical results. The big thing about WGANs is that developers can train their discriminator to convergence, which was not possible earlier. Doing this eliminates the need to balance generator updates with discriminator updates.

What is Earth Mover's Distance?

When dealing with discrete probability distributions, the Wasserstein Distance is also known as Earth mover's distance (EMD). Imagining different heaps of earth in varying quantities, EMD would be the minimal total amount of work it takes to transform one heap into another. Here, work is defined as the product of the amount of earth being moved and the distance it covers. Two discrete probability distributions are usually defined as Pr and P(theta).

Pr comes from unknown distribution, and the goal is to learn P(theta) that approximates Pr.

Calculation of EMD is an optimization process with infinite solution approaches; the challenge is to find the optimal one.

Calculation of EMD

One approach would be to directly learn probability density function P(theta). This would mean that P(theta) is some differentiable function that can be optimized by maximum likelihood estimation. To do that, minimize the KL (Kullback–Leibler) divergence KL(Pr||(P(theta)) and add a random noise to P(theta) when training the model for maximum likelihood estimation. This ensures that distribution is defined elsewhere; otherwise, if a single point lies outside P(theta), the KL divergence can explode.

Adversarial training makes it hard to see whether models are training. It has been shown that GANs are related to actor-critic methods in reinforcement learning. Learn More.

Kullback–Leibler and Jensen–Shannon Divergence

  1. KL (Kullback–Leibler) divergence measures how one probability distribution P diverges from a second expected probability distribution Q.

    Ecuations

    We drop −H(p) going from (18) − (19) because it is a constant. We can see if we minimize the LHS (Left-hand side), we are maximizing the expectation of log q(x) over the distribution p. Therefore, minimizing the LHS is maximizing the RHS, which is maximizing the log-likelihood of the data.

    DKL achieves the minimum zero when p(x) == q(x) everywhere.

    It is noticeable from the formula that KL divergence is asymmetric. In cases where P(x) is close to zero, but Q(x) is significantly non-zero, the q’s effect is disregarded. It could cause buggy results when the intention was just to measure the similarity between two equally important distributions.

  2. Jensen–Shannon Divergence is another measure of similarity between two probability distributions. JS (Jensen–Shannon) divergence is symmetric and relatively smoother and is bounded by [0,1].

    Given two Gaussian distributions, P with mean=0 and std=1 and Q with mean=1 and std=1. The average of two distributions is labelled as m=(p+q)/2. KL divergence DKL is asymmetric but JS divergence DJS is symmetric.

    Gaussian distributions

Generative Adversarial Network (GAN)

GAN consists of two models:

  • A discriminator D estimates the probability of a given sample coming from the real dataset. It works as a critic and is optimized to tell the fake samples from the real ones.
  • A generator G outputs synthetic samples given a noise variable input z (z brings in potential output diversity). It is trained to capture the real data distribution so that its generative samples can be as real as possible, or in other words, it can trick the discriminator to offer a high probability.

GAN model

Use Wasserstein Distance as GAN Loss Function

It is almost impossible to exhaust all the joint distributions in Π(pr,pg) to compute infγ∼Π(pr,pg). Instead, the authors proposed a smart transformation of the formula based on the Kantorovich-Rubinstein duality:

Kantorovich-Rubinstein duality

One big problem involves maintaining the K-Lipschitz continuity of fw during the training to make everything work out. The paper presented a simple but very practical noteworthy trick: after the gradient gets updated, clamping the weights w to a small window is required, such as [−0.01,0.01], resulting in a compact parameter space W; and thus, fw obtains it's lower and upper bounds in order to preserve the Lipschitz continuity.

K-Lipschitz

Compared to the original GAN algorithm, the WGAN undertakes the following changes:

  • After every gradient update on the critic function, we are required to clamp the weights to a small fixed range is required, usually [−c,c].
  • Use a new loss function derived from the Wasserstein distance.  The discriminator model does not play as a direct critic but rather a helper for estimating the Wasserstein metric between real and generated data distributions.

Empirically the authors recommended usage of RMSProp optimizer on the critic, rather than a momentum-based optimizer such as Adam which could cause instability in the model training.

Improved GAN Training

The following suggestions are proposed to help stabilize and improve the training of GANs.

  • Adding noises - Based on the discussion in the previous section, it is now known that Pr and Pg are disjointed in a high dimensional space and they may become the reason for the problem of vanishing gradient.To synthetically “spread out” the distribution and to create higher chances for two probability distributions to have overlaps, one solution is to add continuous noises onto the inputs of the discriminator D.
  •  One-sided label smoothing - When we are feeding the discriminator, instead of providing the labels as 1 and 0, this paper proposed using values such as 0.9 and 0.1. This will help in reduce the vulnerabilities in Network.

Wasserstein metric is proposed to replace JS divergence because it has a much smoother value space.

Overview of DCGAN

In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. As compared to supervised learning, ConvNets have received little attention. Deep convolutional generative adversarial networks (DCGANs) have certain architectural constraints and demonstrate a strong potential for unsupervised learning. Training on various image datasets show convincing evidence that a deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, the learned features were used for novel tasks - demonstrating their applicability as general image representations.

DCGAN

Problem with GANs

  1. It’s harder to achieve Nash Equilibrium - Since there are two neural networks (generator and discriminator), they are being trained simultaneously to find a Nash Equilibrium. In the whole process each player updates the cost function independently without considering the updates of cost function by another network. This method cannot assure a convergence, which is the stated objective.
  2. Vanishing gradient - When the discriminator works as required, the distribution D(x) equals 1 when x belongs to Pr and vice versa. In this process, loss function L fails to zero and results in no gradients to update the loss during the training process. This figure shows that as the discriminator gets increasingly better, the gradient vanishes fast, tending to 0.
  3. Use better metric of distribution similarity - The loss function as proposed in the vanilla GAN (by Goodfellow et al.) measures the JS divergence between the distributions of Pr and P(theta). This metric fails to provide a meaningful value when two distributions are disjointed.

Replacing JS divergence with the Wasserstein metric gives a much smoother value space.

Training a Generative Adversarial Network faces a major problem:

  • If the discriminator works as required, the gradient of the loss function starts tending to zero. As a process loss cannot be updated, training becomes very slow or the model gets stuck.
  • If the discriminator behaves badly, the generator does not have accurate feedback and the loss function cannot represent the reality.

Evaluation Metric

GANs faced the problem of good objective function that can give better insight of the whole training process. A good evaluation metric was needed. Wasserstein Distance sought to address this problem.

Technologies Involved and Methodologies

GANs are difficult to train since convergence is an issue. Using Intel® AI DevCloud and implementing with TensorFlow* served to hasten the process. The first step was to determine the evaluation metric, followed by getting the generator and discriminator to work as required. Other steps included defining the Wasserstein Distance and making use of Residual Blocks in the generator and discriminator.

Steps and Development Process

Initially the project dealt with GANs, which are powerful models but suffer from training stability. Switching to DCGANs and making this project with WGANs sought to make progress towards stable training of GANs. Images generated with DCGANs were not good quality and failed to converge during training. In the initial WGANs paper instead of optimizing Jenson Shannon Divergence, they proposed using Wasserstein metric (measure of the distance between two probability distributions).

The reason why Wasserstein distance is better than JS or KL divergence is that when two distributions are located in lower dimensional manifolds without overlaps, Wasserstein distance can still provide a meaningful and smooth representation of the distance in-between.

Additionally, there is almost no hyperparameter tuning.

Intel Development Tools Used

The project made use of Jupyter notebook on the Intel® AI DevCloud (using Intel® Xeon® Scalable processors) to write the code and for visualization purposes. Also made used was information from the Intel® AI Academy forum.

Few GANs Applications

These are some very few applications of GANs (just to provide some ideas) but they can be extended to do so much than what we can possibly think of. There are many papers which have made use of different architectures of GANs, some are listed below:

  • Font generation with conditional GANs
  • Interactive image generation
  • Image editing
  • Human pose estimation
  • Synthetic data generation
  • Visual saliency prediction
  • Adversarial examples (defense vs attack)
  • Image blending
  • Super resolution
  • Image inpainting
  • Face aging

Code

The code can be found in this Github* repository.

Empirical Results

Initially the paper (Soumith at al.) demonstrated the real difference between GAN and WGAN. A GAN Discriminator and Wasserstein GAN critic are trained optimality. In the following graph blue depicts real Gaussian distribution and green depicts fake ones then the values are plotted. The red curve depicts the GAN discriminator output.

GAN discriminator output

Both GAN and WGAN will identify which distribution is fake and which ones are real, but GAN Discriminator does this in such a way that gradients vanish over this high dimensional space. WGANs make use of weight clamping which gives them an edge and it which is able to give gradients in almost every point in space. Wasserstein loss seems to correlate well with image quality also.

Join the Intel® AI Academy

Signup for the Intel® AI Academy and access essential learning materials, community, tools and technology to boost your AI development.

Apply to become an AI Student Ambassador and share your expertise with other student data scientists and developer.

Contact the author on Twitter* or Github*.

Start Amazon Web Services Greengrass* Core on the UP Squared* Development Board

$
0
0

Introduction

This guide shows the steps to start Amazon Web Services (AWS) Greengrass* core on Ubuntu* using the UP Squared* development board.

About the UP Squared* Board

Characterized by low power consumption and high performance - which is ideal for the Internet of Things (IoT) - the UP Squared platform is the fastest x86 maker board based on the Apollo Lake platform from Intel. It contains both the Intel Celeron® processor Dual Core N3350 and Intel® Pentium® brand Quad Core N4200 processor.

AWS Greengrass*

AWS Greengrass is software that extends AWS cloud capabilities to local devices, allowing them to collect and analyze data on the local devices. This reduces latency between the devices and data processing layer, and reduces storage and bandwidth costs involved with sending data to the cloud. The user can create AWS Lambda functions to enable Greengrass to keep data in sync, filter data for further analysis, and communicate with other devices securely.

Operating System Compatibility

The UP Squared board can run Ubilinux*, Ubuntu*, Windows® 10 IoT Core, Windows® 10, Yocto Project*, and Android* Marshmallow operating systems. For more information on UP Squared, visit this website.

Hardware Components

The hardware components used in this project are listed below:

Create AWS Greengrass* Group

An AWS Greengrass group is a collection of settings for AWS Greengrass core devices, and the devices that communicate with them. Let’s start by logging into the Amazon Web Services (AWS)* Management Console, opening AWS IoT console, choosing a region from the top right corner of the navigation bar, then selecting Greengrass.

On the Welcome to AWS Greengrass screen, choose Get Started.

Figure 1: AWS IoT Console

On the Set up your Greengrass group page, select Use easy creation to create an AWS Greengrass group.

Figure 2: Setting up AWS Greengrass Group

Choose a name for your Greengrass Group, then click Next.

Figure 3: Setting up AWS Greengrass Group: Name the Group

Use the default name for the AWS Greengrass core, then select Next.

Figure 4: Setting up AWS Greengrass Group: Name the Greengrass Core

Select Create group and Core on the Run a scripted easy Group creation page. 

Figure 5: Setting up AWS Greengrass Group: Create Group and Core

You should see following page while the AWS Greengrass Group is being created.

Figure 6: Setting up AWS Greengrass Group: Creating Group and Core

When you see a certificate and public and private keys, you have successfully created the new Greengrass group. Click on Download these resources as a tar.gz to download the certificate and private key for later use. Select x86_64 for CPU architecture and then click on the Download Greengrass to start the Greengrass download.

Figure 7: Setting up AWS Greengrass Group: Certificate and Private Key

Select Finish.

Figure 8: Setting up AWS Greengrass Group: Successfully

Development Boards

Before you begin, make sure that the Ubuntu* operating is installed on The UP Squared board. To ensure that the Ubuntu operating system is up to date and dependent Ubuntu packages are installed, open a command prompt (terminal) and type the following:

sudo apt-get update

Install sqlite3 package by entering the following command in the terminal:

sudo apt-get install sqlite3

Create the Greengrass user and group account:

sudo adduser --system gcc_user
sudo addgroup --system gcc_group

Untar the Greengrass Core software that was downloaded in the “Figure 7: Setting up AWS Greengrass Group: Certificate and Private Key” step earlier.

Download the cmake package by entering the following command in the terminal:

wget https://cmake.org/files/v3.8/cmake-3.8.0.tar.gz

Execute the following commands:

tar -xzvf cmake-3.8.0.tar.gz
cd cmake-3.8.0
./configure
make
sudo make install

Use the following commands to install OpenSSL:

wget https://www.openssl.org/source/openssl-1.0.2k.tar.gz
tar -xzvf openssl-1.0.2k.tar.gz
cd openssl-1.0.2k
./config --prefix=/usr
make
sudo make install
sudo ln –sf /usr/local/ssl/bin/openssl ‘which openssl’
openssl version -v

Enable Hardlinks and Softlinks Protection

Activate the hardlinks and softlinks protection to improve security on the device. Add the following two lines to /etc/sysctl.d/10-link-restrictions.conf.

fs.protected hardlinks = 1
fs.protected symlinks = 1

Reboot the UP Squared board and validate the system variables by running:

sudo sysctl -a | grep fs.protected

Install Greengrass Certificate and Key

Copy the certificate and private keys files created in the “Figure 7: Setting up AWS Greengrass Group: Certificate and Private Key” above to the UP Squared board as follow:

  • cloud.pem.crt: 4f7a73faa9-cert.pem.crt created above
  • cloud.pem.key: 4f7a73faa9-private.pem.key created above
  • root-ca-cert.pem: wget https://www.symantec.com/content/en/us/enterprise/verisign/roots/VeriSign-Class%203-Public-Primary-Certification-Authority-G5.pem -O root-ca-cert.pem

The ~/greengrass/certs should look like this:

Edit config.json

Open a command prompt (terminal) and navigate to ~/greengrass/config folder. Edit config.json as follows to configure the Greengrass Core:

{
    "coreThing": {
        "caPath": "root-ca-cert.pem",
        "certPath": "cloud.pem.crt",
        "keyPath": "cloud.pem.key",
        "thingArn": "arn:aws:iot:us-east-1:xxxxxxxxxxxx:thing/MyGreengrass1stGroup_Core",
        "iotHost": "yyyyyyyyyyyy.iot.us-east-1.amazonaws.com",
        "ggHost": "greengrass.iot.us-east-1.amazonaws.com",
	"keepAlive": 600
    },
    "runtime": {
        "cgroup": {
            "useSystemd": "yes"
        }
    },
    "system": {
        "shadowSyncTimeout": 120
    }
}
Note: The default value of shadowSyncTimeout is 1.
  • thingArn: Navigate to AWS IoT console, choose Manage on the left, and then select MyGreengrass1stGroup under Thing.

ThingARN should look like this:

  • iotHost: Navigate to AWS IoT console, the Endpoint is located under Settings on the bottom left corner of the AWS IoT console. 

Start AWS Greengrass* Core

Open a command prompt (terminal) and navigate to ~/greengrass/config folder

cd ~/greengrass/ggc/core
sudo ./greengrassd start

When you see the message “Greengrass successfully started”, you will know the Greengrass core has been created successfully.

To confirm that the Greengrass core process is running, run the following command:

ps aux | grep greengrass

Summary

We have described how to start the Greengrass core on the UP Squared board. From here, there are several projects you can try to explore the potential of the UP Squared board. For example, you can create a Greengrass deployment, add a group of devices that can communicate with the local IoT endpoint, enable Lambda functions to filter data for further analysis, and more.

References

Amazon Kinesis* Service API Reference: 
http://docs.aws.amazon.com/greengrass/latest/developerguide/what-is-gg.html
http://docs.aws.amazon.com/greengrass/latest/developerguide/gg-config.html

Up Squared:
http://www.up-board.org/upsquared

Amazon:
https://aws.amazon.com/kinesis/streams/getting-started

IoT References:
https://software.intel.com/en-us/iot/hardware/devkit

About the Author

Nancy Le is a software engineer at Intel Corporation in the Software and Services Group, working on the Intel Atom® processor and IoT scale enabling projects.

*Other names and brands may be claimed as the property of others.

More on UP Squared

Artificial Intelligence (AI) Helps with Skin Cancer Screening

$
0
0

ai-helps-with-skin-cancer-screening

"The long-term goal and true potential of AI is to replicate the complexity of human thinking at the macro level, and then surpass it to solve complex problems—problems both well-documented and currently unimaginable in nature."1

Challenge

Skin cancer has reached epidemic proportions in much of the world. A simple test is needed to perform initial screening on a wide scale to encourage individuals to seek treatment when necessary.

Solution

Doctor Hazel, a skin cancer screening service powered by artificial intelligence (AI) that operates in real time, relies on an extensive library of images to distinguish between skin cancer and benign lesions, making it easier for people to seek professional medical advice.

Background and History

Hackathons have proven to be a successful way to channel energy and technical expertise into solving very specific problems and generating bright, new ideas for applied technology. Such is the case for the genesis of Doctor Hazel, a noteworthy project at the TechCrunch Disrupt’s San Francisco 2017 hackathon, co-developed by Intel® Software Innovator, Peter Ma, and Mike Borozdin, VP of Engineering at Ethos Lending and cofounder of Doctor Hazel. (see Figure 1).

Peter noted, "My cofounder and I had a very close mutual friend who died of cancer in his early 30s. That event triggered our desire to do something about curing cancer. After researching AI and cancer, we think we can actually do something— using AI effectively—to screen for skin cancer."

Peter Ma (left) and Mike Borozdin show screening techniques
Figure 1. Peter Ma (left) and Mike Borozdin show screening techniques.

With the purchase and aid of an inexpensive, high-powered endoscope camera to capture images, Peter and Mike launched into the creation of the Doctor Hazel website and presented the project at the TechCrunch hackathon to widespread acclaim. "Since we built the first prototype in September 2017," Peter said, "we've been covered on TechCrunch, in The Wall Street Journal, IQ by Intel, and many other outlets and publications. Given our experience, we are confident that we can handle the technical requirements; our biggest challenges are US Food and Drug Administration (FDA) approval and gathering additional classified images."

"For all startups," Peter said, "the ideas are the easiest and execution is the hard work. Most of the projects fail because they can't find the product market fit. I've built out hundreds of prototypes, but very few of them gained interest from anyone. When you show people the demo of Doctor Hazel, everyone wants to join the beta and help out. We are getting hundreds of inquiries every single week from people who want to donate data and try the service."

Notable project milestones

  • First introduction of the Doctor Hazel concept and prototype at the TechCrunch hackathon, September 2017.
  • Launch of the Doctor Hazel website to explain the project and solicit images and information from parties that want to help build the database.
  • Media coverage in a number of different outlets and publications, including The Wall Street Journal, TechCrunch, IT by Intel.
  • Demonstrations of the project capabilities at multiple venues, including the Global IoT DevFest II, November 7 and 8, 2017.

Peter Ma demonstrates the technology at Strata Data NY in 2017
Figure 2. Peter Ma demonstrates the technology at Strata Data NY in 2017.

Enabling Technologies

The hardware portion of the project came together easily. Using a high-power endoscope camera acquired from Amazon* for about USD 30, the team captured high resolution images of moles and skin lesions to compare with the images in the growing database. Peter and Mike took advantage of Intel® AI DevCloud to train the AI model. This Intel® Xeon® Scalable processor-powered platform is available to Intel® AI Academy members for free and supports several of the major AI frameworks, including TensorFlow* and Caffe*. To broaden the utility of this diagnostic tool, Doctor Hazel employs the Intel® Movidius™ Neural Compute Stick, which makes it possible to conduct screening in situations where no Internet access is immediately available.

"Intel provides both hardware and software needs in artificial intelligence," Peter said, "from training to deployment. As a startup, it's relatively inexpensive to build up the prototype. The Intel® Movidius™ Neural Compute Stick costs about USD 79 and it allows AI to run in real time. We used the Intel® Movidius™ Software Development Kit (SDK), which proved extremely useful for this project."

Contained in a USB form factor and powered by a low- power Intel® Movidius™ Vision Processing Unit (VPU), the Intel Movidius Neural Compute Stick excels at accelerating deep neural networks processing using the self-contained inference engine. Developers have the option of initiating projects with a Convolution Neural Network model, based Caffe or TensorFlow frameworks, using one of the multiple examples networks. A toolkit then makes it possible to profile and tune the neural network, then compile a version for embedding with the Neural Compute Platform API. Visit this site for tips to start developing with the Intel Movidius Neural Compute Stick.

An extensive image database of suspected and validated skin cancer lesions is a primary requisite for improving machine learning and boosting recognitions accuracy.

Thousands of images were downloaded from the International Skin Imaging Collaboration, the Skin Cancer Foundation, and the University of Iowa to seed the learning process initially. In assessing a sample, Doctor Hazel gauges 8,000 variables to detect whether an image sample is likely to be skin cancer, a mole, or a benign lesion.

The driving goal of the project is to provide a means for anyone to get skin cancer screening for free. To build the image database and collect a broader sampling of confirmed skin cancer images, the beta version of the Doctor Hazel site is soliciting input and data. In an interview with TechCrunch, Mike commented, "There's a huge problem in getting AI data for medicine, but amazing results are possible. The more people share, the more accurate the system becomes." The team is working to advance recognition rates past the 90 percent level, a goal that gets closer as the image database expands.

Eventually, the team is planning an app to accompany the platform, and plans are also being considered for a compact, inexpensive image-capturing device to use in screening. An underlying goal of the project is to permit individuals to have themselves tested easily, perhaps at a clinic or through a free center using the real-time test system, and then seek a dermatologist or medical professional if the results indicate a high probability of skin cancer. Doctors will no longer need to perform the initial screening, allowing them to focus on patients that show a greater need for treatment based on a positive indication of cancer (see Figure 3).

Doctor reaching for a dermascope to examine a patient’s skin lesion
Figure 3. Doctor reaching for a dermascope to examine a patient's skin lesion.

AI is opening innovative paths to medical advances

The use of AI in diagnostic medicine and treatment methods is creating new opportunities to enhance healthcare globally. Through the design and development of specialized chips, optimized software and frameworks, sponsored research, educational outreach, and industry partnerships, Intel is firmly committed to advancing the state of AI to solve difficult challenges in medicine, manufacturing, agriculture, scientific research, and other industry sectors. Intel works closely with government organizations, non-government organizations, and corporations to uncover and advance solutions that solve major challenges, while complying with governmental policies and mandates in force.

The Intel® AI portfolio includes:

Intel Xeon logo

Intel® Xeon® Scalable processor: Tackle AI challenges with a compute architecture optimized for a broad range of AI workloads, including deep learning.

Framework Optimization

Framework Optimization: Achieve faster training of deep neural networks on a robust scalable infrastructure.

Intel Movidius Myriad

Intel® Movidius™ Myriad™ Vision Processing Unit (VPU): Create and deploy on-device neural networks and computer vision applications.

For more information, visit this portfolio page: https://ai.intel.com/technology

For Intel® AI Academy members, the Intel AI DevCloud provides a cloud platform and framework for machine learning and deep learning training. Powered by Intel Xeon Scalable processors, the Intel AI DevCloud is available for up to 30 days of free remote access to support projects by academy members.

Join today: https://software.intel.com/ai/sign-up

"AI fundamentally will enable us to advance scientific method, which itself is a tool, a process that allows us to have repeatable, reproducible results. Now we need to incorporate more data into those inferences in order to drive the field forward. Gone are the days that a single person goes and looks at some data on their own and comes up with a breakthrough, sitting in a corner. Now it is all about bringing together multiple data sources, collaborating, and the tools are what makes that happen."2

– Naveen Rao, Intel VP and GM, Artificial Intelligence Products Group

Resources

Intel® AI Academy

Skin Cancer Project in Intel Developer Mesh

IQ by Intel article - Skin Cancer Detection Using Artificial Intelligence

Deep-learning Algorithm for Skin Cancer Research

Doctor Hazel Website

Doctor Hazel uses AI for Skin Cancer Research

Getting the Most out of AI Using the Caffe Deep Learning Framework

Intel® Distribution for Caffe*

Intel® Movidius™ Neural Compute Stick

Dermatologist-level classification of skin cancer

References

1. Carty, J., C. Rodarte, and N. Rao. "Artificial Intelligence in Pharma and Care Delivery", HealthXL. 2017

2. https://newsroom.intel.com/news/intel-accelerates-accessibility-ai-developer-cloud-computing-resources/

Intel® System Studio 2018 for FreeBSD* Release Notes

$
0
0

This page provides the Release Notes for Intel® VTune™ Amplifier 2018 component of Intel® System Studio 2018 for FreeBSD*. 

To get product updates, log in to the Intel® Software Development Products Registration Center.

For questions or technical support, visit Intel® Software Products Support.

You can register and download the Intel® System Studio 2018 package  here.

Intel® VTune™ Amplifier 2018 for FreeBSD* Release Notes

Intel® VTune™ Amplifier 2018 provides an integrated performance analysis and tuning environment with graphical user interface that helps you analyze code performance on systems with IA-32 or Intel® 64 architectures.  It provides a target package for collecting data on the FreeBSD* system that is then displayed on a host system supporting the graphical interface, either via the remote capability or by manually copying the results to the host

This document provides system requirements, issues and limitations, and legal information for both the host and target systems.

System requirements

For an explanation of architecture names, see https://software.intel.com/en-us/articles/intel-architecture-platform-terminology/

Host Processor requirements 

  • For general operations with user interface and all data collection except Hardware event-based sampling analysis:
    • A PC based on an IA-32 or Intel® 64 architecture processor supporting the Intel® Streaming SIMD Extensions 2 (Intel® SSE2) instructions (Intel® Pentium® 4 processor or later, or compatible non-Intel processor).
    • For the best experience, a multi-core or multi-processor system is recommended.
    • Because the VTune Amplifier requires specific knowledge of assembly-level instructions, its analysis may not operate correctly if a program contains non-Intel instructions. In this case, run the analysis with a target executable that contains only Intel® instructions. After you finish using the VTune Amplifier, you can use the assembler code or optimizing compiler options that provide the non-Intel instructions.
  • For Hardware event-based sampling analysis (EBS):
    • EBS analysis makes use of the on-chip Performance Monitoring Unit (PMU) and requires a genuine Intel® processor for collection. EBS analysis is supported on Intel® Pentium® M, Intel® Core™ microarchitecture and newer processors (for more precise details, see the list below).
    • EBS analysis is not supported on the Intel® Pentium® 4 processor family (Intel® NetBurst® MicroArchitecture) and non-Intel processors. However, the results collected with EBS can be analyzed using any system meeting the less restrictive general operation requirements.
  • The list of supported processors is constantly being extended. In general VTune Amplifier supports publicly launched Desktop, Mobile, Server and Embedded Processors listed at https://ark.intel.com/. For pre-release processor support please file a support request at Online Service Center (https://www.intel.com/supporttickets).

System memory requirements

At least 2GB of RAM

Disk Space Requirements

900MB free disk space required for all product features and all architectures

Software Requirements

For software requirements, please refer here

Target FreeBSD* collection

For information on configuring the FreeBSD* collection and target setup please refer here.

What's new

Support for Latest Processors:

  • New Intel® processors including Intel® Xeon® Scalable Processor (code named Skylake-SP)

Issues and limitations

For information on issues and limitations please refer here.

Attributions

Attributions can be found here

Disclaimer and Legal Information

Disclaimer and Legal information can be found here

 

Viewing all 1201 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>