Because GPUs are SIMD machines, to exploit CUDA's potential you must pose the problem in an SIMD manner. Computation that can be partitioned in such a way that each thread can compute one element independently is ideal for the GPU.
Some algorithms either cannot be written in parallel, or cannot be used on CUDA (due to architecture constraints). In those cases, research is ongoing to introduce alternative methods to use the GPU to perform those computations.
In this section, some usage of CUDA programming inside the Wolfram Languageis showcased. All the following examples use CUDAFunctionLoad, which allows you to load CUDA source, binaries, or libraries into the Wolfram Language.
The dot product of two vectors is a common operation in linear algebra. This implements a function that takes a set of vectors and gives the dot product of each.
This loads the function into the Wolfram Language.
Generate 50 random vectors.
This calls the function, returning the result.
The results agree with the Wolfram Language's output.
Convex hull is hard to make parallel on the GPU, so in this example the hybrid approach to GPU programming is taken: do computation that makes sense on the GPU, and the rest is done on the CPU.
This convex hull implementation is a textbook implementation of Andrew's algorithm and is designed to be simple, not efficient.
Define the LeftOf predicate in the Wolfram Language.
This defines a function that splits the point set to be either above or below a given line connecting the extreme points.
This loads the above function as a CUDA function.
Define a Wolfram Language function that takes two split point sets and finds the convex hull for them.
This calls the above function. Note that the list is sorted with CUDASort before being processed by partitionPts.
To test, create 20,000 uniformly distributed random points.
This computes the hull.
This visualizes the result. Lines are drawn between the hull points.
The above algorithm handles the extreme case where all or most points lie on the hull. This generates uniformly distributed points on the unit disk.
This computes the hull points.
This visualizes the hull.
The above is a prime example of combining Wolfram Language and CUDA programming to make an algorithm partially parallel that would otherwise be written in only serial code.
Random Number Generation
Many algorithms, ranging from ray tracing to PDE solving, require random numbers as input. This is, however, a difficult problem on many core systems, where each core has the same state as any other core. To avoid that, parallel random number generators usually use entropy values such as the time of day to seed the random number generators, but those calls are not available in CUDA.
The following section gives three classes of algorithms to generate random numbers. The first is uniform random number generators (where pseudorandom number generators and quasi-random number generators are showcased). The second is random number generators that exploit the uniform distribution of a hashing function to generate random numbers. The final is normal random number generators.
Pseudorandom Number Generators
Pseudorandom number generators are deterministic algorithms that generate numbers that appear to be random. They rely on the seed value to generate further random numbers.
In this section, a simple linear congruential random number generator (the Park–Miller algorithm) is shown, along with a more complex Mersenne Twister.
The Park–Miller random number generator is defined by the following recurrence equation:
It can be implemented easily in the Wolfram Language.
Here, common values for and are used. Using NestList, you can generate a list of 1000 numbers and plot them.
Here is the timing to generate 10 million numbers.
An alternative is to use the Method option in SeedRandom; this can be used in the same manner as before.
Compared to the Wolfram Language implementation, this is around 300 times faster.
The CUDA implementation is similar to the one written in the Wolfram Language. The implementation is distributed along with CUDALink, and the location is shown below.
This loads the CUDA function into the Wolfram Language.
If you measure the timing, you notice that it is twice as fast as the Wolfram Language's built-in method, and 600 times faster than pure Wolfram Language implementation.
The timing against a Compile is similar to that of CUDA. This generates C code from a Compile statement with no integer overflow detection.
This finds the timing. Notice that there is little difference.
If you ignore the time it takes for memory allocation (and you can rightly do so in this case, since generated random numbers are usually reused on the GPU), you notice a 10× speed improvement.
Mersenne Twister utilizes shift registers to generate random numbers. Because the implementation is simple, it maps well to the GPU. The following file contains the implementation.
This loads the "MersenneTwister" function from the file.
Here, the Mersenne Twister input parameters are defined. The twister requires seed values. Those values can be computed offline and stored in a file, or they can be generated by the Wolfram Language. The latter is shown here.
This allocates the output memory. Since the output will be overwritten, there is no need to load memory from the Wolfram Language onto the GPU.
A Wolfram Language function can be written that takes the number of random numbers to be generated as input, performs the required allocations and setting of parameters, and returns the random output memory.
This generates a plot of the first 10,000 elements.
The following measures the time it takes for the random number generator to generate 100 million numbers.
This is on par with the Wolfram Language's random number generator timings.
Considering that random numbers are seeds to other problems, a user may get performance increase in the overall algorithm even if the Wolfram Language's timings are superior to the CUDA implementation.
Quasi-Random Number Generators
This section describes quasi-random number generators. Unlike pseudorandom number generators, these sequences are nonuniform and have underlying structures that are sometimes useful in numerical methods. For instance, these sequences typically provide faster convergence in multidimensional Monte Carlo integration.
The Halton sequence generates quasi-random numbers that are uniform on the unit interval. While the code works with arbitrary dimensions, only the van der Corput sequence is discussed, which works on 1D space. This is adequate for comparison.
The resulting numbers of the Halton (or van der Corput) sequence are deterministic but have low discrepancy over the unit interval. Because they fill the space uniformly in some applications, such as Monte Carlo integration, they are preferred to pseudorandom number generators.
For a given with base :
The one-dimensional Halton (or van der Corput) value in base :
The sequence of length is then written as:
Given a number in base representation , the van der Corput sequence mirrors the number across the decimal point, so that its sequence value is .
In the Wolfram Language, you can find the sequence using IntegerDigits.
Setting the base to 2, you can calculate the first 1000 elements in the sequence.
This plots the result; notice how it fills the space uniformly.
A property of low-discrepancy sequences is that the next elements in the sequence know where the previous elements are positioned. This can be shown with Manipulate.
For the CUDA implementation, you have to implement your own version of IntegerDigits, but this is not difficult. First, load the implementation source code.
Random numbers have applications in many areas. Here two main applications are presented: Monte Carlo integration (by approximating and an arbitrary function) and simulating Brownian motion.
The value of can be approximated using Monte Carlo integration. First, generate uniformly random numbers in the unit square. Then the number of points inside the first quadrant of the unit circle is counted. The result is then divided by the number of points. This will give .
This implements reduction, counting the number of points in a unit circle.