CUDALink on Multiple Devices
The functional and list-oriented characteristics of the core Wolfram Language allow CUDALink to provide immediate built-in data parallelism, automatically distributing computations across available GPU cards.
Introduction
First, load the CUDALink application.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-nldpuu
This launches as many worker kernels as there are devices.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-i7imm

$CUDADeviceCount is the number of the devices on the system.
$CUDADeviceCount | number of CUDA devices on system |
$CUDADeviceCount gets the number of CUDA GPUs on the system.
This loads CUDALink on all worker kernels.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-b0ffin
CUDALink relies on existing Wolfram Language parallel computing capabilities to run on multiple GPUs. Throughout this section the following functions will be used.
ParallelNeeds | load a package into all parallel subkernels |
DistributeDefinitions | distribute definitions needed for parallel computations |
ParallelEvaluate | evaluate the input expression on all available parallel kernels and return the list of results obtained |
CUDALink relies on the Wolfram Language's parallel computing capabilities to use multiple GPUs.
This sets the $CUDADevice variable on all kernels.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-cykvw0

CUDALink Functions
High-level CUDALink functions like the image processing, linear algebra, and fast Fourier transforms can be used on different kernels like any other Wolfram Language function. The only difference is that the $CUDADevice variable is set to the device on which computation is performed.
Here you set image names to be taken from the TestImages dataset for ExampleData.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-cfb2ma
Distribute the variable imgNames with the worker kernels.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-ehiijj

Perform CUDAErosion on images taken from ExampleData.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-e8zffx

Notice the 2x speed improvement. Since these images are small and data must be transferred, you do not get the 4x performance speedup.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-k8viod

In other cases, the amount of time spent transferring the data is not as significant as the amount of time spent in calculation. Here, you allocate 2000 random integer vectors.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-lu78hw
Map CUDAFold on each device.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-4j5u1

Notice that there is now a better speedup.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-qemk7

CUDALink Programming
Since a CUDAFunction is optimized and local to one GPU, it cannot be shared with worker kernels using DistributeDefinitions. This section describes an alternative way of programming the GPU.
Add Two
This loads a basic CUDA code that adds 2 to a vector.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-tk2q1q
This loads the CUDAFunction. Notice the use of SetDelayed in the assignment. This allows DistributeDefinitions to distribute all dependent variables in the CUDAFunctionLoad call.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-o49o6j
This sets the input parameters.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-3q5m6
This distributes the definitions between worker kernels.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-ghszwv

This runs the CUDAFunction on each worker kernel using different CUDA devices.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-maik9c

This gathers the result showing the first 20 elements.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-jjmntx

Mandelbrot Set
This is the same CUDA code defined in other sections of the CUDALink documentation.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-d3qll1
Here, you load the CUDAFunction.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-dzfw9v

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-b1ynen
This shares the variables with worker kernels.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-dafk38

This launches the kernel, each with a different zoom level, returning the "Bit" image.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-eu0fsr

Random Number Generators
The Mersenne Twister is implemented in the following file.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-d84o29

This loads the function into the Wolfram Language.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-u3efc
This sets the input variables for the CUDAFunction.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-dscs93
This distributes the mersenneTwister function and input parameters.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-hda44b

This allocates the seed values—note that seed evaluation needs to be performed on each worker kernel so that the random numbers are not correlated. The output memory is also allocated, computation is performed, and the result is visualized.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-hdppql

Memory
CUDAMemory is tied to both the kernel and device where it is loaded and thus cannot be distributed among worker kernels.
Load memory in the master kernel.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-vqpro

Then distribute the definition.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-buxkwz

Distributed CUDAMemory cannot be operated on by worker kernels.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-evopll





To load memory onto the worker kernels, users can use ParallelEvaluate.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-jkpq84

Operations can be further done on the memory using ParallelEvaluate.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-f3lfir

Bandwidth
In some cases, the amount of time spent transferring the data dwarfs the time spent in computation.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-jhrlxc
Since the parallel version needs to share the large list with worker kernels, it takes considerably longer than the sequential version.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-j7hdo0

The sequential version is faster since no data transfer is necessary.

https://wolfram.com/xid/0k1j0pobq27zyibwarns7oq-dln2ha
