Simple CUDA Acceleration Example and Runtime Comparison

Published on Aug. 16, 2019, 7:26 p.m.

In this example we execute the built in scipy function for gaussian distribution agains the vectorized version and compare their performances. Note that for a small number of points (<100000) the built in function out-performs the vectorized one, due to the overhead that is neaded for the data transfarence to and from the GPU. Data management may come handy for this issue (see - Data Management).

Regular Function exec. Time (sec) Vectorized Function exec. Time (sec) Exec. Time Gain(sec) Improvement (%)
10.98  1.74 9.24 630.73
# Our inputs are too small: the GPU achieves performance through parallelism, operating on thousands of values at once. Our test inputs have only 4 and 16 integers, respectively. We need a much larger array to even keep the GPU busy.
# Our calculation is too simple: Sending a calculation to the GPU involves quite a bit of overhead compared to calling a function on the CPU. If our calculation does not involve enough math operations (often called "arithmetic intensity"), then the GPU will spend most of its time waiting for data to move around.
# We copy the data to and from the GPU: While including the copy time can be realistic for a single function, often we want to run several GPU operations in sequence. In those cases, it makes sense to send data to the GPU and keep it there until all of our processing is complete.
# Our data types are larger than necessary: Our example uses int64 when we probably don't need it. Scalar code using data types that are 32 and 64-bit run basically the same speed on the CPU, but 64-bit data types have a significant performance cost on the GPU. Basic arithmetic on 64-bit floats can be anywhere from 2x (Pascal-architecture Tesla) to 24x (Maxwell-architecture GeForce) slower than 32-bit floats. NumPy defaults to 64-bit data types when creating arrays, so it is important to set the dtype attribute or use the ndarray.astype() method to pick 32-bit types when you need them.
import math
import scipy.stats
import time
import numpy as np
from numba import vectorize

SQRT_2PI = np.float32((2*math.pi)**0.5)

@vectorize(['float32(float32, float32, float32)'], target='cuda')
def gaussian_pdf(data_list, mu, sigma):
    return math.exp(-0.5 * ((data_list - mu) / sigma)**2) / (sigma * SQRT_2PI)

if __name__ == '__main__':
    x = np.random.uniform(-3, 3, size=100000000).astype(np.float32)
    mean = np.float32(0.0)
    std = np.float32(1.0)
    norm_pdf = scipy.stats.norm

    t_start = time.time()
    norm_pdf.pdf(x, loc=mean, scale=std)
    t_end_reg = time.time() - t_start
    print(f'The execution of a built in function took : {t_end_reg} seconds')

    t_start = time.time()
    gaussian_pdf(x, mean, std)
    t_end_vec = time.time() - t_start
    print(f'The execution of a vectorized function took : {t_end_vec} seconds')

    if t_end_reg - t_end_vec > 0:
        print(f'The vectorized version took {t_end_reg - t_end_vec} sec less')
        print(f'The regular version took {t_end_vec - t_end_reg} sec less')
    print(f'The overall improvement of the vectorized version in runtime is : {np.round((100 * t_end_reg)/t_end_vec, 3)} %')

  • Simple CUDA Acceleration Example and Runtime Comparison
    (currently viewing)
  • Memory Management with CUDA
  • CUDA device functions
  • Parallel Computing Example