clFFT Pre-callback – A Faster Way to Pre-Process Input Data

The math library group at AMD is continuously looking for areas of improvement in the library and working towards optimizing the same. In his blog on the latest performance improvements in the clFFT library, Bragadeesh Natarajan explained speedup improvements in clFFT 2.6.1 over previous versions, as well as competitive performance against NVIDIA cuFFT. Continuing the trend of improvements, I am excited to share a new feature we are adding to clFFT v2.7, called pre-callback. The clFFT library is one component of the AMD Compute Library (ACL).

(Editor’s note: The Beta 2 release of the AMD Compute Libraries is now available in GitHub. Check out all of the new features and improvements in our ACL 1.0 Beta 2 release announcement blog here.)

Pre-callback Introduction

372x322_green2 The pre-callback feature of clFFT gives you the ability to invoke user provided OpenCL™ inline functions to pre-process data, from within the FFT kernel. The inline OpenCL callback function is passed as a string to the library. It is then incorporated into the generated FFT kernel. This eliminates the need for an additional kernel launch to carry out the pre-processing tasks.

To get a better understanding of the pre-callback feature, let’s consider data format conversion as an example. In many image processing applications, the input data might be in a non-supported format. For example, it could be stored as 24-bit integers. Without the pre-callback feature, you would have to launch a separate pre-processing kernel to convert the values to 32-bit floating point before running the FFT kernel. Instead, by using the new pre-callback feature of clFFT, you can fold the pre-processing logic into the FFT kernel, improving performance by avoiding additional kernel launch overhead.

So let’s look at both approaches: a separate kernel for the format conversion pre-processing, and the same conversion using the clFFT pre-callback. Then we’ll look at a performance comparison. For brevity, the source code is kept high level. You can find the complete implementation on the clFFT GitHub page.

Pre-processing data without pre-callback

The workflow without pre-callbacks would be as follows.

Declare two buffers; one for storing 24-bit original data and the other to store the 32-bit data after conversion.

//24-bit input buffer
cl_mem in24bit fftbuffer = clCreateBuffer( … );
…
…
//Initialize 24-bit data
….
//32-bit input buffer
cl_mem in32bitfftbuffer = clCreateBuffer( … );

Write format conversion pre-processing kernel. This kernel takes input as 24-bit data and converts it to 32-bit format.

const char* source = “typedef unsigned char uint24_t[3]; \n
__kernel void convert24To32bit (__global void *input24bit, __global void *output32bit) \n
{ \n
uint inoffset = get_global_id(0); \n
__global uint24_t* inData = (__global uint24_t*) input24bit; \n
float val = inData[inoffset][0] << 16 | inData[inoffset][1] << 8 | inData[inoffset][2] ; \n
*((__global float*)output32bit + inoffset) = val; \n
} \n ”;

Compile the pre-processing kernel and launch it.

cl_program program = clCreateProgramWithSource( context, 1, &source, … );
cl_kernel kernel = clCreateKernel( program, "convert24To32bit", … );
…..
//Set Kernel arguments
…..
status = clEnqueueNDRangeKernel( …, kernel, 1, NULL, … );

Bake the clFFT Plan.

clfftBakePlan( … );

This API prepares the clFFT plan for execution. It generates OpenCL kernels based on the arguments passed and compiles them.

Pass the converted output in32bitfftbuffer to the clFFT Transform API.

clfftEnqueueTransform( plan_handle, CLFFT_FORWARD, 1, &commandQueue, 0, NULL, NULL,   &in32bitfftbuffer, &outfftbuffer, … );

Read the FFT output outfftbuffer.

clEnqueueReadBuffer( commandQueue, outfftbuffer, CL_TRUE, 0, out_size_of_buffers, &output[ 0 ],      0, NULL, NULL );

163x83_turq_green_tris4

Pre-processing data using clFFT pre-callback

Now, let’s take a look at how we can implement the format conversion pre-processing logic using clFFT pre-callbacks. The workflow would be as follows.

Declare a buffer to store 24-bit input data. Notice the difference from Step 1 of the earlier approach where we needed two buffers.

//24-bit input buffer
cl_mem infftbuffer = clCreateBuffer( … );
…
…
//Initialize 24-bit data
….

Write the pre-callback function as an inline OpenCL function and store it in a string.

//Precallback inline OpenCL function
const char* precallbackstr = “typedef unsigned char uint24_t[3]; \n
float convert24To32bit(__global void* in, uint inoffset, __global void* userdata) \n
{ \n
__global uint24_t* inData = (__global uint24_t*)in; \n
float val = inData[inoffset][0] << 16 | inData[inoffset][1] << 8 | inData[inoffset][2]; \n
return val; \n
} \n ”;

clFFT expects the user-provided pre-callback function to be of a specific prototype. I’ll cover some details on the expected function prototype later in this blog.

clfftSetPlanCallback(plan_handle, "convert24To32bit", precallbackstr, 0, PRECALLBACK, NULL, 0);

This is an important step in using the pre-callback feature. The library uses the arguments passed here, including the callback function string, to stitch the pre-callback code into the generated FFT kernel. The arguments for clfftSetPlanCallback include

clFFT plan handle
Name of the callback function. The argument "convert24To32bit" as passed above
Callback function in string form. The argument precallbackstr as passed above
Optionally, local memory size if needed by the callback function
Type of callback. This is an enum. The current supported value for this is ‘PRECALLBACK’
Supplementary user data, if any, used by the callback function
Number of user data buffers

Bake the clFFT Plan.

clfftBakePlan( … );

In this case clFFT inserts the callback code into the main FFT kernel during bake plan and compiles it. If there are any compilation errors caused by syntax or incompatible callback function prototype, the failure is reported to the user.

Execute FFT.

clfftEnqueueTransform( plan_handle, CLFFT_FORWARD, 1, &commandQueue, 0, NULL, NULL,   &infftbuffer, &outfftbuffer, … );

Read the FFT output outfftbuffer.

clEnqueueReadBuffer( commandQueue, outfftbuffer, CL_TRUE, 0, out_size_of_buffers, &output[ 0 ],      0, NULL, NULL );

As you can see from these steps, all you have to do is pass the required pre-processing callback function wrapped in a string to the library. The task of invoking the callback is handled inside the library.

Performance Comparison

The following chart shows the speedup achieved using clFFT pre-callbacks.

Fig 1: The clFFT pre-callback feature results in better performance.

The chart shows the speedup (old time/new time) for a separate pre-processing kernel compared with the pre-callback version. We did Real to Complex transforms across a range of sizes, up to 128M elements. We observed a typical speedup of 1.5 to 1.6 using the pre-callback approach, with a couple of outliers. We ran the test on Ubuntu 14.04 LTS Linux64, with AMD FirePro™ driver version 14.502, running on AMD Firepro™ W9100 Professional Graphics card, with an AMD FX™ 4300 CPU and 8GB RAM.

Pre-callback function prototype

clFFT expects the user-provided pre-callback function to be of a specific prototype depending on the type of transform (real/complex), and whether local memory is used. As an example, consider the pre-callback prototype for a Real to Complex FFT as shown in the following

// Precallback function without local memory usage

float <precallback_func> ( __global void *input, uint inoffset, __global void *userdata) // Precallback function with local memory usage float <precallback_func> ( __global void *input, uint inoffset, __global void *userdata, __local void *localmem)

The input parameter is the base pointer of the input buffer. The inoffset parameter is the index of the current element in the input buffer. The userdata pointer is useful for passing any supplementary data to the callback function (e.g. convolution filter data or any scalar value). The userdata can be of any custom data type/structure, in which case, the user has to declare the custom data type and include it along with the callback function string. This is shown in our format conversion example above where uint24_t type is declared before the callback function. The localmem parameter represents a pointer to local memory. This memory is allocated by the library based on the size specified by the user in the clfftSetPlanCallback API call and subject to local memory availability.

The complete list of compatible pre-callback function prototypes is available at the clFFT Library API documentation page. You must write the callback function in adherence to the expected prototype. clFFT considers the value returned by a pre-callback function to be the new value of the input at the index corresponding to the inoffset argument. clFFT may compute a given FFT with multiple kernels. However it invokes the pre-callback function only once for each point in the input.

Conclusion

Many applications that use FFT have the need to pre-process data before performing FFT. The clFFT pre-callback feature gives you the ability to invoke user-provided callback functions to pre-process data directly within the FFT kernel. This avoids the overhead of a separate kernel launch. Memory bandwidth is also saved as the input data preprocessing gets folded into the FFT kernels, thereby improving overall application performance.

163x83_turq_green_tris

Pradeep Rao is Senior Member of Technical Staff in the Developer Solutions Team at AMD. Links to third party sites, and references to third party trademarks, are provided for convenience and illustrative purposes only. Unless explicitly stated, AMD is not responsible for the contents of such links, and no third party endorsement of AMD or any of its products is implied.

OpenCL is a trademark of Apple Inc. used by permission by Khronos.