Overview of the system
A recent trend in high performance computing is
the use of graphic cards as powerful, low cost coprocessors of the
CPU. This is called General-purpose computing on graphics
processing units (GPGPU).
The nVidia TESLA Cluster of the TU Berlin has 16 nodes with two Intel Xeon X5550 (2.66 GHz) processors each and 24 GB RAM.
The GPUs are the following:
- 14 nodes have two nVidia TESLA C1060  cards each. Each C1060 has 240 computing units which work in parallel. This results in a theoretical computing power of 1866 GFLOPS in single precision arithmetics or 156 GFLOPS in double precision arithmetics through the GPU. Each card has 4096 MB RAM.
- 2 nodes have four nVidia GTX 295  cards each. This results in a theoretical computing power of 7152 GFLOPS in single precision arithmetics or 596 GFLOPS in double precision arithmeticsthrough the GPU. Each card has 1792 MB RAM.
All nodes have a 300 GB @ 10.000 RPM hard disk and are connected via Infiniband QDR.
The cluster can be accessed via the usual login
nodes (cluster.math.tu-berlin.de) or the head node of the GPU cluster
The hardware of cluster-g is identical to the hardware of the GPU nodes. After importing the CUDA modul you can call
$ gpurun program
and test a CUDA program for debugging purposes without the need to use the batch system. By calling
module add cuda-3.2
you can load the cuda modul and use the CUDA development environment.
Jobs are submitted with "qsub -l gpus=X ...", where X is the number of GPUs needed. On the C1060 nodes there are 2 GPU prozcessors available, while on the GTX 295 nodes 8 GPU processors are available.
develop code for the GPUs CUDA as well as OpenCL can be used.
Furthermore there are libraries for specific tasks which can be used
directly from C or fortran. The portland compilers can automaticly run
parts of the code on the GPU.
More on these options below.
The CUDA 
API was developed by nVidia and simplifies the development of code for
the GPUs with C and Fortran.
The CUDA website  hosts useful documentation like
- CUDA Programming Guide 
- CUDA Best Practices Guide 
- CUDA Reference Manual 
The CUDA SDK contains examples from the following domains:
- Parallel bitonic sort
- Matrix multiplication
- Matrix transpose
- Performance profiling using timers
- Parallel prefix sum (scan) of large arrays
- Image convolution
- 1D DWT using Haar wavelet
- OpenGL and Direct3D graphics interoperation examples
- CUDA BLAS and FFT library usage examples
- CPU-GPU C- and C++-code integration
- Binomial Option Pricing
- Black-Scholes Option Pricing
- Monte-Carlo Option Pricing
- Parallel Mersenne Twister (random number generation)
- Parallel Histogram
- Image Denoising
- Sobel Edge Detection Filter
Thrust  is a high level C++ API written in STL style, which simplifies the development with CUDA even more.
is a platform independent alternative to CUDA. It is still in
A first version of the OpenCL development tools is available at the cluster by loading the modul:
module add opencl-1.0
First steps with CUDA
The TESLA and GTX 295 cards have a large number of computing units which are capable of handling data in parallel. CUDA allows the programmer to write C functions, so called kernels, which are executed in parallel in N so called CUDA threads.. To turn a C function into a kernel, it has to be annotated with the __global__ attribute. The CUDA compiler then interprets this attribute and cares for the execution on the GPU.
A thread ID serves to identify the threads. This allows you to partition the input vector or input matrix and to assign these partitions to well defined threads.
The input data are partitioned into a onedimensional or twodimensional grid. Each block of this grid is identified by one- or twodimensional coordinates, the so called block-index and assigned to a processor core of the GPU. The input data of this block are once again partitioned into one-, two- or threedimensional thread blocks which correspond to one thread each. A block can contan up to 512 threads.
The memory model of CUDA distinguishes between host memory, that is the normal RAM of the computer, and device memory, that is the memory of the GPU. The main program copies input data from the host memory to the device memory and copies results back from the device memory to the host memory. For this end there exist very simple copy funtions. Other memory types like texture memory, thread local memory or thread block local memory will not be explained here. They are explained in the CUDA tutorial.
The following example adds two Nx1 vectors. There for the vector is partitioned into blocks with 256 threads each. A thread has the task of adding the numbers of the vectors at one specific coordinate. The coordinate results from the block index and the thread index in the block.
The program is saved with the name ending ".cu" and compiled with "nvcc -o addvectors addvectors.cu". Then it can be run on the head node with "gpurun ./addvectors". The wrapper "gpurun" sets the neccessary execution rights on the GPU.
Attention: On direct execution without "gpurun" the program might run without error message, but nothing will be computed!
You can submit the job the the batch system with "qsub -l gpus=1 ./addvectors".
More more details about the programming see the CUDA Programming Guide.
With the Thrust abstraction  the example above can be simplified like the following:
Thrust has many functions for searching, copying, reducing, sorting and transforming of vectors.
cuBLAS und cuFFT
there is an example on how to work very simple with the cuBLAS library for Linear Algebra. This library hides the usage of the GPUs transparently for the programmer. The cuBLAS reference  and a website about Linear Algebra  from nVidia has more details. CuBLAS is supported for C++ as well as Fortran 77.
A similar abstraction for FFT is available at
The cuFFT reference  has details on the usage.
PGI GPU Accelerator
The Portland compilers
provide to option of annotating parts of the source code which are
then automaticly run on the GPU.
The following example show this:
#pragma acc region
for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
Details are available on the websites of PGI.
- PGI Accelerator 
- Tutorial 1 
- Tutorial 2 
To use this you have to load
the PGI environment with "module add pgi-10.0". After that
the program can be compiled with "pgcc program.c
-ta=nvidia". For details see the tutorials.
- gpgpu.org 
News about the GPGPU development, tools etc.
- Tutorial for the usage of CUBLAS and CUFFT with C and Fortran