TU Berlin


Page Content

to Navigation

Overview of the system

A recent trend in high performance computing is the use of graphic cards as powerful, low cost coprocessors of the CPU. This is called General-purpose computing on graphics processing units (GPGPU).

The nVidia TESLA Cluster of the TU Berlin has 16 nodes with two Intel Xeon X5550 (2.66 GHz) processors each and 24 GB RAM.

The GPUs are the following:

  • 14 nodes have two nVidia TESLA C1060 cards each. Each C1060 has 240 computing units which work in parallel. This results in a theoretical computing power of 1866 GFLOPS in single precision arithmetics or 156 GFLOPS in double precision arithmetics through the GPU. Each card has 4096 MB RAM.
  • 2 nodes have four nVidia GTX 295 cards each. This results in a theoretical computing power of 7152 GFLOPS in single precision arithmetics or 596 GFLOPS in double precision arithmeticsthrough the GPU. Each card has 1792 MB RAM.

All nodes have a 300 GB @ 10.000 RPM hard disk and are connected via Infiniband QDR.

Usage information

The cluster can be accessed via the usual login nodes (cluster.math.tu-berlin.de) or the head node of the GPU cluster (cluster-g.math.tu-berlin.de).

The hardware of cluster-g is identical to the hardware of the GPU nodes. After importing the CUDA modul you can call
$ gpurun program
and test a CUDA program for debugging purposes without the need to use the batch system. By calling

module add cuda-3.2

you can load the cuda modul and use the CUDA development environment.

Jobs are submitted with "qsub -l gpus=X ...", where X is the number of GPUs needed. On the C1060 nodes there are 2 GPU prozcessors available, while on the GTX 295 nodes 8 GPU processors are available.

Development tools

To develop code for the GPUs CUDA as well as OpenCL can be used. Furthermore there are libraries for specific tasks which can be used directly from C or fortran. The portland compilers can automaticly run parts of the code on the GPU.
More on these options below.


The CUDA API was developed by nVidia and simplifies the development of code for the GPUs with C and Fortran.

The CUDA website hosts useful documentation like

The CUDA SDK contains examples from the following domains:

  • Parallel bitonic sort
  • Matrix multiplication
  • Matrix transpose
  • Performance profiling using timers
  • Parallel prefix sum (scan) of large arrays
  • Image convolution
  • 1D DWT using Haar wavelet
  • OpenGL and Direct3D graphics interoperation examples
  • CUDA BLAS and FFT library usage examples
  • CPU-GPU C- and C++-code integration
  • Binomial Option Pricing
  • Black-Scholes Option Pricing
  • Monte-Carlo Option Pricing
  • Parallel Mersenne Twister (random number generation)
  • Parallel Histogram
  • Image Denoising
  • Sobel Edge Detection Filter

Thrust is a high level C++ API written in STL style, which simplifies the development with CUDA even more.


OpenCL is a platform independent alternative to CUDA. It is still in development.
A first version of the OpenCL development tools is available at the cluster by loading the modul:

module add opencl-1.0

First steps with CUDA

The TESLA and GTX 295 cards have a large number of computing units which are capable of handling data in parallel. CUDA allows the programmer to write C functions, so called kernels, which are executed in parallel in N so called CUDA threads.. To turn a C function into a kernel, it has to be annotated with the __global__ attribute. The CUDA compiler then interprets this attribute and cares for the execution on the GPU.

A thread ID serves to identify the threads. This allows you to partition the input vector or input matrix and to assign these partitions to well defined threads.

The input data are partitioned into a onedimensional or twodimensional grid. Each block of this grid is identified by one- or twodimensional coordinates, the so called block-index and assigned to a processor core of the GPU. The input data of this block are once again partitioned into one-, two- or threedimensional thread blocks which correspond to one thread each. A block can contan up to 512 threads.

The memory model of CUDA distinguishes between host memory, that is the normal RAM of the computer, and device memory, that is the memory of the GPU. The main program copies input data from the host memory to the device memory and copies results back from the device memory to the host memory. For this end there exist very simple copy funtions. Other memory types like texture memory, thread local memory or thread block local memory will not be explained here. They are explained in the CUDA tutorial.

The following example adds two Nx1 vectors. There for the vector is partitioned into blocks with 256 threads each. A thread has the task of adding the numbers of the vectors at one specific coordinate. The coordinate results from the block index and the thread index in the block.



#include <cuda.h>
#include <stdio.h>
// Device code (executed on TESLA card, AKA "kernel")
__global__ void VecAdd(float* A, float* B, float* C, int N) {
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
// Host code
int main() {
int N = 10000;
size_t size = N * sizeof(float);
// Allocate input vectors h_A and h_B in host memory
float * h_A = (float *)malloc(size);
float * h_B = (float *)malloc(size);
float * h_C = (float *)malloc(size);
int i;
// Fill input vectors with content
for (i=0; i<N; ++i)
h_A[i] = h_B[i] = 1.0;
// Allocate vectors in device memory
float * d_A;
cudaMalloc((void**) &d_A, size);
float * d_B;
cudaMalloc((void**) &d_B, size);
float * d_C;
cudaMalloc((void**) &d_C, size);
// Copy vectors from host memory to device memory
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid =
(N + threadsPerBlock - 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Copy result from device memory to host memory
// -> h_C conains the result in host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
// Free device memory
// Print result
for (i=0; i<N; ++i)
printf("%d %f\n", i, h_C[i]);
// Free host memory
return 0;

The program is saved with the name ending ".cu" and compiled with "nvcc -o addvectors addvectors.cu". Then it can be run on the head node with "gpurun ./addvectors". The wrapper "gpurun" sets the neccessary execution rights on the GPU.

Attention: On direct execution without "gpurun" the program might run without error message, but nothing will be computed!

You can submit the job the the batch system with "qsub -l gpus=1 ./addvectors".

More more details about the programming see the CUDA Programming Guide.


With the Thrust abstraction the example above can be simplified like the following:


#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <cstdlib>
int main(void)
int N = 10000;
thrust::host_vector<float> h_A(N);
thrust::host_vector<float> h_B(N);
thrust::host_vector<float> h_C(N);
for (int i=0; i<N; ++i)
h_A[i] = h_B[i] = 1.0;
// transfer data to the device
thrust::device_vector<float> d_A = h_A;
thrust::device_vector<float> d_B = h_B;
thrust::device_vector<float> d_C(N);
// apply + operator
d_A.begin(), d_A.end(), // input 1
d_B.begin(), // input 2
d_C.begin(), // output
// transfer data back to host
h_C = d_C;
for (int i=0; i<N; ++i)
printf("%d %f\n", i, h_C[i]);
return 0;

Thrust has many functions for searching, copying, reducing, sorting and transforming of vectors.

cuBLAS und cuFFT



there is an example on how to work very simple with the cuBLAS library for Linear Algebra. This library hides the usage of the GPUs transparently for the programmer. The cuBLAS reference and a website about Linear Algebra from nVidia has more details. CuBLAS is supported for C++ as well as Fortran 77.

A similar abstraction for FFT is available at


The cuFFT reference has details on the usage.

PGI GPU Accelerator

The Portland compilers provide to option of annotating parts of the source code which are then automaticly run on the GPU.

The following example show this:
#pragma acc region
for( i = 0; i < n; ++i ) r[i] = a[i]*2.0f;
Details are available on the websites of PGI.

To use this you have to load the PGI environment with "module add pgi-10.0". After that the program can be compiled with "pgcc program.c -ta=nvidia". For details see the tutorials.



Quick Access

Schnellnavigation zur Seite über Nummerneingabe