Transmittance Computation

829 Words2 Pages

Transmittance computation is the most time consuming part of the radiative transfer model. One way to implement it on a GPU is to divide the workload so that each thread computes results for a level. Each CUDA thread would in effect be computing a matrix by vector multiplication between coefficients and predictors. In BLAS terms, it is a batched SGEMV operation. Since there are only 101 levels there would be only 101 threads. Compared to the more than 30,000 active threads available on Tesla C1070 this is extremely low number for a CUDA threading model. NVIDIA recommends that the number of threads per block should be a multiple of 32 threads [30]. Thus, 101 threads could effectively use at most 4 out of the 30 multiprocessors. Since this type of work division is not very efficient we will next describe in detail implementations that divide the work so that each thread computes results for a channel. 3.6 GPU implementation using 6 CUDA kernels We are implementing radiance computation on a GPU using six kernels, five for transmittance computation and one for radiance computation. Our GPU implementation of the transmittance of radiative transfer model in illustrated in Figure 6. It demonstrates what happens inside each of the 30 multiprocessor inside a Tesla C1070 during layer-to-space transmittance computation. Also, shared memory utilization for storing predictors for the effective layer optical depths is illustrated. The most time consuming part of the radiative transfer model are the dot products in a layer-to-space transmittance computation as shown in Fig. 3. The dot product between regression coefficients for predicting the effective layer optical depths, 'C', and predictors for the effective layer optical depths, 'X',... ... middle of paper ... ...xecution, pre-computed values, Sqp, are stored in the shared memory. Sqp represents the square root of a slant-path gas layer amount. For each level, 101 threads take part in transferring values of predictors for the effective layer optical depths of the fixed gases, variable X, to the shared memory and the threads are synchronized. After that, a dot product between coefficients and values of variable X is computed. This operation is the most time consuming one in the kernel due to a large amount of global memory accesses. The global memory access pattern for the coefficients using in the dot product is such that consecutive threads access consecutive memory addresses. This maximizes the memory bandwidth utilization. Similar kernels are used for the other transmittance components and for brevity’s sake we do not present the other transmittance kernels in this paper.

Open Document