Skip to end of metadata
Go to start of metadata
About Me

I'm a math student and an independent app developer (see www.cubiccarrot.com).

E-mail: sven : cubiccarrot : com

ESA SOCIS Project Description

GitHub Code Page (ESTL)

GitHub Code Page (ESTL-OX)

List of tasks

Implement a 2D FFT-based image convolution plugin using OpenCL

The project may be subdivided into six tasks:

(A) Implement a basic Opticks plugin without convolution code

This plugin displays a dialog where the user can select two images, the image to convolve and the convolution kernel, and produces a new image (initially a copy of the original) that is returned to Opticks.  This allows to separate the Opticks related code from the convolution code.

(B) Investigate existing FFT OpenCL implementations and familiarize with OpenCL

The OpenCL implementation I referred to in the #opticks channel points out that due to the communication overhead between CPU and GPU, the FFTs in his experiments only run faster when the size > 218.  By investigating the existing implementations I will obtain a good understanding of how existing implementations exploit memory locality to ensure good performance.  This will also familiarize me with OpenCL.

In addition to the implementation I referred to on the #opticks channel, I also found some code by Apple, which is based on two papers that seem worth reading thoroughly:
- Fitting FFT onto the G80 Architecture (Volkov and Kazian)
- High Performance Discrete Fourier Tansforms on Graphics Processors (Govindaraju, Lloyd, et al)

(C) Implement a basic 2D FFT convolution using CPU only and plug it into the plugin developed in (A)

This implementation serves as a reference for comparison of the performance of the OpenCL implementation and to test its correctness.  Even if my OpenCL implementation fails, or is too slow to be useful due to communication overhead, this will ensure a CPU-based 2D FFT convolution is available to Opticks.

(D) Convert the CPU-based 2D FFT convolution code to use OpenCL by using FFTs that are implemented in OpenCL

To this end, an OpenCL-based FFT implementation will be obtained, either by using existing code (if the license of this code is compatible with Opticks) or by converting a CPU-based FFT.  Replacing the FFT in (C) with the OpenCL FFT yields a first OpenCL implementation.

(E) If time permits, consider alternate OpenCL implementations to attempt to further improve performance

The 2D FFT convolution permits parallellism in several ways.  In this part, I can investigate if I can further speed up the implementation of (D) by using alternate OpenCL implementations.

(F) Write documentation

Time for testing and validation is included in the relevant parts.

Since completing tasks (A) to (D) provides a 2D FFT-based convolution in OpenCL, these tasks constitute the main tasks of the project (together with (F)), and task (E) is a highly desirable but optional task.

Availability

Daily

Allocation in weeks

Week

Tasks in progress this week

Week 1 5-Aug-2011

(A)

Week 2 12-Aug-2011

(A),  (B) (together with studying for my math exams)

Week 3 19-Aug-2011

(B) (together with studying for my math exams)

Week 4 26-Aug-2011

(B) (together with studying for my math exams)

Week 5 2-Sep-2011

(C)

Week 6 9-Sep-2011

(C)

Week 7 16-Sep-2011

(D)

Week 8 23-Sep-2011

(D)

Week 9 30-Sep-2011

(D)

Week 10 7-Oct-2010

(E)

Week 11 14-Oct-2010

(E)

Week 12 21-Oct-2010

(E)

Week 13 28-Oct-2010

(F)

Week 14

Adjusted schedule

In week 7 a theoretical analysis and a resulting implementation strategy was presented.

To obtain an OpenCL FFT with reasonable performance an implementation must be made that can be decomposed into several substeps:

  • (1) Create a contiguous OpenCL FFT using global memory as discussed in section 3.1.  This includes the communication through local memory to enable the last steps of the FFT.
  • (2) Improve the locality factor for the FFT by letting multiple threads work on a single FFT as discussed in section 3.2.
  • (3) Optimize the kernel to enable prefetching of reads from global memory and reduce the dependency constraints on the writes to global memory.
  • (4) Add support for batch FFTs
  • (5) Try to find a global memory transformation to avoid partition camping (if necessary)

This is only a preliminary plan that may be changed as required (at least, starting from (3)) if available timing results or other analysis of substeps indicate some other optimizations are required than originally planned.  Steps (1) and (2) seem required to build a basic skeleton with reasonable performance that can be further optimized.  In week 7, I have proposed the following schedule for the remaining weeks:

Week 8 23-Sep-2011

(1)

Week 9 30-Sep-2011

(1), (2)

Week 10 7-Oct-2010

(2)

Week 11 14-Oct-2010

(3), (4)

Week 12 21-Oct-2010

(5)

Week 13 28-Oct-2010

(F)


Week 14








Recently Updated













Navigate space






|

Labels
  • None