Today's realities require a change in computing: new paradigm in archiving computing performance is parallel processing. Driving by data growth and intense computing processing tasks with a limitation of scaling out computing system without increasing costs, size or power, architecture of computing system is has been changing, from the serial computing one instruction execution after another, turning into heterogeneous computing model with efficient parallel execution acceleration and this fact should be reflected in the development of modern computing acceleration hardware architecture.
The last decade, FPGAs - (reprogrammable logic IC, FPGA), have been successfully deployed to high-end computing electronics due to its ability for parallel execution, which is result of hardware structure - massively parallel architecture. FPGA is rapidly evolved from the relatively simple logic array for applications of "glue logic" to complex IC consisting of millions of logic elements, hundreds of memory blocks, thousands of blocks of digital signal processing units, interface (transceivers) units, that allows the use of FPGAs in advanced signal processing and computing electronics, which require rapid and intensive calculations - radio processing systems and telecommunications, video processing, analysis and machine vision, complex industrial automation systems.
But, developing with such advanced IC, requires numerous of special trainings as well as highest skills from developer, as the developing flow of FPGA design is very specific, complex and time-consuming, compared from complexity standpoint with design of ASIC. In our country and the similar picture is around the world there is constantly a shortage of engineers able to design FPGA. In addition to the difficult developing flow, which requires detailed understanding of FPGA hardware structure, FPGA requires programming with specific hardware description languages (HDL). It is estimated that less than 5% of programming-engineers possess HDL language skills. Thus, on the one hand we have a brilliant massively parallel processing hardware, but on the other hand, we are limited to use it because of high complexity of the developing flow.
In modern world, proven approach to transition from serial to parallel execution of the program is creation of heterogeneous computing system consisting of the host machine running the program in a standard execution and accelerator capable of parallel execution.
Unfortunately, at the moment, according to research conducted by the Ministry of Education programming of heterogeneous systems consisting of the main multi-core computer and accelerator card based on programmable logic (FPGAs) is complicated. Since it is not yet offered a coherent programming model (a high-level and low-level) with a corresponding support of compiling and runtime model, which allowed the efficient use of computing resources of such hybrid systems. Studies in recent decades indicate that the automatic acquisition of the parallel program from serial code not even consistent for a homogeneous architecture in the general case (a reference to the study "Use of OpenCL standard for FPGA programming," ISPRAS, Ministry of Education.
EulerProject is addressing this problem by developing and producing standardized, friendly-to-use, efficient parallel processing hardware with coherent programming model which helps to abstract from high-complexity hardware architecture and unbounding writing program to a specific architecture - to make the code reusable on different hardware platforms.
EulerProject hardware solution is based on FPGA of latest generations of IntelFPGA with computing power of several TFLOPS per FPGA depending on logic array density and number of hardware DSP blocks. We target up a heterogeneous computing system with host and accelerator where we want to offload host-code and run it in parallel on FPGA accelerator board. Part of the accelerator is scalable and embed multiple FPGA and could be designed in a different form factors, depending on the computing system requirements (double or single width PCIe, SoM, 1U/2U server). The role of the host performs by standard processor, and optionally with an ARM processor (for embedded systems).
OpenCL programming model allows the programmer to describe the functions to be performed in parallel on set of accelerators (cluster) available on this machine. To set a specific accelerator the concept of context is created - a data structure that defines the required accelerator class (multi-core CPU, GPU, FPGA) - and the concept of command queuing - the data structure through which the execution of operations on a particular accelerator. The programming model is based on the OpenCL kernel concepts (computing cores).
Kernel is a acceleration function that will be performed in parallel on the accelerator by certain number of threads. The kernels are determined by the programmer in the form of functions on given extension of the C language. Creating the core is divided into several stages, each of which corresponds to the system call of corresponding OpenCL functions on the host CPU. In the beginning it is only needed to convert the source code of one or more kernels into OpenCL-program, then the resulting program should be compiled by the compiler to be executed on the accelerator. After OpenCL-compiling the program locates kernel function in accordance to its source code. OpenCL provides a mechanism for allocating buffers in memory on the accelerator and the exchange of data between the memory and CPU accelerator. Run the desired core/cores in parallel by calling the OpenCL library functions on the host CPU. Accelerator, which will run the particular core/cores is defined by the command queue, which sets the computing execution.
FPGA (field-programmable gate array) is high density IC with a variety of circuit elements (modern FPGA contains logic gates, memories, digital signal processing blocks (DSP), transceiver (communication) units). Switching the interconnections between elements is preceded using the configuration file which is loaded into the FPGA. Thus, changes in the configuration file modifies a hardware device implementation on the FPGA chip. In contrast to "frozen architecture", wherein the interconnect between the elements technologically set and frozen, connection elements in the FPGA can be reprogrammed repeatedly, allowing the electronic circuitry been versatile and reconfigurable. Reprogrammable integrated circuit has the same flexibility in design as software, but is not limited to number of processor cores. Unlike processors FPGA performs all calculations in parallel, so different computing tasks are performed at the same time and this does not affect the speed of each other. Each independent computing task runs on its FPGA logic elements (or other blocks like DSP) and can work independently without the interference to other tasks executed out in other parts of the FPGA. As a result - the performance of one independent section is not reduced even if newly added tasks are implemented on other elements of the same FPGA, which is very important for the construction of parallel processing data systems.