CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

DSc PhD Krzysztof ROJEK, byteLAKE’s CTO
PPAM 2019, Bialystok, Poland, September 8-11, 2019
CFD code adaptation to the FPGA
architecture

• Current trends in the FPGA
market
• Common FPGA applications
• FPGA access
• Architecture of the Xilinx Alveo
U250 FPGA
• Evaluation metrics
• Algorithm scenario
• Development of FPGA codes
• Algorithm design
2
Background
• OpenCL kernel processing
• Memory queue
• Limitations of memory access
• Burst memory access
• Vectorization
• Code regionalization
• CPU implementation overview
• Performance and Energy results
• Conclusion

3
Current trends in the FPGA market

• Confirmed effectiveness
– Audio processing
– Image processing
– Cryptography
– Routers/switches/gateways software
– Digital displays
– Scientific instruments (amplifiers, radio astronomy, radars)
• Current challenges
– Machine learning
– Deep learning
– High Performance Computing (HPC)
4
Common FPGA applications

• Test Drive in the Cloud
– Nimbix: High Performance Computing &
Supercomputing Platform
– Other cloud providers, soon…
• Your own cluster
– RAM memory: 80GB (16GB for deployment only)
– Hard disk space: 100GB
– OS: RedHat, CentOS, Ubuntu
– Xilinx Runtime – driver for Alveo
– Deployment Shell – the communication layer physically implemented
and flashed into the card
– The Xilinx SDAccel IDE – framework for development
5
FPGA access
More cloud providers
soon…

• Premiere: October 02, 2018
• Built on the Xilinx 16nm UltraScale™ architecture
6
Xilinx Alveo U250 FPGA
Memory
Off-chip
Memory
Capacity
64 GB
Off-chip Total
Bandwidth
77 GB/s
Internal SRAM
Capacity
54 MB
Internal SRAM
Total
Bandwidth
38 TB/s
Power and Thermal
Maximum Total
Power
225W
Thermal
Cooling
Passive
Clocks
KERNEL CLK 500 MHz
DATA CLK 300 MHz

• The deployment shell that handles device bring-up and configuration over
PCIe is contained within the static region of the FPGA
• The resources in the dynamic region are available for creating custom
accelerators
7
Xilinx Alveo U250 FPGA
SLR1
Dynamic Region
SLR2
Dynamic Region
SLR3
Dynamic Region
SLR0
Dynamic Region
Static Region
DDR
DDR
DDR
DDR
Resources
Look-Up
Tables
(LUTs) (K)
1341
Registers (K) 2749
36 Kb Block
RAMs
2000
288 Kb
UltraRAMs
1280

• Desired features of a data center
– Low price
– Low Energy consumption
– High performance
– Technical support
– Reliability and fast service
• Important metrics
– Execution time [s]
– Data throughput of a simulation [MB/s]
– Power dissipation [W]
– Energy consumption [J]
8
Is it a good for you?
How many cards is required to
achieve a desired performance?
How many cards can I handle
within a given Energy budget?
What performance can be achieved
within my Energy budget?
How these results refer to
the CPU-based solution?

• Computational Fluid Dynamics
(CFD) kernel with support for
all industrial parameters and
settings
• Advection algorithm that is the
method to predict changes in
transport of a substance (fluid)
or quantity by bulk motion in
time
– An example of advection is the
transport of pollutants or silt in a
river by bulk water flow downstream
– It is also transport of energy by
water or air
9
Real scientific scenario
• Based on upwind scheme
• 3D compute domain
• Dataset (9 arrays + scalar):
– 3 x velocity vectors
– 2 x forces (implosion, explosion)
– 2 x density vectors
– 2 x transported substance (in, out)
– t – time interval
• Configuration:
– Job setting (size, timestep)
– Border conditions (periodic, open)
– Data accuracy (double, single,
half)
PERIODIC
DOMAIN IN X
DIMENSION
OPEN
DOMAIN

• Config, makefile, and source
10
Development

11
Development

12
Development

• The compute domain is divided
into 4 sub-domains
• Host sends data to the FPGA
global memory
• Host calls kernel to execute it on
FPGA (kernel is called many
times)
• Each kernel call represents a
single time step
• FPGA sends the output array
back to host
Algorithm design
FPGA
CPU
Compute
domain
Sub-domain
Sub-domain
Sub-domain
Sub-domain
Kernel call
Data
sending
Data
receiving
Data
receiving
Data
sending
Kernel
processing
Migrate
memory
objects
N x call
Copy buffer

• Kernel is distributed
into 4 SLRs
• Each sub-domain is
allocated in different
memory bank
• Data transfer occurs
between neighboring
memory banks
Kernel processing
SLR0
Kernel_A
SLR1
Kernel_B
SLR2
Kernel_C
SLR3
Kernel_D
Kernel
Bank0 Bank1
Bank2 Bank3
Sub-domain Sub-domain
Sub-domain Sub-domain
19

• A pipe stores data organized as a FIFO
• Pipes can be used to stream data from one kernel to another inside
the FPGA device without having to use the external memory
• Pipes must be statically defined outside of all kernel functions
• Pipes must be declared in lower case alphanumerics
• Xilinx extended OpenCL pipes by adding blocking mode that
allows users to synchronize kernels
15
Kernels communication with pipes
pipe int p0 __attribute__((xcl_reqd_pipe_depth(512)));

• Each array is transferred from the global memory to the fast BRAM memory
• To minimize the data traffic we use a memory queue across iterations
16
Memory queue
Global
memory
BRAM

• To minimize the data traffic we use a memory queue across iterations
17
Memory queue
Global
memory
BRAM

• To minimize the data traffic we use a memory queue across interactions
18
Memory queue
Global
memory
BRAM

• 31 pins are available in Alveo u250
– Each pointer to the global memory set as the kernel argument
reserves one memory pin
– Each kernel reserves one memory pin
• Using 4 banks and 4 kernels we can set up to 6 global pointers to the global
memory
• To send all required arrays we need to pack them into larger buffers (different
for input and output data)
• All kernel ports require 512-bits data access to provide the highest memory
access
19
Memory access within a kernel

• Burst memory access
– Loop pipelining
– Port data width: 512bits
– Separated data copings from the computation
– Vectorization
20
Burst memory access/vectorization
void copy(__global const float16 * __restrict globMem)
{
float16 bram[tKM];
…
write_0: __attribute__((xcl_pipeline_loop))
for(int kj=0; kj<tKM; ++kj)
{
bram[kj] = globMem[gIdx+kj];
}
…
}
Time
traditional
pipelining

• Shifting elements within a vector (standard shuffle API is not supported)
21
Stencil vectorization
__attribute__((always_inline))
inline float16 getM1(const float a, const float16 b) {
const float16 *ptr2=(realX*)&b;
float16 out;
float *o=(realX*)&out;
o[0] = a;
__attribute__((opencl_unroll_hint(15)))
for(int i=1; i<VECS; ++i) {
o[i] = ptr2[i-1];
}
return out; }
X[i] = Y[i-1]
X[i]=getM1(Y[i-1][15],
Y[i]);

• Memory access supports two accesses per a single array
22
Memory ports
calc_0: __attribute__((xcl_pipeline_loop))
{
bramX[kj] = bramY[kj-off]+bramY[kj]+bramY[kj+off];
}
{
bramX[kj] = bramY[kj-off]+bramY[kj];
}
{
bramX[kj] = bramX[kj]+bramY[kj+off];
}

• Independent regions in the code should be explicitly
separated
• It helps compiler distribute the code amongst LUT
• The separation can be done by adding brackets around
independent code blocks
23
Regionalization
{
//the first block of instructions
}
{
//the second block of instructions
}

• Our CPU implementation utilizes two processors:
– Intel® Xeon® CPU E5-2695 v2 2.40 – 3.2 GHz (2x12 cores)
• The code adaptation includes:
– 24 cores utilization
– Loop transformations
– Memory alignment
– Thread affinity
– Data locality within nested loops
– Compiler optimizations
• The final simulation throughput is: 3.7 GB/s
• The power dissipation is: 142 Watts
25
CPU implementation

27
Results
FPGA 2xCPU
Ratio
FPGA/CPU
Exec. time [s] 11,4 18,0 1,6
Throughput
[MB/s] 5840,8 3699,2 0,6
Power [W] 101,0 142,0 1,4
Energy [J] 1151,4 2556,0 2,2
5840.8
3699.2
FPGA 2XCPU
The higher the better
Throughput [MB/s]
1151.4
2556.0
FPGA 2XCPU
The lower the better
Energy [J]

29
byteLAKE’s ecosystem of partners
Complete solutions
for CFD market
➢HPC system design, build-up
and configuration
➢HPC software applications
development and
optimization to make the
most of the hardware
… and
more

More at:
byteLAKE.com/en/CFD
Accelerated CFD Kernels
Compatible with geophysical models
like EULAG
Pseudovelocity
Divergence
Thomas algorithm
CFD Kernels
Advection • Faster time to results and more
efficient processing compared
to CPU-only nodes
• 4x faster
• 80% lower energy consumption
• 6x better performance per Watt
About byteLAKE
• AI (highly optimized AI engines to analyze text, image, video, time series data)
• HPC (highly optimized apps and kernels for HPC architectures)

Contact me: krojek@byteLAKE.com
31

We build AI and HPC solutions.
Focusing on software.
We use machine/ deep learning to bring
automation and optimize operations
in businesses across various industries.
We create highly optimized software for
supercomputers.
Our researchers hold PhD and DSc
degrees.
byteLAKE
www.byteLAKE.com
• AI (highly optimized AI engines to analyze text, image, video, time series data)
• HPC (highly optimized apps and kernels for HPC architectures)
Building solutions
for real-life
business problems

CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

Similar to CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019) (20)

More from byteLAKE

More from byteLAKE (20)

Recently uploaded

Recently uploaded (20)

CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)