byteLAKE's presentation from the PPAM 2019 conference.
Abstract:
The goal of this work is to adapt 4 CFD kernels to the Xilinx ALVEO U250 FPGA, including first-order step of the non-linear iterative upwind advection MPDATA schemes (non-oscillatory forward in time), the divergence part of the matrix-free linear operator formulation in the iterative Krylov scheme, tridiagonal Thomas algorithm for vertical matrix inversion inside preconditioner for the iterative solver, and computation of the psuedovelocity for the second pass of upwind algorithm in MPDATA. All the kernels use 3-dimensional compute domain consisted from 7 to 11 arrays. Since all kernels belong to the group of memory bound algorithms, our main challenge is to provide the highest utilization of global memory bandwidth. Our adaptation allows us to reduce the execution time upto 4x.
Find out more at: www.byteLAKE.com/en/CFD
Foot note:
This is the presentation about the non-AI version of byteLAKE's CFD kernels, highly optimized for Alveo FPGA. Based on this research project and many others in the CFD space, we decided to shift the course of the CFD Suite product development and leverage AI to accelerate computations and enable new possibilities. Instead of adapting CFD solvers to accelerators, we use AI and work on a cross-platform solution. More on the latest: www.byteLAKE.com/en/CFDSuite.
-
Update for 2020: byteLAKE is currently developing CFD Suite as AI for CFD Suite, a collection of AI/ Artificial Intelligence Models to accelerate and enable new features for CFD simulations. It is a cross-platform solution (not only for FPGAs). More: www.byteLAKE.com/en/CFDSuite.
4. • Confirmed effectiveness
– Audio processing
– Image processing
– Cryptography
– Routers/switches/gateways software
– Digital displays
– Scientific instruments (amplifiers, radio astronomy, radars)
• Current challenges
– Machine learning
– Deep learning
– High Performance Computing (HPC)
4
Common FPGA applications
5. • Test Drive in the Cloud
– Nimbix: High Performance Computing &
Supercomputing Platform
– Other cloud providers, soon…
• Your own cluster
– RAM memory: 80GB (16GB for deployment only)
– Hard disk space: 100GB
– OS: RedHat, CentOS, Ubuntu
– Xilinx Runtime – driver for Alveo
– Deployment Shell – the communication layer physically implemented
and flashed into the card
– The Xilinx SDAccel IDE – framework for development
5
FPGA access
More cloud providers
soon…
6. • Premiere: October 02, 2018
• Built on the Xilinx 16nm UltraScale™ architecture
6
Xilinx Alveo U250 FPGA
Memory
Off-chip
Memory
Capacity
64 GB
Off-chip Total
Bandwidth
77 GB/s
Internal SRAM
Capacity
54 MB
Internal SRAM
Total
Bandwidth
38 TB/s
Power and Thermal
Maximum Total
Power
225W
Thermal
Cooling
Passive
Clocks
KERNEL CLK 500 MHz
DATA CLK 300 MHz
7. • The deployment shell that handles device bring-up and configuration over
PCIe is contained within the static region of the FPGA
• The resources in the dynamic region are available for creating custom
accelerators
7
Xilinx Alveo U250 FPGA
SLR1
Dynamic Region
SLR2
Dynamic Region
SLR3
Dynamic Region
SLR0
Dynamic Region
Static Region
DDR
DDR
DDR
DDR
Resources
Look-Up
Tables
(LUTs) (K)
1341
Registers (K) 2749
36 Kb Block
RAMs
2000
288 Kb
UltraRAMs
1280
8. • Desired features of a data center
– Low price
– Low Energy consumption
– High performance
– Technical support
– Reliability and fast service
• Important metrics
– Execution time [s]
– Data throughput of a simulation [MB/s]
– Power dissipation [W]
– Energy consumption [J]
8
Is it a good for you?
How many cards is required to
achieve a desired performance?
How many cards can I handle
within a given Energy budget?
What performance can be achieved
within my Energy budget?
How these results refer to
the CPU-based solution?
9. • Computational Fluid Dynamics
(CFD) kernel with support for
all industrial parameters and
settings
• Advection algorithm that is the
method to predict changes in
transport of a substance (fluid)
or quantity by bulk motion in
time
– An example of advection is the
transport of pollutants or silt in a
river by bulk water flow downstream
– It is also transport of energy by
water or air
9
Real scientific scenario
• Based on upwind scheme
• 3D compute domain
• Dataset (9 arrays + scalar):
– 3 x velocity vectors
– 2 x forces (implosion, explosion)
– 2 x density vectors
– 2 x transported substance (in, out)
– t – time interval
• Configuration:
– Job setting (size, timestep)
– Border conditions (periodic, open)
– Data accuracy (double, single,
half)
PERIODIC
DOMAIN IN X
DIMENSION
OPEN
DOMAIN
13. • The compute domain is divided
into 4 sub-domains
• Host sends data to the FPGA
global memory
• Host calls kernel to execute it on
FPGA (kernel is called many
times)
• Each kernel call represents a
single time step
• FPGA sends the output array
back to host
Algorithm design
FPGA
CPU
Compute
domain
Sub-domain
Sub-domain
Sub-domain
Sub-domain
Kernel call
Data
sending
Data
receiving
Data
receiving
Data
sending
Kernel
processing
Migrate
memory
objects
N x call
Copy buffer
14. • Kernel is distributed
into 4 SLRs
• Each sub-domain is
allocated in different
memory bank
• Data transfer occurs
between neighboring
memory banks
Kernel processing
SLR0
Kernel_A
SLR1
Kernel_B
SLR2
Kernel_C
SLR3
Kernel_D
Kernel
Bank0 Bank1
Bank2 Bank3
Sub-domain Sub-domain
Sub-domain Sub-domain
19
15. • A pipe stores data organized as a FIFO
• Pipes can be used to stream data from one kernel to another inside
the FPGA device without having to use the external memory
• Pipes must be statically defined outside of all kernel functions
• Pipes must be declared in lower case alphanumerics
• Xilinx extended OpenCL pipes by adding blocking mode that
allows users to synchronize kernels
15
Kernels communication with pipes
pipe int p0 __attribute__((xcl_reqd_pipe_depth(512)));
16. • Each array is transferred from the global memory to the fast BRAM memory
• To minimize the data traffic we use a memory queue across iterations
16
Memory queue
Global
memory
BRAM
17. • Each array is transferred from the global memory to the fast BRAM memory
• To minimize the data traffic we use a memory queue across iterations
17
Memory queue
Global
memory
BRAM
18. • Each array is transferred from the global memory to the fast BRAM memory
• To minimize the data traffic we use a memory queue across interactions
18
Memory queue
Global
memory
BRAM
19. • 31 pins are available in Alveo u250
– Each pointer to the global memory set as the kernel argument
reserves one memory pin
– Each kernel reserves one memory pin
• Using 4 banks and 4 kernels we can set up to 6 global pointers to the global
memory
• To send all required arrays we need to pack them into larger buffers (different
for input and output data)
• All kernel ports require 512-bits data access to provide the highest memory
access
19
Memory access within a kernel
20. • Burst memory access
– Loop pipelining
– Port data width: 512bits
– Separated data copings from the computation
– Vectorization
20
Burst memory access/vectorization
void copy(__global const float16 * __restrict globMem)
{
float16 bram[tKM];
…
write_0: __attribute__((xcl_pipeline_loop))
for(int kj=0; kj<tKM; ++kj)
{
bram[kj] = globMem[gIdx+kj];
}
…
}
Time
traditional
pipelining
21. • Shifting elements within a vector (standard shuffle API is not supported)
21
Stencil vectorization
__attribute__((always_inline))
inline float16 getM1(const float a, const float16 b) {
const float16 *ptr2=(realX*)&b;
float16 out;
float *o=(realX*)&out;
o[0] = a;
__attribute__((opencl_unroll_hint(15)))
for(int i=1; i<VECS; ++i) {
o[i] = ptr2[i-1];
}
return out; }
X[i] = Y[i-1]
X[i]=getM1(Y[i-1][15],
Y[i]);
23. • Independent regions in the code should be explicitly
separated
• It helps compiler distribute the code amongst LUT
• The separation can be done by adding brackets around
independent code blocks
23
Regionalization
{
//the first block of instructions
}
{
//the second block of instructions
}
24. • Our CPU implementation utilizes two processors:
– Intel® Xeon® CPU E5-2695 v2 2.40 – 3.2 GHz (2x12 cores)
• The code adaptation includes:
– 24 cores utilization
– Loop transformations
– Memory alignment
– Thread affinity
– Data locality within nested loops
– Compiler optimizations
• The final simulation throughput is: 3.7 GB/s
• The power dissipation is: 142 Watts
25
CPU implementation
26. 27
Results
FPGA 2xCPU
Ratio
FPGA/CPU
Exec. time [s] 11,4 18,0 1,6
Throughput
[MB/s] 5840,8 3699,2 0,6
Power [W] 101,0 142,0 1,4
Energy [J] 1151,4 2556,0 2,2
5840.8
3699.2
FPGA 2XCPU
The higher the better
Throughput [MB/s]
1151.4
2556.0
FPGA 2XCPU
The lower the better
Energy [J]
27. 29
byteLAKE’s ecosystem of partners
Complete solutions
for CFD market
➢HPC system design, build-up
and configuration
➢HPC software applications
development and
optimization to make the
most of the hardware
… and
more
28. More at:
byteLAKE.com/en/CFD
Accelerated CFD Kernels
Compatible with geophysical models
like EULAG
Pseudovelocity
Divergence
Thomas algorithm
CFD Kernels
Advection • Faster time to results and more
efficient processing compared
to CPU-only nodes
• 4x faster
• 80% lower energy consumption
• 6x better performance per Watt
About byteLAKE
• AI (highly optimized AI engines to analyze text, image, video, time series data)
• HPC (highly optimized apps and kernels for HPC architectures)
30. We build AI and HPC solutions.
Focusing on software.
We use machine/ deep learning to bring
automation and optimize operations
in businesses across various industries.
We create highly optimized software for
supercomputers.
Our researchers hold PhD and DSc
degrees.
byteLAKE
www.byteLAKE.com
• AI (highly optimized AI engines to analyze text, image, video, time series data)
• HPC (highly optimized apps and kernels for HPC architectures)
Building solutions
for real-life
business problems