SlideShare a Scribd company logo
1 of 31
DSc PhD Krzysztof ROJEK, byteLAKE’s CTO
PPAM 2019, Bialystok, Poland, September 8-11, 2019
CFD code adaptation to the FPGA
architecture
• Current trends in the FPGA
market
• Common FPGA applications
• FPGA access
• Architecture of the Xilinx Alveo
U250 FPGA
• Evaluation metrics
• Algorithm scenario
• Development of FPGA codes
• Algorithm design
2
Background
• OpenCL kernel processing
• Memory queue
• Limitations of memory access
• Burst memory access
• Vectorization
• Code regionalization
• CPU implementation overview
• Performance and Energy results
• Conclusion
3
Current trends in the FPGA market
• Confirmed effectiveness
– Audio processing
– Image processing
– Cryptography
– Routers/switches/gateways software
– Digital displays
– Scientific instruments (amplifiers, radio astronomy, radars)
• Current challenges
– Machine learning
– Deep learning
– High Performance Computing (HPC)
4
Common FPGA applications
• Test Drive in the Cloud
– Nimbix: High Performance Computing &
Supercomputing Platform
– Other cloud providers, soon…
• Your own cluster
– RAM memory: 80GB (16GB for deployment only)
– Hard disk space: 100GB
– OS: RedHat, CentOS, Ubuntu
– Xilinx Runtime – driver for Alveo
– Deployment Shell – the communication layer physically implemented
and flashed into the card
– The Xilinx SDAccel IDE – framework for development
5
FPGA access
More cloud providers
soon…
• Premiere: October 02, 2018
• Built on the Xilinx 16nm UltraScale™ architecture
6
Xilinx Alveo U250 FPGA
Memory
Off-chip
Memory
Capacity
64 GB
Off-chip Total
Bandwidth
77 GB/s
Internal SRAM
Capacity
54 MB
Internal SRAM
Total
Bandwidth
38 TB/s
Power and Thermal
Maximum Total
Power
225W
Thermal
Cooling
Passive
Clocks
KERNEL CLK 500 MHz
DATA CLK 300 MHz
• The deployment shell that handles device bring-up and configuration over
PCIe is contained within the static region of the FPGA
• The resources in the dynamic region are available for creating custom
accelerators
7
Xilinx Alveo U250 FPGA
SLR1
Dynamic Region
SLR2
Dynamic Region
SLR3
Dynamic Region
SLR0
Dynamic Region
Static Region
DDR
DDR
DDR
DDR
Resources
Look-Up
Tables
(LUTs) (K)
1341
Registers (K) 2749
36 Kb Block
RAMs
2000
288 Kb
UltraRAMs
1280
• Desired features of a data center
– Low price
– Low Energy consumption
– High performance
– Technical support
– Reliability and fast service
• Important metrics
– Execution time [s]
– Data throughput of a simulation [MB/s]
– Power dissipation [W]
– Energy consumption [J]
8
Is it a good for you?
How many cards is required to
achieve a desired performance?
How many cards can I handle
within a given Energy budget?
What performance can be achieved
within my Energy budget?
How these results refer to
the CPU-based solution?
• Computational Fluid Dynamics
(CFD) kernel with support for
all industrial parameters and
settings
• Advection algorithm that is the
method to predict changes in
transport of a substance (fluid)
or quantity by bulk motion in
time
– An example of advection is the
transport of pollutants or silt in a
river by bulk water flow downstream
– It is also transport of energy by
water or air
9
Real scientific scenario
• Based on upwind scheme
• 3D compute domain
• Dataset (9 arrays + scalar):
– 3 x velocity vectors
– 2 x forces (implosion, explosion)
– 2 x density vectors
– 2 x transported substance (in, out)
– t – time interval
• Configuration:
– Job setting (size, timestep)
– Border conditions (periodic, open)
– Data accuracy (double, single,
half)
PERIODIC
DOMAIN IN X
DIMENSION
OPEN
DOMAIN
• Config, makefile, and source
10
Development
• Config, makefile, and source
11
Development
• Config, makefile, and source
12
Development
• The compute domain is divided
into 4 sub-domains
• Host sends data to the FPGA
global memory
• Host calls kernel to execute it on
FPGA (kernel is called many
times)
• Each kernel call represents a
single time step
• FPGA sends the output array
back to host
Algorithm design
FPGA
CPU
Compute
domain
Sub-domain
Sub-domain
Sub-domain
Sub-domain
Kernel call
Data
sending
Data
receiving
Data
receiving
Data
sending
Kernel
processing
Migrate
memory
objects
N x call
Copy buffer
• Kernel is distributed
into 4 SLRs
• Each sub-domain is
allocated in different
memory bank
• Data transfer occurs
between neighboring
memory banks
Kernel processing
SLR0
Kernel_A
SLR1
Kernel_B
SLR2
Kernel_C
SLR3
Kernel_D
Kernel
Bank0 Bank1
Bank2 Bank3
Sub-domain Sub-domain
Sub-domain Sub-domain
19
• A pipe stores data organized as a FIFO
• Pipes can be used to stream data from one kernel to another inside
the FPGA device without having to use the external memory
• Pipes must be statically defined outside of all kernel functions
• Pipes must be declared in lower case alphanumerics
• Xilinx extended OpenCL pipes by adding blocking mode that
allows users to synchronize kernels
15
Kernels communication with pipes
pipe int p0 __attribute__((xcl_reqd_pipe_depth(512)));
• Each array is transferred from the global memory to the fast BRAM memory
• To minimize the data traffic we use a memory queue across iterations
16
Memory queue
Global
memory
BRAM
• Each array is transferred from the global memory to the fast BRAM memory
• To minimize the data traffic we use a memory queue across iterations
17
Memory queue
Global
memory
BRAM
• Each array is transferred from the global memory to the fast BRAM memory
• To minimize the data traffic we use a memory queue across interactions
18
Memory queue
Global
memory
BRAM
• 31 pins are available in Alveo u250
– Each pointer to the global memory set as the kernel argument
reserves one memory pin
– Each kernel reserves one memory pin
• Using 4 banks and 4 kernels we can set up to 6 global pointers to the global
memory
• To send all required arrays we need to pack them into larger buffers (different
for input and output data)
• All kernel ports require 512-bits data access to provide the highest memory
access
19
Memory access within a kernel
• Burst memory access
– Loop pipelining
– Port data width: 512bits
– Separated data copings from the computation
– Vectorization
20
Burst memory access/vectorization
void copy(__global const float16 * __restrict globMem)
{
float16 bram[tKM];
…
write_0: __attribute__((xcl_pipeline_loop))
for(int kj=0; kj<tKM; ++kj)
{
bram[kj] = globMem[gIdx+kj];
}
…
}
Time
traditional
pipelining
• Shifting elements within a vector (standard shuffle API is not supported)
21
Stencil vectorization
__attribute__((always_inline))
inline float16 getM1(const float a, const float16 b) {
const float16 *ptr2=(realX*)&b;
float16 out;
float *o=(realX*)&out;
o[0] = a;
__attribute__((opencl_unroll_hint(15)))
for(int i=1; i<VECS; ++i) {
o[i] = ptr2[i-1];
}
return out; }
X[i] = Y[i-1]
X[i]=getM1(Y[i-1][15],
Y[i]);
• Memory access supports two accesses per a single array
22
Memory ports
calc_0: __attribute__((xcl_pipeline_loop))
for(int kj=0; kj<tKM; ++kj)
{
bramX[kj] = bramY[kj-off]+bramY[kj]+bramY[kj+off];
}
calc_0: __attribute__((xcl_pipeline_loop))
for(int kj=0; kj<tKM; ++kj)
{
bramX[kj] = bramY[kj-off]+bramY[kj];
}
calc_1: __attribute__((xcl_pipeline_loop))
for(int kj=0; kj<tKM; ++kj)
{
bramX[kj] = bramX[kj]+bramY[kj+off];
}
• Independent regions in the code should be explicitly
separated
• It helps compiler distribute the code amongst LUT
• The separation can be done by adding brackets around
independent code blocks
23
Regionalization
{
//the first block of instructions
}
{
//the second block of instructions
}
• Our CPU implementation utilizes two processors:
– Intel® Xeon® CPU E5-2695 v2 2.40 – 3.2 GHz (2x12 cores)
• The code adaptation includes:
– 24 cores utilization
– Loop transformations
– Memory alignment
– Thread affinity
– Data locality within nested loops
– Compiler optimizations
• The final simulation throughput is: 3.7 GB/s
• The power dissipation is: 142 Watts
25
CPU implementation
26
FPGA optimizations
27
Results
FPGA 2xCPU
Ratio
FPGA/CPU
Exec. time [s] 11,4 18,0 1,6
Throughput
[MB/s] 5840,8 3699,2 0,6
Power [W] 101,0 142,0 1,4
Energy [J] 1151,4 2556,0 2,2
5840.8
3699.2
FPGA 2XCPU
The higher the better
Throughput [MB/s]
1151.4
2556.0
FPGA 2XCPU
The lower the better
Energy [J]
29
byteLAKE’s ecosystem of partners
Complete solutions
for CFD market
➢HPC system design, build-up
and configuration
➢HPC software applications
development and
optimization to make the
most of the hardware
… and
more
More at:
byteLAKE.com/en/CFD
Accelerated CFD Kernels
Compatible with geophysical models
like EULAG
Pseudovelocity
Divergence
Thomas algorithm
CFD Kernels
Advection • Faster time to results and more
efficient processing compared
to CPU-only nodes
• 4x faster
• 80% lower energy consumption
• 6x better performance per Watt
About byteLAKE
• AI (highly optimized AI engines to analyze text, image, video, time series data)
• HPC (highly optimized apps and kernels for HPC architectures)
Contact me: krojek@byteLAKE.com
31
We build AI and HPC solutions.
Focusing on software.
We use machine/ deep learning to bring
automation and optimize operations
in businesses across various industries.
We create highly optimized software for
supercomputers.
Our researchers hold PhD and DSc
degrees.
byteLAKE
www.byteLAKE.com
• AI (highly optimized AI engines to analyze text, image, video, time series data)
• HPC (highly optimized apps and kernels for HPC architectures)
Building solutions
for real-life
business problems
CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

More Related Content

What's hot

A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networksinside-BigData.com
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
The Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorDeepak Tomar
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainMDC_UNICA
 
Scalability and Efficiency in Accelerator Sharing on FPGA Devices
Scalability and Efficiency in Accelerator Sharing on FPGA DevicesScalability and Efficiency in Accelerator Sharing on FPGA Devices
Scalability and Efficiency in Accelerator Sharing on FPGA DevicesNECST Lab @ Politecnico di Milano
 
SOC Chip Basics
SOC Chip BasicsSOC Chip Basics
SOC Chip BasicsA B Shinde
 
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesRuntime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesMugdha2289
 
BKK16-303 96Boards - TV Platform
BKK16-303 96Boards - TV PlatformBKK16-303 96Boards - TV Platform
BKK16-303 96Boards - TV PlatformLinaro
 
SOC Peripheral Components & SOC Tools
SOC Peripheral Components & SOC ToolsSOC Peripheral Components & SOC Tools
SOC Peripheral Components & SOC ToolsA B Shinde
 
BKK16-312 Integrating and controlling embedded devices in LAVA
BKK16-312 Integrating and controlling embedded devices in LAVABKK16-312 Integrating and controlling embedded devices in LAVA
BKK16-312 Integrating and controlling embedded devices in LAVALinaro
 
DesignCon-TF-THA2_final_21jan
DesignCon-TF-THA2_final_21janDesignCon-TF-THA2_final_21jan
DesignCon-TF-THA2_final_21janAshish Sirasao
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design ApproachA B Shinde
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image CompressionA B Shinde
 
BKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream StategyBKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream StategyLinaro
 
PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)Nicola Bonelli
 

What's hot (20)

A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
Microblaze
MicroblazeMicroblaze
Microblaze
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
The Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft ProcessorThe Microarchitecure Of FPGA Based Soft Processor
The Microarchitecure Of FPGA Based Soft Processor
 
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC DomainReconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
Reconfigurable Coprocessors Synthesis in the MPEG-RVC Domain
 
Scalability and Efficiency in Accelerator Sharing on FPGA Devices
Scalability and Efficiency in Accelerator Sharing on FPGA DevicesScalability and Efficiency in Accelerator Sharing on FPGA Devices
Scalability and Efficiency in Accelerator Sharing on FPGA Devices
 
SOC Chip Basics
SOC Chip BasicsSOC Chip Basics
SOC Chip Basics
 
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based DevicesRuntime Reconfigurable Network-on-chips for FPGA-based Devices
Runtime Reconfigurable Network-on-chips for FPGA-based Devices
 
Jeremy
JeremyJeremy
Jeremy
 
BKK16-303 96Boards - TV Platform
BKK16-303 96Boards - TV PlatformBKK16-303 96Boards - TV Platform
BKK16-303 96Boards - TV Platform
 
SOC Peripheral Components & SOC Tools
SOC Peripheral Components & SOC ToolsSOC Peripheral Components & SOC Tools
SOC Peripheral Components & SOC Tools
 
Blackfin core architecture
Blackfin core architectureBlackfin core architecture
Blackfin core architecture
 
BKK16-312 Integrating and controlling embedded devices in LAVA
BKK16-312 Integrating and controlling embedded devices in LAVABKK16-312 Integrating and controlling embedded devices in LAVA
BKK16-312 Integrating and controlling embedded devices in LAVA
 
DesignCon-TF-THA2_final_21jan
DesignCon-TF-THA2_final_21janDesignCon-TF-THA2_final_21jan
DesignCon-TF-THA2_final_21jan
 
PF_DIRECT@TMA12
PF_DIRECT@TMA12PF_DIRECT@TMA12
PF_DIRECT@TMA12
 
SOC System Design Approach
SOC System Design ApproachSOC System Design Approach
SOC System Design Approach
 
SOC Application Studies: Image Compression
SOC Application Studies: Image CompressionSOC Application Studies: Image Compression
SOC Application Studies: Image Compression
 
BKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream StategyBKK16-311 EAS Upstream Stategy
BKK16-311 EAS Upstream Stategy
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)PFQ@ 9th Italian Networking Workshop (Courmayeur)
PFQ@ 9th Italian Networking Workshop (Courmayeur)
 

Similar to CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersCastLabKAIST
 
Microprocessor.ppt
Microprocessor.pptMicroprocessor.ppt
Microprocessor.pptsafia kalwar
 
Oow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbOow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbbohanchen
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2Yutaka Kawai
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_finalYutaka Kawai
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceOdinot Stanislas
 
Cpld and fpga mod vi
Cpld and fpga   mod viCpld and fpga   mod vi
Cpld and fpga mod viAgi George
 
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSnehaLatha68
 
Sony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development DivisionSony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development DivisionSlide_N
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
Sparc t4 2 system technical overview
Sparc t4 2 system technical overviewSparc t4 2 system technical overview
Sparc t4 2 system technical overviewsolarisyougood
 
Digital Systems Design
Digital Systems DesignDigital Systems Design
Digital Systems DesignReza Sameni
 
FPGA Selection Methodology for Real time projects
FPGA Selection Methodology for Real time projectsFPGA Selection Methodology for Real time projects
FPGA Selection Methodology for Real time projectsKrishna Gaihre
 
Sparc t4 1 system technical overview
Sparc t4 1 system technical overviewSparc t4 1 system technical overview
Sparc t4 1 system technical overviewsolarisyougood
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
PowerDRC/LVS 2.0.1 released by POLYTEDA
PowerDRC/LVS 2.0.1 released by POLYTEDAPowerDRC/LVS 2.0.1 released by POLYTEDA
PowerDRC/LVS 2.0.1 released by POLYTEDAAlexander Grudanov
 

Similar to CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019) (20)

SoC FPGA Technology
SoC FPGA TechnologySoC FPGA Technology
SoC FPGA Technology
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
 
Microprocessor.ppt
Microprocessor.pptMicroprocessor.ppt
Microprocessor.ppt
 
Oow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-dbOow 2008 yahoo_pie-db
Oow 2008 yahoo_pie-db
 
00 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver200 opencapi acceleration framework yonglu_ver2
00 opencapi acceleration framework yonglu_ver2
 
6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final6 open capi_meetup_in_japan_final
6 open capi_meetup_in_japan_final
 
Using a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application PerformanceUsing a Field Programmable Gate Array to Accelerate Application Performance
Using a Field Programmable Gate Array to Accelerate Application Performance
 
Cpld and fpga mod vi
Cpld and fpga   mod viCpld and fpga   mod vi
Cpld and fpga mod vi
 
nios.ppt
nios.pptnios.ppt
nios.ppt
 
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
Softcore processor.pptxSoftcore processor.pptxSoftcore processor.pptx
 
Sony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development DivisionSony Computer Entertainment Europe Research & Development Division
Sony Computer Entertainment Europe Research & Development Division
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
HiPEAC-Keynote.pptx
HiPEAC-Keynote.pptxHiPEAC-Keynote.pptx
HiPEAC-Keynote.pptx
 
Dsp ajal
Dsp  ajalDsp  ajal
Dsp ajal
 
Sparc t4 2 system technical overview
Sparc t4 2 system technical overviewSparc t4 2 system technical overview
Sparc t4 2 system technical overview
 
Digital Systems Design
Digital Systems DesignDigital Systems Design
Digital Systems Design
 
FPGA Selection Methodology for Real time projects
FPGA Selection Methodology for Real time projectsFPGA Selection Methodology for Real time projects
FPGA Selection Methodology for Real time projects
 
Sparc t4 1 system technical overview
Sparc t4 1 system technical overviewSparc t4 1 system technical overview
Sparc t4 1 system technical overview
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
PowerDRC/LVS 2.0.1 released by POLYTEDA
PowerDRC/LVS 2.0.1 released by POLYTEDAPowerDRC/LVS 2.0.1 released by POLYTEDA
PowerDRC/LVS 2.0.1 released by POLYTEDA
 

More from byteLAKE

byteLAKE's AI Products (use cases) (short)
byteLAKE's AI Products (use cases) (short)byteLAKE's AI Products (use cases) (short)
byteLAKE's AI Products (use cases) (short)byteLAKE
 
byteLAKE's AI Products (use cases) - presentation
byteLAKE's AI Products (use cases) - presentationbyteLAKE's AI Products (use cases) - presentation
byteLAKE's AI Products (use cases) - presentationbyteLAKE
 
byteLAKE's AI Products for Industries (2024-02)
byteLAKE's AI Products for Industries (2024-02)byteLAKE's AI Products for Industries (2024-02)
byteLAKE's AI Products for Industries (2024-02)byteLAKE
 
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)byteLAKE
 
AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...
AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...
AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...byteLAKE
 
Self-Checkout for Restaurants / AI Restaurants (2024-02)
Self-Checkout for Restaurants / AI Restaurants (2024-02)Self-Checkout for Restaurants / AI Restaurants (2024-02)
Self-Checkout for Restaurants / AI Restaurants (2024-02)byteLAKE
 
Self-Checkout (AI for Restautants) - case study by byteLAKE's partner: Simpra
Self-Checkout (AI for Restautants) - case study by byteLAKE's partner: SimpraSelf-Checkout (AI for Restautants) - case study by byteLAKE's partner: Simpra
Self-Checkout (AI for Restautants) - case study by byteLAKE's partner: SimprabyteLAKE
 
byteLAKE: Sztuczna Inteligencja dla Przemysłu i Usług
byteLAKE: Sztuczna Inteligencja dla Przemysłu i UsługbyteLAKE: Sztuczna Inteligencja dla Przemysłu i Usług
byteLAKE: Sztuczna Inteligencja dla Przemysłu i UsługbyteLAKE
 
Przegląd zastosowań sztucznej inteligencji (2024-01)
Przegląd zastosowań sztucznej inteligencji (2024-01)Przegląd zastosowań sztucznej inteligencji (2024-01)
Przegląd zastosowań sztucznej inteligencji (2024-01)byteLAKE
 
Przegląd zastosowań Sztucznej inteligencjI
Przegląd zastosowań Sztucznej inteligencjIPrzegląd zastosowań Sztucznej inteligencjI
Przegląd zastosowań Sztucznej inteligencjIbyteLAKE
 
AI Solutions for Industries
AI Solutions for IndustriesAI Solutions for Industries
AI Solutions for IndustriesbyteLAKE
 
AI-accelerated CFD (Computational Fluid Dynamics)
AI-accelerated CFD (Computational Fluid Dynamics)AI-accelerated CFD (Computational Fluid Dynamics)
AI-accelerated CFD (Computational Fluid Dynamics)byteLAKE
 
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
Advanced Quality Inspection and Data Insights (Artificial Intelligence)Advanced Quality Inspection and Data Insights (Artificial Intelligence)
Advanced Quality Inspection and Data Insights (Artificial Intelligence)byteLAKE
 
AI Solutions for Industries (short)
AI Solutions for Industries (short)AI Solutions for Industries (short)
AI Solutions for Industries (short)byteLAKE
 
Self-Checkout (AI for Restautants)
Self-Checkout (AI for Restautants)Self-Checkout (AI for Restautants)
Self-Checkout (AI for Restautants)byteLAKE
 
Applying Industrial AI Models to Product Quality Inspection
Applying Industrial AI Models to Product Quality InspectionApplying Industrial AI Models to Product Quality Inspection
Applying Industrial AI Models to Product Quality InspectionbyteLAKE
 
byteLAKE and Intel Partnership
byteLAKE and Intel PartnershipbyteLAKE and Intel Partnership
byteLAKE and Intel PartnershipbyteLAKE
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE
 
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...byteLAKE
 
byteLAKE's Scan&GO - Self-Check-Out Solution for Retail (EuroShop'23)
byteLAKE's Scan&GO - Self-Check-Out Solution for Retail (EuroShop'23)byteLAKE's Scan&GO - Self-Check-Out Solution for Retail (EuroShop'23)
byteLAKE's Scan&GO - Self-Check-Out Solution for Retail (EuroShop'23)byteLAKE
 

More from byteLAKE (20)

byteLAKE's AI Products (use cases) (short)
byteLAKE's AI Products (use cases) (short)byteLAKE's AI Products (use cases) (short)
byteLAKE's AI Products (use cases) (short)
 
byteLAKE's AI Products (use cases) - presentation
byteLAKE's AI Products (use cases) - presentationbyteLAKE's AI Products (use cases) - presentation
byteLAKE's AI Products (use cases) - presentation
 
byteLAKE's AI Products for Industries (2024-02)
byteLAKE's AI Products for Industries (2024-02)byteLAKE's AI Products for Industries (2024-02)
byteLAKE's AI Products for Industries (2024-02)
 
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
 
AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...
AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...
AI Solutions for Industries | Quality Inspection | Data Insights | Predictive...
 
Self-Checkout for Restaurants / AI Restaurants (2024-02)
Self-Checkout for Restaurants / AI Restaurants (2024-02)Self-Checkout for Restaurants / AI Restaurants (2024-02)
Self-Checkout for Restaurants / AI Restaurants (2024-02)
 
Self-Checkout (AI for Restautants) - case study by byteLAKE's partner: Simpra
Self-Checkout (AI for Restautants) - case study by byteLAKE's partner: SimpraSelf-Checkout (AI for Restautants) - case study by byteLAKE's partner: Simpra
Self-Checkout (AI for Restautants) - case study by byteLAKE's partner: Simpra
 
byteLAKE: Sztuczna Inteligencja dla Przemysłu i Usług
byteLAKE: Sztuczna Inteligencja dla Przemysłu i UsługbyteLAKE: Sztuczna Inteligencja dla Przemysłu i Usług
byteLAKE: Sztuczna Inteligencja dla Przemysłu i Usług
 
Przegląd zastosowań sztucznej inteligencji (2024-01)
Przegląd zastosowań sztucznej inteligencji (2024-01)Przegląd zastosowań sztucznej inteligencji (2024-01)
Przegląd zastosowań sztucznej inteligencji (2024-01)
 
Przegląd zastosowań Sztucznej inteligencjI
Przegląd zastosowań Sztucznej inteligencjIPrzegląd zastosowań Sztucznej inteligencjI
Przegląd zastosowań Sztucznej inteligencjI
 
AI Solutions for Industries
AI Solutions for IndustriesAI Solutions for Industries
AI Solutions for Industries
 
AI-accelerated CFD (Computational Fluid Dynamics)
AI-accelerated CFD (Computational Fluid Dynamics)AI-accelerated CFD (Computational Fluid Dynamics)
AI-accelerated CFD (Computational Fluid Dynamics)
 
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
Advanced Quality Inspection and Data Insights (Artificial Intelligence)Advanced Quality Inspection and Data Insights (Artificial Intelligence)
Advanced Quality Inspection and Data Insights (Artificial Intelligence)
 
AI Solutions for Industries (short)
AI Solutions for Industries (short)AI Solutions for Industries (short)
AI Solutions for Industries (short)
 
Self-Checkout (AI for Restautants)
Self-Checkout (AI for Restautants)Self-Checkout (AI for Restautants)
Self-Checkout (AI for Restautants)
 
Applying Industrial AI Models to Product Quality Inspection
Applying Industrial AI Models to Product Quality InspectionApplying Industrial AI Models to Product Quality Inspection
Applying Industrial AI Models to Product Quality Inspection
 
byteLAKE and Intel Partnership
byteLAKE and Intel PartnershipbyteLAKE and Intel Partnership
byteLAKE and Intel Partnership
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
 
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
CFD Suite (AI-accelerated CFD) - Sztuczna Inteligencja Przyspiesza Symulacje ...
 
byteLAKE's Scan&GO - Self-Check-Out Solution for Retail (EuroShop'23)
byteLAKE's Scan&GO - Self-Check-Out Solution for Retail (EuroShop'23)byteLAKE's Scan&GO - Self-Check-Out Solution for Retail (EuroShop'23)
byteLAKE's Scan&GO - Self-Check-Out Solution for Retail (EuroShop'23)
 

Recently uploaded

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Recently uploaded (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

CFD acceleration with FPGA (byteLAKE's presentation from PPAM 2019)

  • 1. DSc PhD Krzysztof ROJEK, byteLAKE’s CTO PPAM 2019, Bialystok, Poland, September 8-11, 2019 CFD code adaptation to the FPGA architecture
  • 2. • Current trends in the FPGA market • Common FPGA applications • FPGA access • Architecture of the Xilinx Alveo U250 FPGA • Evaluation metrics • Algorithm scenario • Development of FPGA codes • Algorithm design 2 Background • OpenCL kernel processing • Memory queue • Limitations of memory access • Burst memory access • Vectorization • Code regionalization • CPU implementation overview • Performance and Energy results • Conclusion
  • 3. 3 Current trends in the FPGA market
  • 4. • Confirmed effectiveness – Audio processing – Image processing – Cryptography – Routers/switches/gateways software – Digital displays – Scientific instruments (amplifiers, radio astronomy, radars) • Current challenges – Machine learning – Deep learning – High Performance Computing (HPC) 4 Common FPGA applications
  • 5. • Test Drive in the Cloud – Nimbix: High Performance Computing & Supercomputing Platform – Other cloud providers, soon… • Your own cluster – RAM memory: 80GB (16GB for deployment only) – Hard disk space: 100GB – OS: RedHat, CentOS, Ubuntu – Xilinx Runtime – driver for Alveo – Deployment Shell – the communication layer physically implemented and flashed into the card – The Xilinx SDAccel IDE – framework for development 5 FPGA access More cloud providers soon…
  • 6. • Premiere: October 02, 2018 • Built on the Xilinx 16nm UltraScale™ architecture 6 Xilinx Alveo U250 FPGA Memory Off-chip Memory Capacity 64 GB Off-chip Total Bandwidth 77 GB/s Internal SRAM Capacity 54 MB Internal SRAM Total Bandwidth 38 TB/s Power and Thermal Maximum Total Power 225W Thermal Cooling Passive Clocks KERNEL CLK 500 MHz DATA CLK 300 MHz
  • 7. • The deployment shell that handles device bring-up and configuration over PCIe is contained within the static region of the FPGA • The resources in the dynamic region are available for creating custom accelerators 7 Xilinx Alveo U250 FPGA SLR1 Dynamic Region SLR2 Dynamic Region SLR3 Dynamic Region SLR0 Dynamic Region Static Region DDR DDR DDR DDR Resources Look-Up Tables (LUTs) (K) 1341 Registers (K) 2749 36 Kb Block RAMs 2000 288 Kb UltraRAMs 1280
  • 8. • Desired features of a data center – Low price – Low Energy consumption – High performance – Technical support – Reliability and fast service • Important metrics – Execution time [s] – Data throughput of a simulation [MB/s] – Power dissipation [W] – Energy consumption [J] 8 Is it a good for you? How many cards is required to achieve a desired performance? How many cards can I handle within a given Energy budget? What performance can be achieved within my Energy budget? How these results refer to the CPU-based solution?
  • 9. • Computational Fluid Dynamics (CFD) kernel with support for all industrial parameters and settings • Advection algorithm that is the method to predict changes in transport of a substance (fluid) or quantity by bulk motion in time – An example of advection is the transport of pollutants or silt in a river by bulk water flow downstream – It is also transport of energy by water or air 9 Real scientific scenario • Based on upwind scheme • 3D compute domain • Dataset (9 arrays + scalar): – 3 x velocity vectors – 2 x forces (implosion, explosion) – 2 x density vectors – 2 x transported substance (in, out) – t – time interval • Configuration: – Job setting (size, timestep) – Border conditions (periodic, open) – Data accuracy (double, single, half) PERIODIC DOMAIN IN X DIMENSION OPEN DOMAIN
  • 10. • Config, makefile, and source 10 Development
  • 11. • Config, makefile, and source 11 Development
  • 12. • Config, makefile, and source 12 Development
  • 13. • The compute domain is divided into 4 sub-domains • Host sends data to the FPGA global memory • Host calls kernel to execute it on FPGA (kernel is called many times) • Each kernel call represents a single time step • FPGA sends the output array back to host Algorithm design FPGA CPU Compute domain Sub-domain Sub-domain Sub-domain Sub-domain Kernel call Data sending Data receiving Data receiving Data sending Kernel processing Migrate memory objects N x call Copy buffer
  • 14. • Kernel is distributed into 4 SLRs • Each sub-domain is allocated in different memory bank • Data transfer occurs between neighboring memory banks Kernel processing SLR0 Kernel_A SLR1 Kernel_B SLR2 Kernel_C SLR3 Kernel_D Kernel Bank0 Bank1 Bank2 Bank3 Sub-domain Sub-domain Sub-domain Sub-domain 19
  • 15. • A pipe stores data organized as a FIFO • Pipes can be used to stream data from one kernel to another inside the FPGA device without having to use the external memory • Pipes must be statically defined outside of all kernel functions • Pipes must be declared in lower case alphanumerics • Xilinx extended OpenCL pipes by adding blocking mode that allows users to synchronize kernels 15 Kernels communication with pipes pipe int p0 __attribute__((xcl_reqd_pipe_depth(512)));
  • 16. • Each array is transferred from the global memory to the fast BRAM memory • To minimize the data traffic we use a memory queue across iterations 16 Memory queue Global memory BRAM
  • 17. • Each array is transferred from the global memory to the fast BRAM memory • To minimize the data traffic we use a memory queue across iterations 17 Memory queue Global memory BRAM
  • 18. • Each array is transferred from the global memory to the fast BRAM memory • To minimize the data traffic we use a memory queue across interactions 18 Memory queue Global memory BRAM
  • 19. • 31 pins are available in Alveo u250 – Each pointer to the global memory set as the kernel argument reserves one memory pin – Each kernel reserves one memory pin • Using 4 banks and 4 kernels we can set up to 6 global pointers to the global memory • To send all required arrays we need to pack them into larger buffers (different for input and output data) • All kernel ports require 512-bits data access to provide the highest memory access 19 Memory access within a kernel
  • 20. • Burst memory access – Loop pipelining – Port data width: 512bits – Separated data copings from the computation – Vectorization 20 Burst memory access/vectorization void copy(__global const float16 * __restrict globMem) { float16 bram[tKM]; … write_0: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bram[kj] = globMem[gIdx+kj]; } … } Time traditional pipelining
  • 21. • Shifting elements within a vector (standard shuffle API is not supported) 21 Stencil vectorization __attribute__((always_inline)) inline float16 getM1(const float a, const float16 b) { const float16 *ptr2=(realX*)&b; float16 out; float *o=(realX*)&out; o[0] = a; __attribute__((opencl_unroll_hint(15))) for(int i=1; i<VECS; ++i) { o[i] = ptr2[i-1]; } return out; } X[i] = Y[i-1] X[i]=getM1(Y[i-1][15], Y[i]);
  • 22. • Memory access supports two accesses per a single array 22 Memory ports calc_0: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bramX[kj] = bramY[kj-off]+bramY[kj]+bramY[kj+off]; } calc_0: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bramX[kj] = bramY[kj-off]+bramY[kj]; } calc_1: __attribute__((xcl_pipeline_loop)) for(int kj=0; kj<tKM; ++kj) { bramX[kj] = bramX[kj]+bramY[kj+off]; }
  • 23. • Independent regions in the code should be explicitly separated • It helps compiler distribute the code amongst LUT • The separation can be done by adding brackets around independent code blocks 23 Regionalization { //the first block of instructions } { //the second block of instructions }
  • 24. • Our CPU implementation utilizes two processors: – Intel® Xeon® CPU E5-2695 v2 2.40 – 3.2 GHz (2x12 cores) • The code adaptation includes: – 24 cores utilization – Loop transformations – Memory alignment – Thread affinity – Data locality within nested loops – Compiler optimizations • The final simulation throughput is: 3.7 GB/s • The power dissipation is: 142 Watts 25 CPU implementation
  • 26. 27 Results FPGA 2xCPU Ratio FPGA/CPU Exec. time [s] 11,4 18,0 1,6 Throughput [MB/s] 5840,8 3699,2 0,6 Power [W] 101,0 142,0 1,4 Energy [J] 1151,4 2556,0 2,2 5840.8 3699.2 FPGA 2XCPU The higher the better Throughput [MB/s] 1151.4 2556.0 FPGA 2XCPU The lower the better Energy [J]
  • 27. 29 byteLAKE’s ecosystem of partners Complete solutions for CFD market ➢HPC system design, build-up and configuration ➢HPC software applications development and optimization to make the most of the hardware … and more
  • 28. More at: byteLAKE.com/en/CFD Accelerated CFD Kernels Compatible with geophysical models like EULAG Pseudovelocity Divergence Thomas algorithm CFD Kernels Advection • Faster time to results and more efficient processing compared to CPU-only nodes • 4x faster • 80% lower energy consumption • 6x better performance per Watt About byteLAKE • AI (highly optimized AI engines to analyze text, image, video, time series data) • HPC (highly optimized apps and kernels for HPC architectures)
  • 30. We build AI and HPC solutions. Focusing on software. We use machine/ deep learning to bring automation and optimize operations in businesses across various industries. We create highly optimized software for supercomputers. Our researchers hold PhD and DSc degrees. byteLAKE www.byteLAKE.com • AI (highly optimized AI engines to analyze text, image, video, time series data) • HPC (highly optimized apps and kernels for HPC architectures) Building solutions for real-life business problems