What are the Elements of High Performance Computing (HPC) Nowadays?

High-Performance Computing (HPC) in Data Science nowadays (Part 3)

Altimetrik Poland Tech Blog

9 min readNov 17, 2022

Series index:

1. What is Data Science nowadays?

2. Why should we think about hardware in Data Science?

3. What are the elements of High Performance Computing (HPC) nowadays?

4. Where is the synergy between HPC and Data Science? — publication coming soon

5. Project using big clusters or supercomputers — publication coming soon

Introduction

In this part, the elements of High Performance Computing (HPC) will be presented and described. The article is focused on the types of hardware accelerators that are widely used in Data Science, the types of electronic circuits and the basics of their construction. Standard circuits that are widely used and more specialized ones that are just becoming popular will also be cited. The advantages and disadvantages of each type of accelerator will also be discussed. We will try to combine this knowledge with the development of artificial intelligence methods, as these two topics are very interrelated — they drive each other’s progress.

Types of accelerators

We will look into 4 types of accelerators:

Graphic Processing Units (GPU)
Tensor Processing Units (TPU)
Cerebras
NVIDIA Grace

Progress in hardware and machine learning areas.
Source: (https://aws.amazon.com/ec2/instance-types/p4/)

In the chart above, we can see the hardware progress and areas of machine learning made over the years, especially hardware milestones such as CPU, FPGA, GPU, TPU, NPU. The computational performance of the circuit has increased over time, but in doing so, so has the computing power requirements of the algorithms. This is the synergy between algorithms and hardware. Better algorithms need more computing power. This drives the development of new computing architectures and accelerators all the time. In this article, the focus will be on the most important accelerators: GPU and TPU.
The GPU is a general-purpose computing accelerator. It is easy to program, all machine learning libraries support it. It is not the best choice in terms of computing performance per watt, but you can prototype solutions quickly. More efficient is TPU, but it is more specialized hardware. Prototyping for it is more difficult than for GPU. TPU is specialized for machine learning. It has a better performance to watt ratio, which is important in machine learning because we often need hundreds of GPUs for TPU. The direction of development is also to combine the CPU and GPU on a single chip, such as NVIDIA Grace.

1. Graphic Processing Units (GPU)

Nvidia Hopper Architecture.
Source: (https://www.purepc.pl/nvidia-hopper-gh100-diagram-zdradza-budowe-i-specyfikacje-gpu-nowej-generacji)

The most popular and important accelerator for the Data Science world is currently the GPU. The leading companies that produce this type of hardware are NVIDIA and AMD. In fact, the trends in Data Science are being set by NVIDIA hardware. Why are they so popular? In my opinion, this is affected by two things. First, these accelerators are fairly easy to program, thanks to the CUDA ecosystem. Second, they have relatively high computational performance, because they are “parallel computing hardware,” and this is important from a Data Science perspective.
Moving on — what is Compute Unified Device Architecture (CUDA)? CUDA was born out of an attempt to create a unified programming model or architecture for heterogeneous computing. The main goal was that the CPU and GPU could work together through a single programming interface. It wasn’t all that easy, since they are completely different hardware architectures, but NVIDIA took on the challenge.
At this point, several CUDA concepts that are important in Data Science can already be explained. Below I will outline some of the most important concepts that have been used in CUDA.

CUDA cores

The CUDA core is the main component of any GPU — this element can be considered the heart of the system. Each CUDA core has the ability to run a programmable thread and can perform simple operations such as addition, multiplication or calculate more complex mathematical functions. CUDA cores can be seen like CPU cores, but there may be fewer of them. So where does their advantage lie? In the GPU we have thousands of CUDA cores, while in the CPU there are simply far fewer. For example, the NVIDIA RTX 3900 graphics card has more than 10000 CUDA cores.

*Grid of Thread Blocks*.
Source: (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)

CUDA blocks and grids

CUDA cores are grouped into blocks, and all threads within a block execute the same instructions within a single Streaming Multiprocessor — I explain this further. The programmer must divide the computation into blocks and threads. The blocks are later combined into wonderful grids. This is a natural division if you look at the CUDA C ++ API.

CUDA kernels

Suppose we want to perform addition of two vectors. If you use a CPU, you can write the addition of vectors using a for loop. Each element of the resulting vector must be computed by the same CPU thread (don’t use SIMD instruction). However, on the GPU, each CUDA thread will work to produce only one entry of the result vector. CUDA kernels are a way to tell a CUDA thread what calculation to perform. Each thread runs a CUDA kernel, but on different data. This computing paradigm is called Single Instruction Multiple Thread or SIMT.

Streaming multiprocessors (SMs)

We come to the most important part of the GPU. We already know what CUDA threads, blocks, grids and kernels are. Streaming Multiprocessors (SMs) are the second layer in the hardware hierarchy. An SM is an advanced processor in a GPU that contains the hardware and software to perform computations on hundreds of CUDA threads. Modern GPUs contain dozens of SMs. For example, the RTX 3090 has 82 SMs. SMs, which are physically located close to each other, are further grouped into entities called Graphics Processing Clusters (GPCs).

Nvidia Hopper SM Architecture.
Source: (https://www.benchmark.pl/aktualnosci/nvidia-prezentuje-architekture-hopper-premiera-poteznych.html)

Tensor Cores

Tensor Cores were introduced in Volta architectures. The technology is still being improved, and today we have the fourth generation of this circuit in the Hopper series of NVIDIA GPUs. In the early days of the Volta architecture, Tensor Cores could only perform calculations that are of the form D= A x B + C (where A~D are all 4×4 matrices). This type of computation is often used in Deep Learning. Multiplication can be used to implement dense or convolutional layers. Addition can be used to apply bias. In Volta TC, A and B must be FP16 matrices, but C and D can be both FP16 and FP32. The magic is that the hardware performs two multiplication and addition operations in one clock cycle.
In the Turing architecture, Tensor Cores were improved. They got support for handling INT4 and INT8 types.
In the Ampere architecture, the Tensor Cores were improved with the introduction of the third TC versions. The third generation of Tensor Cores in Ampere supports all data types from binary, INT4, INT8, FP16, TF32 and even FP64. Deep learning practitioners don’t need to use mixed precision training to take advantage of the benefits of Tensor Cores. This is good news because mixed precision training sometimes id numerically unstable. With Ampere, we have TF32 throughput up to 20x that of Volta.
With the Hopper architecture, coming in 2022, we have four generations of Tensor Cores. These should offer twice the performance of Ampere’s Tensor Cores at the same clock frequency. The expected matrix multiplication throughput is about 6x that of the A100.

Comparison between Volta and Ampere Tensor Cores.
Source: (https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf)

Transformer Engine (TE)

In my opinion, this is the most important feature of the Hopper architecture. Nowadays, transform-based models are very popular and achieve the highest performance in many problems of broad Data Science, e.g. NLP, CV, Speech. NVIDIA has created a dedicated ecosystem that accelerates transform models. However, the model size continues to grow exponentially, now reaching trillions of parameters. As a result, training time extends to months because of the large computational volume, which is impractical for business needs. Most AI floating-point calculations are performed using 16-bit “half-precision” (FP16), 32-bit “single” precision (FP32) and, for specialized operations, 64-bit “double” precision (FP64).
The Transformer Engine uses software and custom NVIDIA Hopper Tensor Core technology, designed to accelerate the training of models built from the AI model’s dominant building block — the transformer. These Tensor Cores can use mixed FP8 and FP16 formats to dramatically accelerate AI calculations for transformers. Operations on Tensor cores in FP8 have twice the throughput of 16-bit operations. Transformer Engine uses NVIDIA Hopper Tensor Core software and technology to accelerate the training of transformer architectures. The catch is that these Tensor Cores can use mixed FP8 and FP16 formats, which speeds up calculations for transformers.

Hopper transformer engine.
Source: (https://www.hpctech.co.jp/catalog/gtc22-whitepaper-hopper_v1.01.pdf)

2. Tensor Processing Units (TPUs)

The TPU has been proposed by Google. To speed up AI training, Google has developed an Application Specific Integrated Circuit (ASIC) known as a Tensor Processing Units (TPUs). But what is a Tensor Processing Units and how does it speed up AI programming? TPUs are specialized to perform matrix and vector operations, which is essential for Deep Learning. This accelerator uses processing elements — small DSPs with local memory — in a network, so the elements can communicate with each other and pass data. TPUs use high-bandwidth on-chip memory (HBM) and have scalar, vector and matrix units (MXUs) in each core. MXUs perform processing at 16K multiply-accumulate operations per cycle. 32-bit floating-point input and output is simplified via Bfloat16. Cores perform user calculations (XLA ops) separately.
CPU, GPU are good for rapid prototyping and medium workloads. When we need more power, we should consider using more specialized TPUs. How much faster are TPUs? The resulting TPUs, according to Google, have “15–30x higher performance and 30–80x higher performance per watt than today’s CPUs and GPUs.”

TPUv3 architecture.
Source: (https://www.hpctech.co.jp/catalog/gtc22-whitepaper-hopper_v1.01.pdf)

TPU Pods — Connected TPUs in single cluster.
Source: (https://insidehpc.com/2019/05/google-cloud-tpu-pods-speed-machine-learning/)

3. Cerebras

Many new architectures are currently emerging to accelerate machine learning. This is due to the fact that the demand for computing continues to grow and more efficient systems are being sought. Major companies such as Intel, NVIDIA, Google, IBM and the world’s top universities are in the race.
The trend is to create more parallel hardware, as many machine learning algorithms can be parameterized. One is looking for ways to put as many computing cores and memory as possible on a chip.
One of the largest chips I’ve been able to create is Cerebras. Cerebras has 850,000 tensor computing cores and 40GB of high-speed RAM. The bandwidth of the processor is about 220Pb/s. Another important thing is that it is a single-chip architecture, which is called chip cluster. It has many advantages, such as being easier to program than regular clusters.

Cerebras.
Source: https://www.cerebras.net/product-chip/

4. NVIDIA Grace

NVIDIA Grace is hardware that combines CPU and GPU on a single chip. This trend is called heterogeneous systems. This is a somewhat innovative approach, because often the CPU and GPU are connected by a main line such as PCI-E. This has a negative impact on data transfer performance. In Grace, we have everything on one chip, with the CPU and GPu connected via high-performance NVLink- C2C. Grace has up to 72 ARM cores and up to 512 GB of LPDDR5X. The GPU, meanwhile, is a Hopper architecture with 96 GB of HBM3 memory. These chips can be combined into a cluster called the NVIDIA HGX Grace Hopper Superchip Platform. We can combine 256 Grace chips in the cluster. This allows us to have a fully legitimate machine learning cluster.

NVIDIA Grace.
Source: (https://www.nvidia.com/en-us/data-center/grace-cpu/)

NVIDIA Grace Hopper Superchip Logical Overview.
Source: (https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper)

Summary

What did we learn in this part of the series?

We learned that the development of hardware goes hand in hand with the development of artificial intelligence algorithms.
We realized that artificial intelligence algorithms have huge computational demands.
We became familiar with the most important types of hardware in modern artificial intelligence (GPU, TPU, combination of CPU and GPU).
We learned the basics of building some of the most important hardware architectures for artificial intelligence.
We learned what gives these hardware architectures an advantage over a standard CPU.
We became aware of what’s new in NVIDIA’s new cards from an artificial intelligence standpoint.

Stay tuned for the next parts of the series!

Words by Patryk Binkowski, Data Scientist at Altimetrik Poland

Patryk Binkowski — Solution Architect/Technical Leader — Altimetrik Poland | LinkedIn

I am looking for research internships/work (industry, universities, science institutions) preferably in Europe (e.g…

www.linkedin.com

Copywriting by Kinga Kuśnierz, Content Writer at Altimetrik Poland

https://www.linkedin.com/in/kingakusnierz/