What is Data Science nowadays?
High-Performance Computing (HPC) in Data Science nowadays (Part 1)
1. What is Data Science nowadays?
5. Project using big clusters or supercomputers — publication coming soon
I would like to announce articles series about High-Performance Computing in Data Science nowadays. This is a critical topic in Data Science, but unfortunately often overlooked too. In fact, it is a big mistake. A lot of data scientists and engineers forget about hardware optimization and its impact on algorithm design. In this series of articles, I will try to get your attention to a new perspective.
Why did I choose this subject? I see a significant gap in this topic, because few people combine the fields of High-Performance Computing (HPC) with Data Science (DS). The situation is that we have HPC and DS experts separately, but not many are able to combine these specializations, which is very important and has a substancial impact on the business activities. Connecting this science field is essential, because it notably impacts business operations.
This series will be divided into stages that will be published sequentially. So what will they be about? I will present the following topics:
1. What is Data Science nowadays?
This section will be about what we currently mean by Data Science. How does DS connects with other science fields? What are specific today’s data science projects? Why do we need such considerations at all? It will be a general introduction to the series.
2. Why should we think about hardware in Data Science?
This section will show why we should think about hardware. What are the benefits of using the correct hardware for our problem? I will tell you about it from a business and algorithmic perspective. Not much is said about it, but there are applications where hardware optimization will bring huge profits. We will take it into account.
3. What are the elements of High Performance Computing (HPC) nowadays?
In this section, we’ll go through the basics of hardware. You will get to know what the development of the equipment has looked like over the years and how it was connected with the development of artificial intelligence theory. This will be the first introduction to the intersection of Artificial Intelligence and the hardware field. We will also elaborate generally available solutions as well as more specialized ones. This is an essential introduction to what possibilities we have to accelerate and cost reductions in AI.
4. Where is the synergy between HPC and Data Science?
This section will more focus on the intersection between HPC and DS. We will pay particular attention to the clusters, supercomputers and grids. Actually, this is what gives many people a headache. It happens because people think sequentially, not in parallel, as the multicore computers. This is not a natural way of thinking, but necessary for designing effective algorithms. I mean, a person can do one thing at a time.
5. Project using big clusters or supercomputers
This section will present the real use of clusters and supercomputers in projects. I will show companies which lead the way in this field. I cite a few examples that are happening now. It will be about the Large Language Models (LLMs), Self-Driving Cars, playing chess, going and even designing nuclear reactors.
6. How to use HPC in Data Science (libraries)?
This section will take up the subject of the libraries and engineering techniques to make HPC software. We will quote some of the most important approaches and libraries. I am going to present good practices for designing high-performance and scalable software. It isn’t an easy task, but really often intellectually challenging.
High-Performance Computing (HPC) in Data Science nowadays
(Part 2) — What is Data Science nowadays?
In this part, I will look into what precisely the Data Science is nowadays. This topic is literally pervasive, because it can be viewed from many perspectives. One classical point of view is to divide Data Science by areas. We can divide DS into Natural Language Processing, Computer Vision, Reinforcement Learning etc. This is one of the classical divisions, but today I would like to talk about a bit different perspective — more connected with High-Performance Computing and the scale of problems. Much less is said about scale and issues with it, and I concentrate on this problem in this series of articles.
Okay, so what is Data Science from my perspective? In sum, I can say:
- Mathematic, especially complicated numerical algorithms
- Huge data volume (BigData)
- Huge demand for computational power
- Hardware optimization
- Distributed systems
Does it look simple and inconspicuous? Nothing could be more wrong. Behind these slogans are hidden in many important and difficult science and engineering areas. Please, be a little patient and I will try to develop these points, but before — let me leave you a question to pounder:
Do you think that these above points apply in practice? Let’s look at the industries in which we use Data Science. I’ll show you an example table.
How do you think you can find in this table the application of the points mentioned above?
Let us take search engines in an e-commerce shop, e.g. Amazon. How do you consider how many products and customers Amazon have? Yes, it is about 350 million products. And how many potential customers — visitors? About 195 million per month. Imagine how many questions are asked in the search engine per minute. Do you see the scale? Do you wait long for the search results? Probably not! Simple problem, but how make it on that operation scale?
Another example is Finance. Do you like to pay by card? I think so. Take a look from, e.g. Visa’s perspective. How many transactions do they have per second on the whole world? Do you wait very long for payment acceptance? I suppose not. How is it possible?
For that and another case from the table, the answer is the synergy between software and hardware. This series of articles will show the perspective of use cases and highlight how to ensure adequate performance.
We’ve covered some examples, so now let’s come back to the points and things related to them.
Mathematic, especially complicated numerical algorithms
This is a common thing in every field of Data Science. Whether we work with a CV, NLP or RL, we must solve optimization problems on a huge scale. What is an optimization problem? An optimization problem is a problem of finding the best solution from all feasible solutions. In DS, we optimize the model in the training stage. This is a very exhaustive process which needs a lot of mathematical computations. This process often takes weeks or months on clusters or supercomputers — we will talk about this further. Today’s direction performs these costly calculations once in the learning phase. This process often is called pre-training the model. Today’s deep learning is often a very big optimization problem, because we must often find billions of model parameters, dependent on each other.
These models are often only approximations of problem solutions, but if we make this on huge possible cases, we have the illusion that the problem is solved optimally for each case, which in fact is not true. This is the reason why models are getting pretrained. The point is to reduce scale in the inference stage and at this moment, we slight go to the second point described below.
Huge data volume (BigData)
This point is strictly connected with the first mentioned. What is the easiest way to get the optimal solution to a problem? Of course, answer is an exhaustive search that means look at the all possible cases and chooses the best. What if it is impossible, because the solution space has an enormous size? In this case, we can only show a subset of this space and dope that the model will notice some dependencies that we have not shown it. How do you think, do we reduce the problem a lot? Practice shows that not really because this subset is often still huge. Datasets that are small today can become huge tomorrow. At this point, it is essential to consider the role of data volume in this process.
Massive datasets have been created recently. The big datasets are necessary for learning big scale machine learning models. I can cite a few examples of such datasets. First, Laion-5B is an image-text pair dataset. It contains 5,85 billion image-text pairs and weighs about 240TB. The second one is the language dataset prepared by BigScience — 1,5TB of clear texts. Do you think it is a lot? What do you think of how many computational powers we need to learn a good model on that dataset? For the first dataset, no one has even tried it yet. At this moment was prepared CLIP embeddings only.
Huge demand for computational power
Take an example dataset from the previous paragraph: BigScience 1,5TB of the clear texts. For what is this dataset used? To train a big multilingual Language Model — tr11–176B. What is this model trained on? The answer is here:
This is Jean Zay Supercomputer. The Jean Zay computer was installed at IDRIS, the national computing centre for the CNRS (Centre national de la recherche scientifique), during the first half of 2019, with a peak performance of 15.9 Pflop/s (15.9 million billion floating-point operations per second). An extension to the supercomputer was installed during the summer of 2020, bringing the peak performance to 28 Pflops/s. The configuration of Jean Zay makes it possible to extend the classic usage modes of high-performance computing (HPC) to new usages in artificial intelligence (AI).
What is the most important in it? Of course, Graphics Processing Units (GPU). It is equipped with 416 A100 80GB GPUs. 384 GPU of them were used to train the model. Why use GPU then Central Processing Unit (CPU)? The main reason is simple, GPUs are much faster than CPUs. Virtually any computation from the CPU can be transferred to the GPU and increase the computing performance.
How long will this training take? Approximately about three or four months. After all, it is unimaginable. How it is possible in practice? They use that big machine, and this takes about four months. Can you imagine a situation where this training would take much more time? Yes, it is possible.
This point is highly related to the previous point, as I show. When we forget about hardware optimization, this training will not be finished for four months. This is possible thanks to GPUs and good parallelization of tasks and this is the whole magic. How do you make this correctly? Try to look at this system as 416 separated computers that communicate with others. In this case, researchers used the Microsoft DeepSpeed library to distribute and parallelize computations. Below, I show some ideas for parallelizing and distributing computations.
Three parallelism dimensions are used to get a very high model throughput:
- Tensor Parallelism: 4 (each tensor is split up into multiple chunks processed separately and in parallel on different GPUs)
- Pipeline Parallelism: 12 (the model is split up vertically (layer-level) across multiple GPUs)
- Data Parallelism: 8 (the model is replicated multiple times, each replica being fed a slice of the data)
The important word here is tensor, which means a generalization of the concept of a vector; quantity (array of numbers) whose properties remain identical regardless of the selected coordinate system. Simply put, it is the one of mathematical objects that is able to store multidimensional data in space.
In simplification, distributed systems are connected to many separate computers. Clusters or Supercomputers are just such a system. At this moment, that is all that I would tell in this topic.
This is just a preview of what will be about in the following parts of the articles. Does this article show my point of view on what is Data Science nowadays, maybe large scale Data Science? More details will be discussed in the following articles.
What did we learn in this part of the series?
- On what mathematical foundations is modern Data Science based.
- We gained the intuition that the size of the data does really matter.
- We have gained the intuition that as the amount of data grows, the required computing power also grows.
- Belief that the participation of equipment in the process of teaching contemporary models plays a key role.
- Belief that scaling one compute unit infinitely is impossible, so we need to distribute the computation.
Stay tuned for the next parts of the series!
Words by Patryk Binkowski, Data Scientist at Altimetrik Poland
Patryk Binkowski - Solution Architect/Technical Leader - Altimetrik Poland | LinkedIn
I am looking for research internships/work (industry, universities, science institutions) preferably in Europe (e.g…
Copywriting by Kinga Kuśnierz, Content Writer at Altimetrik Poland