Blog

Machine learning

How Much Data Is Required for Machine Learning?

Eugene Dorfman

25 Mar 2022

10 min

How Much Data Is Required for Machine Learning?

01 Factors that influence the size of datasets you need

02 What is the optimal size of AI training data sets?

03 How to deal with the lack of data

04 Importance of quality data in healthcare projects

05 Need data for an ML project? We will get you covered!

If you ask any data scientist how much data is needed for machine learning, you’ll most probably get either “It depends” or “The more, the better.” And the thing is, both answers are correct.

It really depends on the type of project you’re working on, and it’s always a great idea to have as many relevant and reliable examples in the datasets as you can get to receive accurate results. But the question remains: how much is enough? And if there isn’t enough data, how can you deal with its lack?

The experience with various projects that involved artificial intelligence (AI) and machine learning (ML), allowed us at Postindustria to come up with the most optimal ways to approach the data quantity issue. This is what we’ll talk about in the read below.

Factors that influence the size of datasets you need

Every ML project has a set of specific factors that impacts the size of the AI training data sets required for successful modeling. Here are the most essential of them.

The complexity of a model

Simply put, it’s the number of parameters that the algorithm should learn. The more features, size, and variability of the expected output it should take into account, the more data you need to input. For example, you want to train the model to predict housing prices. You are given a table where each row is a house, and columns are the location, the neighborhood, the number of bedrooms, floors, bathrooms, etc., and the price. In this case, you train the model to predict prices based on the change of variables in the columns. And to learn how each additional input feature influences the input, you’ll need more data examples.

The complexity of the learning algorithm

More complex algorithms always require a larger amount of data. If your project needs standard ML algorithms that use structured learning, a smaller amount of data will be enough. Even if you feed the algorithm with more data than it’s sufficient, the results won’t improve drastically.

The situation is different when it comes to deep learning algorithms. Unlike traditional machine learning, deep learning doesn’t require feature engineering (i.e., constructing input values for the model to fit into) and is still able to learn the representation from raw data. They work without a predefined structure and figure out all the parameters themselves. In this case, you’ll need more data that is relevant for the algorithm-generated categories.

Labeling needs

Depending on how many labels the algorithms have to predict, you may need various amounts of input data. For example, if you want to sort out the pictures of cats from the pictures of the dogs, the algorithm needs to learn some representations internally, and to do so, it converts input data into these representations. But if it’s just finding images of squares and triangles, the representations that the algorithm has to learn are simpler, so the amount of data it’ll require is much smaller.

Acceptable error margin

The type of project you’re working on is another factor that impacts the amount of data you need since different projects have different levels of tolerance for errors. For example, if your task is to predict the weather, the algorithm prediction may be erroneous by some 10 or 20%. But when the algorithm should tell whether the patient has cancer or not, the degree of error may cost the patient life. So you need more data to get more accurate results.

Input diversity

In some cases, algorithms should be taught to function in unpredictable situations. For example, when you develop an online virtual assistant, you naturally want it to understand what a visitor of a company’s website asks. But people don’t usually write perfectly correct sentences with standard requests. They may ask thousands of different questions, use different styles, make grammar mistakes, and so on. The more uncontrolled the environment is, the more data you need for your ML project.

Based on the factors above, you can define the size of data sets you need to achieve good algorithm performance and reliable results. Now let’s dive deeper and find an answer to our main question: how much data is required for machine learning?

What is the optimal size of AI training data sets?

When planning an ML project, many worry that they don’t have a lot of data, and the results won’t be as reliable as they could be. But only a few actually know how much data is “too little,” “too much,” or “enough.”

The most common way to define whether a data set is sufficient is to apply a 10 times rule. This rule means that the amount of input data (i.e., the number of examples) should be ten times more than the number of degrees of freedom a model has. Usually, degrees of freedom mean parameters in your data set.

So, for example, if your algorithm distinguishes images of cats from images of dogs based on 1,000 parameters, you need 10,000 pictures to train the model.

Although the 10 times rule in machine learning is quite popular, it can only work for small models. Larger models do not follow this rule, as the number of collected examples doesn’t necessarily reflect the actual amount of training data. In our case, we’ll need to count not only the number of rows but the number of columns, too. The right approach would be to multiply the number of images by the size of each image by the number of color channels.

You can use it for rough estimation to get the project off the ground. But to figure out how much data is required to train a particular model within your specific project, you have to find a technical partner with relevant expertise and consult with them.

On top of that, you always should remember that the AI models don’t study the data but rather the relationships and patterns behind the data. So it’s not only quantity that will influence the results, but also quality.

But what can you do if the datasets are scarce? There are a few strategies to deal with this issue.

How to deal with the lack of data

Lack of data makes it impossible to establish the relations between the input and output data, thus causing what’s known as “‘underfitting”. If you lack input data, you can either create synthetic data sets, augment the existing ones, or apply the knowledge and data generated earlier to a similar problem. Let’s review each case in more detail below.

Data augmentation

Data augmentation is a process of expanding an input dataset by slightly changing the existing (original) examples. It’s widely used for image segmentation and classification. Typical image alteration techniques include cropping, rotation, zooming, flipping, and color modifications.

How Much Data Is Required for Machine Learning? - photo 1

In general, data augmentation helps in solving the problem of limited data by scaling the available datasets. Besides image classification, it can be used in a number of other cases. For example, here’s how data augmentation works in natural language processing (NLP):

Back translation: translating the text from the original language into a target one and then from target one back to original
Easy data augmentation (EDA): replacing synonyms, random insertion, random swap, random deletion, shuffle sentence orders to receive new samples and exclude the duplicates
Contextualized word embeddings: training the algorithm to use the word in different contexts (e.g., when you need to understand whether the ‘mouse’ means an animal or a tool)

Data augmentation adds more versatile data to the models, helps resolve class imbalance issues, and increases generalization ability. However, if the original dataset is biased, so will be the augmented data.

Synthetic data generation

Synthetic data generation in machine learning is sometimes considered a type of data augmentation, but these concepts are different. During augmentation, we change the qualities of data (i.e., blur or crop the image so we can have three images instead of one), while synthetic generation means creating new data with alike but not similar properties (i.e., creating new images of cats based on the previous images of cats).

During synthetic data generation, you can label the data right away and then generate it from the source, predicting exactly the data you’ll receive, which is useful when not much data is available. However, while working with the real data sets, you need to first collect the data and then label each example. This synthetic data generation approach is widely applied when developing AI-based healthcare and fintech solutions since real-life data in these industries is subject to strict privacy laws.

At Postindustria, we also apply a synthetic data technique in ML. Our recent virtual jewelry try-on is a prime example of it. To develop a hand-tracking model that would work for various hand sizes, we’d need to get a sample of 50,000-100,000 hands. Since it would be unrealistic to get and label such a number of real images, we created them synthetically by drawing the images of different hands in various positions in a special visualization program. This gave us the necessary datasets for training the algorithm to track the hand and make the ring fit the width of the finger.

While synthetic data may be a great solution for many projects, it has its flaws.

Synthetic data vs real data issue

One of the problems with synthetic data is that it can lead to results that have little application in solving real-life problems when real-life variables are stepping in. For example, if you develop a virtual makeup try-on using the photos of people with one skin color and then generate more synthetic data based on the existing samples, then the app wouldn’t work well on other skin colors. The result? The clients won’t be satisfied with the feature, so the app will cut the number of potential buyers instead of growing it.

Another issue of having predominantly synthetic data deals with producing biased outcomes. The bias can be inherited from the original sample or when other factors are overlooked. For example, if we take ten people with a certain health condition and create more data based on those cases to predict how many people can develop the same condition out of 1,000, the generated data will be biased because the original sample is biased by the choice of number (ten).

Transfer learning

Transfer learning is another technique of solving the problem of limited data. This method is based on applying the knowledge gained when working on one task to a new similar task. The idea of transfer learning is that you train a neural network on a particular data set and then use the lower ‘frozen’ layers as feature extractors. Then, top layers are used train other, more specific data sets. For example, the model was trained to recognize photos of wild animals (e.g., lions, giraffes, bears, elephants, tigers). Next, it can extract features from the further images to do more speicifc analysis and recognize animal species (i.e., can be used to distinguish the photos of lions and tigers).

How Much Data Is Required for Machine Learning? - photo 2

The transfer learning technique accelerates the training stage since it allows you to use the backbone network output as features in further stages. But it can be used only when the tasks are similar; otherwise, this method can affect the effectiveness of the model.

Importance of quality data in healthcare projects

The availability of big data is one of the biggest drivers of ML advances, including in healthcare. The potential it brings to the domain is evidenced by some high-profile deals that closed over the past decade. In 2015, IBM purchased a company called Merge, which specialized in medical imaging software for $1bn, acquiring huge amounts of medical imaging data for IBM. In 2018, a pharmaceutical giant Roche acquired a New York-based company focused on oncology, called Flatiron Health, for $2bn, to fuel data-driven personalized cancer care.

However, the availability of data itself is often not enough to successfully train an ML model for a medtech solution. The quality of data is of utmost importance in healthcare projects. Heterogeneous data types is a challenge to research in this field. Data from laboratory tests, medical images, vital signs, genomics all come in different formats, making it difficult to deploy ML algorithms to all the data at once.

Another issue is wide-spread accessibility of medical datasets. MIT, for instance, which is considered to be one of the pioneers in the field, claims to have the only substantially sized database of critical care health records that is publicly accessible. Its MIMIC database stores and analyzes health data from over 40,000 critical care patients. The data include demographics, laboratory tests, vital signs collected by patient-worn monitors (blood pressure, oxygen saturation, heart rate), medications, imaging data and notes written by clinicians. Another solid dataset is Truven Health Analytics database, which data from 230 million patients collected over 40 years based on insurance claims. However, it’s not publicly available.

Another problem is small numbers of data for some diseases. Identifying disease subtypes with AI requires a sufficient amount of data for each subtype to train ML models. In some cases data are too scarce to train an algorithm. In these cases, scientists try to develop ML models that learn as much as possible from healthy patient data. We must use care, however, to make sure we don’t bias algorithms towards healthy patients.

Need data for an ML project? We will get you covered!

The size of AI training data sets is critical for machine learning projects. To define the optimal amount of data you need, you have to consider a lot of factors, including project type, algorithm and model complexity, error margin, and input diversity. You can also apply a 10 times rule, but it’s not always reliable when it comes to complex tasks.

If you conclude that the available data isn’t sufficient and it’s impossible or too costly to gather the required real-world data, try to apply one of the scaling techniques. It could be data augmentation, synthetic data generation, or transfer learning — depending on your project needs and budget.

Whatever option you choose, it’ll need the supervision of experienced data scientists; otherwise, you risk ending up with biased relationships between the input and output data. This is where we, at Postindustria, can help. Contact us, and let’s talk about your ML project!