What Is A Good Classifier When Medical Data (Many Independent Variables) But Small Dataset

How Much Training Data is Required for Machine Learning?

Last Updated on May 23, 2019

The amount of data you need depends both on the complexity of your trouble and on the complexity of your chosen algorithm.

This is a fact, but does not assistance y'all if you are at the pointy terminate of a machine learning projection.

A common question I get asked is:

How much data do I need?

I cannot answer this question directly for you, or for anyone. But I can requite you a handful of ways of thinking most this question.

In this postal service, I lay out a suite of methods that you lot can use to think near how much grooming information you need to apply car learning to your trouble.

My hope that one or more of these methods may help you understand the difficulty of the question and how it is tightly coupled with the heart of the induction trouble that you are trying to solve.

Let's dive into it.

Notation: Practice you take your ain heuristic methods for deciding how much information is required for machine learning? Delight share them in the comments.

How Much Training Data is Required for Machine Learning?
Photo by Seabamirum, some rights reserved.

Why Are You Asking This Question?

It is of import to know why you are request most the required size of the training dataset.

The answer may influence your adjacent step.

For case:

Do you accept likewise much data? Consider developing some learning curves to find out just how big a representative sample is (below). Or, consider using a big information framework in order to use all available data.
Do you accept too piddling data? Consider confirming that you indeed have too fiddling data. Consider collecting more data, or using data augmentation methods to artificially increment your sample size.
Have you lot non collected data still? Consider collecting some data and evaluating whether information technology is plenty. Or, if it is for a study or data collection is expensive, consider talking to a domain skilful and a statistician.

More generally, you may have more than pedestrian questions such every bit:

How many records should I export from the database?
How many samples are required to achieve a desired level of performance?
How large must the training ready be to attain a sufficient estimate of model performance?
How much data is required to demonstrate that 1 model is improve than some other?
Should I use a train/examination split or k-fold cross validation?

It may be these latter questions that the suggestions in this post seek to address.

In practice, I answer this question myself using learning curves (come across below), using resampling methods on pocket-sized datasets (e.yard. k-fold cantankerous validation and the bootstrap), and by adding confidence intervals to final results.

What is your reason for asking about the number of samples required for machine learning?
Please let me know in the comments.

Then, how much data do you need?

1. Information technology Depends; No Ane Tin can Tell Yous

No ane can tell you how much data yous demand for your predictive modeling problem.

It is unknowable: an intractable problem that you must discover answers to through empirical investigation.

The amount of information required for machine learning depends on many factors, such as:

The complexity of the problem, nominally the unknown underlying function that best relates your input variables to the output variable.
The complexity of the learning algorithm, nominally the algorithm used to inductively larn the unknown underlying mapping office from specific examples.

This is our starting bespeak.

And "it depends" is the answer that virtually practitioners volition give you the offset fourth dimension you ask.

two. Reason by Analogy

A lot of people accept worked on a lot of applied motorcar learning issues before you.

Some of them have published their results.

Perhaps you can look at studies on problems similar to yours every bit an estimate for the amount of data that may be required.

Similarly, it is common to perform studies on how algorithm operation scales with dataset size. Perhaps such studies can inform you how much data you lot crave to use a specific algorithm.

Possibly you can average over multiple studies.

Search for papers on Google, Google Scholar, and Arxiv.

3. Apply Domain Expertise

You demand a sample of data from your problem that is representative of the problem you are trying to solve.

In general, the examples must be independent and identically distributed.

Remember, in motorcar learning we are learning a function to map input data to output data. The mapping role learned volition only be as skilful every bit the information y'all provide it from which to larn.

This means that there needs to be enough data to reasonably capture the relationships that may exist both betwixt input features and between input features and output features.

Use your domain knowledge, or notice a domain skillful and reason about the domain and the scale of data that may be required to reasonably capture the useful complexity in the problem.

4. Use a Statistical Heuristic

There are statistical heuristic methods available that allow you lot to calculate a suitable sample size.

Nearly of the heuristics I accept seen have been for classification problems equally a function of the number of classes, input features or model parameters. Some heuristics seem rigorous, others seem completely advertizement hoc.

Here are some examples you may consider:

Factor of the number of classes: In that location must be ten independent examples for each grade, where x could be tens, hundreds, or thousands (east.g. 5, 50, 500, 5000).
Gene of the number of input features: There must be 10% more than examples than there are input features, where x could be tens (e.g. 10).
Gene of the number of model parameters: At that place must be ten independent examples for each parameter in the model, where x could exist tens (east.g. 10).

They all look like advertizement hoc scaling factors to me.

Take you lot used whatever of these heuristics?
How did information technology go? Let me know in the comments.

In theoretical work on this topic (not my area of expertise!), a classifier (due east.g. k-nearest neighbors) is often assorted against the optimal Bayesian determination rule and the difficulty is characterized in the context of the curse of dimensionality; that is there is an exponential increase in difficulty of the problem as the number of input features is increased.

For example:

Pocket-size Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners, 1991
Dimensionality and sample size considerations in blueprint recognition do, 1982

Findings suggest fugitive local methods (like k-nearest neighbors) for sparse samples from loftier dimensional bug (e.g. few samples and many input features).

For a kinder discussion of this topic, encounter:

Section two.5 Local Methods in High Dimensions, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2008.

5. Nonlinear Algorithms Need More Data

The more powerful machine learning algorithms are frequently referred to every bit nonlinear algorithms.

By definition, they are able to learn complex nonlinear relationships betwixt input and output features. Yous may very well exist using these types of algorithms or intend to use them.

These algorithms are often more flexible and even nonparametric (they can figure out how many parameters are required to model your trouble in addition to the values of those parameters). They are also loftier-variance, meaning predictions vary based on the specific information used to train them. This added flexibility and power comes at the cost of requiring more training information, ofttimes a lot more than data.

In fact, some nonlinear algorithms like deep learning methods can continue to meliorate in skill as you give them more than data.

If a linear algorithm achieves adept functioning with hundreds of examples per grade, you lot may need thousands of examples per grade for a nonlinear algorithm, like random forest, or an artificial neural network.

6. Evaluate Dataset Size vs Model Skill

It is common when developing a new machine learning algorithm to demonstrate and even explicate the performance of the algorithm in response to the amount of data or problem complexity.

These studies may or may not exist performed and published by the author of the algorithm, and may or may not exist for the algorithms or problem types that yous are working with.

I would suggest performing your ain study with your available data and a single well-performing algorithm, such every bit random forest.

Design a report that evaluates model skill versus the size of the training dataset.

Plotting the result as a line plot with training dataset size on the x-centrality and model skill on the y-axis will give yous an idea of how the size of the information affects the skill of the model on your specific problem.

This graph is called a learning curve.

From this graph, you may be able to projection the amount of data that is required to develop a skillful model, or peradventure how footling information y'all actually need before hitting an inflection point of diminishing returns.

I highly recommend this arroyo in general in order to develop robust models in the context of a well-rounded understanding of the problem.

7. Naive Guesstimate

You need lots of data when applying car learning algorithms.

Frequently, y'all need more data than you lot may reasonably require in classical statistics.

I often answer the question of how much information is required with the flippant response:

Go and use as much information every bit yous can.

If pressed with the question, and with zero knowledge of the specifics of your trouble, I would say something naive like:

Yous need thousands of examples.
No fewer than hundreds.
Ideally, tens or hundreds of thousands for "average" modeling bug.
Millions or tens-of-millions for "hard" issues like those tackled by deep learning.

Again, this is just more advertisement hoc guesstimating, just it's a starting bespeak if you demand it. So get started!

8. Get More Information (No Affair What!?)

Large data is often discussed along with machine learning, merely y'all may non require big data to fit your predictive model.

Some bug require big data, all the data you lot have. For case, uncomplicated statistical machine translation:

The Unreasonable Effectiveness of Data (and Peter Norvig'southward talk)

If you are performing traditional predictive modeling, so there will likely be a point of diminishing returns in the grooming set size, and you should study your problems and your chosen model/s to encounter where that point is.

Keep in mind that car learning is a process of induction. The model tin can only capture what it has seen. If your training data does not include edge cases, they volition very likely not exist supported by the model.

Don't Procrastinate; Become Started

Now, stop getting prepare to model your problem, and model it.

Practise non allow the trouble of the training set up size stop yous from getting started on your predictive modeling problem.

In many cases, I see this question equally a reason to procrastinate.

Get all the data you can, utilize what you take, and come across how constructive models are on your problem.

Larn something, and so have activeness to better understand what you accept with farther assay, extend the data you have with augmentation, or assemble more than data from your domain.

Summary

In this post, you discovered a suite of ways to call up and reason about the trouble of answering the common question:

How much training information practise I need for machine learning?

Did whatever of these methods aid?
Let me know in the comments below.

Do you take any questions?
Enquire your questions in the comments below and I will exercise my best to answer.
Except, of course, the question of how much information that you lot specifically need.