As classification is the slightly less known little sister of prediction, I start with a small example. Classification is the prediction of a class label on the basis of several criteria or variables. Suppose class labels are {child, adult} and we know a person reads the Donald Duck. Then, a reasonable classifier probably classifies this person as ‘child’. I testify that this is certainly an imperfect classifier! To quantify how good a classifier is we always report the accuracy of a classifier: an estimate of the proportion of correctly predicted labels.

## Classifiers

Here are my classifiers of the two clans, Stats and AI, plus my own very subjective assessment of the accuracy of each classifier .

Thinkers vs Do-ers (80%) Statisticians are trained to start with an hypothesis, which requires thinking. Sometimes very long thinking, with the benefit of a tailored solution for the problem at hand. Practitioners might be annoyed by waiting, though, and turn to the ML community: a quick (and often reasonable) answer is guaranteed provided their computers are fast enough.

Models vs Algorithms (90%) While both disciplines have mathy roots, statisticians stay closer to those roots by formulating their methods as mathematical models. ML has a dominant computer science component, preferring an algorithmic representation. Take your pick: formulas or recipes?

Linear vs Non-linear (95%) Possibly the most discriminating classifier. Statisticians are simple-minded people: let’s try the easiest model possible, the linear one, and derive all its properties. They know the model is wrong, but it might well fit the purpose when data is scarce. Only if it doesn’t suffice, add non-linear terms or, more often, apply a non-linear transformation. Machine learners think reversely: the reality is often non-linear, so let’s use very flexible representations as a starting point. Then, apply data-driven techniques to prevent over-fitting, as this is a curse that might come with flexibility.

Generalisation vs Optimisation (90%) Statisticians are trained to think about random errors, and build models that prevent these impacting predictions. Hence, their learners usually generalise well: they are robust against unforeseen changes. The price to pay is that deviations from the model are interpreted as random errors, while these may also reflect modelling error. Machine learners are keen to minimize the latter, and hence focus on optimization of complex, flexible learners. This often leads to fantastic performance in the small world encompassing the current samples. However, when one enters the great unknown (e.g. a different country/hospital/time) predictive performance may drop dramatically. Such lack of generalisation is a serious problem for some machine learners. The Google flu nowcast is a well-known example. A fancy learner accurately estimated the number of flu cases in the US based on Google search entries, but miserably failed when applied one year after^{1}.

The generalisation-optimization balancing act is intriguing. It seems the two clans are slowly moving towards each other on this matter: statisticians realize their models are sometimes too simple, so they adapt model components to be more flexible, while machine learners are developing many ideas to improve generalisation of their learners.

Sparse vs Dense (90%) Worth a blog in its own right, but let me try to keep it short. ‘Sparse’ means: only a few variables are truly relevant, whereas ‘dense’ means the opposite: many, if not all, variables, are relevant, and possibly in multiple ways. Here, the two communities rigorously split: many statisticians have a sparsity fetish, whereas machine learners like superdense learners. A statistician shoots with bullets, and often misses. A machine learner shoots with lead, causing a lot of collateral damage.

Sparsity allows statisticians to prove mathematical theorems, which enable confidence statements on selected variables. Nice for applications which are likely sparse (think of astronomical signals). Unfortunately, sparsity has become a panacea in statistics. I see many papers that apply sparse methods to settings that are intrinsically non-sparse such as cancer genomics: complex diseases likely involve thousands of genes. At the other end of the spectrum, machine learners use an enormous number of parameters. Once they realize this, they smash these parameters by one hammer called regularization. Not very subtle either. A good compromise might be to strive for parsimony. Here, parsimony is a light version of sparsity. Rather than imposing conditions on the number of relevant variables upfront, parsimony strives for a learner with as few variables as possible, but predicting (almost) as well as a dense learner.

## What should you use?

The short answer is: both. A reasonable criterion might be: use the simple, interpretable statistical learner unless the machine learner beats it by a prespecified margin on a relevant test set. Use a small margin or even reverse the burden of proof when you’re an avid machine learner who understands all the properties of your favourite learner. But whatever you prefer, be aware of the great unknown and keep your options open for a changing world.

ps. The above list of classifiers is surely incomplete. Feel free to add your favourite one as a comment to this blog, or to disagree with my subjective accuracy estimates.

^{1}For an in-depth discussion: see this article in the Conversation

**Credits**

Main image: Clash of clans, Case_newton on pixabay.com

## Add comment