Since many years we can watch Top Chef Gordon Ramsay on TV. Most of the time he comes to rescue at ailing restaurants. He is infamous for his constant swearing, but in almost all of the episodes he is very outspoken about just one thing “Use fresh ingredients, not those f***ng canned or processed stuff!”.
In data analysis we can learn from him. Also, for us it is extremely important to use fresh ingredients and not pre-processed ones. What do I mean with that? Just what it says, use data that have as much of the original information as possible, just like fresh ingredients have more taste than processed ones.
Let me illustrate this with the variable ‘Age of the person”. Often, we use this variable in an already condensed state, collapsed into a few categories like ‘young’, ‘middle’ and ‘old’. That can be handy for crosstabulations or an analysis of variance, using here a series of ten categories will be not very insightful. But why should we collect that variable in just these three categories, we could better try to catch as much information in the collecting process.
So, ask for the age itself, instead of an age category. We get even more precision in information when we ask for the date of birth. Collapsing can always be done later, depending on the technique we are going to use. Then we can decide how many categories we want and what the optimal cut-off values will be. Of course, we have to consider privacy regulations to prevent possible identification of individual persons by collecting information that is too detailed.
Why is this so important?
First because pre-chosen cut-off points for categories can hide peculiarities from the distribution, maybe there is a maximum right at your cut-off point. Also, cut-off points may differ depending on the subject of the study and the techniques used. In the Netherlands 16 years is important for education research, youngsters need to go to school at least part-time until that age. But for research on political parties 18 years is a better choice as this is the age on which one gets the right to vote.
Collecting data ‘as fresh as possible’ gives us also the possibility of comparison with other research; we can then collapse them into the specific categories they used.
If this is so important, why is age so often recorded in coarse categories? This is an artefact of the old days, long ago, when data were stored on punched cards. These had 80 columns of 12 positions each. In most cases data processing was nothing more than just feeding the cards into a so-called counting-sorter. You could set these machines to sort your cards into 12 bins according to the content of one single column. First you sorted them according to the column that contained the coding for the 3 categories of age, and then you sorted the content of each separate bin again, this time according to a further variable, like income. Voila: you had your crosstabulation of Age vs. Income.
However clever it was, the process allowed only sorting on one column at a time. Should we have recorded Age as integer numbers between 0 and 99 we would have needed two columns. Then sort first on the first of these columns, the bins would contain the ages 0-9, 11-19, etc. Sorting each of these ten stacks on the second digit would then produce 100 stacks with the ages 0, 1, 2 etc. each of which we could sort again against the variable Income. Apart from the cumbersome and error-prone process we would need two columns instead of one. And columns were a sparse commodity: a card had only 80 columns available to store all data so you needed a sparse coding or you had to limit your number of questions. Huge questionnaires solved this problem by using more cards for each person. But then you needed to duplicate all ‘background variables’ like Age, Sex, Education, Income, etc. on each separate card as you could only make a crosstabulation with variables within the same card.
Collect fresh data
Do not think that this problem occurs only with a variable like Age. It is even more important to use ‘fresh’ data with variables like Price or Length. Imagine prices collapsed into a few Euro-based categories, you can never compare them with research that uses UK Pounds or US Dollars. And length in cm-based categories is hardly comparably with inch-based categories.
The available techniques of some 60 years ago let us no other choice than to record variables in a highly condensed form. But today there is absolutely no need for that, so we can follow the advice of the swearing Chef and collect ‘fresh’ data with as much detail we can get in order to collect a maximum of information, just like fresh ingredients have a lot more flavour then processed ones!
Main image: Wikimedia Commons