And that’s OK!
(This is part 3 of a series – please read part 1 and 2 first)
There is a misconception that computers don’t make mistakes (which conflicts with the “computer error” excuse humans use to explain their own errors!). It is also true that algorithms are perfect – they will do exactly what they are instructed to do. 2+2 always equals 4.
However, machine learning does not give an absolute answer – it provides a prediction. It is an implementation of what is, essentially, a statistical model. The answers (or rather predictions) it returns as a result of some input is a statistically derived best prediction – not the absolute truth. The statistical model is calculated accurately – the computer cannot make an error there. It just may be that the model does not sufficiently explain the real-world problem it is trying to solve for.
Let’s unpack that with an example. Suppose you have measured the weights and heights of 100 men and determined that the average mass in kilograms for men 1.8m tall is 80kg. You will have seen in your measurement activities that some men are heavier and some are lighter (but for the sake of this exercise all men are the same height). Now, you are taken to a room at the front of which is a curtain. You are told that behind the curtain are 10 men, all 1.8m tall and you are asked to determine their individual weight. What does the 1st man weigh? The 2nd?
Your best chance of being “mostly correct, on average” is to say each man weighs 80kg (the average for all men you have tested with height 1.8m). If this does not immediately make sense to you, spend some time thinking it through (or drop me an email), as this is an important point.
If the curtain is now dropped away, and you get to see each man for the first time, you will immediately start thinking “Man number 1 looks about right, but man number 2 is heavier than normal, man number 3 looks like a power lifter – must be easily 120kg” and so on. Does this make your ability to take past data and use it for future predictions wrong?
No. You have used all the data you had at hand and applied that correctly to the unknown problem. You performed the task correctly – what was lacking was detail in your data set. You did not have enough information about each man in the sample set. or of the 10 new men, to make an accurate prediction. Those reading this who have studied statistics may want to speak about population means, standard deviations and confidence levels – and those would factor in. But for the purposes of this analogy I think the explanation is easier to explain without additional concepts (feel free to comment below).
Clearly, our model can be improved if we add additional features to our body of knowledge. We can add Body Mass Index, Cholesterol Levels, Fitness and so on to each of the original 100 men sampled, and then label the 10 unknown men with the same characteristics. This will most likely give us different ranges (the greater the fitness level, the lower the mass for example. But we know this is not true in all cases. A marathon runner weighs less than a body builder – but they may have equal fitness). So we may want to add “maximum bench press” to the list of features.
It is here that things become interesting to data scientists. We have a balancing act between what to add and what not to add. Given infinite resources (money, time, people, computing power) we can add everything to a model and not concern ourselves with what to add. But we do not have infinite resources. We may not have BMI information and would need to run tests in a laboratory on all the men. This could take weeks and cost us more than we have available.
It can be also be difficult to determine beforehand what is predictive and what is not. A predictive variable is one which helps us determine the (correct) prediction. It may turn out that cholesterol levels have no bearing on the weight of a person in all but the most extreme outliers. It may be, however, that cholesterol levels are highly predictive. It may be that we have not recognised a certain feature and it would help us (and we may or may not have access to the data for that feature). Maybe the person’s income level is a predictor of mass?
Arguing for the resources to collect data on additional features without knowing beforehand what will be predictive and what will not is a hard-sell. We can spend months and a bucket of cash collecting cholesterol levels for each of our samples, and it adds no value to the final prediction.
Statistics is essentially pure. If we are doing a study on mass for health studies in a country and we collect cholesterol levels for one hundred thousand people and determine that cholesterol has no impact on body mass, then that information is useful to health organisations. They now know that if they want to reduce obesity in a city they need not factor in cholesterol levels. But in our case of a machine learning model, the unused feature (and the “wasted” resources to collect the data) has a negative impact on the bottom line. And business wants to see the bottom line improve.
Models are not perfect – nor can they ever be. But given enough resources and commitment from business, and provided the additional features can be collected, predictive models can do an excellent job. It is finding the balance between expenses and return that remains a challenge – and we will look at that in a future article.