Why Models are Random

In machine learning applications one will often come across the term “random” (or in the jargon of ML, stochastic). What does it mean when we speak about random models (stochastic models) and does the “randomness” of models mean they are not reliable?

Let us begin with a simple analogy:

One day you wake up in the middle of a forest – trees reach high into the sky with a thick, viney canopy blocking out the sky. In every direction you see nothing but trees and vines. Panicking slightly you realise you are lost. What to do?

In the distance you hear what you think is the sound of a highway – if you can get to that highway you will be rescued! The problem facing you is that the sound is very faint and with the breeze rustling the canopy, birds tweeting and the occasional howling wolf (!) you are not quite sure which way to go (your environment is “noisy” in the parlance of machine learning).

You listen intently and decide the highway is in a certain direction and you head off – after 20 steps you pause and re-evaluate. Can you hear any better from here? Which way is the highway now? You decide on a direction and head off – another 20 steps.

Stop. Evaluate. Take 20 steps. Repeat (perhaps hundreds of times).

Eventually you find yourself on the highway – right near mile marker 35. Well done!

Your path through the forest was random (stochastic), with each step taking you closer to your target on average but with some steps taking you away. Imagine in your mind’s eye the path you took, wandering across the forest floor in a “drunken stagger”. Sometimes (from the vantage point of an eagle, and with hindsight), you headed directly to the highway (so close!) and other times you moved away (oh no!). But, on average, you got closer and closer until you reached your goal.

If you were to repeat this process many many times, each path would likely differ, and you would exit at mile marker 34,35,36 and so on in your journey. But in all cases you would land up at the highway. This random walk, with an outcome in a narrow band of true goals (reach the highway), is a stochastic model. We could repeat the exercise hundreds of times, carefully writing down each literal step we took, and then decide which one of the exercises was quickest or took the fewer steps (or was most scenic!) and then publish a guide Lost in Woods for anyone in those exact circumstances.This would be a “model”.

Machine Learning models follow the same basic stochastic approach. They start off lost. They try and learn to predict an outcome by taking small steps (gradients) in the direction that leads to their goal (finding local or global minima) and when they reach a final solution, that solution is just one of many possible solutions, all “good enough” (getting to the highway solved your problem – it was not your aim to reach mile marker X in Y number of steps and Z minutes).

In short, the random nature of machine learning (stochastic gradient descent) is an approach to solve for the local or global minima so as to approximate the best solution in a noisy, non-deterministic model. It is the fundamental concept underpinning all of machine learning and should not concern you – embrace it.

 

Your model won’t explain everything

And that’s OK!

(This is part 3 of a series – please read part 1 and 2 first)

There is a misconception that computers don’t make mistakes (which conflicts with the “computer error” excuse humans use to explain their own errors!). It is also true that algorithms are perfect – they will do exactly what they are instructed to do. 2+2 always equals 4.

However, machine learning does not give an absolute answer – it provides a prediction. It is an implementation of what is, essentially, a statistical model. The answers (or rather predictions) it returns as a result of some input is a statistically derived best prediction – not the absolute truth. The statistical model is calculated accurately – the computer cannot make an error there. It just may be that the model does not sufficiently explain the real-world problem it is trying to solve for.

Let’s unpack that with an example. Suppose you have measured the weights and heights of 100 men and determined that the average mass in kilograms for men 1.8m tall is 80kg. You will have seen in your measurement activities that some men are heavier and some are lighter (but for the sake of this exercise all men are the same height). Now, you are taken to a room at the front of which is a curtain. You are told that behind the curtain are 10 men, all 1.8m tall and you are asked to determine their individual weight. What does the 1st man weigh? The 2nd?

Your best chance of being “mostly correct, on average” is to say each man weighs 80kg (the average for all men you have tested with height 1.8m). If this does not immediately make sense to you, spend some time thinking it through (or drop me an email), as this is an important point.

If the curtain is now dropped away, and you get to see each man for the first time, you will immediately start thinking “Man number 1 looks about right, but man number 2 is heavier than normal, man number 3 looks like a power lifter – must be easily 120kg” and so on. Does this make your ability to take past data and use it for future predictions wrong?

No. You have used all the data you had at hand and applied that correctly to the unknown problem. You performed the task correctly – what was lacking was detail in your data set. You did not have enough information about each man in the sample set. or of the 10 new men, to make an accurate prediction. Those reading this who have studied statistics may want to speak about population means, standard deviations and confidence levels – and those would factor in. But for the purposes of this analogy I think the explanation is easier to explain without additional concepts (feel free to comment below).

Clearly, our model can be improved if we add additional features to our body of knowledge. We can add Body Mass Index, Cholesterol Levels, Fitness and so on to each of the original 100 men sampled, and then label the 10 unknown men with the same characteristics. This will most likely give us different ranges (the greater the fitness level, the lower the mass for example. But we know this is not true in all cases. A marathon runner weighs less than a body builder – but they may have equal fitness). So we may want to add “maximum bench press” to the list of features.

It is here that things become interesting to data scientists. We have a balancing act between what to add and what not to add. Given infinite resources (money, time, people, computing power) we can add everything  to a model and not concern ourselves with what to add. But we do not have infinite resources. We may not have BMI information and would need to run tests in a laboratory on all the men. This could take weeks and cost us more than we have available.

It can be also be difficult to determine beforehand what is predictive and what is not. A predictive variable is one which helps us determine the (correct) prediction. It may turn out that cholesterol levels have no bearing on the weight of a person in all but the most extreme outliers. It may be, however, that cholesterol levels are highly predictive. It may be that we have not recognised a certain feature and it would help us (and we may or may not have access to the data for that feature). Maybe the person’s income level is a predictor of mass?

Arguing for the resources to collect data on additional features without knowing beforehand what will be predictive and what will not is a hard-sell. We can spend months and a bucket of cash collecting cholesterol levels for each of our samples, and it adds no value to the final prediction.

Statistics is essentially pure. If we are doing a study on mass for health studies in a country and we collect cholesterol levels for one hundred thousand people and determine that cholesterol has no impact on body mass, then that information is useful to health organisations. They now know that if they want to reduce obesity in a city they need not factor in cholesterol levels. But in our case of a machine learning model, the unused feature (and the “wasted” resources to collect the data) has a negative impact on the bottom line. And business wants to see the bottom line improve.

Models are not perfect – nor can they ever be. But given enough resources and commitment from business, and provided the additional features can be collected, predictive models can do an excellent job. It is finding the balance between expenses and return that remains a challenge – and we will look at that in a future article.

 

Choosing data for a machine learning problem

(This is part 2 of a series – please read part 1 first)

How does one go  about choosing data for a machine learning problem? What does it even mean to choose data? In this article we will discuss data types, what data to choose and why you should not (paradoxically!) choose anything!

If you have not read the previous blog in this series and you are new to the concepts of data science and machine learning, please read that first as it provides an introduction to key concepts we will use in this article.

Let us begin with a hypothetical example that is small enough and relates well enough to our everyday human knowledge and experience that it can serve as a learning aide.

The problem we will model is to determine what correlation (a connection between two or more things) exists between people’s height and their weight. We have an instinctive feeling for what this relationship is, that the taller a person is, the more likely they are to weigh more. And we also know that this is not always true. We can see the problem in our mind’s eye and we can see the challenges our assumptions make.

The example above is a simple linear regression model. When we collect data on the heights and weights of people we can create a scatter graph like the first one below. The numbers at the bottom are the X axis and in this case represent people’s heights. The numbers up the left hand side are the Y axis and in this case represent people’s weight.

We can see that taller people in general weigh more than shorter people, but this does not always hold true. We can see one person who weighs around 100kgs and has a height of 170cm. This is called an outlier.

heightweightscatter

 

If we now fit a linear regression line to the data, we will get the following graph.

heightweightLR

The linear regression line tells us what the expected value is for every combination of height and weight. So given a certain height, we can predict the weight of the person. As we can see from the graph this is not an exact science – in fact only three points are actually on the line. The distance between a point and where it should intercept (touch) the line is called the error. Clearly our error for the outlier is very large, while most other data points have a lower error.

A very simple form of machine learning would calculate the line that best fits the available data. The best fit line would minimize the total sum of all the errors. While this may sound complicated it is a fairly easy exercise. All we need to do (in essence and skipping the math and statistics) is measure how far each data point is from the line and then add up all these errors. For the curious a common way to do this is called Mean Squared Error.

Why do we have an outlier in the graph above, and what can we do to eliminate these or use them to improve our predictive capability? This is where machine learning triumphs! We can add additional features to our data sets and use these additional features to improve our prediction. Unfortunately, when we do this, we need to say good-bye to the graphs above and start using our imaginations. We will need to add (many) additional dimensions to our dataset. We can create a 3D graph using visual tricks, but we cannot produce 4D graphs or graphs with much higher dimensions.

Let us begin by taking a walk through an imaginary city – we will pick a cosmopolitan city made up of a diverse range of cultures, ages, occupations and interests. We can choose New York or London, or any other large vibrant city. What would you expect in this city? You will find health enthusiasts, comic book aficionados, vegans, fast food addicts, doctors, lawyers, personal trainers. You will have Americans, Swiss, Chinese. You will of course have males and females and transgenders. And you would expect each of these people to have a certain height (mostly as a result of genetics) and a certain weight (a combination of genetics and lifestyle). Which of these characteristics (features) are important in determining height/weight correlation? Spend some time in this imaginary place and see what you can determine. This is an important exercise and will assist you in future data selection processes.

In your exploration of the city above did you make generalisations? Did you expect certain nationalities to be heavier on average? Did you perhaps expect a personal trainer to be fitter and leaner than an office worker? What effect on weight did you attribute to diet? With your vast (but fallible!) knowledge of humans acquired over decades of life you will have reached certain conclusions. If you expected an office worked to weigh more than a professional athlete you may well be incorrect if the office worker runs ultra-marathons as a hobby. However, you may be thinking, how was I to know she runs ultra-marathons? That information was not made available to me! This is the same problem machine learning has – it can only use features that you have provided to it. And it may not always be obvious up-front what affect certain features will have on predictive capabilities.

The advantage of machine learning is that we need not determine the features to use. In fact, we should not! We should not predetermine what features are important as this limits the machine from learning novel relationships. We simply add everything we have to the model and allow it to find the relationships and dependencies for us! And machines do not have any issue with multi-dimensional analysis (other than computing power and memory). We can (and should) at a later stage remove some features – but we can investigate that in a later article.

(Some readers may wonder about the effects of multicollinearity etc but we can tackle that at a later stage)

The more features we have the more we can mitigate the risks of an algorithm suffering from incorrect (or limited) predictive ability owing to missing features. As we provide additional features to a model, so it should improve its predictive capabilities. Since a machine learning algorithm is mathematically derived it does not make errors in prediction – it makes the correct predictions based on the information it has. It is the underlying data sets that may be insufficient to the task at hand.

One final point that needs mentioning is the importance of sample size. Sample size is the number of data points in the model. In our imaginary city it is the number of people we can see, and for whom we have detailed information. If we only have one person from Iceland and she is a data scientist that enjoys reading Shakespeare on the weekends, it would tell us very little about Icelanders, data scientists or people who enjoy Shakespeare. There simply is not enough information to determine if she is the norm or an outlier. The more examples of “data scientists” we have, the more we can predict about the weight of a data scientist of a certain height. And similarly for Icelanders and Shakespeare enthusiasts. Our datasets need to contain a representative set of samples for as many subjects as we can.

In conclusion, machine learning algorithms require as much data, with as many features, as possible to learn the best predictive models. It may not always be obvious what data will be useful and what data will not. What we need to know is what problem needs to be solved and what datasets we have available to us (or can acquire through credible external data partners). We can then allow the machine to learn the best fit for the problem area and provide us with predictions.

 

Demystifying AI for Non Data Scientists

I am often asked to explain complex technical concepts to people who are not technical. These people are often experts in their own field and are either generally curious as to the subject matter or require a non-technical grasp of the subject matter for decision making purposes.

This article is aimed at non-technical people to assist in basic understanding of machine learning and provide insight into the process. Data scientists may argue some of the points or analogies I make, of feel that some parts are over simplified.

I always find the best way to explain any subject is through an analogy, woven into a story for the more complex subjects. We build toward an understanding of the subject matter through an exploration of material the student already knows.

And we can apply this same concept to AI (or machine learning).

Before we get into the nuts and bolts of the topic, let us first take a simple analogy and work from there. If I were to program a robotic arm to make a cup of tea using conventional programming languages the task would be tedious and time consuming. Now, imagine explaining to a five year old the same task. You would begin by saying “Take a tea cup from the cupboard and place it next to the kettle. Put some water in the kettle and turn it on. When it is boiling, turn it off and pour some water into the tea cup.” The five year old may well succeed at this task, but when you analyse what you have instructed, you will realize the enormous amount of information the child had at her disposal that you did not need to explain. What is a tea cup? Where are they kept? What is water? How do I get water into the kettle? This is all domain knowledge the child has accumulated over a few years. Procedural programming languages cannot learn. The child has learned each step through trial and error over many attempts, and is able to build a new cohesive solution from these partial skills.

Taking one step closer to machine learning from the above analogy, show the child a photo of a common animal (dog,cat,horse etc) and ask what animal it is. You would likely be disappointed if they could not. But how does a child perform this feat? How do they know, instinctively, what animal it is? It could be a dog breed they have never before encountered, and yet most five year olds would recognise it immediately. What is it about a photo of a dog that says “this is a dog”? We humans just “know” things like “if it has big floppy ears it is more likely to be a dog than a cat” and we take all these “known” facts and add it up instantly in our brain and say “dog!”. We have been able to take partial information segments and cohesively build it into one coherent result. These partial pieces of information is our first concept. In machine learning the “partial pieces of information that allow us to decide on an answer” are called features.

Machines learn through a process of identifying which features are most important to a problem area and using those learned importances to make decisions. Going back to the previous example of animal identification, a feature such as “has a head” is easily dismissed by humans. ALL animals have a head (at least those that are alive!). But do they? A starfish does not – but we would still recognise it! If you have ever played “20 questions” you will understand that asking certain questions up front can lead you to a smaller subset of possibilities quickly, but can also cause you to go down the wrong track from the start. “Does it fly” is such a question. If the answer is “No” then we eliminate all birds and start thinking of land or sea animals. But the answer may have been “Ostrich”.

The machine learning algorithm has to decide on which features are most important and for what outcome. These “decisions” are called weights. Different combinations of features can be learned together and these sets would have individual weights. This allows us to overcome the “does it fly/ostrich” type of problems. “Does it fly and not stand 1.4m tall” for example.

Within the structure of a machine learning system we have a number of decision making nodes called neurons. For the sake of simplicity we will only have one layer of neurons for our discussion, and in a future article we will add more. Each neuron is fed information about the problem domain (so each neuron gets access to all the features). Each neuron then determines the importance of the features it has been given and provides  information to the output (the part that guesses the answer). If the neuron contributed to the correct answer, that is its “instinct” was correct, it is rewarded through a reinforcement process that strengthens its decision, and if it is wrong it is told to try something else next time. and this repeats for all the neurons. Once it has worked through one game of “20 questions” or “Guess the animal” it is given a second turn, and the process repeats. And it will repeat for every example we have. These examples are called samples.

And we can rerun the full set of samples multiple times (these reruns are called epochs, but we need not worry about that for now).

Over a course of many (many!) samples the machine can learn what features are most likely to indicate the animal is a dog / cat / horse etc. Some neurons will get excited when they see the pairing of “floppy ears” and “20 or more kgs” or “retractable claws” and “slit eyes”. This excitement is called an activation. When neurons activate they send a signal to the output node, and given a specific set of activations, the output node will be able to guess what animal was shown.

If you played had a game of “20 questions” but it was instead “2 questions”, you would think the game was rigged. This is because you realise that the more information you have, the better your guess will be. The same is true for machine learning. The more samples we have, and the more features per sample, the better the result. In order to distinguish which animal it is (or in a business setting, which clients are likely to lapse, or not pay us) we need enough samples with enough important features. Note here that we need important features. And much like the “does it have a head?” analogy earlier it is not always a good idea for us humans to decide what is good or bad in terms of a feature. It may well be that “hair colour”, as an example, is a good indicator of whether or not a person will subscribe to our service! (Maybe our marketing campaigns are subconsciously influencing a certain segment?)

The last part of this mini article I want to touch on is the importance of good data. Imagine we are playing “20 questions” with a small child. We ask “does it fly” and they say “yes”. So we guess “eagle”, “sparrow”, “dragon fly” etc. When we lose, the child says “Ostrich! haha”. The child just assumed that since it has wings, it flies. This is an example of how poor information can cause the system to learn the wrong things. In a business setting if our call centre operators routinely misclassify some people as say old or young by the sound of their voice or the nature of the language they use, without asking for an age our data would be incorrect. This incorrect data will then form features and weights in a system and our results could be invalid. It is not that the machine made an error – it made the correct decisions based on the data it had.

In this article we discussed features, weights and activations and described basic machine learning in the context of how humans learn. If you are left feeling “this is too simple” then fear not. While the mechanics of the process are complex, involving statistics and calculus, the “how it does it in simple terms” is accurate. It is no more difficult to understand the basics of machine learning and avoid the complexity of the mathematics than it is to understand how a child knows what animal she sees and yet know nothing about neuroscience!

 

 

 

Welcome

Welcome!

In this series of blog posts I will introduce some key concepts in technology that interest me in a gentle, non-technical manner.

My aim is to explain concepts in everyday terms. Learning begins with an intuitive feeling that we understand a topic, and this is achieved by viewing new material through the lens of what we already know. It is not important to know all the technical terms to begin to understand a topic, that can come later. The joy in learning is to see new ideas grow from existing knowledge.

I am the CIO of a group of Credit Bureaus operating out of South Africa with clients world-wide. You can view our company profile at Bureau House Group .

You can contact me directly at alaincraven@gmail.com for specific questions and comments around my articles.

I am also contactable at alain@cpbonline.co.za if you want us to assist in your own data modelling / machine learning or analytical projects.

“You can know the name of a bird in all the languages of the world, but when you’re finished, you’ll know absolutely nothing whatever about the bird… So let’s look at the bird and see what it’s doing — that’s what counts.”
― Richard Feynman

post