Who’s Data Breach is it Anyway?

Welcome to Whose Data Breach Is It Anyway? The blog post where everything is worrying and traditional security does not matter. That’s right, traditional security is like going home and expecting no one from work to call you. (with apologies to Drew Carrey)

Before we can answer that question, let’s start with defining a few terms.

Traditionally, cyber security is about using firewalls, anti-virus and anti-malware tools and relying on the operating system and software you use to keep you safe, to update itself and to alert you of issues. Sure there is a LOT more that can be done, but the vast majority of companies and users rely on just these. To achieve this all that organisations need to do is create a policy and a set of controls/tasks that deal with security updates – and practice this daily.

A more robust approach is to trust no one (what IT people call “zero trust” or “zero trust networks”). We should have “zero trust” in vendors to push updates, and for vendors to ensure those updates correctly install. We should have “zero trust” that all staff have correctly checked for updates (I will be the first to admit that I have also missed updates before – no one is immune and everyone needs a second set of eyes!) Zero trust also means that we need to be careful who has access to what and for what purpose. This takes a lot of effort and requires periodic reviews of access levels. And – lastly – zero-trust also means we need to be VERY careful what we install – even if its from a known and respected software vendor (afterall, zero trust means zero trust…)

Of course we cannot double check updates for bad code from huge organizations like Microsoft – none of us have the resources at hand to do that. However, we should be aware that all updates and installations carry a degree of risk – supply chain attacks are on the increase (see for example Kaseya in June 2021)

Physical checks of user equipment to ensure patches are correctly installed is a must – non-IT staff may not be sufficiently trained to recognise issues and ensure patches are correct. Doing this is a cultural mind shift – it brings security concerns right to the desk of all staff and makes security top of mind in the organisation. I won’t lie to you – this is a difficult transition. Business people (and I can understand their stance) put clients and business first. What we need to do is ensure that “business people” understand that “security first” is also “business first”. Data breaches are more and more common, and more and more data is being compromised in each breach. As at end September 2021 the number of data breaches for the year already exceeded all the data breaches for the entirety of 2020!

Now that we have discussed what should be done and the limitations imposed on us to ensure things are done regularly and without risk, lets return to the topic of this blog – Whose Data Breach is it Anyway? The simple answer is “the company who loses control of the data is responsible (or accountable) for the breach”. But if we simply accept the short answer then this blog post is over and my work here is done! None of what follows here is a legal defense nor is it advice of any sort – I simply want to explore a topic that has been burdening my mind for some time so that this particular topic gets more attention.

Lets examine some hypothetical / fictional scenarios.

Scenario 1 is when respected, pervasive and well built software leads to an exploit. Here we delve into one fictional world where I want to explore “culpability”. What happens if a security flaw in our operating system, web browser or suite of office productivity tools – as yet unknown to the vendor – compromises our work computer and data is leaked.? Are we responsible for the data loss suffered in this event. A clear argument can be made that you are merely a victim here – the situation is akin to your brakes failing on a brand new vehicle. Are you responsible for an accident? For our hypothetical data breach, “yes” GDPR / POPI will hold you responsible, but what could you have done to limit the data breach? Can you be accused of being negligent? What are the factors involved:

  1. How much data was available on your device
  2. Was the data actively being used or was it old data you did not archive
  3. Are all possible security measures up to date (patches, antivirus, antimalware)
  4. Are you able to demonstrate your policies and controls for data retention and security updates

Much like our motor vehicle accident example, culpability can be determined by the level of negligence and the amount of personal responsibility involved. Regardless of the defective braking system, if someone were DUI at the time of the accident they would find it difficult to get a sympathetic ear from the judge! Likewise, having sufficient personal accountability and sufficient additional layers of protection in place in the event of “defective software” would go a long way towards appeasing regulators in the event of a data breach.

Some actions we can take to limit data breaches would be:

  1. Limit the data you have access to at any point in time. Ensure that as far as possible you only ever have direct access to data and files you really need at that point. If you have access to 1000s of data files in real time, then so does any hacker or any malware that has access to your computer!
  2. Routinely archive and clean up data that is not in use – automated tools to do this exist. Regulators do not look kindly on data breaches that contain many years of “old data” that should have been archived!
  3. Update your computer every week – or more often. Check routinely that patches are all up to date by physically running a manual check. Do not rely on automated patches installing – check them (zero trust!).
  4. Have solid policies and controls in place to double check compliance and record which files have PII and to check that “old files” are archived securely.
  5. If you are in charge of your organisation’s cyber security or data security, ensure the above is done by everyone – check, check, check and record the evidence of having done that!

In this way – even if the breach is a result of a third party vulnerability (the tech equivalent of defective brakes) – you can demonstrate sound security hygiene, sound adherence to data retention regulations, and limit the amount of data a cyber criminal has access to.

So, back to “Whose Data Breach is it Anyway?” and Scenario 2. At the beginning of July 2021 a software provider (Kaseya) were hacked and a vulnerability was packaged into their software. (For the curious this is a “supply chain attack”). This software was subsequently released and hundreds of companies were affected. Companies installing the new software update were victims here – they installed updates to software they had every right to expect to be safe and had no reason to suspect of being malicious. The hypothetical question here is “if significant amounts of PII had been leaked, who is responsible? Is it Kaseya or is it their client?”

Would the situation change if the security incident were not as a result of the injection of malicious code but rather due to a software defect? Imagine a hypothetical company (we can call them ABC Software) that releases an update to their product that accidentally creates a vulnerability that hackers exploited. Again, much like Kaseya, the client had every right to expect the software to be safe and had no reason to suspect anything malicious. If PII were leaked in this event, who is responsible?

There are even a number of zero-click malware exploits in the wild (a zero-click simply means that malware can be installed even if you do not click a link or open a malware executable – your device simply needs to receive the malware. Pegasus hit the news almost world-wide recently for example). I am not going to answer my own hypothetical – this is a thought experiment in how we perceive data breaches and liability. However, once again, like in Scenario 1 above, the scale of the damage can be contained by limiting what can be breached in an attack.

So, regardless of “Whose Data Breach is it Anyway?” it’s important that we know:

  1. The organisation that breached the data is responsible
  2. Software vendors have an obligation to ensure their software is free of defects, that defects are corrected timeously and that their supply chain is secure. However, at least as of now, software vendors are not being held accountable to regulators for data breaches at their customers which result from their (the vendor’s) flaws.
  3. The best way to protect yourself is by limiting what can be breached, by deleting files you no longer need, encrypting what cannot be archived and / or deleted, and by continuously monitoring for data that should not be kept available.

In summary, when it comes to software product defects I will be interested in the future to see if society and the law adopt a “product liability” stance where manufacturers are either subject to “strict liability” or we have “Joint and Several Liability” – especially where such defect leads directly to a data breach. (If a lawyer out there wants to do a mini article on how that could work with software I would happily link or post with full credit!)

Keep safe!

Cyber Security: An Asymmetrical War

This is the first of an on-going series where we will explore basic cyber security topics.

Cyber security is a war – a war that that the good guys are losing and the bad guys are winning. Why is that?

First let’s look at the definition of “asymmetrical warfare” as I use it in this article. Asymmetrical warfare is when a war is fought where one side has an overwhelming advantage (money, resources, weaponry) and/or where the rules for the two sides differ. The “War on Terror” is asymmetrical for both reasons. The “good guys” have overwhelming weaponry at their disposal – but in almost every case are unable to use it! The “bad guys” on the other hand have less firepower but play by their own rules – international condemnation does not phase them.

In Cyber Warfare the “bad guys” do not need to play by any rules – they are a law unto themselves. They can bribe and blackmail staff within organisations to deploy malware or steal data – however organisations are restricted in terms of ensuring their staff remain honest (there are very few organisations that would be allowed to run routine polygraphs on their staff!).

Cyber Criminals also only need to find a single vulnerability to exploit – and only once. Organisations have to expend enormous resources maintaining their defenses and need to repel every attack – every time. In other words – organisations have to contend with the asymmetry of resource allocation – we need “a man on every wall, every hour of every day” while a criminal only needs to look out for the one wall without a guard, for a single moment in time. Every James Bond film and John Wayne spaghetti western uses this trope – the lone individual sneaking into a fortified area that’s heavily guarded. The femme fatale who distracts the guard with a wink and a smile while Bond sneaks behind him. It’s a well-worn trope for a reason – it works.

The first of our asymmetries of the war on Cyber Crime is budgets and objectives. This is an asymmetry in terms of resources – more pointedly, the allocation of limited resources towards cyber security. Most legal organisations attempt to create as much defense for as little spend as possible. This is in sharp contrast to cyber crime organisations whose key objective is to breach systems for as much gain as possible. It may be clearer with a simple example:

Company A is a mid-sized enterprise with an ecommerce website that collects large amounts of user data, including payment information. They are vulnerable to cyber threats and also hold data that a cyber criminal would be interested in stealing.

Hacker B is a cyber crime organisation that steals data and then encrypts systems for a ransom. They are well known and have been responsible for a number of high profile data breaches in the past 12 months.

Company A has a fixed budget for expenses (as do we all!) and view cyber security as just one expense item – it is also not an expense item that generates revenue. Company A could for example expand their line of products, improve logistics to be able to ship faster, and could rebrand their web site to keep it fresh and appealing in an effort to attract new customers. Since Cyber Security has no immediate payoff, the bulk of the budget goes to expanding product lines, logistic improvements and web site rebranding. Meanwhile Cyber Security receives far too little budget to provide adequate protection. Company A has stacked their budget towards meeting their objective – and that objective is to be a market leader with the slickest web site and fastest shipping times. Cyber security for them meets no objective and reduces the budget for items that do meet their objective.

Meanwhile, Hacker B also has a number of expenses to budget for (remember that hacking groups are often fully fledged organisations run as for-profit companies). They need to budget for staff (more hackers) and office space. They often also have web sites (usually a forum on the Dark Web). They decide to spend the bulk of their budget on hackers – foregoing expensive web sites (their web sites are usually of a low quality – it is not important to them). They also have no real need for compliance and regulatory expenses like GDPR or PoPI . For Hacker B the objective is simple: compromise as many systems as possible in order to drive up revenue. Spending money on cyber security (as the aggressor) is key to their objective and they gladly spend it.

This brings us to our first asymmetry: Company A sees cyber security as an expense that is begrudgingly paid and that fails to meet the organisation’s objective, while Hacker B sees cyber security as an investment that generates revenue and fulfils the organisation’s objective. It is not too difficult to see that this places legitimate organisations – like Company A – who are defending their networks at a severe disadvantage.

The second asymmetry is in terms of “acceptable practices”. Legal corporations acting within the law are required to abide by regulations governing their market sector, they need to comply with HR requirements and they need to be seen by their clients has ethical and fair in their dealings. And that is all as it should be – the “good guys” need to be good. However, the “bad guys” have none of those constraints. Let’s go again to a fabricated example:

Company A (our ecommerce site from earlier) employs 20 staff – mostly developers and sales people – at their Head Office, and operate shipping warehouses in 3 cities that each employ between 20 and 30 staff to fulfil orders – many of these are minimum wage earners and often employed for short periods of time only. In order that the organisation can protect itself from cyber threats and so as to ensure it meets regulatory requirements, it must ensure that each of these employees are trained in cyber security, physical security and information protection. Even the staff in the shipping department need to know the basics as they deal with customer information on shipping labels for example. In addition, each of these staff are a possible vulnerability – they can be tricked with phishing scams, they can be bribed, and they can be blackmailed. This problem is more complex than it may at first appear. Imagine the logistical task of meeting staffing requirements for Black Friday where a large number of part-time staff are employed for a short period of time. In many cases (but by no means all), shortcuts are taken in the employment process in deference to expediency. We can all hear the words “there is no point spending a long time training the temps as they won’t be here that long”. A tacit acknowledgment that cyber security incidents are only caused by long term staff (tongue in cheek!)?

Hacker B on the other hand loves chaos. They thrive on shortcuts and business expediency as these create opportunities for them. Cyber crime rocketed in the pandemic lockdown as companies struggled to link their work-from-home staff to corporate networks – this “new normal” was paid for with short-cuts and expediency. Severe financial losses born by staff on short-time, seasonal workers who suddenly had no jobs, and spouses of your staff who lost their jobs created a flood of opportunity for hackers looking to bribe staff. They could – with impunity – attempt to bribe staff to inject malware onto systems or to steal data. We are well advised to remember that ““There are only nine meals between mankind and anarchy

This asymmetry is again in the favor of the cyber criminals. Untrained staff are not aware of cyber threats and can be easily compromised. Temporary workers may not recognise their managers – especially in large organisations and especially on a voice call – and can be tricked into performing actions that compromise an organisation; and lastly, over-indebted people are susceptible to bribery. Organisations are often hamstrung in their efforts to ensure staff remain honest.

The last asymmetry I want to discuss in this article is that of the sheer amount of surface area the “good guys” must protect versus the single failed control that the “bad guys” are looking for. This can be summed up as: The good guys need to defeat every attack, every time, every day. The bad guys just need to catch you on the one day you let your guard down. Let’s return to our Company A once again and see how asymmetrical the attack surface is.

Company A, as we know, is an ecommerce web site. As such the web site is hosted on the public internet and anyone in the world can access the web site at any time of the day. This means that if they are operating out of London, their developers are asleep when Los Angeles is eating lunch. Company A would need to employ 24/7 staff to maintain systems, action outages and react to security events. This is difficult to achieve logistically and comes at an expense. Do you employ nightshift staff in London? Do you open an office in Chicago and employ developers, security specialists and managers? Is there sufficient justification to swing the Board to approve these expenses when the Board may see it as an unnecessary expense? In addition, Company A has numerous systems scattered across the country – from Head Office, to distribution centers, to remote staff working at home. Their senior staff have work phones with 24/7 access to the corporate email server, most middle and senior management have laptops that they travel with and carry home every day. Each of these opens another opportunity for a hacker, and requires additional resources (time and money) to maintain, update and protect. And to top it all off its a bank holiday today and Microsoft just released an urgent patch for its operating system for a critical vulnerability. Company A is stretched to its limits.

Meanwhile, over at Hacker B, the senior hacker has just arrived at work after a leisurely breakfast. He sits at his desk with a cup of coffee and launches an automated script that scans Company A’s infrastructure looking for a vulnerability. The entire process might take a few minutes. All the hacker is looking for is an opening. His colleague comes in with a doughnut and asks if he has seen the new Microsoft vulnerability yet? Its only an hour later and they have already crafted an exploit to target the new vulnerability and launched it against Company A. And they are in! All it took was one middle manager who had not yet updated his laptop because “he had an urgent meeting that could not wait”.

In the above scenario – fictitious but based in reality – Company A simply stood no chance. There was simply no time to patch what could be hundreds of machines across multiple time zones before the hackers had the chance to breach just one. All Company A’s cyber security team could do to prepare for this was build a defense-in-depth, limit the damage (something we will discuss in a later article) and pray for the best.

We are at war with cyber criminals – and they are winning. And they will continue to win while we fight an asymmetric war. We and our business partners who are not all in IT, need to continue to educate ourselves every day as to the threats and what measures can be put in place to help make it a level playing field.

Deciding what to model

(This is part 4 of a series)

Deciding what to model and what not to model can be challenging. When presented with a new technology it is often difficult not to use it – we all enjoy new challenges and new toys!

One way is to first decide what it is we are trying to solve – what is the business challenge we need to fix or what is the opportunity we are presented with. In other words we are asking ourselves at this stage, “should we build a solution for a given opportunity or problem?” If the answer is “yes”, then we have the opportunity to create a solution.

Next we need to define our problem. What is the nature of the problem, can it be defined in terms of our outcomes, and how accurate should our predictions be? Let’s explore these:

  1. What is the nature of the problem? Is the problem translatable into a set of steps or is it nebulous? We require, at this stage, a problem with a set of concrete goals. We want to predict the weight of people given their heights. We want to know which of our customers might be interested in our new product range. We want to know which transactions are possibly fraudulent. And so on. Our problem is not of the nature: “how do we solve world hunger?”
  2. Once we have our prediction (weight, target customers, fraudulent transactions), how do we represent it? Is is a continuous value as with weights, or a discrete data point, as with yes/no for fraudulent transactions.
  3. Lastly, what is the range of error we would be happy with? For weight we may be happy with +-5kgs, for target customers we may be happy with 60% (for every 100 people we select, 60 will buy the new product), and for fraudulent transactions we may be happy with a 10% error rate (rather double-check a few more transactions, than lose money to fraud)

 

The last part is vital – as we learned in the last post we cannot be 100% accurate all the time – with difficult prediction models we will have errors. We must just decide how large is acceptable and on what side we will err. Remember at this stage that we are using an organic machine learning model to predict values in an uncertain area, or based on uncertain input data. If we had a clear, unambiguous environment we would not need a learning algorithm!

Why Models are Random

In machine learning applications one will often come across the term “random” (or in the jargon of ML, stochastic). What does it mean when we speak about random models (stochastic models) and does the “randomness” of models mean they are not reliable?

Let us begin with a simple analogy:

One day you wake up in the middle of a forest – trees reach high into the sky with a thick, viney canopy blocking out the sky. In every direction you see nothing but trees and vines. Panicking slightly you realise you are lost. What to do?

In the distance you hear what you think is the sound of a highway – if you can get to that highway you will be rescued! The problem facing you is that the sound is very faint and with the breeze rustling the canopy, birds tweeting and the occasional howling wolf (!) you are not quite sure which way to go (your environment is “noisy” in the parlance of machine learning).

You listen intently and decide the highway is in a certain direction and you head off – after 20 steps you pause and re-evaluate. Can you hear any better from here? Which way is the highway now? You decide on a direction and head off – another 20 steps.

Stop. Evaluate. Take 20 steps. Repeat (perhaps hundreds of times).

Eventually you find yourself on the highway – right near mile marker 35. Well done!

Your path through the forest was random (stochastic), with each step taking you closer to your target on average but with some steps taking you away. Imagine in your mind’s eye the path you took, wandering across the forest floor in a “drunken stagger”. Sometimes (from the vantage point of an eagle, and with hindsight), you headed directly to the highway (so close!) and other times you moved away (oh no!). But, on average, you got closer and closer until you reached your goal.

If you were to repeat this process many many times, each path would likely differ, and you would exit at mile marker 34,35,36 and so on in your journey. But in all cases you would land up at the highway. This random walk, with an outcome in a narrow band of true goals (reach the highway), is a stochastic model. We could repeat the exercise hundreds of times, carefully writing down each literal step we took, and then decide which one of the exercises was quickest or took the fewer steps (or was most scenic!) and then publish a guide Lost in Woods for anyone in those exact circumstances.This would be a “model”.

Machine Learning models follow the same basic stochastic approach. They start off lost. They try and learn to predict an outcome by taking small steps (gradients) in the direction that leads to their goal (finding local or global minima) and when they reach a final solution, that solution is just one of many possible solutions, all “good enough” (getting to the highway solved your problem – it was not your aim to reach mile marker X in Y number of steps and Z minutes).

In short, the random nature of machine learning (stochastic gradient descent) is an approach to solve for the local or global minima so as to approximate the best solution in a noisy, non-deterministic model. It is the fundamental concept underpinning all of machine learning and should not concern you – embrace it.

 

Your model won’t explain everything

And that’s OK!

(This is part 3 of a series – please read part 1 and 2 first)

There is a misconception that computers don’t make mistakes (which conflicts with the “computer error” excuse humans use to explain their own errors!). It is also true that algorithms are perfect – they will do exactly what they are instructed to do. 2+2 always equals 4.

However, machine learning does not give an absolute answer – it provides a prediction. It is an implementation of what is, essentially, a statistical model. The answers (or rather predictions) it returns as a result of some input is a statistically derived best prediction – not the absolute truth. The statistical model is calculated accurately – the computer cannot make an error there. It just may be that the model does not sufficiently explain the real-world problem it is trying to solve for.

Let’s unpack that with an example. Suppose you have measured the weights and heights of 100 men and determined that the average mass in kilograms for men 1.8m tall is 80kg. You will have seen in your measurement activities that some men are heavier and some are lighter (but for the sake of this exercise all men are the same height). Now, you are taken to a room at the front of which is a curtain. You are told that behind the curtain are 10 men, all 1.8m tall and you are asked to determine their individual weight. What does the 1st man weigh? The 2nd?

Your best chance of being “mostly correct, on average” is to say each man weighs 80kg (the average for all men you have tested with height 1.8m). If this does not immediately make sense to you, spend some time thinking it through (or drop me an email), as this is an important point.

If the curtain is now dropped away, and you get to see each man for the first time, you will immediately start thinking “Man number 1 looks about right, but man number 2 is heavier than normal, man number 3 looks like a power lifter – must be easily 120kg” and so on. Does this make your ability to take past data and use it for future predictions wrong?

No. You have used all the data you had at hand and applied that correctly to the unknown problem. You performed the task correctly – what was lacking was detail in your data set. You did not have enough information about each man in the sample set. or of the 10 new men, to make an accurate prediction. Those reading this who have studied statistics may want to speak about population means, standard deviations and confidence levels – and those would factor in. But for the purposes of this analogy I think the explanation is easier to explain without additional concepts (feel free to comment below).

Clearly, our model can be improved if we add additional features to our body of knowledge. We can add Body Mass Index, Cholesterol Levels, Fitness and so on to each of the original 100 men sampled, and then label the 10 unknown men with the same characteristics. This will most likely give us different ranges (the greater the fitness level, the lower the mass for example. But we know this is not true in all cases. A marathon runner weighs less than a body builder – but they may have equal fitness). So we may want to add “maximum bench press” to the list of features.

It is here that things become interesting to data scientists. We have a balancing act between what to add and what not to add. Given infinite resources (money, time, people, computing power) we can add everything  to a model and not concern ourselves with what to add. But we do not have infinite resources. We may not have BMI information and would need to run tests in a laboratory on all the men. This could take weeks and cost us more than we have available.

It can be also be difficult to determine beforehand what is predictive and what is not. A predictive variable is one which helps us determine the (correct) prediction. It may turn out that cholesterol levels have no bearing on the weight of a person in all but the most extreme outliers. It may be, however, that cholesterol levels are highly predictive. It may be that we have not recognised a certain feature and it would help us (and we may or may not have access to the data for that feature). Maybe the person’s income level is a predictor of mass?

Arguing for the resources to collect data on additional features without knowing beforehand what will be predictive and what will not is a hard-sell. We can spend months and a bucket of cash collecting cholesterol levels for each of our samples, and it adds no value to the final prediction.

Statistics is essentially pure. If we are doing a study on mass for health studies in a country and we collect cholesterol levels for one hundred thousand people and determine that cholesterol has no impact on body mass, then that information is useful to health organisations. They now know that if they want to reduce obesity in a city they need not factor in cholesterol levels. But in our case of a machine learning model, the unused feature (and the “wasted” resources to collect the data) has a negative impact on the bottom line. And business wants to see the bottom line improve.

Models are not perfect – nor can they ever be. But given enough resources and commitment from business, and provided the additional features can be collected, predictive models can do an excellent job. It is finding the balance between expenses and return that remains a challenge – and we will look at that in a future article.

 

Choosing data for a machine learning problem

(This is part 2 of a series – please read part 1 first)

How does one go  about choosing data for a machine learning problem? What does it even mean to choose data? In this article we will discuss data types, what data to choose and why you should not (paradoxically!) choose anything!

If you have not read the previous blog in this series and you are new to the concepts of data science and machine learning, please read that first as it provides an introduction to key concepts we will use in this article.

Let us begin with a hypothetical example that is small enough and relates well enough to our everyday human knowledge and experience that it can serve as a learning aide.

The problem we will model is to determine what correlation (a connection between two or more things) exists between people’s height and their weight. We have an instinctive feeling for what this relationship is, that the taller a person is, the more likely they are to weigh more. And we also know that this is not always true. We can see the problem in our mind’s eye and we can see the challenges our assumptions make.

The example above is a simple linear regression model. When we collect data on the heights and weights of people we can create a scatter graph like the first one below. The numbers at the bottom are the X axis and in this case represent people’s heights. The numbers up the left hand side are the Y axis and in this case represent people’s weight.

We can see that taller people in general weigh more than shorter people, but this does not always hold true. We can see one person who weighs around 100kgs and has a height of 170cm. This is called an outlier.

heightweightscatter

 

If we now fit a linear regression line to the data, we will get the following graph.

heightweightLR

The linear regression line tells us what the expected value is for every combination of height and weight. So given a certain height, we can predict the weight of the person. As we can see from the graph this is not an exact science – in fact only three points are actually on the line. The distance between a point and where it should intercept (touch) the line is called the error. Clearly our error for the outlier is very large, while most other data points have a lower error.

A very simple form of machine learning would calculate the line that best fits the available data. The best fit line would minimize the total sum of all the errors. While this may sound complicated it is a fairly easy exercise. All we need to do (in essence and skipping the math and statistics) is measure how far each data point is from the line and then add up all these errors. For the curious a common way to do this is called Mean Squared Error.

Why do we have an outlier in the graph above, and what can we do to eliminate these or use them to improve our predictive capability? This is where machine learning triumphs! We can add additional features to our data sets and use these additional features to improve our prediction. Unfortunately, when we do this, we need to say good-bye to the graphs above and start using our imaginations. We will need to add (many) additional dimensions to our dataset. We can create a 3D graph using visual tricks, but we cannot produce 4D graphs or graphs with much higher dimensions.

Let us begin by taking a walk through an imaginary city – we will pick a cosmopolitan city made up of a diverse range of cultures, ages, occupations and interests. We can choose New York or London, or any other large vibrant city. What would you expect in this city? You will find health enthusiasts, comic book aficionados, vegans, fast food addicts, doctors, lawyers, personal trainers. You will have Americans, Swiss, Chinese. You will of course have males and females and transgenders. And you would expect each of these people to have a certain height (mostly as a result of genetics) and a certain weight (a combination of genetics and lifestyle). Which of these characteristics (features) are important in determining height/weight correlation? Spend some time in this imaginary place and see what you can determine. This is an important exercise and will assist you in future data selection processes.

In your exploration of the city above did you make generalisations? Did you expect certain nationalities to be heavier on average? Did you perhaps expect a personal trainer to be fitter and leaner than an office worker? What effect on weight did you attribute to diet? With your vast (but fallible!) knowledge of humans acquired over decades of life you will have reached certain conclusions. If you expected an office worked to weigh more than a professional athlete you may well be incorrect if the office worker runs ultra-marathons as a hobby. However, you may be thinking, how was I to know she runs ultra-marathons? That information was not made available to me! This is the same problem machine learning has – it can only use features that you have provided to it. And it may not always be obvious up-front what affect certain features will have on predictive capabilities.

The advantage of machine learning is that we need not determine the features to use. In fact, we should not! We should not predetermine what features are important as this limits the machine from learning novel relationships. We simply add everything we have to the model and allow it to find the relationships and dependencies for us! And machines do not have any issue with multi-dimensional analysis (other than computing power and memory). We can (and should) at a later stage remove some features – but we can investigate that in a later article.

(Some readers may wonder about the effects of multicollinearity etc but we can tackle that at a later stage)

The more features we have the more we can mitigate the risks of an algorithm suffering from incorrect (or limited) predictive ability owing to missing features. As we provide additional features to a model, so it should improve its predictive capabilities. Since a machine learning algorithm is mathematically derived it does not make errors in prediction – it makes the correct predictions based on the information it has. It is the underlying data sets that may be insufficient to the task at hand.

One final point that needs mentioning is the importance of sample size. Sample size is the number of data points in the model. In our imaginary city it is the number of people we can see, and for whom we have detailed information. If we only have one person from Iceland and she is a data scientist that enjoys reading Shakespeare on the weekends, it would tell us very little about Icelanders, data scientists or people who enjoy Shakespeare. There simply is not enough information to determine if she is the norm or an outlier. The more examples of “data scientists” we have, the more we can predict about the weight of a data scientist of a certain height. And similarly for Icelanders and Shakespeare enthusiasts. Our datasets need to contain a representative set of samples for as many subjects as we can.

In conclusion, machine learning algorithms require as much data, with as many features, as possible to learn the best predictive models. It may not always be obvious what data will be useful and what data will not. What we need to know is what problem needs to be solved and what datasets we have available to us (or can acquire through credible external data partners). We can then allow the machine to learn the best fit for the problem area and provide us with predictions.

 

Demystifying AI for Non Data Scientists

I am often asked to explain complex technical concepts to people who are not technical. These people are often experts in their own field and are either generally curious as to the subject matter or require a non-technical grasp of the subject matter for decision making purposes.

This article is aimed at non-technical people to assist in basic understanding of machine learning and provide insight into the process. Data scientists may argue some of the points or analogies I make, of feel that some parts are over simplified.

I always find the best way to explain any subject is through an analogy, woven into a story for the more complex subjects. We build toward an understanding of the subject matter through an exploration of material the student already knows.

And we can apply this same concept to AI (or machine learning).

Before we get into the nuts and bolts of the topic, let us first take a simple analogy and work from there. If I were to program a robotic arm to make a cup of tea using conventional programming languages the task would be tedious and time consuming. Now, imagine explaining to a five year old the same task. You would begin by saying “Take a tea cup from the cupboard and place it next to the kettle. Put some water in the kettle and turn it on. When it is boiling, turn it off and pour some water into the tea cup.” The five year old may well succeed at this task, but when you analyse what you have instructed, you will realize the enormous amount of information the child had at her disposal that you did not need to explain. What is a tea cup? Where are they kept? What is water? How do I get water into the kettle? This is all domain knowledge the child has accumulated over a few years. Procedural programming languages cannot learn. The child has learned each step through trial and error over many attempts, and is able to build a new cohesive solution from these partial skills.

Taking one step closer to machine learning from the above analogy, show the child a photo of a common animal (dog,cat,horse etc) and ask what animal it is. You would likely be disappointed if they could not. But how does a child perform this feat? How do they know, instinctively, what animal it is? It could be a dog breed they have never before encountered, and yet most five year olds would recognise it immediately. What is it about a photo of a dog that says “this is a dog”? We humans just “know” things like “if it has big floppy ears it is more likely to be a dog than a cat” and we take all these “known” facts and add it up instantly in our brain and say “dog!”. We have been able to take partial information segments and cohesively build it into one coherent result. These partial pieces of information is our first concept. In machine learning the “partial pieces of information that allow us to decide on an answer” are called features.

Machines learn through a process of identifying which features are most important to a problem area and using those learned importances to make decisions. Going back to the previous example of animal identification, a feature such as “has a head” is easily dismissed by humans. ALL animals have a head (at least those that are alive!). But do they? A starfish does not – but we would still recognise it! If you have ever played “20 questions” you will understand that asking certain questions up front can lead you to a smaller subset of possibilities quickly, but can also cause you to go down the wrong track from the start. “Does it fly” is such a question. If the answer is “No” then we eliminate all birds and start thinking of land or sea animals. But the answer may have been “Ostrich”.

The machine learning algorithm has to decide on which features are most important and for what outcome. These “decisions” are called weights. Different combinations of features can be learned together and these sets would have individual weights. This allows us to overcome the “does it fly/ostrich” type of problems. “Does it fly and not stand 1.4m tall” for example.

Within the structure of a machine learning system we have a number of decision making nodes called neurons. For the sake of simplicity we will only have one layer of neurons for our discussion, and in a future article we will add more. Each neuron is fed information about the problem domain (so each neuron gets access to all the features). Each neuron then determines the importance of the features it has been given and provides  information to the output (the part that guesses the answer). If the neuron contributed to the correct answer, that is its “instinct” was correct, it is rewarded through a reinforcement process that strengthens its decision, and if it is wrong it is told to try something else next time. and this repeats for all the neurons. Once it has worked through one game of “20 questions” or “Guess the animal” it is given a second turn, and the process repeats. And it will repeat for every example we have. These examples are called samples.

And we can rerun the full set of samples multiple times (these reruns are called epochs, but we need not worry about that for now).

Over a course of many (many!) samples the machine can learn what features are most likely to indicate the animal is a dog / cat / horse etc. Some neurons will get excited when they see the pairing of “floppy ears” and “20 or more kgs” or “retractable claws” and “slit eyes”. This excitement is called an activation. When neurons activate they send a signal to the output node, and given a specific set of activations, the output node will be able to guess what animal was shown.

If you played had a game of “20 questions” but it was instead “2 questions”, you would think the game was rigged. This is because you realise that the more information you have, the better your guess will be. The same is true for machine learning. The more samples we have, and the more features per sample, the better the result. In order to distinguish which animal it is (or in a business setting, which clients are likely to lapse, or not pay us) we need enough samples with enough important features. Note here that we need important features. And much like the “does it have a head?” analogy earlier it is not always a good idea for us humans to decide what is good or bad in terms of a feature. It may well be that “hair colour”, as an example, is a good indicator of whether or not a person will subscribe to our service! (Maybe our marketing campaigns are subconsciously influencing a certain segment?)

The last part of this mini article I want to touch on is the importance of good data. Imagine we are playing “20 questions” with a small child. We ask “does it fly” and they say “yes”. So we guess “eagle”, “sparrow”, “dragon fly” etc. When we lose, the child says “Ostrich! haha”. The child just assumed that since it has wings, it flies. This is an example of how poor information can cause the system to learn the wrong things. In a business setting if our call centre operators routinely misclassify some people as say old or young by the sound of their voice or the nature of the language they use, without asking for an age our data would be incorrect. This incorrect data will then form features and weights in a system and our results could be invalid. It is not that the machine made an error – it made the correct decisions based on the data it had.

In this article we discussed features, weights and activations and described basic machine learning in the context of how humans learn. If you are left feeling “this is too simple” then fear not. While the mechanics of the process are complex, involving statistics and calculus, the “how it does it in simple terms” is accurate. It is no more difficult to understand the basics of machine learning and avoid the complexity of the mathematics than it is to understand how a child knows what animal she sees and yet know nothing about neuroscience!

 

 

 

Welcome

Welcome!

In this series of blog posts I will introduce some key concepts in technology that interest me in a gentle, non-technical manner.

My aim is to explain concepts in everyday terms. Learning begins with an intuitive feeling that we understand a topic, and this is achieved by viewing new material through the lens of what we already know. It is not important to know all the technical terms to begin to understand a topic, that can come later. The joy in learning is to see new ideas grow from existing knowledge.

I am the CIO of a group of Credit Bureaus operating out of South Africa with clients world-wide. You can view our company profile at Bureau House Group .

You can contact me directly at alaincraven@gmail.com for specific questions and comments around my articles.

I am also contactable at alain@cpbonline.co.za if you want us to assist in your own data modelling / machine learning or analytical projects.

“You can know the name of a bird in all the languages of the world, but when you’re finished, you’ll know absolutely nothing whatever about the bird… So let’s look at the bird and see what it’s doing — that’s what counts.”
― Richard Feynman

post