A while back I fell into a debate on twitter about whether OFSTED/Raise Online stats were wrong. It was about something that’s at the heart of interpreting what’s really going on underneath data. It was about how we can have any confidence that results didn’t just happen by luck.
I can’t say how many times I have watched school managers pick over vast grids of figures and attribute every tiny up or down to a decision they’ve made or not made, each time my statistics heart has broken a little (yes even data monkeys have hearts).
It seems to be one of our idiosyncrasies that when we see something presented as a number it takes on the status of unquestionable fact. The number for these children is higher than the number for those children because of something I did, fact.
But there is a hell of a lot that has to happen to get a reading out of child’s brain and get it into your spreadsheet and that means there is a hell of a lot that can go wrong and frequently does.
Sciences have learnt the hard way that measurement is a tricky business. Even when we do everything we possibly can to measure things accurately all sorts of unexpected stuff always gets in the way to mess up the figures.
This is such a common problem that it is standard practice in all sciences that one of the main things researchers have to do is show that their results are not likely to be due to the vagaries of measurement.
Scientists think measuring anything is such a big problem that they proudly assume results are always caused by bad measurement until they can be persuaded otherwise.
A main way researchers try to summit this mountain of skepticism is to try to figure out how flaky the measurement was and see if their results are bigger than this flakiness.
Researchers ask ‘are my results bigger than the margin of error in my measurements’. Imagine your results have gone up 1%, this is great, congratulate yourself, but you then find out the margin of error for that measurement was plus or minus 2%. This means your results might have gone down 2% or they might have gone up 2% just because of flakiness in measurement. Knowing the margin of error makes the result look like luck.
Sciences take margins of error so seriously they have developed a way to describe their results whilst at the same time describing the margins of error. It seems likely they came up with it because they got tired of having to ask for several statistics before deciding whether results were worth bothering with. The method they most generally use is ‘significance’ testing, there I said it!.
This is what the argument on twitter was about, how can you or OFSTED be confident that the difference between your school and national is not due to bad measurement and is less than the margin of error. This issue has really big consequences because if the differences are due to bad measurement and not anything you have done then OFTSED really cannot, should not, and must not hold you accountable for it. If the data’s flaky you should not hold yourself accountable for it and you certainly should not hold your staff accountable for it.
The twitter discussion seemed to be of interest to quite a few people; perhaps because it’s high stakes stuff, and could reflect poorly on OFSTEDs credibility. It may also have been of interest because significance tests are increasingly finding their way into education but are not well understood. Conscientious people want to know more about them and why they are increasingly cropping up. This blog is an attempt to reproduce the argument to help people make up their own minds.
The argument went something like this:
One protagonist felt the way OFSTED used significance tests in RAISE Online was wrong. Their problem concerned the way significance tests figure out the margin of error. Significance tests do this by looking at results in a comparison (Control) group and seeing how much scores within this comparison group tend to vary (how much bad measurement).
To get your head around why they do this it’s necessary look at the reasoning behind significance testing, unfortunately there is no other way.
Imagine you have invented the cure for a disease but no-one believes you. You will need to prove it. If you have really find a cure your treatment will cure everyone all the time. However, the trouble with healthcare is that people are weirdly different and a drug that cures most can harm quite a few. You can’t test your treatment on everyone because that’s not practical so you have to test a small number of people.
Trouble is if you cure your small group you may have only proven that your treatment cures the small group you tested and what you are really trying to do is find a cure that works for everybody.
So the real job facing you is to study a small group but in such a way as it allows you to reasonably say that if it works for your study group it will probably work for everybody. So what you are aiming for is to make a study that allows you to reasonably extrapolate what you find from a few out to everybody. In other words your study should allow you to draw an inference about how all people will react based on results from only a few. This is why significance tests are called ‘Inferential’ statistics because they are designed to support conclusions about everyone you cannot test directly from the study of only a few. For example we can never test a new drug on everyone before it’s released, we have to infer its safe based on studies of its effects on only a few people.
To show that your treatment always cures everyone you will need to test your cure on a group that includes people very similar to all the people in the world. To show that it’s your treatment that is actually curing people you will need to show that your treatment cures people at a much higher rate than get better naturally without your treatment. To do that you will need to compare the cure rate for people you treat to others that are very similar in every way but are not receiving your treatment.
So you get hold of a small number of people to test your treatment on and use as a comparison who you won’t treat.
Trouble is when you recruit a small group of people there is a strong chance they won’t include examples just like all the kinds of people that occur in the entire world. There’s a strong chance that by accident you recruit particularly healthy people or ones with particularly bad habits. This can’t be avoided because providence is a fickle mistress, that’s how the universe works. To get around this and maximise your chances of finding a variety of people that accurately reflect the variety of everyone in the world it’s recommended you recruit at random.
It’s very hard to recruit completely at random for all sorts of reasons, for example it’s hard to recruit people living on other side of the world, who speak another language and might be unwilling or unable to participate. However, the idea is to try your best to gather people as randomly as you can. This is because random has a magical quality; as recruitment becomes more random the chances of all and any kind of person being recruited become more equal. If the chances of being recruited get close to equal all the qualities and factors present in all the people in the entire world become more likely to occur in the small group you recruit. You are never going to be able to include all the very rare conditions and factors likely to affect a disease. However, this is where the real magic of random comes in; because if you get anywhere close to random the likelihood of your sample containing rare conditions and factors will approach the rate of their occurrence in the entire world. This means that without you having to know anything about what effects the disease you can ensure all thing that do will be included in your study. The universe sorts it out for you. How cool is that?
Significance tests are best when the groups being analysed are drawn at random from a wider group you want to infer about. This is because otherwise your results, i.e. any significance you might find, might always just be freak luck that you selected a weird group to study that are not like the wider group you want to know about. You want to cure everybody not just those you happen to test.
Researchers don’t just recommend analysis selects those to be studied at random they insist that it is the principle that causes this form of analysis to be useful and not misleading. Indeed, they maintain this technique is valid to the extent those analysed are recruited at random. Conversely, they argue these tests become corrupted and misleading to the extent they are not. These kinds of very strong guidelines or principles are known as ‘assumptions’, a lot of what statisticians argue about is whether these assumptions have been met or not.
Now back to the argument; the problem one protagonist has with OFSTEDs use of significance tests in RAISE is that the group being analysed, all schools in the OFSTED database are not draw at random from the wider group because they are all schools. That is, they are not selected at random because in a sense they are not draw or selected at all, they are the wider group. Consequently, in one protagonists view, this violates one of the main assumptions upon which significance tests are based and therefore they should not be used.
Another protagonist agree that selection was very unusual because it at least is a very large proportion of the wider group. Indeed they agree it may even be the wider group however, they felt that it didn’t matter even if it was.
They argued it did not matter because the whole reason statisticians want the selection to be random is so that it is a good representation of the wider group. What could be a better representation of the wider group than actually having the wider group?
Put another way, imagine you were being treated using a drug that had been tested on every single human being on the planet and it was found to heal everyone and harm no-one. Would you be worried that the statistical analysis of that study was misleading? You should have more faith in that drug that any other because it had been tested on every possible kind of condition and variety of constitution that there is.
The main question we have to answer is, do our results exceed the margin of error in our measurements. To answer that we need a good clear sight of the margins of error, we get that from knowing our data is a good reflection of wider group we want generalise about, everybody, all schools. We know the OFSTED database is a good refection of all schools because it is damn nearly all of them. OFSTEDs use of significance tests is not inappropriate it’s nearly ideal.
Beware of OFSTED bashers and beware of stats bashers, it’s great catharsis, a great way to release the pressure both exert on life in the frontline of schools. Buts it’s low hanging fruit to get support online for a rant against OFSTED. They are not perfect, a million miles from it. However, as far as I can see, from everything I’ve learned OFSTEDs use of significance tests is appropriate. More importantly if it’s interpreted carefully sig tests would massively reduce the number of things you need to worry about, berate your staff for and the number sticks OFSTED can use to beat us with.
This is how the BBC puts it: http://www.bbc.co.uk/news/science-environment-18521327
Statistics of a ‘discovery’
- Particle physics has an accepted definition for a “discovery”: a five-sigma level of certainty
- The number of standard deviations, or sigmas, is a measure of how unlikely it is that an experimental result is simply down to chance, in the absence of a real effect
- Similarly, tossing a coin and getting a number of heads in a row may just be chance, rather than a sign of a “loaded” coin
- The “three sigma” level represents about the same likelihood of tossing nine heads in a row
- Five sigma, on the other hand, would correspond to tossing more than 21 in a row
- Unlikely results are more probable when several experiments are carried out at once – equivalent to several people flipping coins at the same time
- With independent confirmation by other experiments, five-sigma findings become accepted discoveries