The ‘S’ word: Are OFSTED RAISEonline stats wrong?

John F Brown / January 23, 2015

A while back I fell into a debate on twitter about whether OFSTED/Raise Online stats were wrong. It was about something that’s at the heart of interpreting what’s really going on underneath data. It was about how we can have any confidence that results didn’t just happen by luck.

I can’t say how many times I have watched school managers pick over vast grids of figures and attribute every tiny up or down to a decision they’ve made or not made, each time my statistics heart has broken a little (yes even data monkeys have hearts).

It seems to be one of our idiosyncrasies that when we see something presented as a number it takes on the status of unquestionable fact. The number for these children is higher than the number for those children because of something I did, fact.

But there is a hell of a lot that has to happen to get a reading out of child’s brain and get it into your spreadsheet and that means there is a hell of a lot that can go wrong and frequently does.

Sciences have learnt the hard way that measurement is a tricky business. Even when we do everything we possibly can to measure things accurately all sorts of unexpected stuff always gets in the way to mess up the figures.

This is such a common problem that it is standard practice in all sciences that one of the main things researchers have to do is show that their results are not likely to be due to the vagaries of measurement.

Scientists think measuring anything is such a big problem that they proudly assume results are always caused by bad measurement until they can be persuaded otherwise.

A main way researchers try to summit this mountain of skepticism is to try to figure out how flaky the measurement was and see if their results are bigger than this flakiness.

Researchers ask ‘are my results bigger than the margin of error in my measurements’. Imagine your results have gone up 1%, this is great, congratulate yourself, but you then find out the margin of error for that measurement was plus or minus 2%. This means your results might have gone down 2% or they might have gone up 2% just because of flakiness in measurement. Knowing the margin of error makes the result look like luck.

Sciences take margins of error so seriously they have developed a way to describe their results whilst at the same time describing the margins of error. It seems likely they came up with it because they got tired of having to ask for several statistics before deciding whether results were worth bothering with. The method they most generally use is ‘significance’ testing, there I said it!.

This is what the argument on twitter was about, how can you or OFSTED be confident that the difference between your school and national is not due to bad measurement and is less than the margin of error. This issue has really big consequences because if the differences are due to bad measurement and not anything you have done then OFTSED really cannot, should not, and must not hold you accountable for it. If the data’s flaky you should not hold yourself accountable for it and you certainly should not hold your staff accountable for it.

The twitter discussion seemed to be of interest to quite a few people; perhaps because it’s high stakes stuff, and could reflect poorly on OFSTEDs credibility. It may also have been of interest because significance tests are increasingly finding their way into education but are not well understood. Conscientious people want to know more about them and why they are increasingly cropping up. This blog is an attempt to reproduce the argument to help people make up their own minds.

The argument went something like this:

One protagonist felt the way OFSTED used significance tests in RAISE Online was wrong. Their problem concerned the way significance tests figure out the margin of error. Significance tests do this by looking at results in a comparison (Control) group and seeing how much scores within this comparison group tend to vary (how much bad measurement).

To get your head around why they do this it’s necessary look at the reasoning behind significance testing, unfortunately there is no other way.

Imagine you have invented the cure for a disease but no-one believes you. You will need to prove it. If you have really find a cure your treatment will cure everyone all the time. However, the trouble with healthcare is that people are weirdly different and a drug that cures most can harm quite a few. You can’t test your treatment on everyone because that’s not practical so you have to test a small number of people.

Trouble is if you cure your small group you may have only proven that your treatment cures the small group you tested and what you are really trying to do is find a cure that works for everybody.

So the real job facing you is to study a small group but in such a way as it allows you to reasonably say that if it works for your study group it will probably work for everybody. So what you are aiming for is to make a study that allows you to reasonably extrapolate what you find from a few out to everybody. In other words your study should allow you to draw an inference about how all people will react based on results from only a few. This is why significance tests are called ‘Inferential’ statistics because they are designed to support conclusions about everyone you cannot test directly from the study of only a few. For example we can never test a new drug on everyone before it’s released, we have to infer its safe based on studies of its effects on only a few people.

To show that your treatment always cures everyone you will need to test your cure on a group that includes people very similar to all the people in the world. To show that it’s your treatment that is actually curing people you will need to show that your treatment cures people at a much higher rate than get better naturally without your treatment. To do that you will need to compare the cure rate for people you treat to others that are very similar in every way but are not receiving your treatment.

So you get hold of a small number of people to test your treatment on and use as a comparison who you won’t treat.

Trouble is when you recruit a small group of people there is a strong chance they won’t include examples just like all the kinds of people that occur in the entire world. There’s a strong chance that by accident you recruit particularly healthy people or ones with particularly bad habits. This can’t be avoided because providence is a fickle mistress, that’s how the universe works. To get around this and maximise your chances of finding a variety of people that accurately reflect the variety of everyone in the world it’s recommended you recruit at random.

It’s very hard to recruit completely at random for all sorts of reasons, for example it’s hard to recruit people living on other side of the world, who speak another language and might be unwilling or unable to participate. However, the idea is to try your best to gather people as randomly as you can. This is because random has a magical quality; as recruitment becomes more random the chances of all and any kind of person being recruited become more equal. If the chances of being recruited get close to equal all the qualities and factors present in all the people in the entire world become more likely to occur in the small group you recruit. You are never going to be able to include all the very rare conditions and factors likely to affect a disease. However, this is where the real magic of random comes in; because if you get anywhere close to random the likelihood of your sample containing rare conditions and factors will approach the rate of their occurrence in the entire world. This means that without you having to know anything about what effects the disease you can ensure all thing that do will be included in your study. The universe sorts it out for you. How cool is that?

Significance tests are best when the groups being analysed are drawn at random from a wider group you want to infer about. This is because otherwise your results, i.e. any significance you might find, might always just be freak luck that you selected a weird group to study that are not like the wider group you want to know about. You want to cure everybody not just those you happen to test.

Researchers don’t just recommend analysis selects those to be studied at random they insist that it is the principle that causes this form of analysis to be useful and not misleading. Indeed, they maintain this technique is valid to the extent those analysed are recruited at random. Conversely, they argue these tests become corrupted and misleading to the extent they are not. These kinds of very strong guidelines or principles are known as ‘assumptions’, a lot of what statisticians argue about is whether these assumptions have been met or not.

Now back to the argument; the problem one protagonist has with OFSTEDs use of significance tests in RAISE is that the group being analysed, all schools in the OFSTED database are not draw at random from the wider group because they are all schools. That is, they are not selected at random because in a sense they are not draw or selected at all, they are the wider group. Consequently, in one protagonists view, this violates one of the main assumptions upon which significance tests are based and therefore they should not be used.

Another protagonist agree that selection was very unusual because it at least is a very large proportion of the wider group. Indeed they agree it may even be the wider group however, they felt that it didn’t matter even if it was.

They argued it did not matter because the whole reason statisticians want the selection to be random is so that it is a good representation of the wider group. What could be a better representation of the wider group than actually having the wider group?

Put another way, imagine you were being treated using a drug that had been tested on every single human being on the planet and it was found to heal everyone and harm no-one. Would you be worried that the statistical analysis of that study was misleading? You should have more faith in that drug that any other because it had been tested on every possible kind of condition and variety of constitution that there is.

The main question we have to answer is, do our results exceed the margin of error in our measurements. To answer that we need a good clear sight of the margins of error, we get that from knowing our data is a good reflection of wider group we want generalise about, everybody, all schools. We know the OFSTED database is a good refection of all schools because it is damn nearly all of them. OFSTEDs use of significance tests is not inappropriate it’s nearly ideal.

Beware of OFSTED bashers and beware of stats bashers, it’s great catharsis, a great way to release the pressure both exert on life in the frontline of schools. Buts it’s low hanging fruit to get support online for a rant against OFSTED. They are not perfect, a million miles from it. However, as far as I can see, from everything I’ve learned OFSTEDs use of significance tests is appropriate. More importantly if it’s interpreted carefully sig tests would massively reduce the number of things you need to worry about, berate your staff for and the number sticks OFSTED can use to beat us with.

This is how the BBC puts it: http://www.bbc.co.uk/news/science-environment-18521327

Statistics of a ‘discovery’

Particle physics has an accepted definition for a “discovery”: a five-sigma level of certainty
The number of standard deviations, or sigmas, is a measure of how unlikely it is that an experimental result is simply down to chance, in the absence of a real effect
Similarly, tossing a coin and getting a number of heads in a row may just be chance, rather than a sign of a “loaded” coin
The “three sigma” level represents about the same likelihood of tossing nine heads in a row
Five sigma, on the other hand, would correspond to tossing more than 21 in a row
Unlikely results are more probable when several experiments are carried out at once – equivalent to several people flipping coins at the same time
With independent confirmation by other experiments, five-sigma findings become accepted discoveries

13 thoughts on “The ‘S’ word: Are OFSTED RAISEonline stats wrong?”

e=mc2andallthat says:

January 23, 2015 at 11:30 pm

Reblogged this on The Echo Chamber.

LikeLike

Reply
Mark Bennet says:

January 24, 2015 at 8:40 am

I’m not sure how this takes account of the variability in the individual measurements for a single school. If a cohort had been tested on another day, or using last year’s test, would the school results be the same or different? How different would they be? RAISE is used to draw conclusions about individual schools as well as system-wide conclusions, and it is the validity of those conclusions about single schools which I think is in question.

To put the question acutely, there are cases I know of where a group of five primary school pupils have all fallen one mark below a threshhold. Most years there would have been two or three above, and in a good year they’d all hit the mark (and there is reporting bias here – no-one complains when that happens). Next year the cohort is different and most of the children are clear of the threshold one way or another. Such within school variability can a school up and down the RAISE scale by large amounts. Aggregating all schools in a database, similar schools can have very different outcomes within it. The large sample controls for the between school variability – maybe there are twenty schools in the country with a similar cohort, and most of those will fall in the middle. But the results of some of the outliers will still be significantly affected by chance. One in twenty schools is expected to be outside a 5% significance bound.

The other issue about the within school variability (or error) is that without knowing it, it is impossible to tell whether apparent trends (especially over three years, and three data points) are simply random variations, or whether they reflect a real improvement or decline. Judging an individual school you really need error bars on the school data points.

The issue I have with RAISE is that it is not clear what kind of significance is being measured. The presentation invites over interpretation. Ofsted should be asking more often “were they lucky, or were they good?” and “were they unlucky, or are they as poor as they look?”

LikeLike

Reply
- educationresearcher says:
  
  January 24, 2015 at 10:28 am
  
  These accidental ups and downs, like one year group just happening to be brighter than another, is exactly what significance testing overcomes, because sig testing takes into account the ups and downs in all schools caused by these kinds of things and all schools will suffer from these accidents of birth just as much as your school does.
  
  Concerning your second paragraph, if science is happy to say that odds of 1 in 20 indicates something meaningful then thats good enough for me.
  
  OFSTED RAISEonine significance testing measures whether your school results are different from all schools in the UK taking into account the margin of error in results.
  
  LikeLike
  
  Reply
  - Mark Bennet says:
    
    January 24, 2015 at 12:11 pm
    
    I am not talking or complaining about my own school, though I am a school governor. I want to raise a point of principle.
    
    RAISE has no way of knowing whether this year’s cohort was a risky cohort, with lots of borderline children, or not. Those children who appear near the borderline may be ‘middle’ children who have over or under-performed. It doesn’t sample in a way in which I can put error bars on a particular cohort, because the data doesn’t capture all the variability; only testing on different days with different materials would capture the kind of variability I am trying to address. This is evened out in the global picture because all the schools with similarly risky cohorts are included, and the variation in the whole population averages out. But I don’t know which those similarly risky schools are when I dig down to a school level, so I don’t know the variability for “a cohort like this”. There are some poor proxies for this which measure similarity in other dimensions.
    
    5% is a conventional marker of significance, but in interpreting it, we need to be aware that, sampling from any reasonable distribution, we would expect some schools to look significant by chance – and RAISE can’t tell us which schools they are.
    
    In a particular school – any one – how can you tell from the RAISE statistics whether the school has had a lucky year, or whether teaching and learning has improved? Addressing this as a governor, the data provide the question, not the answer.
    
    LikeLiked by 1 person
Jack Marwood says:

January 24, 2015 at 4:45 pm

I’ve covered some basic concerns with the use of statistics in RAISEonline on my blog Icing on the Cake. Here is a particularly relevant passage from http://icingonthecakeblog.weebly.com/blog/raiseonline-is-contemptible-rubbish:

“What is RAISEonline testing? You need to be fairly sure that your data is independent and identically distributed to use Z-scores and the like. If you want to test whether the test results for a given school is statistically significant when compared to a national mean and standard deviation, as RAISEonline does, you are effectively testing a ‘school effect’. Is there something about this school which makes it different to a control sample, in RAISEonline’s case all the children contributing results for a given school year?

So what does make the school different? Is it the quality of the teaching and learning, as RAISEonline implicitly assumes? Is it a particular cohort’s teaching and learning? Is it the socio-economic background of the children? Is it their prior attainment? Is it their family income? Or is it a combination of these factors?

All of this begs the question, what is RAISEonline actually assessing?”

I also outlined some problems with RAISEonline’s use of statistics:

The Not Independent and Identically Distributed problem
The Primary Age problem
The Key Stage 1 Data Manipulation problem
The Loss of Definition Problem
The Missing Data Problem
The Misunderstanding Significance problem.

I’ve subsequently written about the Fine Grade problem (http://icingonthecakeblog.weebly.com/blog/another-fine-mess-fine-grades-are-wrong-and-misleading), which assumes a level of accuracy in Key Stage 2 testing which is entirely unjustified.

In addition to these problems, I’ve also written about the confounding problems of bad measurement and shadow education (http://icingonthecakeblog.weebly.com/blog/ofsteds-use-of-test-scores-to-judge-schools-is-ridiculous) and the problems with significance testing in general (http://icingonthecakeblog.weebly.com/blog/hammering-nails-in-raiseonlines-coffin).

All of this means that it simply isn’t possible to agree with your conclusion that ‘OFSTEDs use of significance tests is not inappropriate it’s nearly ideal. As far as I can see, from everything I’ve learned OFSTEDs use of significance tests is appropriate.’ Ofsted’s – and anyone else’s – use of the entirely incorrect use of significance testing in RAISEonline is not appropriate in any way whatsoever.

By all means look at the numbers, but with an understanding that you are looking at a clustered population, not a sample of anything. A class or cohort represents nothing but itself, and any suggestion that it can be compared to any other class or cohort as if it were a sample of an independently and identically distributed wider population is simply wrong.

This isn’t having a go at Ofsted per se – and those whom I’ve spoken to at Ofsted seem to recognise the problems they have with Inspectors who make the kinds of assumptions you do. It’s simply an argument that the way in which data analysis has been developed in the last twenty years or so is built on flawed foundations.

LikeLike

Reply
educationresearcher says:

January 24, 2015 at 8:28 pm

Most readers won’t be familar with the technical terms ‘independent, identically distributed or Z Scores’; until you spell them out in a way that allows others to weight the arguments, you won’t persuade.

Your position seems to boil down to saying “I believe each class, cohort and school is so unique they cannot be compared to other classess, cohorts and schools’. Of course all children are special and unique and each class has its own character and each school has slightly different social conditions determing its intake. No-one could reasonably suggest there are no differences between each and every school. But it seems reaonable that there is enough similarity between classes, cohorts and schools for analysis comparing them to be informative, helpful and indicative of something important.

I once supported two primary schools serving the same housing estate, located meters appart, addmission to both was determined by a lottery, one massively out performed the other. Exactly the same social conditions different outcomes.

LikeLike

Reply
- Jack Marwood says:
  
  January 27, 2015 at 4:49 pm
  
  Firstly, your specific point about the statistics using in RAISEonline: I’m aware that most readers “won’t be familiar with the technical terms ‘independent, identically distributed or Z Scores’; until you spell them out in a way that allows others to weight the arguments, you won’t persuade.” That’s why I laid them out in my post http://icingonthecakeblog.weebly.com/blog/raiseonline-is-contemptible-rubbish., which I linked to in my comment.
  
  Since you have assumed that your readers won’t follow the link, here’s the relevant passage on independent, identically distributed variables (IID):
  
  “The Not Independent and Identically Distributed problem
  
  If you are testing the effect of a given fertiliser on a given species of plant, you can be fairly sure that each of the plants in a sample of plants is independent and identically distributed. This is complicated, but essentially, you should be able to swap any of your plants between sample groups before conducting your experiment, since any given plant should react to the experiment in the same way.
  
  In order to be able to assume that a sample cohort which has supplied test results for a school is independent and identically distributed, you should be certain that a completely different random group of children subject to the same teaching and learning would perform in exactly the same way.
  
  This seems entirely unlikely, since a given cohort in a given school will not be randomly selected from the entire population. The children are likely to be similar to each other in a statistically significant way – which could be socio-economic background, prior attainment, family income and so on. And that means that attainment levels of the cohort are not independent and identically distributed random variables.”
  
  As for z-scores; well, yes, they are complicated – which is why I explained them at great length in the post. I hope you and your readers have read it.
  
  Significance Testing has come under sustained attack for some time now, as academics have pulled it apart and exposed its many flaws. Even with its strict assumptions – completely ignored in school test score analysis – it has had its day*.
  
  And yes, my position does boil down to “I believe each class, cohort and school is so unique they cannot be compared to other classes, cohorts and schools” with one important corollary – which is that classes, cohort and schools can clearly be compared but *not by using sampling theory and statistical significance*, which is what RAISEonline does and which you have defended in your blog. It is Not Even Wrong to compare schools in this way, as it completely clouds the judgement of any data-illiterate person reading a RAISEonline report, which clearly includes a large number of people.
  
  To spell out my position: It is simple not reasonable to use flawed tests of statistical significance with clustered population data which is badly measured, ineptly collated, often incorrect, clearly incomplete and derived from the efforts of pupils who are emphatically not IID. You may not like that, but the mathematics behind significance tests requires various basic conditions to have been met, and this simply doesn’t happen in RAISEonline. I am dismayed that the government has chosen to do this, and surprised that you would defend this misuse of statistics.
  
  As for your unverifiable story about two anonymous schools, you shouldn’t need me to remind you that anecdote isn’t data, but I will. There could be any number of reasons for the effects you have observed but tests of statistical significance are neither correct or relevant. By all means compare schools – some clearly get better results than others for all sorts of complicated reasons – but we shouldn’t misuse tests of statistical significance to do so.
  
  *Here are some academic papers to confirm this:
  
  Click to access CEMWeb037%20The%20Case%20Against%20Statistical%20Significance%20Testing.pdf
  
  eprints.bham.ac.uk/…/Gorard_2010_Oxford_Review_of_Education.pdf
  http://www.methodspace.com/forum/attachment/download?id=2289984%3AUploadedFile%3A271610
  
  Click to access gill99.pdf
  
  And a whole list of papers compiled by Professor Robert Coe http://community.dur.ac.uk/r.j.coe/teaching/critsig.htm
  
  LikeLike
  
  Reply
  - educationresearcher says:
    
    January 31, 2015 at 12:46 pm
    
    You’re very much in the minority thinking schools can’t be compared. Of cause that does not mean you’re are wrong, but I think it does mean you would need some very very clear evidence most school are not like most other schools.
    
    There is big chunk of randomness determining which school a child goes to, the biggest factor is catchment area.
    
    I went to my primary school because in 1972 my dad got a job in Warwickshire and moved from Stevenage to a village near Leamington spa. It could have been any village near Leamington spa. It just so happened there was a house for sale at the time at the right price that they liked in that village. Of course socio-economics played a role; the ratio of private housing to council housing in that village was probably higher than nationally, but I guess not by much. There was only one primary so every child in village went there.
    
    There was very little determining whether I went to that primary or a hundred others. The intake of the schools was determined by the housing and this was very similar right across the region. For your argument to hold there would need to be systemic difference between schools, in Warwickshire at least there just wasn’t any difference.
    
    I do agree some schools are unlike most because of the unique arrangement of the communities they serve. I would be in favour of excluding these from significance testing, because in these cases you are not comparing like with like. This is clustered sampling problem you talk about. But let’s not throw the aby out with the bath water, many/most schools are very much like many others
    .
    The Education Endowment Fund has recently released a website for comparing schools. It has adopted the London Challenge idea of Families of schools, 40 schools grouped by socio-economic factors. I would prefer significance testing to be done comparing schools with others in the same family.
    
    Concerning the citations you list criticizing significance testing, firstly just because a few people have offered an opinion it doesn’t mean the argument has been settled. It is nonsense to represent this handful as majority opinion in statistics or research science. Surely you must accept the majority opinion in the sciences is in favour of baysian parametric statistics that take the margin of error into account. Secondly as @ollieorange2 points out there is this weird clique in education research with an agenda to reinvent educational statistics in a way researchers in other fields just don’t recognise.
    https://ollieorange2.wordpress.com/2014/07/27/significance-testing-part-1/
    
    LikeLike
jackmarwoodiotc says:

January 31, 2015 at 3:50 pm

Hmm. This is getting somewhat circular. But here goes:

As I said, ‘By all means compare schools – some clearly get better results than others for all sorts of complicated reasons – but we shouldn’t misuse tests of statistical significance to do so.’ So I’m not sure why you would say ‘You’re very much in the minority thinking schools can’t be compared. ‘ I have said clearly that they can be compared, but not using tests of statistical significance.

Anecdotes about school choice are interesting, but aren’t really relevant for a discussion about RAISE’s use of significance testing. RAISE doesn’t just compare schools in Warwickshire, it compares them to schools in rural Northumberland, post-industrial Hartlepool, inner city London, farthest Cornwall and all points in between. Whatever your experience might tell you, there *are* systematic differences between school intakes, for which RAISE makes no concession.

We could get into all kind of discussions about current thinking on statistics. I’ll simply offer the following link: http://www.stats.org.uk/statistical-inference/Gill.pdf. Yes, he’s not a trained mathematician, but he references those who are. Here’s an extensive quote from it:

“Led in the social sciences by psychology, many are challenging the basic tenets of the way that nearly all social scientists are trained to develop and test empirical hypotheses. It has been described as a “strangle-hold” (Rozeboom 1960), “deeply flawed or else ill-used by researchers” (Serlin and Lapsley 1993), “a terrible mistake, basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology” (Meehl 1978), “an instance of the kind of essential mindlessness in the conduct of research” (Bakan 1960), “badly misused for a long time” (Cohen 1994), and that it has “systematically retarded the growth of cumulative knowledge” (Schmidt 1996). Or even more bluntly: “The significance test as it is currently used in the social sciences just does not work.” (Hunter 1997)
Statisticians have long been aware of the limitations of null hypothesis significance testing as currently practiced in political science research. Jeffreys (1961) observed that using p-values as decision criteria is backward in its reasoning: “a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.”

Another common criticism notes that this interpretation of hypothesis testing confuses inference and decision making since it “does not allow for the costs of possible wrong actions to be taken into account in any precise way” (Barnett 1973). The perspective of many statisticians toward null hypothesis significance testing is typified by the statement: “a P-value of 0.05 essentially does not provide any evidence against the null hypothesis (Berger, Boukai, and Wang 1997), and the observation that the null versus research hypothesis is really an “artificial dichotomy” (Gelman et.al. 1995). Berger and Sellke (1987) show that evidence against the null given by correctly interpreting the posterior distribution or corresponding likelihood function “can differ by an order of magnitude.”

The basic problem with the null hypothesis significance test in political science is that it often does not tell political scientists what they think it is telling them. Most of the problems discussed here are interpretive in that they highlight misconceptions about the results of the procedure. From the current presentation of null hypothesis significance testing in published work it is very easy to confuse statistical significance with theoretical or substantive importance.”

He is talking about political science in the main, but the points are relevant to education.

Clearly we disagree on this. I’ve said all I want to say on the subject. This is your blog, and you are entitled to the last word. My final word is that the answer to your question is yes, OFSTED RAISEonline stats are wrong when they use tests of significance which they should not use.

PS Ollie Orange’s principle beef is with Effect Sizes, not statistical significance as used in RAISE, and the blog you link to simply dismisses those who argue against the use of significance testing rather than arguing *for* it in any way.

LikeLike

Reply
- jackmarwoodiotc says:
  
  July 28, 2015 at 1:53 pm
  
  I know you’ve read (and contributed to) http://www.educationdatalab.org.uk/Blog/July-2015/Significance-tests-for-school-performance-indicato.aspx#.Vbd2IdB-iNP and it’s cetainly a useful addition to this ongoing debate. As you will see, Dave Thompson agrees with my view that schools are unique, and says that “school cohorts are not simple, random samples. The 2-way process through which pupils and schools select each other is not random. Nor can we treat them as effectively random by controlling for observable pupil characteristics.”
  
  We differ in that whilst Dave accepts that, “despite being part of the educational landscape for ten years, significance tests are not widely understood by many users,” he thinks that, “significance tests (are) a helpful first step in the process of understanding a school’s performance, not the conclusion.” I don’t agree and, as I will continue to argue, OFSTED’s RAISEonline stats are wrong when they use tests of significance which they should not use.
  
  LikeLike
  
  Reply
  - John F Brown says:
    
    July 28, 2015 at 5:23 pm
    
    The advice/training and literature about sig test from their introduction by FFT and adoption by RAISE has always been the same and exactly as described by Dave Thomson, so I am not clear why this ‘light touch’ interpretation is news to you.
    You dont explain why you disagree with Dave Thomsons conclusions about what to do with sig tests which are shared with all the analyists in FFT, DfE and OFSTED.
    
    I wonder and hope that perhaps the discussion could move on from ‘what do sig tests show/ is it useful’ to ‘how do we properly ensure interpretation of sig test is approprate’.
    
    LikeLike
Simon Carson says:

July 28, 2015 at 7:22 pm

I have followed the discussion over from educationdatalab…

Perhaps my position is somewhere in the middle of both Jack’s and John’s. The use of the blue and green SIG-/SIG+ indicators in RAISE is a cause for concern. It draws attention where it may not be warranted. Statistical significance is not necessarily significant: one needs to know the size of the effect; a p score on its own tells you little.

However, a confidence interval plotted on an appropriate chart is, I would suggest, useful. From the chart, one can see just how big the actual effect is, and just how significant it is. Plotting the underlying data in the form of a scatter graph also helps in looking for areas for further investigation. For example, plots of value added scores, with confidence intervals, alongside a scattergraph of expected versus actual KS4 points scores is, for me, a useful picture of performance which allows one to begin to look for any areas of strength and weakness, especially if the analysis is extended to significant sub-groups of students, and this is exactly what I do for our school data.

So, I would not want to see significance testing banished from RAISE. I would, perhaps, be glad to say goodbye to blue and green highlighting, however, in exchange for more informative and detailed analyses of the underlying data.

LikeLiked by 1 person

Reply
jackmarwoodiotc says:

July 29, 2015 at 3:27 pm

John,

I’m interested in your comment that “The advice/training and literature about sig test from their introduction by FFT and adoption by RAISE has always been the same and exactly as described by Dave Thomson, so I am not clear why this ‘light touch’ interpretation is news to you.”

So sig tests were first introduced by FFT, then adopted by RAISE? Are there any links to this history (or personal experience) you could share? I’d be interested to read them.

Jack

LikeLike

Reply