The use (and misuse) of PISA in guiding policy reform: the case of Spain

ABSTRACT In 2013 Spain introduced a series of educational reforms explicitly inspired by the Programme for International Student Assessment (PISA) 2012 results. These reforms were mainly implemented in secondary education – based upon the assumption that this is where Spain's educational problems lie. This paper questions this assumption by attempting to identify the point where Spanish children fall behind young people in other developed countries. Specifically, by drawing data from multiple international assessments, we are able to explore how cross-national differences in reading skills change as children age. Consideration is given to both the average level of achievement and the evolution of educational inequalities. Our conclusion is that policy-makers have focused their efforts on the wrong part of the education system; educational achievement is low in Spain (and educational inequalities large) long before children enter secondary school. This study therefore serves as a note of caution against simplistic interpretation of the PISA rankings.


Introduction
Since its return to democracy, a change in the colour of Spain's governing party has generally meant a new set of educational reforms. The latest is the Organic Act for the Improvement of Quality in Education (LOMCE); approved by the conservative government shortly after the release of Programme for International Student Assessment (PISA) 2012 results. These reforms have been designed to tackle what the Spanish Ministry of Education (2013a) believe are the key weaknesses of Spain's education system: high rates of school failure, early school dropout, 1 the low status of vocational education, lack of external evaluations, low levels of school autonomy and generally low academic performance of students. It is the last of these which is perhaps the ruling government's greatest concern. This is driven, at least in part, by Spain's continual poor performance in three major international assessmentsthe Trends in Mathematics and Science Study (TIMSS), the Progress in International Reading Literacy Survey (PIRLS) and, most notably, This last concern has surgedas it has happened in other countries (Bulle 2011;Pons 2011)with the relatively modest achievement of Spanish students in international assessment programmes such as TIMSS, PIRLS and, especially, the PISA. It is thus clear that the low performance of Spanish children in PISA has had a significant impact upon important policy-makers in this country. Indeed, it is to their credit that they have taken the results of such assessments so seriously, and are passionate in their desire to introduce educational reforms. However, although international assessments can be a useful tool for comparative education purposes, a naïve use by educational policy-makers can be misleading, as will be shown in this paper.
Specifically, it is our belief that the Spanish government have focused their reforms upon the wrong part of the education system, due to their simplistic interpretation of the PISA data. Specifically, the main components of the LOMCE reforms are: (1) Raising the level of autonomy of schools, increasing the importance of school principals.
(2) Introducing external evaluations to students at the end of the primary (year 6) and lower secondary levels (year 10). 4 These evaluations will be performed for providing information to families and schools. (3) Simplifying the curriculum, putting more weight on instrumental competencies, ICT and foreign languages. (4) Making tracks more flexible, avoiding dead-ends in the educational system. For meeting these ends, tracking between the academic and vocational paths is advanced by one year (from age 16 to age 15).
The vast majority of the above are focused upon changes to lower secondary education. But is this really when Spain's educational problems emerge? Or are low levels of academic achievement, and large educational inequalities, already apparent much earlier in young people's lives? Unfortunately, despite the changes already underway in Spain, there is actually very little robust evidence on this important issue. This study therefore aims to fill this gap in the literature by investigating how Spanish children's reading skills develop over time (between the ages of 10 and 16) relative to children in a selection of other countries. Specifically, we address the following three research questions: (1) At what point in the schooling system does Spain fall behind other countries in terms of average reading achievement? Do other countries improve relative to Spain in secondary school, or is the achievement gap already stark by the end of primary school and then simply maintained? How does the gender gap evolve during this period? (2) How does the distribution of academic achievement change in Spain between the end of primary school (age 9/10) and the end of secondary school (age 15/16)? Do educational inequalities grow, shrink or remain the same? (3) Is the socio-economic status (SES) gradient in children's reading skills large or small in Spain relative to other developed countries? Do these inequalities grow or narrow during secondary school, and does Spain differ significantly to other countries in this respect?
Although this exercise should have been conducted before the approval of the 2013 education act, this ex post analysis will nevertheless reveal how well founded the 'Wert Act' educational reforms are. It therefore provides an illustration of how not to use international assessments (in this case PISA) in designing changes to national education systems 5 and, at the same time, how comparative education approaches can be useful for implementing country-level reforms. The paper is now structured as follows. Section 2 describes the PIRLS and PISA databases and our empirical methodology. Section 3 presents results, focusing upon how Spain's relative performance on important international reading tests changes between the end of primary school and the end of secondary school. Conclusions and policy discussion follows in Section 4.

Methodology and databases
The aim of this study is to investigate Spain's relative performance in international reading tests at ages 10 and 16. Ideally, longitudinal data would be available to track children's progress over time. Unfortunately, such data are not collected in Spain, nor in several other important comparator countries. Consequently, we follow an alternative strategy pursued by Goodman, Sibieta, and Washbrook (2009) and Jerrim and Choi (2014). Specifically, we treat PIRLS 2006 and PISA 2012 as repeated cross-sectional data, with children aged 9/10 (4th year of primary school) in the former and 15/16 (3rd or 4th year of compulsory secondary education) in the latter. To maximise comparability, we retain only those countries that participated in both the PIRLS 2006 and PISA 2012 studies. Moreover, we only retain children born in either 1996 or 1997. 6 This leaves a total of 25 education systems 7 for whom we investigate change in relative reading test scores as children age. 8 Although PIRLS and PISA both collect nationally representative samples, with similar survey designs and response rates 9 (see Mullis et al. 2007; OECD 2011 for further information), raw test scores cannot be directly compared across the two surveys (Jerrim 2013). First, the two surveys use different item-response theory models to scale the test score data (see Brown et al. 2007). Second, there are some subtle conceptual differences in the skills the two tests measure, with PIRLS focused upon 'curriculum-based' measures of literacy, while PISA measures children's ability to use their skills in 'real-life' situations. Finally, the two studies contain different sets of countries (e.g. 41 countries participated in PIRLS 2006 compared to 65 in PISA 2012) with test scores then scaled to a mean of 500 and standard deviation of 100 within each of the respective surveys. Consequently, a score of 500 in PIRLS is not equivalent to a score of 500 in PISA.
We deal with this issue by converting all test score data into international Z-scores, following the lead of Brown et al. (2007). In other words, we normalize test scores for each survey at the student level, resulting in a mean of 0 and a standard deviation of 1 across all 25 countries included in our sample. This has important implications regarding interpretation of results. Specifically, we are unable to comment upon how children's reading test scores change in Spain as children age in absolute terms. Rather we can only consider relative differences between Spain and other countries, and how this relative difference changes between the end of primary (PIRLS 2006) and end of secondary (PISA 2012) school. It is important for readers to bare this in mind when interpreting our results.
Our analysis begins by considering how average reading test scores (converted into the Z-score metric) compares across countries at ages 9/10 and 15/16. This is followed by a consideration of how the distribution of children's reading scores changes as children age. We then turn to the issue of socio-economic inequalities, estimated using the following Ordinary Least Squares regression model: 10 where A ijk is the performance on the PIRLS or PISA reading test (in terms of Z-scores); Sex i the pupil gender (0 = boys, 1 = girls); I i the immigrant status (0 = native, 1 = immigrant); SES i a set of dummy variables reflecting parental occupation; i the pupil i; j the school j; and ∀ k the model is estimated separately for each K country. 11 The parameter of interest from (1) is the association between children's socio-economic background and performance on the reading test.
We estimate model (1) twice; once using father's occupation to measure SES (divided into four groups: elementary, semi-skilled blue collar, semi-skilled white collar and skilled white collar workers) and once using the number of books at home (Wößmann 2008;Evans et al. 2010;Hanushek and Wößmann 2011;Jerrim and Choi 2014). 12 Both of the above have strengths and limitations. Although father's occupation is a widely accepted measure of SES in sociological research, and is reliably reported in international surveys (Jerrim and Micklewright 2014), such information is missing for up to half the sample in PIRLS 2006 for some countries. In contrast, missing data for books in the home is low (less than 5% in most countries), and is a frequently used proxy for SES in international comparative research (see Schütz, Ursprung, and Wößmann 2008). Concerns have been raised, however, regarding accuracy of measurement and whether the number of books is really a robust measure of social stratification (Jerrim 2012;Jerrim and Micklewright 2014).
This difficulty will be handled as follows. First, we estimate model 1 using father's occupation, with multiple imputation by chained equations used to account for missing data (in terms of observable characteristics). 13 Then model 1 is re-estimated, but using books in the home to measure SES rather than father's occupation. Our interest is whether the same broad pattern of results holds whichever family background measure is used. For instance, do we consistently find that socio-economic inequality in reading achievement is greater in Spain than other countries? And is there consistent evidence that the SES gradient grows, shrinks or stays the same in Spain as children move from the end of primary school to the end of secondary school?
The clustering of pupils within school is accounted for throughout the analysis by either Huber-White adjustments, bootstrapping by cluster (using 50 replications) or application of the Jackknife (PIRLS) or Balanced-Repeated Replication (PISA) weights. Final student senate weights are also applied to correct estimates for non-response and to scale national samples up to population estimates. Standard errors for differences between countries and between surveys are calculated using a two-sample t-test assuming independence between samples.

Average reading scores
Cross-country differences in average reading test scores (converted into the Z-score metric) are presented in Table 1. The first point of note is that, at both age 9/10 and 15/16, Spain falls below the international median. Specifically, in both surveys, it is ranked 19th out of the 25 countries included. Moreover, there is little change in the average Z-score for Spain between the two studies; it stands at -0.071 standard deviation at age 9/10 and -0.079 at age 15/16. 14 This highlights two important points. First, even by age 9/10, Spanish children's reading proficiency is behind that of most other countries included in our analysis. For instance, average reading achievement in Spain is already 0.34 standard deviations lower than in the United States, 0.41 standard deviation lower than in Italy and more than half a standard deviation behind the top performer (Hong Table 1. Average test scores in reading competency between ages 9/10 and 15/16 (international Zscores). Kong). Second, there is little evidence that the gap in relative performance between Spain and other countries either shrinks or grows during secondary school. On the one hand, this suggests that Spanish secondary schools are unable to compensate for the comparatively poor reading skills children have developed during their first 10 years of life. On the other, it is clearly not during secondary school where Spain's educational problems start to emerge. This finding has important policy (and political) implicationsthe 'blame' for Spain's poor performance in PISA should not be directed at secondary schools. Rather Spain's educational problems seem to emerge much earlier in children's lives, which the secondary education system then struggles to reverse.
This point is further emphasized in the last column of Table 1, which illustrates the change in average Z-scores between ages 9/10 and 15/16 across the selected countries. In total, seven jurisdictions saw significantly more improvement than in Spain, including Norway, Poland and Taiwan. This was balanced out by six countries significantly declining relative to Spain, including several major OECD countries such as Austria, Italy, the Netherlands and the United States. Hence, one can actually make a case for Spain's secondary schools being superior to those in several other European and North American countries (in that children make, on average, more progress). This serves as a valuable lesson to policy-makers (particularly those in Spain)disappointing performance in PISA does not necessary mean that secondary schools are 'failing' or that this part of the education system is the root cause of a country's educational problems. Table 1 also highlights some other interesting findings. Notably, countries performing well above the international average at the end of primary school generally managed to maintain their strong performance to the end of secondary school (Italy and Slovakia are notable exceptions). The same is also true at the other extreme, with countries performing poorly at primary school also tending to perform poorly at secondary school. Norway and Poland are two examples of low performing countries at age 9/10 which have improved significantly by ages 15/16. Their experiences may be particularly relevant for understanding features of secondary school systems that enable children to make strong progress (though some caution is required here, due to the possibility of statistical artefacts such as 'regression to the mean'see Jerrim and Vignoles 2013). Nevertheless, these results seem to stress the importance of the early stages of education and the difficulty of overcoming large initial achievement gaps. In other words, once a country falls behind in the educational achievement race, it is difficult to then catch up. This should be particularly worrying for policy-makers in Spain, given both this country's poor performance in PIRLS, and the fact that the 2013 LOMCE educational act introduced very few changes at the primary and pre-primary school levels. We believe this to be a grave mistake, driven by policy-makers' naive use of the international educational achievement rankings.
We conclude this subsection by analysing differences in progress by gender. Previous research has consistently shown that, in almost every OECD country, girls outperform boys in international reading assessments (OECD 2010, 16). The unique contribution of Table 2 is in considering whether the gender gap in relative reading test scores shrinks or grows during secondary education, and how this varies across countries. Interestingly, in almost every economy the 'change' coefficient is positivenot only do girls outperform boys in the international reading assessmentsthey also make significantly more progress during secondary school. Moreover, in most countries this cannot simply be attributed to sampling variationthe change is statistically significant in 21 out of the 25 countries considered (the exceptions are England, Scotland, Indonesia and Nova Scotia). This includes Spain, where the gender gap increases from 0.03 standard deviations at age 9/ 10 to 0.29 standard deviations at ages 15/16. This is an important finding; it suggests that it is indeed during secondary school where the gender gap in reading skills in Spain seems to emerge. Hence, to the extent that Spanish policy-makers should be looking at policy reforms to the secondary education system, it would seem one of the most fruitful targets may be to reduce the gender gap in reading achievementby making sure the reading skills of boys keeps pace with their female peers.

Inequality in educational outcomes
We now turn to inequality in children's educational outcomes, along with change in reading performance of the highest and lowest achievers. To begin, the standard deviation of children's test scores is presented as the preferred measure of educational inequality. 15 Results can be found in Figure 1. The length of the bars illustrate the standard deviation at age 15/16, with triangles providing analogous figures at age 9/10. The most unequal countries at the end of primary school are Israel, Qatar, England and Scotland, with greatest equality found in the Netherlands, Flemish-Belgium and Hong Kong. The standard deviation for Spain at age 10 (0.813) is around the international average, with educational inequalities neither standing out as particularly large or small. There is a modest increase of 0.086 standard deviations in educational test scores in Spain between ages 9/10 and 15/ 16. Yet similar increases are observed in other countries. Consequently, educational inequality in Spain remains around the international average even at the end of secondary school. Thus, neither the magnitude nor the change in educational inequality stands out as particularly pronounced in Spain relative to other countries.
To gain further insight into this issue, Tables 3 and 4 consider change in the 10th (P10) and the 90th (P90) percentile of the reading test distribution across the two studies. The former can be interpreted as the performance of the lowest achievers in a country, while the latter refers to the highest achievers. Unsurprisingly, countries that saw an increase in mean performance also tended to see an increase in P10 and P90. As Figure 2 shows, there was a modest but statistically significant increase in the 90th percentile in Spain between ages 9/10 and 15/16 (from 0.92 to 1.03), while the opposite holds true for the 10th percentile (from -1.12 to -1.26). This is an important findingit suggests that already high achieving Spanish children saw a relative improvement in their reading scores (compared to children in other countries) while low achieving children in primary school fall further behind. 16 Consequently, if action is to be taken in Spanish secondary education, it should be targeted at the country's lowest performing schools and pupils.

Inequality of educational opportunity
To conclude, we turn to socio-economic differences in educational achievement. Table 5 measures the socio-economic gradient as differences in test scores between children   whose father works in a skilled white collar occupation versus those whose father works in an elementary occupation. The robustness of these results are considered in Table A1 of the appendix, where the socio-economic gradient is alternatively measured as the differences in test scores between children living in homes with more than 200 books versus those with 25 books or less (as noted in Section 2, books in the home is a frequently used proxy for socio-economic status in cross-national research).
Results in Table 5 illustrate there exists a sizeable socio-economic gradient in Spanish children's reading skills at ages 9/10 (0.59 standard deviation points). There is a slight reduction of this gap by age 15/16 (to 0.48 standard deviations) but this change does not reach statistical significance at conventional thresholds. A similar finding holds across most of the selected countries, with a significant increased observed in only three (the Netherlands, Flemish-Belgium and Taiwan) and a decrease in just one (Scotland). These results therefore strongly suggest that inequality of educational opportunity in Spain is largely generated before the age of 9/10. However, some caution is needed here, as our analysis using books in the home produces a somewhat different result (see the appendix). In particular, in most countries a significant increase in the impact of this SES measure is observed, including in Spain. In particular, the difference in test scores between the lowest (less than 25 books) and highest (more than 200 books) socio-economic groups increases from 0.63 (age 9/10) to 0.94 (age 9/10) standard deviations. This is of broadly similar magnitude to the increase observed in most other countries. What do we therefore conclude from these results? First, there seems robust evidence that SES inequality in Spain does not appreciably decline between the end of primary and secondary school. Rather, inequalities in educational opportunities are either maintained or increasedwith somewhat different results depending upon which SES measure one chooses to use. Secondly, both Table 5 and the appendix suggest SES inequalities in Spain do not seem to change by any more or less than is observed in most other countries. Finally in Spain, as in many other countries, socio-economic differences in educational attainment are largeand require urgent policy action to be reduced.
Despite the LOMCE reforms noting the importance of this last point, few details are provided on how such a reduction in SES achievement gradients might be achieve. We believe that our evidence suggests Spanish policy-makers should target their interventions early in young people's lives (i.e. before secondary school). In particular, both Table 5 and the appendix illustrate how, once SES inequalities in educational attainment emerge, they are very difficult to reverse.

Discussion and conclusions
Reducing school failure and increasing the 'quality' of education were among the main objectives of Spain's latest educational reforms. The Ministry of Education has acknowledged these reforms were inspired by Spain's poor performance in international Table 5. Socio-economic differences in the reading competency between ages 9/10 and 15/16 (international Z-scores): father's occupation. assessments, and the subsequent recommendations for improvement made by international organizations. The aim of this article was to scrutinize Spain's performance in these educational assessments in more detail, in order to provide a more nuanced view of this country's educational problems. Our focus has been whether Spain's disappointing performance in important international reading assessments really emerges during secondary education, or if it already lags behind other countries towards the end of primary school. We not only considered performance on average, but also changes in the distribution of reading achievement and the evolution of educational inequalities between ages 9/10 and 15/16. Our four key findings can be summarized as follows. First, the gap in average reading test scores between Spain and other countries is just as stark at age 9/10 as it is at age 15/16. In other words, Spain's poor performance on international reading assessments seems to be generated in primary (and pre-primary) education, and does not appreciable decline (or improve) during secondary school. This is consistent with the work of Mena, Enguita, and Gómez (2010), who describe how low primary school performance can harm children's educational expectations, self-concept and engagement in schoolwith slow progress and early school dropout the result. Thus, improving the poor reading skills of primary school children seems to be critical if Spain is to significantly improve its position in the PISA achievement rankings.
Second, although there is little change in mean reading test scores between ages 9/10 and 15/16, this masks some interesting changes to the distribution of reading achievement. In particular, whereas the reading skills of Spain's lowest achieving children declines (relative to other countries) during secondary school, the reading skills of its top performers actually improves. In other words, there is a small increase in educational inequality, with the least able children falling further behind the average and the more able moving further ahead. This has important implications for Spanish policy-makers; improving basic skills amongst the country's lowest performing pupilsin both primary and secondary schoolmay be an effective way to simultaneously reduce educational inequality while improving average levels of achievement.
Third, our results have highlighted the socio-economic differences in educational achievement that exist in the Spanish educational system. Such inequality is established early in young people's lives, and then either maintained or exacerbated during secondary education. Consequently, our evidence suggests that once social inequalities in educational attainment have emerged, they become very difficult to reverse. This again points towards early action, long before children reach secondary school.
Finally, we provide empirical evidence on the usefulness for policy-making of some of the existing international assessments. Indeed, we show it is precisely the comparative nature of PIRLS and PISA that enables us to provide guidelines at the national-level. However, we reach this conclusion having taken the Spanish case as our starting point, Spain being a case of misuse of international assessments for policy-making. This strategy has allowed us to draw out the limitations and risks of simplistic approaches to crossnational studies such as PISA.
One must of course recognize the limitations of this paper and to stress the need for further work. Ideally, this study would have been conducted using longitudinal data, following exactly the same group of pupils over time. Unfortunately, cross-nationally comparable data of this nature does not yet exist, leading us to take the alternative 'repeated crosssection' approach instead. Nevertheless, this study has illustrated one of many interesting questions such data could address, and highlighted the need for international assessment like PISA to begin to track the progress of children over time. Second, our results are based upon observing young people at two time pointsage 9/10 and 15/16. This limits our ability to identify the exact point when Spanish children fall behind their peers in other countries (in terms of their reading skills). For instance, we do not know how Spain compares to other countries at the approximate point of school entry (e.g. age 5/6), and thus whether educational problems actually emerge in this country even before compulsory schooling has begun. Finally, the focus of this study has been children's reading skills. We are unable to comment upon whether similar patterns are likely to hold for other cognitive (or indeed non-cognitive) domains, including science and mathematics. For example, Spain only started to participate in the TIMSS in 2011, 17 meaning an investigation of children's performance in these domains over time is not currently possible. Nevertheless, this may be a fruitful direction for future research once further data become available (e.g. results from PISA 2015).
Despite these limitations, we believe this paper has the potential to make an important contribution to contemporary education policy in Spain. Despite not being clear from international achievement rankings such as PISA, Spain's major educational problems emerge long before children enter secondary school. Yet, due to their naive interpretation of such rankings, Spain's politicians have nevertheless decided to concentrate the recent LOMCE reforms at the secondary education level. Although analysing the impact of these reforms is beyond the scope of this paper, we believe that they have been designed and developed on a rocky foundation. Indeed, despite containing a number of well-meaning and potentially sensible measures, we believe the LOMCE reforms are unlikely to get to the heart of Spain's under-achievementwhich occurs much earlier in the schooling system. Much more emphasis should have been given to primary and pre-school education when these reforms were being designed. As such, our study uncovers the paradox of LOMCE; international assessments such as PISA have been used to justify their existencea clear case of the so-called tyranny of numbers (Ball 2015)yet the measures being introduced would have benefitted immensely from a more nuanced approach to their use. This study therefore acts as an important warning to policy-makers from other countries. International assessments like PISA may have some role to play in directing education reforms and encouraging policy change. Yet their naïve use (and misuse) by policy-makers may lead to a waste of resources, with sub-optimal changes to the education system being made. Notes 1. According to the Ministry of Education, Culture and Sports (2013b), during 2010/2011, 33% of 16-year-old students had not completed compulsory education. Moreover, early school dropout stood around 25%. This was well above the 15% target, and higher than any other European Union country. 2. The previous 2006 Education Act (LOE) included the following generic statement: 'Some recent international assessments have clearly revealed it is possible to combine quality and equity and should not be considered opposing objectives.' 3. The LOMCE is popularly referred to in Spain as the 'Wert Act'. 4. Compulsory education in Spain begins at age six and comprises six years of primary education and four years of lower secondary education. Nevertheless, school enrolment rates at age three are over 95%.
5. That is, using Bieber and Martens (2011) terminology, we will assess a real case of the role of PISA as a 'Soft Power' in education. 6. Any country where more than half the sample was born outside these years has also been excluded from our analysis. Sensitivity analyses using a lower threshold (25%) has also performed, with the main conclusions unaltered (results available upon request). 7. We will refer to these education systems, throughout the article, as countries. The reader should, however, bear in mind that, among the units compared, there are also smaller administrative units such as, for example, five Canadian provinces. 8. In PIRLS 2006, Iceland and Norway assessed their year 5 students too. However, in order to keep comparability with the rest of countries, we work with their year 4 pupils. Given the decentralised nature of the Spanish educational system, an analysis by Autonomous Communities would have been relevant. However, the information provided by PIRLS does not allow to perform analyses for Spain at the regional level. 9. PIRLS 2006 and PISA 2012 response rates after replacement are available in Martin, Mullis, and Kennedy (2007, 126) and OECD (2014, 271), respectively. 10. This specification follows Schütz, Ursprung, and Wößmann (2008), Wößmann (2008), or Jerrim and Choi (2014). 11. We use the first plausible value only, both in PIRLS and PISA, throughout the analysis. As OECD (2010, 129) notes, 'analyzing one plausible value instead of five plausible values provides unbiased population estimates'. 12. Jerrim and Choi (2014) provide an extensive review of analyses which have used this variable with international assessments. 13. Precise details on the imputation model used is available from the authors upon request. We have also conducted a 'complete case' analysis, with found little substantive difference to the results presented. 14. Note that two very low performing countries (Indonesia and Qatar) are included in the analysis.
This explains why the average score for Spain is close to zero, despite being well behind most other OECD countries. 15. See Ferreira and Gignoux (2014) for a discussion on educational inequality measures and the validity of the standard deviation. 16. As shown in Table 1, these two effects largely cancel one another out, meaning there was little change in mean scores for Spain between 9/10 and 15/16. 17. A Spanish region, the Basque Country, participated in previous TIMSS waves as a benchmarking participant.

Disclosure statement
No potential conflict of interest was reported by the authors.

Funding
This work has been supported by the Spanish Ministry of Economy and Competitiveness [grant number EDU2013-42480-R].