How Robust Are Cross-Country Comparisons of PISA Scores to the Scaling Model Used?

The Programme for International Student Assessment (PISA) is an important international study of 15-olds’ knowledge and skills. New results are released every 3 years, and have a substantial impact upon education policy. Yet, despite its influence, the methodology underpinning PISA has received significant criticism. Much of this criticism has focused upon the psychometric scaling model used to create the proficiency scores. The aim of this article is to therefore investigate the robustness of cross-country comparisons of PISA scores to subtle changes to the underlying scaling model used. This includes the specification of the item-response model, whether the difficulty and discrimination of items are allowed to vary across countries (item-by-country interactions) and how test questions not reached by pupils are treated. Our key finding is that these technical choices make little substantive difference to the overall country-level results.


Introduction T he Programme for International Student Assessment
(PISA) is an important international study of 15-yearold knowledge and skills. Conducted by the Organisation for Economic Co-Operation and Development (OECD) every 3 years, the results are now widely anticipated by academics, journalists, and public policymakers alike. Results from PISA have led to reforms of education systems across the world, including curriculum changes in Norway (Baird et al., 2011), reforms of national assessments in Japan and the Slovak Republic (Breakspear, 2012), alterations to the number of teaching hours in Iceland (Wagemaker, 2011) or to the complete reform of the general education act, as in Spain. It has consequently been described as "the world's most important exam" (BBC 2013), with Andreas Schleicher (the OECD di-rector who leads the PISA study) having been described as "the most important man in education" by high-ranking policy officials (Gove, 2013).
However, having established such an influential reputation, PISA and other international studies are coming under ever greater scrutiny. One particular line of criticism has been about how students' test scores are produced; the scaling methodology that lies behind the production of PISA's socalled "plausible values." Rather than simply adding up the number of correct responses students give to the test questions, the PISA study uses a complex item-response theory (IRT) model to produce estimates of students' latent ability in each subject area. However, rather than producing one single ability estimate, multiple possible values are derived for each child. This series of values are known in the psychometric literature as plausible values, and capture the uncertainty we have surrounding students' latent ability. The intuition for using this complex approach is that it is impossible to thoroughly examine students in multiple different subjects (science, reading, mathematics, problem solving) within the confines of a 2-hour test. Consequently, participants only take a random subsample of test questions, with the IRT model used to equate performance across different versions of the test, and plausible values designed to reflect the uncertainty in the results. Further details regarding the PISA test design are provided below.
Various authors have described how this process is opaque, with many of the potentially important technical details not fully understood outside of a narrow range of highly specialized psychometricians (Brown & Micklewright, 2004;Goldstein, 2017), which may also have implications for how these data then get used. Others have suggested that the particular item-response model used in PISA until 2015 is overly simplistic and does not fit the data well (Kreiner & Christensen, 2014). Particular criticism has been reserved for PISA's use of the Rasch model (Fernandez-Cano, 2016), which some consider to be less sophisticated than the threeparameter item-response model used in other large-scale international assessments such as the Trends in Mathematics and Science Study (TIMSS). This has consequently led to various different opinions emerging, ranging from whether the methodology behind PISA is sufficiently transparent (Spiegelhalter, 2013;Goldstein, 2017) through to whether this study is actually fit for purpose (Stewart, 2013a,b).
A key question that therefore emerges from this literature is: How much do the technicalities around the PISA scaling model actually matter? Not only in terms of national averages (upon which the "international rankings" are based), but also other distributional statistics of importance, such as crosscountry comparisons of high and low achievers, measures of educational inequality and the gender gap in students' performance?
Such issues have taken on particular importance since the publication of the PISA 2015 results, when a number of technical changes were made to the construction of the PISA scale scores (plausible values). These included the following 1 : r The introduction of item-by-country interactions. A limited number of item-by-country interactions were included in the PISA scaling model for the first time. In other words, in PISA 2015 there were some countryspecific item parameters, allowing some items to be freely estimated by country. This meant some questions were treated as harder to answer correctly in some countries than in others (e.g., some questions are now treated as "harder" to answer correctly in England than in Scotland). The decision of where to allow item-by-country interactions was based upon item-fit statistics to determine differential item functioning, 2 and thus based upon a purely statistical approach. See OECD (2016, pp. 150-152) for further details. Such interactions were not used in PISA between 2000 and 2012. 3 r The use of a two-parameter model. In PISA 2015, questions were not only allowed to vary in terms of their difficulty, but also their "discrimination" (i.e., how well each question is thought to measure students' reading/science/maths skills). This was not the case in PISA 2000 to 2012, when the discrimination parameter for each question was fixed to one (i.e., it was assumed that each reading/science/maths question measured reading/science/maths skills equally well).
r Items that are "not-reached" no longer contribute to the proficiency scores. As a timed assessment, not all students manage to reach the end of the test. In PISA 2000 to 2012, these "not-reached" items were treated as incorrect responses when creating the scale scores. 4 This changed in PISA 2015, with the "not-reached" items treated as missing data, and hence do not contribute to the level of each student's estimated latent ability.
r Changes to how the item parameters are estimated. In PISA 2015, data from the 2006 through to the 2015 rounds were used in the calibration of the item parameters. 5 This was different from the procedure used in previous PISA waves, when only data from the current round was used in the item-calibration process. 6 Consequently, item parameters (e.g., item difficulty) differ less between PISA 2015 and previous waves.
Yet, despite this collection of potentially important technical changes, little easily digestible information has been provided to consumers of the PISA data as to the likely impact they had upon cross-country comparisons. Indeed, more generally, little previous work has considered how technical changes made to the underlying scaling model affects international comparisons of students' achievements. 7 For instance, does using a two-parameter item-response model produce different cross-country comparisons than using a Rasch model? If "not-reached" items are treated as incorrect rather than as missing data, does this alter our view on which countries have the greatest levels of educational inequality (e.g., the gap between the highest and lowest achievers)? And does the inclusion of item-by-country interactions mean that cross-national differences in PISA scores become more or less pronounced? Currently, little independent information is available to consumers of the PISA results.
The aim of this article is therefore to make this important contribution to the existing evidence base. Focusing upon the results for science, the major domain in PISA 2015, we illustrate how cross-country comparisons of key distributional statistics change once specific technical aspects of the PISA scaling model are altered. This includes a consideration of all the major changes made to the scaling model in PISA 2015, as outlined above. To preview our key findings, we discover that relative differences between countries are generally unaffected by the scaling model used. This holds true not only on average, but also for key statistics frequently used to describe the distribution of students' achievement, as well as covariation with key demographic characteristics. We consequently conclude that most of the headline findings from PISA do not seem to be particularly sensitive to the scaling model used.

Data and Methods
The data we use are drawn from PISA 2015. Although a total of 72 countries and economies participated, we restrict our attention to the 35 members of the OECD. The focus of this article is therefore the robustness of the PISA results within rich, developed countries. In each country, a two-stage sample design was used, with schools selected as the primary sampling unit and students then randomly selected from within. A total of around 150 schools and approximately 5,500 pupils participated within each OECD country. Response rates, after the inclusion of "replacement schools," were around 90% in most countries at both the pupil and school level.
PISA employs a complex test design. In 2015, the study included 184 questions in science, 81 questions in mathematics, 103 questions in reading, and 117 in collaborative problem solving. It is, of course, impossible to expect all students to provide an answer to each of these questions within the space of a 2-hour test. Test questions from the different subject areas were divided into subject-specific clusters, which were then organized to create around 66 different test forms. Participating students were then randomly assigned one of these forms to complete. Consequently, although all students answered 1 hour's worth of science questions, only around 40% of students answered any questions in reading, 40% any questions in mathematics and 30% any questions in collaborative problem solving (OECD, 2016, p. 40 incorporating how students responded to each test question they were assigned plus information from the background questionnaire, to estimate a distribution of students' latent achievement in each subject area. In other words, rather than producing a single "test score" for each child, this itemresponse model produces a range of possible values. "Plausible values" are then created by the survey organizers, which are essentially random draws from each child's estimated latent achievement distribution. Further details with respect to the PISA test design can be found in OECD (2016, chapter 2) and the item-response methodology in OECD (2016, chapters 9 and 12). Within this article, we make use of the publicly available item-level data and item parameters provided by the OECD to broadly replicate the methodology used to generate the PISA plausible values in science. 8 Specifically, we fit a multidimensional item response model to students' item-response data, constraining the item parameters to the values published in the PISA 2015 technical report (OECD, 2016). Following the OECD's methodology, this model allows for students' latent science, reading, and mathematics abilities to be correlated, via the inclusion of correlated error terms within the measurement model. Consequently, scale scores are produced for each student in each subject area-even in those subjects where they have not answered any test questions.
A simplified summary of the model we estimate is presented in Figure 1. We estimate this model separately for each language group within each country, generating for each pupil their Expected A Posteriori (EAP) proficiency estimates in each subject along with their standard errors (as a measure of uncertainty). We then draw 10 random values for each student from a normal distribution in order to generate our plausible values (PVs). The mean of this normal distribution is set, for each student, to their EAP achievement estimate, with the standard deviation of the distribution set to their EAP standard error. Finally, we standardize these values across the OECD countries, so that they have the same mean and variance as the "official" PVs. Our focus within this article is therefore the relative performance of countries against one another. In other words, does making a particular change to the PISA scaling procedure advantage any one country compared to another?
Note that the OECD does not report "official" EAP values in the international PISA database; they only include plausible values. 9 However, as plausible values contain measurement error (they are random draws), correlations between our PVs and the OECD's "official" PVs will be attenuated. In other words, if we were to compare the correlation between our PVs and the OECD's PVs, this would underestimate how well we have managed to reproduce the PISA scaling methodology at the individual pupil level. To overcome this issue, we create proxy "official" EAP estimates by averaging the 10 "official" PVs in the international PISA database. We then correlate our "replicated" EAPs to these "official" EAPs to consider how closely we have managed to replicate the OECD's scaling procedure.
Although we largely follow the methodological approach of the OECD in generating the PISA plausible values, it is important that we document a handful of areas where there are some subtle differences. First, in the OECD model, all the data collected in the background questionnaire has a direct role in the generation of the PISA plausible values. Specifically, an enormous principal components analysis is conducted upon all the background variables, with the derived components then included in the model as direct effects upon students' science, reading, and mathematics achievement. 10 In contrast, Figure 1 illustrates how we have only included gender as a direct background regressor in our model. 11 Second, while we have included three subjects in our multidimensional IRT model (science, reading, and mathematics), the OECD Note. The OECD EAP estimate is approximated by taking the average of the 10 plausible values in science for each student. The Pearson correlation is .9557 and Spearman correlation is .9585. Graph presented based upon a random sample of 5,000 students from the countries analyzed. version includes financial literacy and collaborative problem solving (for those countries that participated in these national options) as well. 12 Third, whereas we have estimated separate models for each language group within all nations, the OECD did this in only a handful of countries (Belgium, Canada, and Israel-see OECD, 2016. chapter 9, page 67). 13 Fourth, all of our models have been estimated using Stata (a well-known and widely used statistics package; StataCorp, College Station, TX) while the "official" scale scores were produced by the Educational Testing Service (ETS) using their own specialized software (DGROUP). Finally, we have used maximum likelihood procedures to estimate the model underlying our replication of the PISA proficiency scores. The OECD, in contrast, used the Laplace approximation (see OECD, 2016, chapter 9). Given these differences, how closely has our procedure replicated the "official" PISA proficiency scores? We consider this at both the individual pupil and country levels, focusing upon the results for science (our subject of interest). Figure 2 and Table 1 provide results for the former, illustrating the correlation between our EAP science proficiency estimates and the analogous "official" values calculated directly from the public-use PISA database. 14 Figure 2 illustrates how the correlation between our science EAP estimates and the average of the OECD's 10 plausible values is very high (r = .96) when looking at students drawn from across all countries. Table 1 then extends this result to illustrate that it also holds within each individual country of interest. In other words, despite the handful of subtle differences between our scaling model and the scaling model used by the OECD, we nevertheless closely replicate students' proficiency estimates in science, as reported in the international database.
Using our replicated plausible values, are we also able to successfully reproduce the official PISA country-level results? Figure 3 provides the answer to this question for mean scores and other key statistics (10th percentile, 90th percentile and  the standard deviation; see Appendix). The correlation we find is even stronger (approximately .99), with the difference between our country means and those produced using the "official" PISA plausible values typically differing by just a couple of test points. Together, the above demonstrates how we have managed to closely reproduce the official PISA science scores.
Winter 2018 C 2018 by the National Council on Measurement in Education 31 Note. The Pearson correlation is .994 and Spearman correlation is .986.
FIGURE 3. Correlation between our estimate of the mean EAP science score and the OECD mean EAP science score at the country level.
Note. Figure can be cross-referenced with the statistic in the top-left hand corner of  Our replicated values will therefore serve as a robust baseline for us to measure change against, once we have made some technical alterations to the underlying scaling model used.
In the following section, we illustrate how cross-country comparisons change after making a number of alterations to the PISA 2015 scaling model. First, PISA 2015 allowed for a limited number of item-by-country interactions. This means that the difficulty and discrimination parameters were allowed to be higher or lower in some countries than in others (usually due to concerns over poor model fit). Although the number of such interactions used in PISA 2015 was small, their   (7) 512.2 (6) 512.6 (6) 511.9 (6) Switzerland 511.7 (9) 508.0 (10) 509.0 (10) 508.1 (10) 507.1 (10) 506.9 (10) Netherlands 511.3 (10) 508.5 (9) 509.5 (9) 509.7 (9) 507.8 (9) 510.0 (8) Australia 509.1 (11) 507.0 (11) 505.9 (11) 506.8 (11) 505.7 (11) 506.6 (11) Ireland 508.6 (12) 504.7 (13)  inclusion in the scaling model is somewhat of a contentious issue. It has been suggested, for instance, that this may "smooth out" important and interesting differences between countries (Goldstein, 2017) and could jeopardize cross-national comparability. Likewise, on a conceptual level, it seems difficult to justify why some questions should be treated as harder in Scotland than in England (for example)-as the PISA 2015 scaling model posits. We hence begin by investigating whether excluding such interactions from the PISA scaling model would lead to an appreciable change to the results. The second change we make is to the parameterization of the underlying IRT model. Specifically, a two-parameter model was used in PISA 2015; something that was seen as a significant departure from past waves of PISA when a Rasch model was used. Table 2 provides some descriptive information on the distribution of the discrimination parameters used in PISA 2015, illustrating how the average value was typically just over the value of one used in the Rasch model. In the following section, we consider how the PISA 2015 results would look (in terms of relative differences between countries) if a Rasch model had been used instead. We return to our scaling model and constrain all the discrimination parameters to one, thereby assuming each science question measures students' science skills equally well. Third, as in previous cycles of PISA, there were some nontrivial changes to the estimated item parameters between PISA 2015 and previous cycles. Not only was the discrimination parameter allowed to vary (see Table 2), but the item difficulty also changed. For instance, PISA 2015 used different difficulty parameters than PISA 2006 in science even for the same items (as did previous waves of PISA). But does altering the item parameters used in the scaling model really make any difference to the results? We consider how the PISA 2015 results would change if the 2006 item parameters were used instead (we use the parameters from 2006 as this was the only other time science was the major PISA domain). Specifically, this implies that we constrain all discrimination parameters to one (i.e., we fit a Rasch model) and use the 2006 itemdifficulty values (instead of the 2015 values) where they are available. This is possible only for trend items, and not for the new science questions introduced in PISA 2015 (where we continue to use the 2015 item parameters). The purpose of this particular exercise is to demonstrate whether using a different set of item-parameter estimates leads to substantial changes to the cross-national pattern of results.
Fourth, the PISA 2015 scoring procedure treated "notreached" questions as missing data-and hence such items not make any contribution to students' proficiency scores. Within our analysis, we illustrate how cross-country comparisons change if these not-reached questions are treated as incorrect responses instead (as per the PISA 2000 to 2012 approach). Table A2 in the appendix provides an overview of the percentage of questions classified as "not reached" by country and subject, illustrating that this is typically very low (less than 2% of questions being unreached). Although it is therefore unlikely that altering the treatment of not-reached items in the PISA scaling model had an impact upon average scores, it may have had an influence upon some other statistics of interest (e.g., percentage of low achievers, inequality in educational achievement). One of our primary interests will hence be how this change influences international comparisons of low performance (e.g., the 10th and 25th percentiles) and measures of educational inequality (e.g., the standard deviation, socioeconomic gaps), under the assumption that lower-achieving and disadvantaged students are most likely to fail to complete the test within the time limit (Bridgeman, McBride, & Monaghan, 2004).
Finally, we ask: What is the cumulative impact of making all the changes outlined above? In other words, how would the relative position of countries change when multiple alterations are made to the scaling model?
To summarize the consistency of results across the different models, we use the Spearman rank correlation. This measures the direction and strength of the association between two ranked variables, and thus illustrates how the rank-ordering of countries changes when the various different alterations to the PISA scaling model are made. Country average scores and country rankings are also provided to illustrate how the alterations of the scaling model influence these particular statistics. Figure 4 illustrates the correlation between our original replicated country-average science scores described in the Data and Methods section (x-axis) and our alternative estimates when the item-by-country interactions have been excluded from the scaling model (y-axis). This is complemented by the first column of Table 3, which illustrates the analogous strength of the cross-country correlations for various distributional statistics (10th, 25th, 50th, 75th, and 90th percentile, mean and standard deviation). 15 The clear message is that whether item-by-country interactions are included or excluded from the scaling model makes essentially no difference to the substantive results. The correlation between the two sets of estimates is extremely high for all the countrylevel descriptive statistics considered, with all the Pearson coefficients sitting above .99. Hence there is no evidence that the inclusion of item-by-country interactions into the PISA scaling model has provided a particular advantage (or disadvantage) for any of the countries we consider. Table 4 illustrates how this translates into changes in the mean PISA science scores across countries. Following on from the previous results, these two tables further illustrate how the removal of item-by-country interactions barely leads to any change in the results. For instance, even in countries where the movement is most extreme, the average science score changes by just three or four test points (e.g., Ireland and Switzerland). Likewise, the standard deviation varies by less than a single PISA test point in most countries if item-by-country interactions are excluded. Consequently, Table 4 helps to further illustrate how this technical feature of the PISA scaling model has almost no impact upon the substantive results.

Applying a Rasch Model
What happens to cross-country comparisons in PISA 2015 if item-discrimination is no longer allowed to vary, and a Rasch model is fitted instead? To begin, Table 5 provides a comparison of model fit between our Rasch and two-parameter models, based upon the Akaike Information Criterion (AIC). The AIC is a statistic that is commonly used to decide between two competing models, and trades off parsimony (number of estimated parameters) against how closely the model aligns with the empirical data. It is therefore a measure of relative fit, used to judge one model against another, with preference given to the model generating the lower AIC value. 16 Table 6 reveals that, in most countries, the AIC is lower for the two-parameter model than the Rasch model. In other words, we find evidence that the two-parameter model introduced in PISA 2015 is typically an improvement over the Rasch model used in PISA 2000 to 2012 in terms of model fit.
But has this improved fit to the data led to a substantive change in the country-level results? The second column of Tables 3 and 4 provides the answer and again illustrates how international comparisons of various descriptive statistics are largely unaffected by this choice. For instance, the mean, standard deviation, and selected achievement percentiles are all virtually identical regardless of the approach used (the Spearman rank correlations are all approximately .99). Hence, despite PISA having received a great deal of criticism for its historical use of the Rasch model, we find little evidence that moving to a more complex two-parameter item-response model has any meaningful impact upon cross-country comparisons of the results.

Using the 2006 Item Parameters (Rather than 2015)
As well as allowing the discrimination parameter to vary, the item-difficulty parameters used in PISA 2015 also differed from previous rounds. But how much impact does using different IRT item parameters really have upon the results? The third column of Table 3, where we have used the 2006 values of the item parameters in the scaling model rather than the 2015 values, provides insight into this issue. 17 Consistent with the findings presented in the subsections above, altering the item parameters used in the scaling model leads to only  (14) 501.4 (9) 495.6 (14) Denmark 500.6 (9) 503.4 (8) 508.0 (6) 502.0 (8) 502.0 (7) 500.5 (10) Belgium 497.8 (10) 503.6 (7) 502.3 (10) 499.9 (12) 498.1 (12) 500.6 (9) Germany 497.5 (11) 501.7 (10) 501.0 (11) 500.6 (10) 498.5 (11) 505.0 (6) Netherlands 497.4 (12) 501.1 (11) 505.5 (7) 501.1 (9) 499.0 (10) 500.5 (11)  trivial changes to the estimates. In particular, note how the Spearman correlations reported are consistently very strong (approximately .99) for each of the distributional statistics considered. Moreover, for most countries, the average score and rank position presented in Table 4 are broadly stable. Consequently, the exact value of the item parameters used in the scaling model (and whether a Rasch or two-parameter IRT model is used) has a trivial impact upon the substantive conclusions reached.

Treating Not-Reached Items as Incorrect
In line with the findings presented thus far, the impact of altering how not-reached items are treated has a trivial impact upon cross-national comparisons of students' achievement. Importantly, this is not only true on average (mean scores) but also for comparisons of the lowest achievers, as measured by the 10th and 25th percentiles of the science distribution. Specifically, the fourth column of Table 3 illustrates how the cross-country correlations reported are all consistently above .99, with almost no substantive change to countries' positions in the international rankings in Table 4. We consequently conclude that this particular analytic choice has almost no impact upon the results.

The Combined Effect
The final column of Table 3 provides the correlations between (a) our initial replication of key country-level statistics and (b) alternative country-level estimates once all the changes made to the scaling model covered in the subsections above have been taken into account. Given the results presented thus far, it is perhaps unsurprising that the correlation coefficients all remain extremely high (around .99). Likewise, the country average science scores and rankings remain very similar between the first and last columns of Table 4. In other words, even when a raft of changes are made to the scaling model, the same cross-national pattern of results continues to be found. Consequently, this provides yet more evidence of how cross-country comparisons made within a given PISA cycle are robust to the choice of the scaling model used.

Do Similar Findings Hold for Other Subject Areas?
All of the estimates presented thus far relate to the results in science-the major domain in PISA 2015. Do we find similarly strong correlations for the minor domains (reading and mathematics)? Table 6 provides a summary of our results for these two subjects based upon the Spearman's rank correlation. This is supplemented by Tables 7 and 8, which illustrates how average scores and country rankings change as the various alterations to the PISA scaling model are made. Consistent with our findings for science, we find little change to the crosscountry pattern of results when changes are made to the scaling model. The correlations we find remain extremely high across the various distributional statistics considered, though they are slightly lower than the analogous results for science. This is likely to be due to reading and mathematics being "minor domains" in PISA 2015, with students answering fewer questions on these topics, and hence the specification of scaling model having a slightly more important role. Nevertheless, the results we have presented for science throughout this section do generally seem to hold in other subject areas as well.

Conclusions
In this article we have investigated whether the precise specification of the PISA scaling model really makes a substantial difference to cross-national comparisons of educational achievement. Our results provide a clear and consistent message. Even when multiple alterations are made to the scaling model, this only has a trivial impact upon cross-country comparisons within a given PISA cycle. This holds true across a range of key statistics (mean, standard deviation, gender dif-ferences) and the different PISA domains (science, reading, and mathematics). There are two potential ways of interpreting these findings. First, there is a view within parts of the psychometric community that the scaling model used in previous rounds of PISA was flawed, particularly with respect to the use of the Rasch model (Kreiner & Christensen, 2014). Yet, given that we have shown that cross-country comparisons do not really change when a more complex methodology is used, it was perhaps good enough, and that some of the media reports questioning this aspect of the study have been overblown. Alternatively, one might conclude that the new methodology introduced in PISA 2015 is therefore equally as flawed as the methodology used before, given that it does not produce substantially different results. Our own view is closer to the former-we believe our investigations illustrate how the key results from PISA (at least as far as the psychometric scaling model are concerned) seem to be relatively robust to the technical choices made. Nevertheless, we believe further investigations in the spirit of those conducted within this article should be welcomed by the OECD and the scientific community to further justify the chosen psychometric approach.
These findings should, of course, be interpreted in light of the limitations of this article and the need for further research. First, this article has focused solely upon relative differences between OECD countries within a single PISA cycle. We have not considered how the scaling approach influences absolute measures of students' performance, such as changes in a country's PISA scores over time, or for middle and low-income countries. Although clearly a topic of great importance, it is beyond the scope of this article, but remains an important area for future research. Second, we have focused upon a particular set of changes made to the scaling model, motivated by the fact that these technical details have altered across the PISA cycles (most notably in 2015). Although these changes are quite extensive from a psychometric perspective, including much debated issues in this technical literature (e.g., the impact of shifting from a Rasch to a two-parameter model), we can obviously not rule out the possibility that making some other changes may have some kind of an impact upon the results (e.g., if PISA were to move to a three-parameter IRT model instead).
Despite these limitations, we believe this article has made an important contribution to ongoing debates about PISA and other large-scale assessments. Although there are clearly important limitations to such studies, our analysis suggests that some of the criticisms made of the scaling methodology are unjustified. Although a complex methodology is used, one which is not widely understood outside a highly specialized psychometric field, the scaling model can be closely replicated using information freely available in the public domain. More importantly, cross-country comparisons seem to be largely unaffected by the precise specification of the scaling model used. By completing this independent investigation, it is hoped that this will be accurately reflected in media reports of future PISA results, and that there is a greater appreciation among sceptics that international comparisons seem quite robust to departures from the official OECD scaling approach.   Notes 1 A further important change to PISA in 2015, not covered within this paper, is the introduction of computer-based assessment. 2 Poorly fitting items were determined using two criteria: (a) root mean square deviation >.12 and (b) a mean deviation >.12 and < -.12. 3 However, in these earlier cycles, some items were deleted if they did not fit the chosen IRT model sufficiently well across a large number of countries. 4 Note that "not-reached" items are different from "not-answered" items. The former is where students have essentially run out of testing time and so have not seen the item. The latter refer to questions which students have seen (and thus attempted) but to which they have not provided a response. 5 The motivation for basing the item-parameter estimates upon the pooled 2006-2015 data was that this would maximize sample sizes at the item level and lead to greater stability in the item-parameter estimates. As a similar approach will also be used by PISA moving forward, it should also mean that there are not sudden large changes in item parameters across different PISA cycles. 6 A related difference is that, in PISA 2000 to 2012, only a subset of pupils in each country were used in the item-parameter calibration process. Specifically, the survey organizers randomly selected 500 students from each OECD country to form an international subsample upon which the item-parameter estimates were based. 7 One important exception is Brown, Micklewright, Schnepf, and Waldmann (2007). Using TIMSS 1995 data, they consider how the change from a one-to a three-parameter item-response model impacted crosscountry comparisons. They found "cross-country patterns of central tendency to be robust to the choice of [item-response] model. But the same is not true for dispersion, for which model choice can have a big effect." They hence advised that "survey reports should include an analysis of the sensitivity of basic results to model choice"-though this suggestion has yet to be taken up. 8 The item-level PISA data are available from http://www.oecd.org/pisa /data/2015database/. International item parameters are available from http://www.oecd.org/pisa/data/2015-technical-report/. Information on item-by-country interactions was provided to the authors by the OECD. 9 EAPs and their standard errors reflect the mean and standard deviation of each child's latent proficiency distribution in a subject. PVs are, on the other hand, random draws for each child's latent proficiency distribution. 10 The principal components analysis is performed separately in each country, with the number of components retained sufficient to explain around 80% of the common variance in the background data. In Figure 1, these direct effects would be represented by additional squares with arrows pointing towards the circular latent achievement variables. 11 The inclusion of additional background issues led to convergence issues in the maximum likelihood estimation in a number of countries, while in others increased estimation time to prohibitive levels. 12 We have excluded these additional domains from our model due to (i) the data not being publicly available at the time of writing and (ii) it would require the inclusion of several additional latent correlations, increasing the complexity of the model, and hence estimation times and convergence issues. 13 For the other countries with more than one language group, the OECD ran a single model, though this did include item-by-country interactions in the measurement model. 14 Note that the correlation between the average of the first five PVs and the last five PVs is .983. We take this as approximately the maximum possible correlation that is achievable, given the random error within the PVs. 15 Note that, throughout this section, we use our replicated plausible values. (We only produced EAP estimates for the purpose of the previous section, where we investigated how well our replication worked.) 16 We have also estimated the Bayesian Information Criterion (BIC) for the two models in each country, with the same substantive conclusions reached. 17 Note that the use of the 2006 parameters implies that a Rasch model is fitted (i.e., we set all discrimination parameters to one).