Ordinal Factor Analysis of Graded-Preference Questionnaire Data

We introduce a new comparative response format, suitable for assessing personality and similar constructs. In this “graded-block” format, items measuring different constructs are first organized in blocks of 2 or more; then, pairs are formed from items within blocks. The pairs are presented 1 at a time to enable respondents expressing the extent of preference for 1 item or the other using several graded categories. We model such data using confirmatory factor analysis (CFA) for ordinal outcomes. We derive Fisher information matrices for the graded pairs, and supply R code to enable computation of standard errors of trait scores. An empirical example illustrates the approach in low-stakes personality assessments and shows that similar results are obtained when using graded blocks of size 3 and a standard Likert format. However, graded-block designs might be superior when insufficient differentiation between items is expected (due to acquiescence, halo, or social desirability).

The most common method to measure personality traits, personal values, and similar constructs is using Likert-type items (or ratings). However, when this method is used, respondents might endorse all items regardless of their valence (so-called acquiescence) or trait allocation (cognitive bias of exaggerated coherence, or halo effect). In applications where these effects are common, the validity of inferences is threatened. In such applications, the use of comparative judgments (i.e., asking respondents about their preferences for one or another item) is an attractive alternative because comparisons between items facilitate better differentiation and calibration, thus reducing halo effects (Kahneman, 2011). Also, when forced to compare items, one cannot agree with all of them indiscriminately, thus alleviating acquiescent responding (Cheung & Chan, 2002).
Preferences can be expressed as choices among two items, and as rankings or partial rankings among three or more items. Data collected by this method represent binary choices involved for each pair of items within a set. Simple choice, however, is not the only way of expressing preferences. We might want to obtain quantitative information about the relative merits of items within the set. For example, we might ask respondents to distribute a fixed number of points (say, 100) between the items, resulting in so-called compositional data (Brown, 2016b). Or, we might ask respondents to indicate how much they prefer Item A to Item B using a number of ordered categories, such as "much more-a little more-a little less-much less." Each subsequent category represents diminishing preference for Item A and increasing preference for Item B. Such graded-preference format is the focus of this article.
Why would we consider collecting graded preferences, if binary preferences have already proven themselves an attractive alternative to ratings, particularly for their resistance to response biases? We believe this extension is desirable for at least two reasons. First, test takers often criticize forced-choice formats for the perceived "lack of choice" when presented with items that either all apply to them or none apply; as one test taker put it, "Responding correctly was impossible because it forced a choice between equally ranked options" (Bartram & Brown, 2003, p. 9). Allowing the test takers to indicate the extent of their preference could increase their engagement and the face validity of the questionnaire. Second, scores derived from forced-choice responses (representing binary choices among pairs of items) generally have lower reliability than scores obtained from Likert ratings of the same items. It is easy to see when considering a simple choice between two items, A and B, which can result in only one of two possible outcomes: Either A is preferred to B or otherwise. Clearly, such a binary variable contains less information than Likert ratings of the same two items using, say, five ordered categories. More information can be obtained per item in forced-choice tasks when items are combined in larger blocks (Brown & Maydeu-Olivares, 2011b); however, blocks of four items still yield lower reliability than the 5-point Likert scales (Brown & Maydeu-Olivares, 2013). As a result, more items are needed in general in forced-choice questionnaires to reach the same precision of measurement as their Likert-scale counterparts. The additional information obtained from every comparison by asking participants to quantify the preferences could help solve this problem.
How would we score a questionnaire composed of gradedpreference items? The simple summative schemes, where preference for one item adds points to that item while decreasing by the same amount the points awarded to the other item, will result in ipsative scores. Ipsative, or relative-to-self scores, are problematic for interpersonal comparisons and preclude application of standard psychometric analyses (Brown & Maydeu-Olivares, 2013;Clemans, 1966;Closs, 1996;Dunlap & Cornwell, 1994). However, recent advances in modeling forced-choice data (Brown & Maydeu-Olivares, 2011bMaydeu-Olivares & Brown, 2010), which have enabled proper scoring of personal attributes without artefacts of ipsative data, have not yet been extended to graded comparisons. This article aims to fill this gap. The objectives of this article are as follows. The first objective is to introduce a response format for gathering measurements on latent attributes using graded comparisons. We refer to this new format as graded blocks. In graded-block designs, individuals are presented with pairs of items and are asked to indicate the extent to which they prefer one item to the other (or the extent to which one item describes their personality or attitudes better than the other item) using a graded scale. The second objective is to propose a model suitable for such data. Such a model needs to take into account (a) the ordinal and comparative nature of the data, (b) dependencies when the same item is administered in more than one pair, and (c) potential intransitivity of responses to pairs involving the same items (an individual might prefer A to B, and B to C, but not prefer A to C). The third objective is to provide the item and test information functions suitable for the proposed model. Armed with such a model, researchers could analyze existing graded-preference data, design optimal graded-block questionnaires, or infer the expected properties of their questionnaires before data are gathered.
The remainder of this article is organized as follows. First, we describe the graded-block design. In a nutshell, items are first organized into blocks of n items (the block size n can be 2, 3, 4, etc.). All possible pairs are drawn from items within each block. Then, the resulting pairs of items are administered using a graded scale. Next, we describe a model suitable for these data. The model is based on Thurstone's (1927) law of comparative judgment, where utilities of items under comparison are linked to graded preference decisions via a threshold process to accommodate ordinal data. We show that the proposed model is an ordinal factor analysis model with specific constraints, and it can be estimated using standard software such as Mplus (Muthén & Muthén, 2016). Because ordinal factor analysis models belong to the general class of item-response theory (IRT) models, in technical appendices, we provide the item and test information functions. Our derivation takes into account the inherent multidimensionality of responses when items measuring different attributes are compared, and the fact that it is impossible to estimate the latent traits separately in such designs. We provide R functions to compute the item and test information, allowing computation of standard errors for estimated scores, and reliability estimates. To illustrate the graded-preference model, we provide an empirical example, in which the Five-Factor markers (Goldberg, 1992) are measured using two alternative response formats: standard Likert ratings and graded blocks. We conclude with a general discussion and a set of recommendations for applied researchers.

THE GRADED-BLOCK DESIGN
In forced-choice questionnaires, items are uniquely assigned to blocks of size n, and respondents are asked to provide a ranking or a partial ranking of the items within the blocks. In a graded-block design, items to be compared with each other are still drawn from within blocks, but they are presented as pairs to enable graded comparisons. For each pair, respondents are asked to express the extent of their preference for one item or the other using several graded categories. For instance, they might prefer Item A "much more" or "slightly more" than Item B, be ambivalent about Items A and B, or prefer Item B "slightly more" or "much more" than Item A.

Much more
Item A X Item B OFA OF GRADED PREFERENCES across the questionnaire to minimize the carry-over effect. Importantly, the model for such designs needs to take into account these patterns of within-block dependencies arising from the repeated item use. The reason we might want to draw paired comparisons from blocks of three or more items is to increase the amount of information obtained per one item. Indeed, when pairs are drawn from blocks of size n = 2, the questionnaire has half the number of tasks of a standard Likert-type questionnaire in which items are presented one at a time. When pairs are drawn from blocks of size n = 3, there are ñ = 3 pairs arising from each block, and the questionnaire has the same number of tasks as a standard rating task. When items are drawn from blocks of size n = 4, there are ñ = 6 pairs arising from each block and the questionnaire contains more tasks than a standard rating task, and therefore may gather more information per item than a questionnaire created from smaller blocks.
To code the graded preferences appropriately, we will always consider the degree of preference for the first item in the pair {i, k}, item i, arbitrarily using descending integers, 1 for example: if i is preferred 00 much more 00 than k 4; if i is preferred 00 slightly more 00 than k 3; if i and k are 00 about the same 00 2; if k is preferred 00 slightly more 00 than i 1; if k is preferred 00 much more 00 than i 8 > > > < > > > : (1) Responses coded in this way are the observed outcomes in graded-preference analysis. It is easy to see that the observed outcomes are ordinal variables.

MODELING GRADED-PREFERENCE QUESTIONNAIRE DATA
To model graded preferences when items are presented in pairs, we use the law of comparative judgment (Thurstone, 1927), which attributes preference decisions to the relative utilities (or psychological values) of items under comparison. Thus, person j prefers item i to item k if his or her utility for item i (t ji ) is greater than the utility for k (t jk ). Therefore, the unobserved difference of utilities is the fundamental quantity in the analysis, which determines the observed preference decision y j{i,k} via a threshold process (Böckenholt & Dillon, 1997;Maydeu-Olivares, 2002): According to this threshold process, person j selects one of C graded options depending on the size of the latent difference y Ã j i;k f g , and a set of C -1 thresholds. However, when graded paired comparisons are drawn from blocks of three or more items (n ≥ 3), respondents need not be consistent in their pairwise preferences, possibly yielding intransitive patterns of preference. That is, they might prefer item i to item k, item k to item l, but not prefer item i to l. Intransitive pairwise preferences can be accommodated by adding an error term to the difference of utility judgments (Maydeu-Olivares & Böckenholt, 2005;Takane, 1989): The next section describes the distributional assumptions for the unobserved utilities and intransitivity error terms that are necessary to model graded preferences.

Ordinal Factor Model for Graded-Block Preferences
Consider a questionnaire containing b blocks of n ≥ 2 items where items are to be presented in pairs using a graded scale. As for each block ñ = n(n -1)/2 item pairs can be obtained, there are bñ ordinal responses for each respondent.
In matrix form, the model can be written as follows. Let y be a bñ vector of observed ordinal variables, which are related to the corresponding latent utility differences y* via the threshold process (Equation 3). The bñ vector of latent utility differences y* is given by Equation 4.
where t is a bn vector of item utilities, A is a bñ × bn blockdiagonal design matrix of contrasts, and e is a bñ vector of pairwise intransitivity errors needed when block size n ≥ 3 (these are zero when block size n = 2 because there cannot be any intransitivity in a single pair). The errors e are assumed to have mean zero and uncorrelated with the utilities. They are also assumed uncorrelated with each other so that their covariance matrix Ω 2 is diagonal. The blockdiagonal matrix A contains contrasts of utilities arising from each block. For n = 2, the diagonal entries contrast the first item in a pair with the second A 2 ¼ 1 À1 ð Þ ; and for n = 3 and n = 4, respectively, the contrasts are pairwise: Because questionnaires are designed to measure some personal attributes (latent traits), we assume that the item utilities depend linearly on a set of d common factors η representing the attributes, and the unique factors ε where Λ is a bn × d matrix of the factor loadings. The factor analysis model assumes that the common and unique factors have mean zero and they are uncorrelated. The unique factors are assumed uncorrelated so that their covariance matrix Ψ 2 is diagonal. The common factors could be correlated among themselves, with covariance matrix Φ.
Putting together the first-order structure (Equation 5) and the second-order structure (Equation 7), Assuming that the common and unique factors, as well as the pairwise intransitivity errors are normally distributed, the latent utility differences y Ã are also normally distributed. Then, their mean is zero and their covariance matrix is The model just described is an extension of the Thurstonian factor model for polytomous data (Maydeu-Olivares, 2002) to items presented in more than one block. It is also an extension of the Thurstonian IRT model designed for forced-choice blocks (Brown & Maydeu-Olivares, 2011b) to ordinal data with possibly intransitive preferences.

Model Estimation
We recognize (Equation 9) as the covariance structure of a second-order factor analysis model where A, the matrix of fixed contrasts, represents the first-order factor loadings of the pairwise outcomes on their respective utilities, and Λ represents the second-order factor loadings of the utilities on their respective personal attributes. Because the latent utility differences y* are assumed to be normally distributed and the observed variables are ordinal, the model is akin to an ordinal (second-order) factor analysis and it could be estimated from polychoric correlations. Importantly, when items are presented one at a time as in standard Likert type, A is an identity matrix and the model reduces to the standard ordinal factor analysis model.
To enable estimation of the covariance structure (Equation 9) from ordinal data, the latent utility differences y* are standardized using 2 is a diagonal matrix with the reciprocals of the standard deviations of y* in the diagonal (Maydeu-Olivares, 2002;Maydeu-Olivares & Böckenholt, 2005). Therefore standardized latent difference responses z*are multivariate normal with mean zero and correlation matrix If we organize the thresholds in Equation 3 in a bñ × C matrix τ, then the thresholds relating the standardized latent utility differences z* to the observed ordinal variables y are α ¼ Dτ: First, the sample thresholdsα and polychoric correlationsP are estimated. Then the model parameters are estimated from these sample statistics using unweighted or diagonally weighted least squares ). This can be accomplished using standard software such as Mplus (Muthén & Muthén, 2016). When using this program, researchers only need to specify the first-order structure (Equation 5) and the second-order structure (Equation 7), as Mplus automatically implements the constraints (Equations 10 and 11). Writing Mplus code can be tedious when block size is greater than two, as many utility contrasts (matrix A) have to be specified. An Excel macro that automates writing the full code, including the necessary identification constraints described here, is available from the first author's Web page.

Estimable Parameters and Identification
Although items measuring different personal attributes are often combined in blocks, most questionnaires are constructed so that each item measures only one attribute (utility factor loadings Λ forming "independent clusters"; McDonald, 1999). We provide identification conditions for this case. As in any other factor analysis model, we begin by setting the metrics for the common factors by setting their variances to one so that Φ is a correlation matrix. However, due to the categorical nature of the data, the metrics of the unique factors need to be set as well. To do so, in blocks of size n ≥ 3, it suffices to set the uniqueness (i.e., variance of the unique factor) of just one item per block to an arbitrary constant. It is usual to set the uniqueness of the last (or the first) item in each block to one. These are the constraints needed to identify the elements of Ψ 2 .
The diagonal elements of Ω 2 capturing the degree of intransitivity in pairwise comparisons can be freely estimated. However, in this case the model has a large number of parameters and might be nearly nonidentified in applications (the standard errors for some parameters might be poorly estimated).

OFA OF GRADED PREFERENCES
To reduce the number of parameters, Maydeu-Olivares and Böckenholt (2005) suggested setting all intransitivity variances equal; that is, Ω 2 ¼ ω 2 I.
A special case arises when the block size is n = 2. In this case, Ω 2 ¼ 0 as there can be no intransitivity. Also, the two items' unique variances cannot be identified independently, so we set Ψ 2 ¼ I.
A further special case arises when exactly two attributes (d = 2) are measured using multidimensional pairs (n = 2). Because each pairwise ordinal outcome loads on both factors, this is essentially an exploratory factor model, and additional identification constraints need to be imposed on some factor loadings (Brown & Maydeu-Olivares, 2012).

Person Score Estimation
After the model parameters have been estimated, factor scores for each person can be estimated using maximum likelihood or, alternatively, Bayesian estimation with the multivariate normal prior with covariance matrix Φ. Either the mean of the posterior distribution can be estimated (expected a posteriori or EAP), or the mode (maximum a posteriori or MAP). The former can be used in applications with one to three measured attributes; the latter is recommended in applications with many measured attributes. The software we use to fit the graded-preference model, Mplus, conveniently provides MAP scores. When blocks are of size n = 2, factor scores cannot be estimated using the ordinal factor model with covariance structure (Equation 9) because Ω 2 ¼ 0 (responses cannot be intransitive). In this case, the second-order factor structure (Equation 9) needs to be reparameterized as a first-order structure by using resulting in the Thurstonian IRT model for ordinal data

Information and standard errors
In questionnaires measuring personal attributes, it is of interest to evaluate the amount of information that every graded comparison contributes to the measurement of the attributes, and the amount of information that the questionnaire provides as a whole. Because the graded blocks are typically designed to compare items measuring different attributes, the outcomes of comparisons are multidimensional by design, even when items under comparisons are unidimensional. Because test developers typically employ balanced designs in which numbers of comparisons between items measuring different attributes are approximately equal, any subset of graded comparisons indicating a particular attribute will also indicate other attributes. In such inseparable designs, no single attribute can be estimated without estimating the whole model. Inevitably then, the measurement errors of all attributes are correlated-and likely highly correlated-therefore not only their variances (as reciprocals to test information functions) but also covariances must be considered (McDonald, 1999). To complicate things further, the outcomes of graded pairs arising from blocks of size n ≥ 3 indicate not only the common factors (i.e., attributes), but also the unique factors (i.e., utility errors). Because some of the unique factors and common factors are indicated by the same graded pairs, their measurement errors are also correlated. In this situation, covariances of measurement errors for all the independent variables defining the latent space (the common factors and the unique factors) must be considered.
In Appendix A, we provide the item characteristic functions for graded preference models, which are necessary for computation of item information. In Appendix B, we provide a complete solution for computing information and standard errors for graded-preference questionnaire data, which obviously also applies to binary preferences (i.e., forced choice). Past solutions for computing information in forcedchoice questionnaires (Brown & Maydeu-Olivares, 2011b;Maydeu-Olivares & Brown, 2010) were incomplete as they only partially accounted for multidimensionality by controlling for relationships between traits using directional information. Moreover, they did not take into account the correlated measurement error in questionnaires using multidimensional comparisons, and in unidimensional blocks of three or more items. The solution proposed in Appendix B computes the item and test information functions as Fisher information matrices, fully accounting for the inherent multidimensionality in the data, and can be applied to both graded and binary comparative designs. To enable implementation of this solution in practice, as an online supplement to this article, we provide R functions for computing item and test information from the model parameters and MAP scores estimated in Mplus, as well as a sample R code for estimating standard errors (SEs) for these scores.

Reliability
Although the availability of SEs for the estimated trait scores of each person is an advantage for individual diagnostics, summarizing the precision of measurement of the questionnaire for a range of trait values might also be of interest. However, if in unidimensional IRT models a curve depicting either the test information function (or the SE function) is a good summary, in the inherently multidimensional Thurstonian models with nonseparable designs, trait information could be conditional on all other measured traits (and on some utility errors when the block size is n ≥ 3). In this case, instead of exact functions, sample-based scatter plots of SEs against the trait of interest, such as one illustrated in Figure 2b, can be helpful.

520
BROWN AND MAYDEU-OLIVARES Another common method of summarizing SEs is the empirical reliability index, which is the ratio of true score variance to the sum of true and error variance estimated in a sample. As suggested in Du Toit (2003) for Bayesian EAP or MAP scores, 2 which are regressed estimates of latent traits with the shrunken distribution, the true score variance is best estimated directly from the variance of the EAP or MAP score, say, varη MAP ð Þ, which is conveniently printed in Mplus output. The, error variance is the mean of the squared SEs estimated for the sample (e.g., using the supplied R code), yieldinĝ EMPIRICAL EXAMPLE: MEASURING THE FIVE FACTORS OF PERSONALITY USING GRADED PREFERENCES

Participants and Materials
A total of 595 undergraduate psychology students from the University of Barcelona completed a questionnaire measuring the five factors of personality online in return for a comprehensive feedback report. The sample was 71.4% female, with average age of 22.8 years (SD = 7.9). For this study, we modified the Spanish version of the Forced-Choice Five Factor Markers questionnaire (FCFFM; Brown & Maydeu-Olivares, 2011a) with respect to the response format only. The FCFFM consists of 60 items selected from the International Personality Item Pool, more specifically from the subset measuring the Five-Factor markers (Goldberg, 1992). Each factor is measured with 12 items. The items are organized in b = 20 blocks of three items, with the restriction that within a block no two items measure the same factor. We presented the items from each block as ñ = 3 separate paired comparisons, and respondents had to indicate their preference for the item on the left or on the right using five graded options: much more, a little more, equal, a little more, or much more. To counteract the carry-over effect in paired comparisons with repeated items, we randomized the presentation of pairs, so that the pairs from the same block did not appear sequentially. In total, respondents were presented with b × ñ = 60 graded paired comparisons.
After completing the graded preferences, participants were presented with the same 60 items using a standard Likert format, in which they rated the items according to the extent to which they represented their personality using a 5-point rating scale (very well; quite well; sometimes well/ sometimes badly; quite badly; very badly).

Likert format
A confirmatory factor model with five latent correlated factors illustrated in Figure 1a was fitted to 60 observed item ratings coded from 5 (very well) to 1 (very badly). An ordinal factor analysis model was fitted to these data. Every one of the five factors was indicated by 12 items and no item was loading on more than one factor. Thus, one factor loading and four thresholds were estimated per item. In total, this model estimated 60 loadings, 4 × 60 = 240 thresholds, and 10 interfactor correlations. This model is equivalent to a five-dimensional Samejima's (1969) normal ogive graded response model.

Graded-block format
The ordinal factor model for graded-block preferences illustrated in Figure 1b was fitted to the 60 observed outcomes of paired graded preferences, coded from 5 (much more preference for first item in the pair) to 1 (much more preference for second item). Because the observed variables were results of comparisons of two items, each ordinal outcome was linked to two latent utilities of items under comparison; the first utility positively influencing the outcome, and the second utility negatively with the effects fixed to unity as per contrast matrix A 3 in Equation 6. The utilities, in turn, were indicators of five latent correlated factors (the five factors of personality). The same factorial structure as in the model for Likert ratings was applied to the utility variables: Each factor was measured by 12 utilities and no utility was loading on more than one factor. Thus, one factor loading (pertaining to the item utility) was estimated per item and four thresholds were estimated per graded pairwise outcome. Because every block of three items was presented as three paired comparisons, transitivity of preferences could not be guaranteed (as it would be in rankings), necessitating an error term for every observed preference outcome. Because it is reasonable to assume an approximately equal degree of intransitivity in all paired comparisons (Maydeu-Olivares & Böckenholt, 2005), all 60 variances of the pairwise errors e (the diagonal elements of Ω 2 ) were constrained equal. To set the metric of the unique factors, we fixed the uniqueness of the last item in each block to one (thus fixing 20 of the 60 diagonal elements of Ψ 2 ). In total, this model estimated 60 loadings, 4 × 60 = 240 thresholds, 10 interfactor correlations, 60 -20 = 40 uniquenesses, and one intransitivity variance parameter common to all pairs.

Estimation
Both the Likert and graded-block models were estimated from polychoric correlations in Mplus 7.2, using the unweighted least squares estimator with robust standard errors (denoted ULSMV). To assess goodness of fit, we considered the chisquare statistic (χ 2 ), and the root mean square error of approximation (RMSEA) with values less than .06 indicating good fit (Hu & Bentler, 1999). Recently, it has been suggested to reverse the role of the null and alternative hypotheses when assessing model fit. This is termed a test of not-close fit (MacCallum, Browne, & Sugawara, 1996) and equivalence testing (Yuan, Chan, Marcoulides, & Bentler, 2016), where significant results provide strong support for good fit. With this approach, claims can be made regarding an upper bound on the size of misspecification (T size) as measured by the RMSEA; specifically, the upper limit of the 90% RMSEA confidence interval (CI) printed by Mplus corresponds to 95% confidence in the maximum size of misspecification (Yuan et al., 2016). In addition to these statistical tests of model fit, we also considered a direct measure of discrepancy between the observed and model-implied polychoric correlations, the standardized root mean square residual (SRMR 3 ), with values less than .08 indicating good fit (Hu & Bentler, 1999

Person scores and their standard errors
Mplus produced two sets of MAP scores on the five factors -one based on the Likert responses, and the other based on the graded-block responses-for each participant. For the gradedblock responses, Mplus produced not only the trait scores (second-order factors) but also the utility scores (first-order factors). At the time of writing, Mplus does not compute SEs for MAP scores. SEs for MAP scores for Likert and gradedblock formats using respective multivariate normal priors were computed using R functions supplied with this article according to the formulas provided in the Appendices. (Note that the supplied R functions can also be used to compute SEs of MAP scores in the multidimensional ordinal model for Likert items, as it is a special case of our graded preference model when n = 1 and the contrast matrix A set to identity matrix).
We estimated the empirical reliabilities of the Five-Factor scores measured in the Likert and graded-block models using Equation 14, with the error variance of the MAP scores estimated by squaring and averaging the respective SEs across the whole sample. All these steps are included in the sample R code supplied with this article.

Model fit and parameter estimates
The ordinal factor model applied to the Likert ratings yielded χ 2 (1,700) = 5,239 p < .001, a poor fit according to the SRMR = .092, and a barely acceptable approximate fit according to the RMSEA = .059 (90% CI for RMSEA [.057, .061]). Under the equivalence testing framework, we can be 95% confident that the population RMSEA is no more than .061. Exploring potential reasons for misfit, we examined the model's modification indexes (MIs). Only five MIs exceeded 100; all of them pertained to cross-loadings. For example, the largest MI (χ 2 = 197) was for the item "I am always prepared" ("Siempre estoy preparado"), which was designed to measure Conscientiousness, suggesting a cross-loading on Openness. Judging that allowing the suggested cross-loadings would not radically change the model fit or interpretation, we retained the original model. The factor loadings of all the Likert items on the personality factors were in the expected directions and statistically significant. The model-based correlations of the five personality traits for Likert data are given in Table 1, above the diagonal.
The second-order ordinal factor model applied to the graded-block comparisons yielded χ 2 (1,659) = 3,874, p < .001, a good fit according to SRMR = .072 and RMSEA = .047 (90% CI [.045, .049]). Under the equivalence testing framework, we can be 95% confident that the population RMSEA is no more than .049. The a priori model appeared to fit better to graded-block comparisons than the counterpart model to Likert ratings. The factor loadings of all the first-order utilities on the second-order personality factors were in the expected directions and statistically significant. The model-based correlations between the five personality dimensions in the graded-preference model are given in Table 1 (below the diagonal).
It can be seen from Table 1 that the correlations yielded by the Likert and graded-preference models were largely similar; however, the small differences were systematic. The correlations in the Likert model were always stronger (except the Agreeableness-Neuroticism correlation, which was weaker in the third decimal place for the Likert data, a clearly negligible outlier from this trend). If we reverse the direction of trait Neuroticism, presenting it as Emotional Stability, all intertrait correlations become positive, yielding the average correlation of .195 in the Likert model and .137 in the graded-preference model. Interestingly, all intercorrelations except those involving Agreeableness are uniformly larger by about .09 in the Likert model. The correlations involving Agreeableness are very close in the two models.

Standard errors and reliability of factor scores
The SEs and reliabilities of the MAP Five-Factor scores in the Likert and graded-preference models are summarized in Table 2. For comparison, coefficients alpha for sum scores obtained from the Likert items are also provided. We see in Table 2 that the MAP scores in both formats were highly reliable in the range of .8 to .9; all scores were slightly more reliable when the Likert format was used (differences in reliabilities around .05). Given the same number of observed variables and the same number of graded categories in both response formats, the slightly more reliable scores with ratings are to be expected because each rating loaded on one factor only, hence providing independent contributions to the reduction of measurement error.

Convergent validity of the factor scores
MAP estimated scores from the Likert and graded-preference measurement models were used to explore the relationships between corresponding personality constructs (heteromethod monotrait correlations), which are given in  .027 .010 .080 Note. The monomethod heterotrait latent correlations from the Likert model are above the diagonal, and from the graded-preference model are below the diagonal.
*Correlations significant at the .05 level, two-tailed. **Correlations are significant at the .01 level, two-tailed. Table 2. The estimated trait scores for the same construct correlated highly, and were similar in magnitude to their respective reliability coefficients. The correlation coefficients corrected for unreliability (using the empirical reliability coefficients) are provided in parentheses after the observed value. Except for the trait Agreeableness, for which the corrected correlation was .937, the rest of the traits correlated nearly perfectly, suggesting that the same psychological constructs were measured regardless of the response format.

CONCLUSIONS AND DISCUSSION
This article introduces an ordinal factor analysis model of graded preferences among pairs of items, where the extent of preference for one or another item can be quantified in terms of ordered categories such as much more, slightly more, about the same, and so on. Questionnaires using graded comparisons can be used to assess personality traits, motivations, attitudes, and similar constructs. Items designed to measure different constructs can be combined to create multidimensional graded pairs. Pairs can be formed by simply splitting a pool of items into blocks of two items, in which case no graded pairs have overlapping content. However, the pool of items can also be split into blocks of three or more items from which all possible pairs are then drawn (we call this graded-block design). In this latter case, graded pairs drawn from the same block have overlapping content with known patterns of dependence. The model we propose for these data is equivalent to an extension of the Thurstonian IRT model to ordered categorical outcome data. The new contribution of this article beyond extending the family of Thurstonian factor and IRT models is the complete solution to the item and test information functions, which are now computed as Fisher information matrices and can be applied to both binary and graded comparative designs.
We believe that when used in the right context, grading of preferences can be superior to both Likert ratings and binary rankings (forced choice). Graded preferences could replace Likert ratings when finer differentiation between judgments is needed, for instance in organizational appraisals where halo effects are common and affect the validity of inferences (Bartram, 2007;Brown, Inceoglu, & Lin, 2017); or in settings where respondents might acquiesce. Graded preferences could also replace forced-choice rankings when the test reliability needs to be increased without increasing the number of item pairs administered. Indeed, given a fixed number of items, and all other factors held constant, the use of a graded scale over a binary scale is known to increase the amount of information the test provides (Maydeu-Olivares, Kramp, García-Forero, Gallardo-Pujol, & Coffman, 2009).

Graded Preferences Versus Likert Ratings
To illustrate the potential advantages and disadvantages of graded preferences as compared to Likert ratings, consider our empirical example where we compared measurement of the five factors of personality using the two formats. Both designs had the same number of items (60), the same number of observed variables (60), and the same number of graded options per observed outcome variable (5). As can be seen in Table 2, the empirical reliabilities were over .8 in both formats, with ratings still slightly outperforming graded preferences (loss in reliability for each scale was around .05). This small loss was due to the multidimensionality inherent to comparative response formats. As explained in the section on information, because every graded pair contributes to measurement of more than one common factor (and more than one unique factor), the measurement errors are correlated. To accommodate for this, we evaluate information contribution of every pair to measurement of all the relevant common and unique factors using the Fisher item information matrix, a procedure common in computerized adaptive testing (CAT) applications using multidimensional IRT models. When the measurement errors are correlated, the standard errors of the trait scores are generally larger than in the counterpart Likert questionnaires with factorially pure items, and reliabilities are consequently smaller.
However, the slight loss of information in graded pairs compared to Likert ratings might well be outweighed by potential benefits in reducing unwanted effects such as acquiescence, halo, or socially desirable responding. Comparing the intertrait Note. L = Likert; GP = graded preferences. Observed correlations between the estimated factor scores in the two measurement models are shown; these correlations corrected for unreliability of both measures are in parentheses.

524
BROWN AND MAYDEU-OLIVARES correlations in the Likert and graded-preference models, we noted that the Likert ratings yielded a slightly stronger positive manifold of correlations among the five personality traits (with Neuroticism reversed to represent Emotional Stability). At the item level, the average model-based correlation between utilities (suitably reversed to measure the desirable poles) was again greater in the Likert model (.168) than in the graded-preference model (.144). It appears that the Likert items elicited utility judgments that were slightly less differentiated than the judgments elicited by the graded pairs. Specifically, the Likert ratings of items indicating the desirable poles of personality traits were more similar to each other, and so were the ratings of items indicating the undesirable poles. This similarity in ratings could not be attributed to acquiescence because it adjusted for item polarities. We believe the more likely reason for less differentiated utility judgments in the Likert version of the FCFFM was socially desirable responding. The lack of fit and the required cross-loadings in the Likert model also point to an additional source of common variance in the ratings, which our a priori model did not take into account. It is outside of scope of this article to examine alternative models for Likert ratings, but models exist that incorporate biases as latent "method" variables acting at either the item level as in the random intercept model (Maydeu-Olivares & Coffman, 2006), or at the response category level as in the scoring functions approach (Falk & Cai, 2016). Such models could be used in future research to explore the source and extent of response biases.
We believe that although detectable, response distortions were small in the empirical study presented here because by providing participants with personalized feedback reports, we tried to ensure sufficient motivation not to engage in acquiescence and inattentive responding on one hand (Meade & Craig, 2012), and present the true picture of themselves without managing impression on the other hand. The high degree of similarity between the results obtained from absolute and comparative response formats corroborates findings reported in similar lowstakes conditions, for instance in a validation study reported by Brown and Maydeu-Olivares (2013). However, this degree of similarity is by no means guaranteed, and is actually unusual in medium-or high-stakes assessments (Birkeland, Manson, Kisamore, Brannick, & Smith, 2006;Brown et al., 2017;Schmit & Ryan, 1993). Comparing the internal and external validities of scores derived from graded preferences to both Likert ratings and rankings in such contexts would be a good topic for further research.

Graded Preferences Versus Binary Preferences (Forced Choice)
To illustrate potential advantages of graded preferences in comparison to binary preferences, although we did not collect them in the empirical example presented here, we collapsed the first three and the last two categories in our graded-pairs data to emulate binary-pairs data. The resulting empirical reliabilities computed in the way described in this article but based on two response categories were ρ N = .780, ρ E = .811, ρ O = .711, ρ A = .731, and ρ C = .755. Comparing these estimates to their counterparts in Table 2, we can see that the binary choice yielded reliability loss of between .07 and .10 compared to the graded preferences. This degree of information loss is greater than the loss we observed in using graded comparisons instead of Likert ratings.
Although the information increase is undoubtedly an advantage of graded over binary preferences, the use of ordinal categories to grade one's preferences could potentially open the door to response biases we typically associate with Likert scales. In theory, idiosyncratic uses of the response categories are possible in the graded-preference format; for example, preferring the extreme categories or the middle categories regardless of the item content. However, these styles would influence the judgments of utility differences rather than utilities themselves. Whether this type of distortion will prove problematic in certain contexts (e.g., cross-cultural research notoriously vulnerable to systematic differences in response styles) and how it will compare to the Likert scales remains to be seen and is also a good topic for future studies.
To conclude, when designed well and used in the right context, graded preferences can be an attractive alternative to either Likert ratings or rankings. They can have the benefits of rankings in differentiating well between responses, and the benefits of ratings in allowing respondents to express the extent of preference, thus increasing information and measurement precision. In this article, we provided tools for fitting factor analysis models to graded preference data, estimating person scores that are free of problems of ipsative data, and assessing the measurement precision of these scores. Equipped with these tools, researchers and test developers can evaluate the performance of various questionnaire designs and select the best one for their required assessment context. We are looking forward to new developments in this area. . . .
When block size n ≥ 3, the latent factor space F in Equation B.1 includes d common factors η representing the attributes and bn unique factors ε representing the utility errors.
Denoting z c the category-dependent argument of the category probability in Equation A.4, the partial derivative with respect to any common factor η a is

Rank of Fisher Test Information Matrix
Whereas the Fisher item information matrix always has rank 1 (Mulder & Van Der Linden, 2009), the maximum likelihood test information matrix for block size n = 2, which is the sum of the item information matrices described by Equation B.3 generally has the full rank d, and therefore is invertible. This is because the matrix ΑΛ is of full rank, d, unless the test items have the discrimination parameters with the same proportional relationship (Brown, 2016a). Adding the posterior information matrix Φ À1 preserves the full rank. However, in blocks of size n ≥ 3, the maximum likelihood test information matrix, which is the sum of the matrices (Equation B.7) is not of full rank. This is the result of the reduced column-rank of blocks (Equation 6) in the contrast matrix Α (Maydeu-Olivares, 1999), which determines the bottom-right block A i;k f g A T i;k f g of the Fisher information matrix. For instance, matrix A 3 in Equation 6 has rank 2 rather than 3 (the number of columns, also the number of utilities). Because the contrast matrices are identical for every block, the sum of all item information matrices I n¼3 has a reduced rank and is not invertible, and the SEs of the maximum likelihood factor scores cannot be computed using this method. However, the posterior test information matrices for any block size, I n¼2 P and I n!3 P , are generally of full rank, therefore they can be inverted to compute the SEs of MAP scores.