Rethinking the Interpretation of Item Discrimination and Factor Loadings (2024)

  • Journal List
  • Educ Psychol Meas
  • v.79(6); 2019 Dec
  • PMC6777065

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

Rethinking the Interpretation of Item Discrimination and FactorLoadings (1)

Educational and Psychological Measurement

Educ Psychol Meas. 2019 Dec; 79(6): 1103–1132.

Published online 2019 May 6. doi:10.1177/0013164419843164

PMCID: PMC6777065

PMID: 31619841

Pascal Jordan1 and Martin Spiess1

Author information Copyright and License information PMC Disclaimer

Abstract

Factor loadings and item discrimination parameters play a key role in scaleconstruction. A multitude of heuristics regarding their interpretation arehardwired into practice—for example, neglecting low loadings and assigning itemsto exactly one scale. We challenge the common sense interpretation of theseparameters by providing counterexamples and general results which altogethercast doubt on our understanding of these parameters. In particular, we highlightthe counterintuitive way in which the best prediction of a test taker’s latentability depends on the factor loadings. As a consequence, we emphasize thatpractitioners need to shift their focus from interpreting item discriminationparameters by their relative loading to an interpretation which incorporates thestructure of the model-based latent ability estimate.

Keywords: factor loadings, item discrimination, person parameter, paradox, multidimensionality

Introduction

Discrimination vectors are part of nearly every item response theory (IRT) model usedin practice. Models (see, e.g., Reckase, 2009) like the multidimensional two-parameter logistic model(M2PL), the multidimensional graded response model (MGRM), the multidimensionalgeneralized partial credit model (MPCM), or the classical factor analysis (FA) modelrepresent major classes of models, wherein item discrimination vectorsai (aiRp,i=1,,k) form a central part to gain understanding of the measurementproperties of the resulting scale.

The information provided by those vectors can be, and is often, used to divide theitem pool into several approximately unidimensional scales, wherein each scale issupposed to measure a different construct of interest. Due to the fact that theestimated item discrimination vectors and probably also their true values differfrom simple structure (Thurstone, 1947), the practitioner is forced to apply some rule ofthumb, which neglects low values of item discrimination and focuses interpretationonly on those dimensions on which the item’s discrimination is large (Kline, 1994). This is alsoreflected in tables in journals reporting the results of factor analysis, whereinquite often (standardized) loadings below 0.3 (this cutoff also frequently appearsin best practice recommendations, see Costello, 2009) are blanked—owing to thebelief that there is no substantial difference in statistical inference whentreating the corresponding loadings as zeros—see also Chapter 10 of Comrey and Lee (1992) for anargument based on the bivariate factor–item correlation.

According to a rule of thumb, items are assigned to a common scale. An itemi corresponding to scale j is characterized by high values of aij|ai|, whereas ail|ai|, that is, the relative loading on another dimensionl, is comparably small. On the other hand, an item i with equal discrimination along all latent dimensions(aijail) is treated as though it provides equal information with respectto the different constructs. Depending on the specific aim of test construction,such an item may be eliminated from the item pool, because it cannot be assigned toa single scale and because approximate simple structure is deemed to be the goal ofitem selection (Cattell,1952, 1966;Thurstone, 1947), orretained, if the test construction allows for (true) multidimensionality. Aftercompletion of the process of scale construction, the meaning of a latent dimensionis based on the item content of those items which are part of the scale. Thus, thereis a second stage, namely the labeling of latent dimensions, wherein the itemdiscrimination values contribute valuable information. It is therefore safe to saythat the item discrimination vectors have a significant impact on the result ofscale construction in a multitude of stages (e.g., item selection, inferring themeaning of latent constructs and also some weak forms of factor validations—seeKline, 1994). As theyare also of fundamental importance when it comes to using the scale for diagnosticpurposes (e.g., in adaptive testing, ), acloser look at their actual meaning becomes mandatory.

This article focuses on the interpretation of item discrimination vectors in itemresponse theory and factor analysis models (using the terms discrimination vectorand “vector of factor loadings” interchangeably throughout the article). We willmake a case for a reconsideration of the meaning of item discrimination by showingthat common sense interpretation in conjunction with the use of commonly appliedheuristics are at odds with various statistical estimation and invarianceprinciples. More specifically, it will be demonstrated that a correct interpretationof item discrimination needs to use all the information provided by the whole set ofitem discrimination parameters. In contrast, attempts to facilitate interpretationby basing interpretation on single entries do not cover important aspects of therelationship between item responses and latent abilities (or more precisely theirestimates). As a consequence of the erroneousness of commonly applied heuristics(i.e., heuristics which try to interpret an item’s discrimination solely by judgingthe item’s discrimination vector) even the somewhat intuitive method of labeling thelatent dimensions (Thurstone,1931) will be called into question. All these aspects will be illustratedvia focusing the discussion on the diagnostic part of the IRT model. That is, weanalyze the dependency of latent ability estimates on the set of item parameters (wedo not treat the dependence of the item associations on the item parameters).Further, multidimensionality will play a key role. In fact, it is not anoverstatement to say that multidimensionality accounts for almost all cases wherecommon sense interpretations fail. Therefore, after introducing the notation and theprecise modeling framework in the “The Modeling Framework” section we will skip the“benevolent” unidimensional case (the treatment of the role of item discriminationin that case is given, e.g., in ) and concentrate henceforth on themultidimensional case.

A crucial part in the understanding of the difficulties that multidimensionalityimposes on the proper interpretation of item discrimination can be highlighted via adiscussion of paradoxical classifications. By the latter we refer to an effect whichwas discovered by Hooker,Finkelman, and Schwartzman (2009). The authors showed that estimates oflatent abilities can decrease due to a correct answer despite the fact that alllatent abilities contribute positively to the solving of an item. Although theirfindings were primarily discussed with respect to the issue of test fairness, theyalso cast doubt on our way of interpreting item parameters. In fact, the paradox isso closely tied to the discussion of item discrimination that a substantial amountof cases, wherein commonly applied heuristics fail, can be subsumed under the issueof paradoxical scoring. Note that while Hooker et al. (2009) focused on testfairness and did not explicitly treat the interpretation of item discrimination, wewill focus on outlining the consequences of the paradoxical scoring effect for thelatter. We will treat all counterintuitive results referring to the interpretationof the sign of item discrimination within this context (in the“Paradoxical Effects in Item Response Models” section) and reserve a separatesection (the “The Impact of Item Discrimination on Scoring” section) forcounterintuitive results concerning erroneousness notions regarding themagnitude of item discrimination wherein we highlight hithertonot discussed challenges to the interpretation of item discrimination. Afterintroducing the paradox to the reader who is not familiar with this line ofresearch, we will discuss the relation between item discrimination and the inferenceof a subject’s latent ability (the “Paradoxical Effects in Item Response Models”section). In sharp contrast to the unidimensional case, it will be shown thatpositive item discrimination does not necessarily imply that solving an itemincreases the estimate of the latent ability. Although this result can be at oddswith common notions of fairness, it does not by any means imply that the underlyingstatistical model (or the estimates resulting from the model) is flawed (van der Linden, 2012; ). This will be highlighted, and the subsequent analysis will in additionshow that our understanding of test fairness and its reliance on the informationprovided by the item discrimination can be challenged by the principle of rotationalinvariance.

In the “The Impact of Item Discrimination on Scoring” section, we directly addressthe issue which was raised at the beginning of this introduction—namely the questionas to whether commonly applied heuristics of scale construction and interpretingitem discrimination can be justified. By providing a counterexample which highlightseight different flaws of commonly accepted rules of thumb, we not only hope to castdoubt on the use of these rules but also point to the necessity to take intoconsideration the behavior of the person parameter estimate when interpreting IRTmodel output properly.

Finally, the potential impact of this suggestion, that is, taking into account thebehavior of the estimates of the latent proficiencies rather than relying onheuristics (which only seem to work for unidimensional models) when labeling latentdimensions, is demonstrated via a real data example (the “Real Data Example”section). The concluding remarks (the “Conclusion” section) sum up potentialpitfalls, provide an alternative—potentially provoking—method to interpret itemdiscrimination and end with a critique of the currently established usage of IRTmodels in practice.

The Modeling Framework

The response of the test taker to the ith item is denoted as Ui, where Ui is treated as a random variable (Holland, 1990). Depending on the type ofitems, this response can take values in (a) the set {0,1} (binary IRT model); (b) the set {0,1,,L} (ordinal IRT model); or (c) the set of real numbers (IRT modelwith continuous outcome, see, e.g., Samejima, 1973). As is common when usingMIRT models for diagnostic purposes—item parameters are assumed to be known—areasonable simplification, if the test items have been calibrated with asufficiently large sample of test takers from the population. Thus, the (only)parameter of interest—namely the vector of latent proficiencies1 of the test taker—will be abbreviated as θ=(θ1,θ2,,θp), wherein the subscript p indicates the dimensionality of the corresponding IRT model. Thisparameter is related to the responses via the function P(U1=u1,Uk=uk|θ) which can be decomposed according to the local independenceassumption (Holland, 1981)as follows:

P(U1=u1,,Uk=uk|θ)=Πi=1kP(Ui=ui|θ).

On one hand this expression describes the probability of the examinee to show acertain response pattern as a function of his or her latent abilities. On the otherhand, it also serves as the starting point for the inference of his or herabilities. More specifically, given response pattern u=(u1,u2,,uk) we infer θ by searching for the set of latent abilities where the likelihoodfunction Lu(θ)=Πi=1kP(Ui=ui|θ) achieves its maximum (usually unrestricted over Rp).

The models that will be of interest to us all share the additional property that theprobabilities P(Ui=ui|θ) are constant on certain hyperplanes in the latent proficiencydomain and that the item discrimination parameter aiRp can be identified with a normal to the hyperplane (Reckase, 2009).

For example, the multidimensional two-parameter logistic model for binary responsesspecifies P(Ui=1|θ) as

P(Ui=1|θ)=exp(aiTθβi)1+exp(aiTθβi)

(1)

and this is a constant function on the set {θ|aiTθ=c}. Likewise, the MGRM, a multidimensional generalization of Samejima’s (1969) gradedresponse model for ordinal responses, specifies for lN,0<l<L and latent thresholds τi,l>τi,l1:

P(Ui=l|θ)=Φ(aiTθ+τi,l)Φ(aiTθ+τi,l1),

(2)

which again is constant on the set {θ|aiTθ=c}. Finally, the factor analysis model specifies

P(Ui(a,b)|θ)=Φ(aiTθ+b)Φ(aiTθ+a).

(3)

Despite the fact that the usual purpose of the factor analysis model isdimensionality reduction, with regard to the issue of person parameter estimation itcan also be thought of as an IRT model with continuous responses. The preciseformulation of this is developed in Appendix A (see also, Samejima, 1973). Here we only emphasize thefollowing: In all cases, Equations (1)-(3), theloglikelihood can be written as

igi(aiTθ)

(4)

for some set of real-valued functions gi (i = 1, …, k), which are twicecontinuously differentiable with second derivative gi<0 throughout R (see, e.g., Hooker et al., 2009; ). And this common part of the three mostprominent models (for binary, ordinal, and continuous responses) is the core of thetheoretical background underlying the subsequent discussion. We will usually use theFA model for the purpose of demonstrating effects and discussing itemdiscrimination. However, the reader should be aware that most of the issues wediscuss (exceptions will be highlighted whenever necessary) pertain to any of theabove-mentioned models.

Subsequently, we will study the dependence of estimates of θ, the vector of latent proficiencies of a single test taker, on theset of item discrimination parameters (ai)i=1,,k. Toward this aim, we first provide a description of the phenomenonof paradoxical results in multidimensional item response models which has beenstudied in depth by various authors (Hooker, 2010; Hooker et al., 2009; ; ;van der Linden, 2012;) and which serves as a starter for highlighting discrepanciesbetween common sense interpretations of item discrimination and the actual behaviorof estimates of latent abilities. The subsequently following “Positivity ImpliesParadoxical Movements” subsection then summarizes the known core result onparadoxical scoring, while new approaches and results are given in the “Fairness andNegative Item Discrimination” subsection, the “Flip-Flop Patterns” subsection, andthe “The Impact of Item Discrimination on Scoring” section.

Paradoxical Effects in Item Response Models

In many practical cases of test construction, it makes perfectly sense to assume thatall latent dimensions contribute positively to the solving of an item. For example,in a two-dimensional scale consisting of physics items it is quite reasonable topostulate that an increase of the value in any latent dimension (say: math abilityand mental rotation) does not decrease the probability of solving an item. Stated interms of conditions on the parameters of the underlying IRT model, one could assumethat the entries of each item discrimination vector ai are nonnegative. Given a multidimensional scale consisting ofitems of this type, Hooker etal. (2009) showed the existence of a paradoxical effect when using thetest for diagnostic purposes. In essence, the paradox points to the fact that anincrease in an item score, that is, a better performance of the person, does notresult in an increase of the values of all estimated latent proficiencies but lowersat least one of those estimates. Moreover, this result can be establishedirrespective of the measurement scale of the item responses (; ). Thus, changing an item score (binary, ordinal, or continuous) of aspecific item toward higher values lowers the estimate of (at least) one latentproficiency. Although this paradoxical effect has important consequences withrespect to the fairness of classification decisions (Hooker et al., 2009; ), we shall notdwell on this aspect but concentrate on its meaning regarding a properinterpretation of item discrimination parameters—a topic that, in contrast to thediscussion on fairness and accurate ability estimation, has only received littleattention. But before we scrutinize this relation, we have to define the precisemeaning of the paradox in terms of the previously outlined modeling framework: Givenany of the models (1)-(3), the latent abilities are inferred by maximizing aloglikelihood of the type:

lu(θ)=ikli,ui(aiTθ),

(5)

wherein the item discrimination vectors ai are nonnegative and li,ui is a twice continuously differentiable function mappingR into R, just as described in the previous section, that is, including theM2PL, MGRM, and the FA model. The assumed negativity of the second derivative, thatis, li,ui<0, also implies, in conjunction with an assumption of a full columnrank matrix of item discrimination vectors, the uniqueness of the maximum likelihoodestimation (MLE) via standard results of convex analysis (Chap. 27 in Rockafellar, 1970).

Denoting the point in θ-space where the maximum of Equation (5) is achieved asθ^(u) we can set up the following criterion of test fairness (with “ ≥ ”interpreted as an entrywise operation):

u1u2θ^(u1)θ^(u2).

(6)

This criterion, first introduced by Hooker et al. (2009), has some intuitiveappeal: If Subject 1 outperformed Subject 2, that is, if Subject 1 obtained on everyitem at least the item score of Subject 2, then this ordering should be transferredto the latent ability estimates. That is, θ^j(u1)θ^j(u2) should hold for all j=1,p. For if there were a single latent dimension j with θ^j(u1)<θ^j(u2) then any classification scheme which emphasizes thejth dimension will favor the underperformer. That is, setting upthis criterion prevents subsequent classification procedures, for example, viamultiple hurdle rules (Segall,2000), from causing questionable results.

The introduction of the fairness criterion (6) serves two purposes: First, theprecise meaning of the occurrence of a paradoxical result is fixed with thestatement of the criterion: We speak of a paradox if and only if there are twoordered responses u1u2 which at the same time satisfy θ^j(u1)<θ^j(u2) for some j. Second, and more important, this fairness criterion will be usedto highlight flaws in our intuitive understanding of item discrimination. Morespecifically, subsequently the following three statements will be illustrated anddiscussed:

  1. Positivity of the entries of each item discrimination vector implies thateach dimension contributes positively to the solving of any item, but at thesame time it also implies the violation of the fairness criterion when thetest is used for diagnostic purposes (paradoxical result).

  2. The possibility to rotate the latent dimensions challenges both, the fairnesscriterion and our interpretation of the sign of item discrimination.

  3. The question as to which items cause paradoxical results is a complicatedfunction of the whole set of item discrimination parameters. An itemi which causes paradoxical results when embedded in aspecific set of items need not cause paradoxical results when embedded inanother set of items.

We will first turn to (a) and discuss the relation between positivity of itemdiscrimination and the fairness criterion (6).

Positivity Implies Paradoxical Movements

First note that positivity of the entries of ai just states that an increase in any latent abilityθj shifts the responses toward higher item scores (in aprobabilistic sense). That is, we expect higher item scores to indicate higherproficiencies. It is crucial however to understand that the statement refers tothe probabilistic relation between θ and a single item Ui.

In contrast to this expectation, which rests on the interpretation of therelation between θ and u on a single item, the inference with respect toθ from the joint responses u to all the items shows paradoxical results. More specifically,the paradox is inherent in any truly2 multidimensional scale of type (4) with nonnegative item discrimination.These MIRT models are commonly referred to as linearly compensatory models(Reckase, 2009;van der Linden,2012).

Within the class of linearly compensatory models, the paradoxical effect has beenfirmly established (Hookeret al., 2009; ). In particular, for the factor analysismodel it was shown (, Theorem 5) that under the assumptions that

  1. The matrix of factor loadings A consists of nonnegative numbers

  2. The precision parameters ξi:=σ2 are positive (“no Heywood case”)

  3. The matrix A (with rows ai(i=1,,k)) has full column rank p>1

the paradox is certain to occur, unless A has simple structure. That is, in every factor analysis modelsatisfying (i), (ii), (iii) and which is not of simple structure, there is anitem j and a latent dimension θl, such that increasing (decreasing) uj decreases (increases) the estimate θ^l for θl. This holds for both—the weighted least squares estimator (MLEunder normality) and the ordinary least squares estimator.

Note further that the statement is sharp in the following sense: Even if a testis essentially unidimensional or essentially of simple structure, that is, ifthere are dominant loadings on a single dimension or if cross-loadings arearbitrarily small, the paradoxical scoring effect will occur. That is, even in ascale of approximate unidimensionality/simple structure, there is always atleast one item causing paradoxical results. Hence, the qualitative statement isstrict. With respect to the quantification of the effect, one could howeverapply a continuity argument and thereby establish that the size of the movementin the paradoxically scored dimension(s) is small compared with the movement inthe other dimension(s), provided the cross-loadings are small enough. However,the mathematical meaning of “small enough” does not necessarily align with ourown intuition and judgment of what constitute small loadings and has to bechecked analytically for the application at hand.

To sum up: Our expectation by inspection of the factor loadings (and also byrelying on simple sumscores when scoring the subscales) might be that betterperformances are always rewarded, yet the behavior of the best statisticalestimator does not agree with our expectation. This provides the firstobservation which calls our intuitive understanding into question. As will bedemonstrated by a discussion of (b) and (c) we can even go further.

Remark

The formulas underlying the MLE-based scoring don’t use any informationregarding factor correlations because the MLE approach is based on theconditional distribution of the item responses given the unknown factorscores. This conditional distribution (and therefore also the impliedscoring rule) only depends on the matrix of factor loadings and thevariances of the uniqueness terms. If, on the other hand one adopts a methodof factor score estimation which incorporates the knowledge on factorcorrelations (i.e. a Bayesian approach; Thomson factor scores), thenparadoxical effects still occur within this altered framework, althoughtheir dependence on the model parameters is a more complicated function. Inparticular, simple structure is neither a necessary nor a sufficientcondition to avoid paradoxical scoring in this framework. To give the readeran impression about the complications which may arise within this framework:In a two-dimensional setting strong positive correlation can preventparadoxical scoring, whereas (strong) pairwise positive correlations in athree-dimensional setting do not necessarily prevent paradoxical scoring.For details on this topic the reader is referred to Hooker (2010).

Fairness and Negative Item Discrimination

So far, a kind of incompatibility between the fairness criterion (6) and ourintuitive judgment of positive item discrimination has been highlighted:Starting off with positive item discrimination one necessarily arrives at anunfair estimate. In this section, we turn to elaborate point (b) and try thereverse: That is, we try to set up a fair classification scheme, using the giventest structure, and look at the underlying pattern of item discrimination.

Suppose we are given a two-dimensional FA model (θ1: math ability, θ2: mental rotation) with item discrimination vectorsa1T=(1,0),a2T=(1,1),a3T=(0,1). As the test is not of simple structure, we already know thata paradox will occur. However, in this case we can be more specific and computethe (least-squares-)estimate (Bartlett, 1938) θ^(u) explicitly (c>0):

θ^(u)=cMu,M=(211112).

(7)

The observation that each row of the matrix M contains a negative entry implies that inference with respectto the original dimensions will result in paradoxical effects. Apart from apositive multiple c, the equations to estimate the two latentvariables become

θ^1=2u1+1u21u3

θ^2=1u1+1u2+2u3.

From this it is apparent that better performances on the third (first) item aredetrimental with respect to the scoring on the first (second) dimension.However, by defining new composite dimensions ψ1:=θ1+θ2 and ψ2:=2θ1+θ2, it can also be seen that the MLEs for the compositedimensions—they are one-to-one functions of the original dimensions andtherefore, the corresponding MLE property directly transfers to them—are givenby

ψ^1=θ^1+θ^2=1u1+2u2+u3

ψ^2=2θ^1+θ^2=3u1+3u2,

and hence, do not share paradoxical scoring effects. Thus, if our inference isbased on an equally weighted composite (ψ1) of the math and the rotation ability, then receiving a higheritem score will always be beneficial for the individual. The same holds, if theweight of the math dimension doubles the weight of the mental rotationdimension.

Now, because of the fact that the composites (or more precisely: theircorresponding vectors of weights (1,1),(2,1)) are linearly independent and thus form a basis, we can lookat the matrix of factor loadings with respect to the newly chosen basis providedby the composite dimensions.

The new dimensions are given by

ψ:=Λθ,Λ=(1121),

and the new matrix of factor loadings with respect to the new constructsψ1,ψ2 can be computed as follows:

B=AΛ1=(111021).

Although we have arrived at a fair classification scheme (as was indicatedabove), the item discrimination vectors do not reflect this properly. Forexample, the first dimension ψ1 contributes negatively to the solving of the first item andthus one would expect that lower scores on that item should be rewarded. Infact, one could pose the question, whether a researcher, who does not know aboutthe initial structure, would label our estimates as fair. Moreover,practitioners could be tempted to score the first item negatively with regard tothe first dimension ψ1 when using simple sign heuristics. In sharp contrast to this,we know that the MLE (under normality) scores the first item positively.3 Note further the implication for the labeling of the first dimension: Apractitioner would probably label the first latent dimension such that high itemscores on the first items indicate low values of the latent construct. The MLEscoring behaves differently, in that high item scores are indicative of highvalues of the latent construct.

Is the example just given in any way artificial? No, in fact we can always assumethe existence of a full set of linearly independent composites, and a subsequent“rotation” of the original item discrimination vectors will result in at leastone discrimination vector with a negative entry (proofs are available onrequest).

Therefore, there is a general discrepancy between our comprehension of fairnessand the use of simple sign heuristics on the loading matrix: Each compositeψ1,ψ2 clearly fits into the common notion of fairness, yet if weview them in pairs, the sign pattern of item discrimination does not show thateach dimension contributes positively to the score on each item.

Thus, in addition to the paradoxical movements in the presence of positive itemdiscrimination (highlighted in the previous section), it was shown that negativeitem discrimination can lead to increasing estimates by changing thecorresponding item score toward higher values. This once again runs counter tothe interpretation of the relation between a single item and the latent ability:Of course, negativity of some entry of ai implies a decreasing chance of obtaining high item scores whenthe latent ability increases (assuming fixed values for the other dimensions).However, this does not prevent the (best) statistical estimate from increasingwhen the respective item score is increased. Furthermore, a coherentinterpretation of item discrimination becomes even more difficult when viewedwith respect to the fairness criterion: In fact, it was shown that the only itemdiscrimination pattern which seems compatible with the fairness criterion (6) isan item discrimination pattern with “mixed signs,” that is, a pattern whereinnegative as well as positive entries occur.

Flip-Flop Patterns

Before we leave this section it seems worthwhile to point at yet anothercounterintuitive result (proven as a new result on paradoxical scoring in a moregeneral setting in AppendixB and depicted in Figure 1) in the context of the example given by (7): From theremarks in the “Paradoxical Effects in Item Response Models” section, it isclear that the second item is responsible for the paradoxical result, that is,if we eliminated the second item then a test of simple structure would resultwhich furthermore would exclude the occurrence of paradoxical movements. Butsurprisingly the second item—though causing the proneness of the test toparadoxical results—is scored positively with respect to both latent dimensions(referring to the original dimensions) as the second column of (7) shows.Ironically, the items which exhibit paradoxical scorings are precisely the itemswhich guarantee fair classifications when embedded in a different test structure(i.e., a test structure after removal of the second item and addition of someother items of similar type). Further, the mere fact that the second item isscored positively with respect to both dimensions can change when additionalitems are added. For instance, adding a single item with item discriminationa4:=(1,3)T leads to the new estimate

Open in a separate window

Figure 1.

Changes of the sign of the scoring of the second item (a2=(1,1)) with regard to the second dimension when additionalitem vectors (depicted in red) are added.

θ^(u)=M2u,M2=(0.650.410.240.060.240.060.180.29),

wherein the second item is now scored negatively with respect to the seconddimension. This change in sign can be repeated if “appropriately” chosen newitems are added to the test. For example, adding a further item withdiscrimination a5T=(2,1) results in the following new estimates, wherein the seconditem is now scored positively:

θ^(u)=M3u,M3=(0.250.130.130.130.380.130.020.150.310.10)

The result of this induced “flip-flop pattern” (see also Figure 1)—which to the best of ourknowledge has not been discussed previously—is worth stating: Increases in theitem score are rewarded when the test is administered with test lengthk—yet adding a single item to the test can change this, so thatdecreasing item scores are rewarded. Hence, regarding the behavior of the MLE,the interpretation of an item’s discrimination parameter is not meaningful,unless the whole test structure is taken into consideration. In particular, thediscrimination parameter of an item is meaningless without further informationabout the specific embedding in a set of other items. We will further elaborateon this crucial observation in the conclusion.

The Impact of Item Discrimination on Scoring

Many of the pitfalls which arise when interpreting item discrimination can besubsumed under the previously described phenomenon of paradoxical scoring effects.In essence, discrepancies with respect to the sign of the scoringare at the core of these very counterintuitive results. In this section, we broadenthe scope and treat pitfalls regarding the (expected) magnitude ofthe scoring. The treatment will highlight new paradoxical effects with respect tothe magnitude and will thus depict further challenges in terms of a properinterpretation of item discrimination.

We will once again use the classical factor analysis model to illustrate thesepoints. The reader should however be aware that qualitatively similar results areprovided by the M2PL or MGRM model. We only skip these models as the estimates donot possess closed form solutions and are therefore less suitable for a simpleillustration of the core notions underlying the pitfalls. The interested (or perhapsskeptical) reader is referred to the Appendix C, wherein we provide R-codedemonstrating some of the presented pitfalls with regard to these classes of models.Moreover, we invite the reader who has doubts concerning the generalizability of ourresults to use the additional simulation program, also provided in the Appendix C, to create othermatrices of factor loadings and apply a similar approach, that is, contrasting theactual behavior of estimates with the expectations derived from common senseheuristics. We did not come up with the subsequent counterexample by manipulation ofmatrix entries.

Suppose now that we are given a factor analysis model with the corresponding matrixof factor loadings:

A:=(113531224111).

A direct computation of the MLE (for simplicity we assume equal measurement varianceacross the items, so that the MLE under normality is identical to the least-squaresestimator) gives (see Chap. 9 of ; or Chap. 2 of Lee, 2007):

C:=(ATA)1AT=(1.000.500.501.501.830.500.832.680.500.000.000.50).

An entry cij of the matrix C corresponds to a change of the estimate referring to theith latent dimension induced by a one-unit increase in thejth item score. We can use this matrix to cast doubt on the generalapplicability of various commonly applied heuristics. For example, from aninspection of the entries of the loading matrix A one would be tempted to say, that the first item measurespredominantly the third dimension. However, how could this line of reasoning bedefended, in the presence of the item’s strong impact on the estimates fordimensions with minor loadings (depicted in the first column of the matrixC)?

We may go further: If we follow common rules, then we might say that the third itemmeasures the third dimension to a larger extend than it contributes informationabout the other dimensions. Nevertheless, the entry c33 shows that increasing the item score does not change the estimateof the third dimension at all. Since a practitioner who uses commonly appliedheuristics would probably not only score this item positively for the inference ofthe third dimension but would also furthermore label the third dimension accordingto the content of the third item, a bizarre result would arise—as the response ofthe test taker to this item is irrelevant (in the “correct” multidimensional settingwhen using the MLE) for inferring his or her ability in θ3-space—at least given the fixed test length (as was already notedtoward the end of the “Flip-Flop Patterns” subsection, this scoring can change whentest length changes).

A similar striking example is given by the last item. Though one might expect thatthe item contributes equal information across the different dimensions, thecorresponding changes in the estimate show a surprising amount of variation. Anincrease of the fourth item’s item score results in a decrease (i.e., a paradoxicaleffect) of the estimates corresponding to the first and third dimensions, whileincreasing (i.e., no paradoxical effect) the estimate of the second dimension.Moreover there are quite large differences in the absolute value of the impliedchanges. In fact the variation of those absolute changes is larger than in thecontext of any other item.

We limit the discussion to the above-mentioned discrepancies. However, each of thefollowing expectations can be refuted by this example:

  1. A change of the first item score should affect the estimate of the thirddimension more severely than the other dimensions.

    Fact: The impact on the second dimension is almost four times aslarge as the impact on the third dimension (−1.83 vs.0.50).

  2. A change of the first item score should affect the estimate of the first andthe second dimensions equally, as there are no differences indiscrimination.

    Fact: The impact on the second dimension is almost twice as large inmagnitude as the impact on the first dimension (−1.83 vs. 1.00).Moreover, even the sign of the scoring differs.

  3. A change of the second item score should have greatest impact on the estimatefor the first dimension and least impact on the third dimension.

    Fact: The estimates corresponding to the first and the seconddimensions are affected equally (in magnitude) − 0.50 vs.−0.50.

  4. A change of the third item score should affect the estimate of the first andthe second dimensions equally, as there is no difference indiscrimination.

    Fact: The impact on the dimensions differs (−0.50 vs. 0.83).Moreover, the sign of the scoring differs.

  5. A change of the third item score should affect the estimate of the thirddimension more severely than the other dimensions.

    Fact: The estimate does not change.

  6. A change of the fourth item score should affect all estimates equally.

    Fact: The change induced on the second dimension is approximatelyfive times as large as the change induced on the third dimension (2.68vs. −0.50). Additionally, two estimates show paradoxical scoring effects(−1.50, −0.50). Only the second dimension is not affected.

  7. The second item should have greater impact overall than the last item as thelengths of the corresponding vectors of factor loadings differ substantially(|a2|/|a4|=3533.4).

    Fact: Overall the last item score affects the estimates of thedimensions more severely than the score of the second item.

  8. With respect to their absolute loading on the third dimension, the items rankas follows: 3, 1, 2/4.

    Fact: Only items 1 and 4 show any scoring effects with respect to theestimate of the third dimension.

In all the aforementioned examples the actual scoring of the items differed from ourexpectation. This implies that classifications based on commonly applied rules ofthumb differ from classifications based on a statistically coherent approach (i.e.,scoring according to a multidimensional model). However, not only the scoring andsubsequent classifications may differ but even more important, the labeling oflatent dimensions can be at odds with the actual behavior of the MLE. We will nowdemonstrate in a real data example the potential mislabeling which might arise whenusing heuristics and neglecting properties of the person parameter estimate.

Real-Data Example

The SPS-J ()—the German adaption of the Reynolds Adolescent Adjustment ScreeningInventory of Reynolds(2001)—measures four dimensions (anxiety, antisocial behavior, problemswith self-esteem, problems with the control of negative emotions [anger]), which maybe used for the screening/diagnoses of adolescent adjustment problems. At firstsight the given labeling seems consistent with item content and the structure of theloading matrix.4 However, taking a look at the behavior of the MLE reveals the following: Anincrease in the item score of the first item decreases the estimate for the thirdlatent dimension (problems with self-esteem). As the first item asks about theperson’s drug consumption (with higher scores indicating more severe levels of drugconsumption), we may conclude that a person might have received a more favorablediagnoses with respect to his or her self-esteem, if his or her level of drugconsumption were higher. Likewise, increasing the item score of an item which asksabout feelings of depression increases the estimate corresponding to the self-esteemdimension. Of course, this seems inappropriate and calls for a reconsideration ofthe labeling of the latent dimensions.

Finally, we note that there are also some items where the corresponding behavior ofthe MLE—although not necessarily in strong contradiction to the labeling of thedimensions—indicates questionable classification decisions. For example, the merefact that increasing the item score on an item asking about the refusal of doinghomework assignments increases the test taker’s estimates on the self-esteemdimension does not pose a major contradiction to the labeling of the dimensions.However, with respect to subsequent classification issues there may still remain aproblem: A subject receiving a negative classification with respect to his or herself-esteem could have avoided this by simply reporting more problems with thefulfillment of homework assignments. Of course, the actual number of test takers whocould have gained a positive classification by this item score change (on only oneitem) depends on further factors and thus could be very low. However, one should beaware that regardless of the actual number of cases concerned by this change ofclassification, the more “philosophical” question as to whether it is acceptable toreceive better classifications by reporting maladapted (and seemingly unrelated)behavior still remains. This last observation is however more closely tied to adiscussion of fairness of estimates. As we are primarily concerned with the topic ofitem discrimination we, therefore, refer the reader to the general literature on theparadoxical effect (e.g., Hookeret al., 2009; ; van der Linden, 2012; ) for a moredetailed discussion of the relevance to address the fairness of latent abilityestimates.

Conclusion

In previous sections, the reader should have seen ample evidence to call his or herbeliefs about the diagnostic consequences of the factor loading matrix intoquestion. By “diagnostic” we thereby, of course, refer to the topic of personparameter estimation. To recap some of the main objections to our intuitiveunderstanding of item discrimination, it was shown that

  1. Higher item scores in a test composed of items with nonnegativediscrimination values, that is, each dimension contributes positively to thesolving of an item, are not generally rewarded, that is, some dimensions mayexhibit lower estimates after the change (Hooker et al., 2009).

  2. Ensuring fair classification within a test originally composed of items withnonnegative loadings through proper weighting leads to a new test structure,wherein some dimensions contribute negatively to the solving of an item. Asa matter of fact, multidimensional fair classification schemes in the senseof definition (6) are only compatible with mixed sign itemdiscrimination.

  3. The direction an item is scored depends on the presence/absence of otheritems in the scale. Taken to its extreme, arbitrarily long “flip-flop”patterns can result, wherein positive and negative scoring of an itemalternates when additional items are added successively.

  4. A change of an item score can have a more significant impact on the estimatesof dimensions with low associated loading than on dimensions for which theitem is a good indicator (according to the usual rule of thumb).

  5. Estimates of latent dimensions need not change at all by manipulation of someunderlying item score, even if the item shows a remarkably high loading on aparticular dimension.

  6. An item with equal discrimination along the various latent dimensions is notnecessarily scored similarly with respect to the estimation of theunderlying dimensions. For some dimensions paradoxical results can occurwhereas estimates of other dimensions do not share those paradoxicalmovements. Moreover, the magnitude of those changes can varysubstantially—despite the presence of equal loadings.

  7. The length of the item discrimination vector is in general not a goodindicator of its overall impact on scoring.

  8. The labeling of latent dimensions can be at odds with the actual behavior ofthe person parameter estimate.

Clarifying remark with regard to (7): This surprising result—seealso Appendix C for a moresystematic treatment of this topic via simulation—refers to the behavior of thepoint estimate for the person parameter. It should, however, benoted that this by no means implies that the magnitude of the item discriminationvector is not a useful indicator of item quality (in terms of intervalestimation), as it determines the size of uncertainty (via volumes ofconfidence regions, see e.g., Chap. 5.5 of ). The latter can beseen by parametrizing the discrimination of an item via a(λ):=λa and computing the derivative of the determinant of thecross-product matrix (see Appendix M.7 of ).The latter shows that the determinant is an increasing function of λ. Thus, the volume of confidence regions shrinks whenever thelength of a discrimination vector is increased.

Even though the above list should call the use of the commonly accepted rules ofinterpreting discrimination into question, one may wonder, if most of the statedresults rely on some pathologically chosen test structure. To refute this objection,we first want to point out that (1) can be proven to occur under very mildregularity conditions in every truly multidimensional IRT model (see Hooker et al., 2009; )and that (2) can also be proven to occur in a general setting,5 that is, whenever computing weighted linear combinations to ensure fairnessis possible.

Second, although the results stated in (3) to (8) were gained through acounterexample, we did not come up with this example through manipulations of theentries of the matrix. In fact, we have prepared a small simulation in Appendix C, which shouldconvince the reader that the above points are not a specific property of the matrixwe chose.

This simulation also shows something which goes beyond the previously presentedpoints. Namely, that an item with zero loading on a particular dimension has aremarkable influence on the estimate of that particular dimension. Figure 2 below highlights thispoint. For a three-dimensional test composed of k=10 items it depicts for an item with zero loading on the firstdimension the number of cases (among 1000 randomly chosen test structures, see Appendix C for details) inwhich the item’s influence on the first dimension was the ith lowest among the 10 items. Although one might expect asubstantial amount of cases wherein the item’s contribution is the lowest among allthe items of the test, the figure indicates that almost the contrary is true, as inmost cases the item’s rank is 8, 9, or 10 (thus third largest impact, second largestimpact or even largest impact).

Open in a separate window

Figure 2.

Number of times (among 1,000 randomly sampled matrices of factor loadings ofsize 10×3) the item with zero loading on the first dimension had theith lowest impact, that is, rank i, on the estimate of the first dimension.

As a result of these observations, it is reasonable to advice the practitioner orresearcher not to rely on common sense interpretation (especially inhigh-dimensional settings, wherein interpretation becomes increasingly moredifficult) but to incorporate the behavior of the MLE as a function of the itemscores in his or her judgment (see Appendix D for a description on how to access the scoring matrix in R).Beyond the factor analysis model this will of course be much more time-consumingthan an inspection of the matrix of discrimination values, as the estimates do nothave closed form solutions. A closer look at the behavior of the MLE might revealthat items which—at first glance—are judged to provide only a modest amount ofinformation, are in fact much more involved in the determination of the abilityestimate for the particular dimension in question as common sense inspection oftheir discrimination may suggest. Sometimes this approach can even lead to areconsideration of the labeling of latent dimensions, as demonstrated in the realdata example (). Thus, incorporating the behavior of the MLE can (a)challenge the labeling of latent dimensions, (b) help in gaining understanding ofitem discrimination in high dimensional settings, and (c) question the way ofgrouping items into subscales (in order to avoid truly multidimensional scoring) andthe subsequently applied crude method of scoring (simple sum scores).

Note that this also extends to the judgment of factorial validity. For instance, inan analysis (see Chap. 7 of Kline, 1994) of the construct of authoritarian personality () a five-factor solution was fitted. The itemsconsisted of established scales measuring other constructs such as, for example,neuroticism, guilt feelings, and intelligence. In interpreting the authoritarianfactor, it is stated that

Finally, it should be noted that Eysenck’s L scale, measuring socialdesirability and Catell’s B scale, intelligence, do not load theauthoritarian factor. This destroys the objection to the authoritarianscales that they are influenced by intelligence and by the response set ofsocial desirability. (Kline, 1994, p. 107)

This example clearly demonstrates that in general far-reaching conclusions arederived from an inspection of the loadings. In this case, it is concluded thatintelligence and social desirability do not compromise the measurement of theauthoritarian factor due to negligible loadings on that factor. However, it is quitepossible that the scoring of the authoritarian factor is influenced heavily by theintelligence score and the social desirability score. In fact, we provide someevidence in the Appendix Cvia simulation that in general there is no clear relation between the size of theloading and the impact of the item on the scoring for that particular dimension.

The most controversial observation when taking the behavior of the person parameterestimate into consideration refers to the way of determining the measurementproperties of the items: The conventional way of labeling dimensions is based on thedirected relation θUi. That is, by inspection of the item discrimination parameters oneestablishes relationships of the form “a one unit change on the latent dimensionθj changes the odds of solving the item by an amount ofx.” Within this setting it is meaningful to state that loadingsclose to zero (or zero) imply that changes of the corresponding latent dimension donot have any impact on the probability of solving the item. (Although this appliesonly when fixing the value on other dimensions!) However, as was demonstrated inprevious sections, the simplicity of this interpretation does not transfer to theactual scoring of the items and the commonly applied rationale, as for example,stated as

The usual procedures followed in factor interpretation are very simple. Thosedata variables with high factor loadings are considered to be “like” thefactor in some sense and those with zero or near-zero loadings are treatedas being “not like” the factor, whatever it is. Those variables that are“like” the factor, that is, have high loadings on the factor, are examinedto find out what they have in common that could be the basis for the factorthat has emerged. (, p. 240)

should be called into question. Now, suppose that in contrast to this, we base ourlabeling of dimensions on the behavior of the MLE. That is, we use thereverse-directed relationship Uiθ (or, more precisely, the relationship Uiθ^) to determine the measurement properties of the items. Or statedanother way: We simply adopt the method a “blind” (“blind” to our model) observerwould use to infer the measurement properties of the items, namely by examining theway in which item score changes afflict the scoring: An item j is indicative of dimension l, if the person parameter estimate of the lth dimension varies with changes of the corresponding item score.When taking this approach, the measurement properties of an item depend on thepresence or absence of other items. An example for this odd result was already givenin the “Flip-Flop Patterns” subsection. The sign of the scoring of an item canalternate when further items are added and in fact will surely alternate if anappropriate additional discrimination vector is selected, see the proof in Appendix B. The measurementproperties of an item depend on the particular embedding of the item in a set ofremaining items. Strictly speaking within this approach there is no such thing as aninherent quantity the item is measuring but merely a situation-specific (i.e., notinvariant) meaning of the item. We are fully aware that the notion of noninvariantitem properties runs counter to the usual comprehension of scale development.However, we note that such an approach is fully consistent with the way a test takerwould infer the meaning of the items. That is, by looking at the impact of an itemscore change on the estimates corresponding to the various dimensions (withoutreference to any latent model). Whether one agrees with the second approach or not,one should be aware of major discrepancies which can arise between subjectiveexpectations deduced from the θUi relation and the actual behavior of estimates. In particular, thecommon approach of deeming items as inappropriate if their corresponding loading islow should be called into question.

Although some readers might find the actual scoring of the person parametercounterintuitive and perhaps even questionable, we emphasize the following: Theunderlying IRT/FA model is identical to a linear regression in case of factoranalysis (see, e.g., Chap. 9 of Mardia et al., 1979) or a logistic regression model in case of the M2PL!Therefore, it is striking to see that in one domain (“regression”) the usage of theMLE is well established whereas in the other domains (“IRT/test scoring”)alternative measures (simple sumscores) are prevailing (see also the overview onscoring methods as given in Chap. 10.3 of ). From the viewpointof a statistician and from the formal equality of the models it is rather bizarrethat measures like simple sumscores are advocated: Has there ever been a proposal toinfer a regression coefficient in the context of linear regression analysis by thefollowing method? Compute a new design matrix, wherein only one nonzero entry withineach row is allowed (the analogue to forcing items into subscales and neglectingcross loadings), and then infer the regression coefficient of some predictorvariable by simply adding up the values of the dependent variable for all thoseobservations with corresponding nonzero entries in the respective column of thedesign matrix. Of course, it would be quite bizarre to see such a method in thecontext of an ordinary regression. Yet it is accepted and even favored within thepractical usage of the IRT framework.

Appendix A

The Factor Analysis Model as an Item Response Theory Model for ContinuousResponses

We hereby sketch briefly the embedding of the factor analysis (FA) modelwithin the item response theory (IRT) framework. For details, we refer toLee (2007)and Samejima(1973).

Note first that the FA equation:

u=Aθ+ε

when supplemented with the usual normality assumptions implies

u|θ~N(Aθ,Σ:=diag((σi2)i=1,,k)),

wherein the diagonal structure of the conditional covariance matrix followsfrom the mutual independence of the measurement error terms Cov(εi,εj)=δijσi2 (δij denoting Kronecker’s delta) and the independence ofθ and ε. Thus, given a factor analysis model with known itemparameters (i.e., estimated from previous test calibration), the conditionaldensity of the continuous responses is (up to a factor independent of(u,θ))

f(u1,un|θ)=exp(12(uAθ)TΣ1(uAθ))=Πi=1kexp(12σi2(uiaiTθ)2),

(A1)

with A denoting the k×p matrix of factor loadings (ith row equals aiT) and σi2 denoting the measurement error variance corresponding tothe ith item. An inspection of (A1) shows that localindependence between the items conditional on the latent abilities holds andthat the resulting likelihood resembles the form given in (4) (e.g., eachterm is constant on a certain hyperplane). The only major change which hasto be accounted for is the substitution of the probability functionP(U1=u1,,Uk=uk|θ) by a corresponding density function.

Remark: Note that the whole (multidimensional) gradedresponse model (MGRM) can be deduced from an underlying FA model (see Lee, 2007) bypostulating a “hidden” continuous measurement following the FA model inconjunction with thresholds on the latent continuum which determine the typeof response that is actually observed. Thus, when it comes down toestimating the person parameter, the FA model is just the “first stage part”of an MGRM model.

Appendix B

Sign Reversals

In this section, we provide within the context of the ordinary least-squaresestimator a theoretical motivation for the described phenomenon in the“Flip-Flop Patterns” subsection (restricting ourselves to the casep=2). That is, we show that given a positive scoring of anitem j (aj>0) with respect to the first dimension, it is alwayspossible to find a new item such that the prolonged test, that is, the scaleincluding the newly defined item, shows negative scoring of thejth item with regard to the first dimension. Thisestablishes the existence of sign switches for arbitrary tests—generalizingthe result of the “Flip-Flop Patterns” subsection. The proof alsoilluminates a specific mechanism to induce this flip-flop pattern: Choose anadditional item with sufficiently precise measurement properties (in termsof the remark on interval estimation given in the conclusion), that is, anitem with a sufficiently high magnitude of the discrimination vector, andsuch that its angle with the first coordinate axis is sufficiently small,then the sign of the scoring will change. The details of this informaldescription are given in the following proof.

Proof

Let A denote the (k×2) loading matrix of full column rank, letj{1,,k} be an item index such that the jth item is scored positively with regard to the firstdimension. Note that the estimate of θ is given by

θ^(u)=(ATA)1ATu.

Therefore, an increase of the response vector in the jth component changes the estimate by an amount of

θ^(u+ej)θ^(u)=(ATA)1AT(u+ej)(ATA)1ATu)=(ATA)1aj,

wherein aj is the jth row of the loading matrix (written as a columnvector).

Therefore, the fact that the jth item is scored positively with regard to the firstdimension means that

e1T(ATA)1aj>0

holds.

The first row of the inverse of the cross-product matrix is up to apositive multiple (the following equality signs all have to beinterpreted as “up to a positive multiple”) given by

e1T(ATA)1=(iai,22,iai,1ai,2).

Hence, the given condition of positive scoring boils down to theinequality:

e1T(ATA)1aj=aj,1·iai,22aj,2·iai,1ai,2>0.

(B1)

We now add an item with loading vector ak+1(c,λ):=(cλ,λ) (λ>0,c>0). Then the scoring with regard to the new test will begiven by

θ^(u)=(BTB)1BTu,

wherein B=B(c,λ) denotes the (k+1)×2 loading matrix whose first k rows equal the rows of the matrix A and whose last row equals ak+1(c,λ). The same argument as before shows that increasing theresponse vector in the jth component changes the first component of theestimate by an amount of

e1T(BTB)1aj.

The first row of the inverse of the new cross-product matrix is up to apositive multiple given by

e1T(BTB)1=(iai,22+λ2,iai,1ai,2cλ2).

If we multiply this with aj the following expression results:

e1T(BTB)1aj=aj,1·(iai,22+λ2)aj,2·(iai,1ai,2+cλ2).

By letting c, it can be seen that eventually the inequality

aj,1·(iai,22+λ2)aj,2·(iai,1ai,2+cλ2)<0

will hold. This establishes the claim.

Appendix C

R-Code for Simulation

In this section, a sketch of an R simulation is presented—showing that thecounterintuitive results obtained in the “The Impact of Item Discriminationon Scoring” section are not a special property of the chosen matrix. Inparticular, the second part of the simulation (Case 2: below) adds thefollowing key observation: An item with zero loading on a particulardimension does show a remarkable impact on the estimate for that particulardimension.

At last there is also a short sketch of a corresponding simulation in theM2PL model. Herein we use a longer test (k=30) for the demonstration to lower the number of cases,wherein the MLE for the person parameter does not exist.

set.seed(67983)

###FA model

###1. Scoring a vector with equal loading across thedimensions###

scoring <- NULL

z <- c(1,1,1)/2

k <- 10

for(i in 1:1000)

{a <- runif((k-1)*3,0,1)

 a <- c(a,z)

 A <- matrix(a,ncol=3,byrow=TRUE)

 estimate <- solve(t(A)%*%A)%*%t(A)

 scoring <- c(scoring,estimate[,k])

 scoring <- matrix(scoring,ncol=3,byrow=TRUE)

}

#number of times different signs appear in thescoring

signrev <- apply(scoring,1,function(x){all(x<=0) |all(x>=0)})

table(signrev)

#highest quotient of the magnitudes and highest difference inmagnitude

quot_magnitude <- apply(scoring,1,function(x)

{x[which.max(abs(x))]/x[which.min(abs(x))]})

mean(quot_magnitude>4)

diff_magnitude <- apply(scoring,1,function(x)

{x[which.max(abs(x))]-x[which.min(abs(x))]})

plot(density(diff_magnitude,kernel=“gaussian”))

###2. Scoring a vector with zero loading on a particulardimension###

ranking <- NULL

k <- 10

for(i in 1:1000)

{a <- runif(k*3,0,1)

 a[1] <- 0

 A <- matrix(a,ncol=3,byrow=TRUE)

 estimate <- solve(t(A)%*%A)%*%t(A)

 r1 <- rank(abs(estimate[1,]))

 ranking <- c(ranking,r1[1])

}

table(ranking)

###3. Magnitude of scoring in dependence on the

###length of the discrimination vector###

ranking <- NULL

k <- 10

for(i in 1:1000)

{a <- runif(k*3,0,1)

  A <- matrix(a,ncol=3,byrow=TRUE)

 len <- apply(A,1,function(x){sum(x^2)})

 index <- which.max(len)

 estimate <- solve(t(A)%*%A)%*%t(A)

 r1 <-rank(apply(estimate,2,function(x){sum(x^2)}))

 ranking <- c(ranking,r1[index])

}

table(ranking)

###4. Item impact largest on dimension with highestloading?

###and lowest for dimension with smallest loading?

hit <- NULL

hit_low <- NULL

k <- 10

for(i in 1:1000)

{a <- runif(k*3,0,1)

 A <- matrix(a,ncol=3,byrow=TRUE)

 highest <- apply(A,1,which.max)

 lowest <- apply(A,1,which.min)

 estimate <- solve(t(A)%*%A)%*%t(A)

 impact_max <- apply(abs(estimate),2,which.max)

 impact_min <- apply(abs(estimate),2,which.min)

 hit <- c(hit,mean(impact_max==highest))

 hit_low <- c(hit_low,mean(impact_min==lowest))

}

mean(hit)

mean(hit_low)

###5. Item with equal loading for two dimensions: differences inscoring?

diff <- numeric(1000)

sign <- numeric(1000)

k <- 10

for(i in 1:1000)

{a <- runif((k-1)*3,0,1)

 A <- matrix(a,ncol=3,byrow=TRUE)

 A <- rbind(A,c(0.3,0.3,0.9))

 estimate <- solve(t(A)%*%A)%*%t(A)

 diffs <-apply(estimate,2,function(x){abs(x[1]-x[2])})

 diff[i] <- rank(diffs)[k]

 sign[i] <- (estimate[1,k]*estimate[2,k])>0

}

table(diff)

mean(sign)

###M2PL model

###6. Scoring a vector with equal loading across thedimensions###

scoring <- NULL

z <- c(1,1,1)/2

k <- 30

for(i in 1:1000)

{a <- runif((k-1)*3,0,1)

 a <- c(a,z)

 A <- matrix(a,ncol=3,byrow=TRUE)

 response <-c(ifelse(runif(k-1,0,1)<0.5,1,0),0)

 estimate <-glm(response~A-1,family=binomial(link=“logit”))$coef

 response[k] <- 1

 estimate2 <-glm(response~A-1,family=binomial(link=“logit”))$coef

 scoring <- c(scoring,estimate2-estimate)

 scoring <- matrix(scoring,ncol=3,byrow=TRUE)

}

#number of times different signs appear in thescoring

signrev <- apply(scoring,1,function(x){all(x<=0) |all(x>=0)})

table(signrev)

###7. Scoring a vector with zero loading on a particulardimension###

rank_scoring <- NULL

diff <- NULL

ind_paradox <- NULL

k <- 30

for(i in 1:1000)

{a <- runif(k*3,0,1)

 a[1] <- 0

 A <- matrix(a,ncol=3,byrow=TRUE)

 A <-diag(1/apply(A,1,function(x){sqrt(sum(x^2))}))%*%A

 response <- ifelse(runif(k,0,1)<0.5,1,0)

 response1 <- response

 response2 <- response

 response1[1] <- 0

 response2[1] <- 1

 est1 <-glm(response1~A-1,family=binomial(link=“logit”))$coef

 est2 <-glm(response2~A-1,family=binomial(link=“logit”))$coef

 diff <- c(diff,est2-est1)

 ind_paradox <-c(ind_paradox,which((est2-est1)<0))

 rank_scoring <-c(rank_scoring,rank(abs(est2-est1)))

}

rank_scoring <-matrix(rank_scoring,ncol=3,byrow=TRUE)

table(rank_scoring[,1])

table(ind_paradox)

###8. Magnitude of scoring and its dependence on the

###length of the discrimination vector###

scoring <- NULL

k <- 30

for(i in 1:1000)

{a <- runif(k*3,0,1)

 A <- matrix(a,ncol=3,byrow=TRUE)

 len <- apply(A,1,function(x){sum(x^2)})

 ord <- order(len)

 A <- A[ord,]

 response <- c(ifelse(runif(k,0,1)<0.5,1,0))

 diff <- NULL

 for(l in 1:k)

 {response1 <- response

  response2 <- response

  response1[l] <- 0

  response2[l] <- 1

  est1 <-glm(response1~A-1,family=binomial(link=“logit”))$coef

  est2 <-glm(response2~A-1,family=binomial(link=“logit”))$coef

  diff <- c(diff,sum(abs(est2-est1)))

 }

 scoring <- c(scoring,rank(diff))

}

scoring <- matrix(scoring,ncol=k,byrow=TRUE)

Appendix D

Accessing the Scoring in R

As suggested in the “Conclusion” section and demonstrated in the “Real-DataExample” section, the information regarding the scoring can provide valuableinsights when checking the labeling of latent dimensions or wheninterpreting factor analysis output. Yet when fitting FA models via R, thescoring scheme is usually not an integral part of the standard output andmust be obtained separately. In this appendix, we demonstrate how theinformation on the scoring pattern can be accessed within commonly appliedmodel fitting routines. For illustrative purposes, we highlight thecomputation of the scoring matrix within an exploratory factor analysisframework (based on the function factanal) and also withina confirmatory framework (based on lavaan, Rosseel, 2012). Weuse the mental ability testing data of Holzinger and Swineford (1939) asdata to illustrate the computations. More specifically, we restrictourselves to the data corresponding to the math and the memory abilitydomain and fit (1) a two-dimensional exploratory factor analysis model and(2) a three-dimensional confirmatory bifactor model to the data. Note thatthis only serves the purpose to illustrate the computation of the scoringscheme. We do not intend to deduce claims about these mental abilities withour subsequent analysis (nor do we proceed to check the model fit).

#Load data and create a subset Y containing only math- andmemory

#domain related data.

library(MBESS)

data(HS.data)

HS.data <- HS.data[HS.data$school!=“Pasteur”,]

mem <- c(“wordr”, “numberr”, “figurer”, “object”, “numberf”,”figurew”)

math <- c(“deduct”, “numeric”, “problemr”, “series”,“arithmet”)

Y_mem <- HS.data[,colnames(HS.data) %in% mem]

Y_math <- HS.data[,colnames(HS.data) %in% math]

Y <- cbind(Y_mem,Y_math)

#EFA with function factanal()

fa_model <- factanal(Y,factors=2)

L <- fa_model$loadings[1:11,1:2] #get factorloadings

U <- diag(fa_model$unique) #get uniqueness U

W <- solve(diag(U)) #weight matrix = inverse of U

round(solve(t(L)%*%W%*%L)%*%t(L)%*%W,2) #MLE/WLS factor scoringmatrix

#CFA (bifactor model) with lavaan:

HS.model <- ‘f1=~ deduct + numeric + problemr + series +arithmet +

          wordr + numberr + figurer + object + numberf +figurew

       f2=~ wordr + numberr + figurer + object + numberf +figurew

       f3=~ deduct + numeric + problemr + series +arithmet’

#for identifiability use orthogonal factors

mem_math <- cfa(HS.model, data =Y,std.lv=TRUE,orthogonal=TRUE)

#Get relevant quantities via lavInspect()

estimates <- lavInspect(mem_math, what = “est”)

L <- estimates$lambda #get factor loadings

U <- estimates$theta #get uniqueness U

W <- solve(U) #weight matrix = inverse of diag(U)

round(solve(t(L)%*%W%*%L)%*%t(L)%*%W,2) #MLE/WLS factor scoringmatrix

Notes

1.We use the term latent proficiency only for abbreviation. Infact, any latent construct can be substituted for the term latentproficiency without altering the following discussion.

2.Cases wherein the scale is decomposable into separate unidimensional scalesare excluded from this.

3.This follows from the fact, that the MLE of a newly defined parameter vectorψ:=Λθ is given by the corresponding evaluation of thetransformation at the MLE of the original parameter, that is,ψ^=Λθ^.

4.The matrix of factor loadings (with nonnegative entries) can be found inHampel and Petermann(2006). Our derivations are based on a computation of the leastsquares estimator for this loading matrix.

5.Using a classical separation argument from convex analysis (see Theorems 21.1and 21.3 of Rockafellar,1970), the existence of these “fair” composites for the FA modelcan be established.

Footnotes

Declaration of Conflicting Interests: The author(s) declared no potential conflicts of interest with respect to theresearch, authorship, and/or publication of this article.

Funding: The author(s) received no financial support for the research, authorship, and/orpublication of this article.

ORCID iDs: Pascal Jordan Rethinking the Interpretation of Item Discrimination and FactorLoadings (4)https://orcid.org/0000-0001-6394-076X

Martin Spiess Rethinking the Interpretation of Item Discrimination and FactorLoadings (5)https://orcid.org/0000-0003-1855-643X

References

  • Adorno T. W., Frenkel-Brunswik E., Levinson D. J., Sanford R. N. (1950). The authoritarianpersonality. New York, NY:Harper. [Google Scholar]
  • Bartlett M. S. (1938). Methods of estimating mentalfactors. Nature, 141,609-610. [Google Scholar]
  • Cattell R. B. (1952). Factor analysis.New York, NY:Harper. [Google Scholar]
  • Cattell R. B. (1966). Handbook of multivariateexperimental psychology. Chicago,IL: RandMcNally. [Google Scholar]
  • Comrey A. L., Lee H. B. (1992). A first course in factoranalysis (2nd ed.). Hillsdale,NJ: LawrenceErlbaum. [Google Scholar]
  • Costello A. B. (2009). Getting the most from youranalysis. Pan-Pacific Management Review,12, 131-146. [Google Scholar]
  • Draper N. R., Smith H. (1998). Applied regressionanalysis (3rd ed.). New York,NY: Wiley. [Google Scholar]
  • Hampel P., Petermann F. (2006). Fragebogen zum Screeningpsychischer Störungen im Jugendalter (SPS-J).Zeitschrift für Klinische Psychologie und Psychotherapie,35, 204-214. [Google Scholar]
  • Holland P. W. (1981). When are item responsemodels consistent with observed data?Psychometrika, 46,79-92. [Google Scholar]
  • Holland P. W. (1990). On the sampling theoryfoundations of item response theory models.Psychometrika, 55,577-601. [Google Scholar]
  • Holzinger K. J., Swineford F. (1939). A study in factor analysis:The stability of a bi-factor solution. SupplementaryEducational Monographs, 48Chicago, IL: University ofChicago. [Google Scholar]
  • Hooker G. (2010). On separable tests,correlated priors, and paradoxical results in multidimensional item responsetheory. Psychometrika, 75,694-707. [Google Scholar]
  • Hooker G., Finkelman M. (2010). Paradoxical results and itembundles. Psychometrika,75, 249-271. [Google Scholar]
  • Hooker G., Finkelman M., Schwartzman A. (2009). Paradoxical results inmultidimensional item response theory.Psychometrika, 74,419-442. [Google Scholar]
  • Jordan P., Spiess M. (2012). Generalizations ofparadoxical results in multidimensional item responsetheory. Psychometrika, 77,127-152. [PubMed] [Google Scholar]
  • Kline P. (1994). An easy guide to factoranalysis. London, Engand:Routledge. [Google Scholar]
  • Lee S. Y. (2007). Structural equation modeling: ABayesian approach (Vol. 711). NewYork, NY:Wiley. [Google Scholar]
  • Lord F. M., Novick M. R., Birnbaum A. (1968). Statistical theories of mental testscores. Reading, MA:Addison–Wesley. [Google Scholar]
  • Mardia K. V., Kent J. T., Bibby J. M. (1979). Multivariate analysis.London, England: AcademicPress. [Google Scholar]
  • Mulder J., van der Linden W. J. (2009). Multidimensional adaptivetesting with optimal design criteria for item selection.Psychometrika, 74,273-296. [PMC free article] [PubMed] [Google Scholar]
  • Reckase M. (2009). Multidimensional item responsetheory. New York, NY:Springer. [Google Scholar]
  • Reynolds W. M. (2001). Reynolds Adolescent AdjustmentScreening Inventory—RAASI. Odessa,FL: Psychological AssessmentResources. [Google Scholar]
  • Rockafellar R. T. (1970). Convex analysis.Princeton, NJ: PrincetonUniversity Press. [Google Scholar]
  • Rosseel Y. (2012). lavaan: An R package forstructural equation modeling. Journal of StatisticalSoftware, 48. doi: 10.18637/jss.v048.i02 [CrossRef] [Google Scholar]
  • Samejima F. (1969). Estimation of latent abilityusing a response pattern of graded scores.Psychometrika, 34(4 Pt. 2),100. (Psychometrika Monograph Supplement No. 17) [Google Scholar]
  • Samejima F. (1973). hom*ogeneous case of thecontinuous response model. Psychometrika,38, 203-219. [Google Scholar]
  • Searle S. R., Casella G., McCulloch C. E. (2006). Variance components.New York, NY:Wiley. [Google Scholar]
  • Segall D. O. (2000). Principles ofmultidimensional adaptive testing. In van der Linden W. J., Glas C. A. W. (Eds.), Computerized adaptive testing: Theory andpractice (pp. 27-52).Dordrecht, Netherlands:Kluwer Academic. [Google Scholar]
  • Thurstone L. L. (1931). Multiple factoranalysis. Psychological Review,38, 406-427. [Google Scholar]
  • Thurstone L. L. (1947). Multiple factor analysis.Chicago, IL: University ofChicago Press. [Google Scholar]
  • van der Linden W. J. (2012). On compensation inmultidimensional response modeling.Psychometrika, 77,21-30. [Google Scholar]
  • van Rijn P. W., Rijmen F. (2012). A note on explaining awayand paradoxical results in multidimensional item responsetheory. ETS Research Report Series,2012(2). Retrieved from https://www.ets.org/Media/Research/pdf/RR-12-13.pdf [Google Scholar]

Articles from Educational and Psychological Measurement are provided here courtesy of SAGE Publications

Rethinking the Interpretation of Item Discrimination and Factor
Loadings (2024)

References

Top Articles
Latest Posts
Article information

Author: Van Hayes

Last Updated:

Views: 5415

Rating: 4.6 / 5 (46 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Van Hayes

Birthday: 1994-06-07

Address: 2004 Kling Rapid, New Destiny, MT 64658-2367

Phone: +512425013758

Job: National Farming Director

Hobby: Reading, Polo, Genealogy, amateur radio, Scouting, Stand-up comedy, Cryptography

Introduction: My name is Van Hayes, I am a thankful, friendly, smiling, calm, powerful, fine, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.