Comparing “ visual ” effect size indices for single-case designs

Effect size indices are indispensable for carrying out meta-analyses and can also be seen as an alternative for making decisions about the effectiveness of a treatment in an individual applied study. The desirable features of the procedures for quantifying the magnitude of intervention effect include educational/clinical meaningfulness, calculus easiness, insensitivity to autocorrelation, low false alarm and low miss rates. Three effect size indices related to visual analysis are compared according to the aforementioned criteria. The comparison is made by means of data sets with known parameters: degree of serial dependence, presence or absence of general trend, changes in level and/or in slope. The percent of nonoverlapping data showed the highest discrimination between data sets with and without intervention effect. In cases when autocorrelation or trend is present, the percentage of data points exceeding the median may be a better option to quantify the effectiveness of a psychological treatment.

Single-case designs present problems for both data analysis of the specific study and quantitative integration of different studies.Replicating across subjects and settings in order to obtain evidence on the strength of the intervention is useful only when there are summary measures available to be used in meta-analyses.
The difficulties in single-case designs analysis are related to the scarce number of observations usually available (Huitema, 1985) and to the serial dependence between the measurements obtained from the same experimental unit (Busk & Marascuilo, 1988;Matyas & Greenwood, 1991;1997;Parker, 2006).Whether being statistically significant or not, autocorrelation has been alleged to affect the analytical techniques employed (Busk & Marascuilo, 1988;Sharpley & Alavosius, 1988;Suen, 1987;Suen & Ary, 1987).Scientific evidence points out that serial dependence alters the performance of procedures as diverse as ANOVA (Toothaker, Banz, Noble, Camp, & Davis, 1983), the split-middle method (Crosbie, 1987) and randomization tests (Gorman & Allison, 1997;Sierra, Solanas, & Quera, 2005).On the other hand, for determining the effectiveness of a treatment in an individual study it is not sufficient to obtain a p-value, due to the disadvantages of this indicator (Cohen 1990;1994;Kirk, 1996;Rosnow & Rosenthal, 1989;Wilkinson & The Task Force on Statistical Inference, 1999).Clinical, educational and social researchers need more meaningful information than the one provided by the statistical significance.Visual analysis, as an alternative, is more subjective and does not allow quantification.Moreover, it has been found to be distorted by the presence of serial dependence (Jones, Weinrott, & Vaught, 1978;Matyas & Greenwood, 1990).An objective measurement that can be used to quantify the relationship between the treatment and the behavior of interest is effect size.
In contrast with p-values, effect size indices are useful for documenting results for posterior meta-analysis and power analysis (Parker & Hagan-Burke, 2007b).Among the advantages of effect size, the following have been stated: a) it is not systematically affected by sample size (Parker & Brossart, 2003); b) it uses on the strength of association between the independent and the dependent variables, instead of centering on the null hypothesis (Kromrey & Foster-Johnson, 1996); c) it allows treatments' comparison (Parker & Hagan-Burke, 2007b); and d) it is possible to construct confidence intervals about the effect size (Kirk, 1996).
The most widely known effect size indices based on standardized mean differences (e.g., Cohen's d; Hedges' g; Glass' Δ) and measurements of association (e.g., η 2 ; ω 2 ; R 2 ) were not developed for single-case designs but rather for designs involving groups' comparison and, thus, focus only on the average levels of behavior in the different conditions.Nonetheless, there are also procedures conceptualized for N = 1 designssome of them based on regression analysis and others closely related to visual analysis.It is possible to convert some effect size indices into others (Friedman, 1982), allowing the comparison between meta-analyses using different measures.The bibliographic search we performed suggests that visually-based indices are applied more often (e.g., Bellini, Peters, Benner, & Hopf, 2007;Mathur, Kavale, Quinn, Forness, & Rutherfod, 1998;Scruggs & Mastropieri, 1994;Scruggs, Mastropieri, Forness, & Kavale, 1988) than regression-based methods (Allison, Faith, & Franklin, 1995;Skiba, Casey, & Center, 1986) in meta-analyses.This could be due to the advantages of visual indices, such as calculus easiness and increased interpretability from clinical and educational perspective.

Regression-based effect size indices
The regression-based procedures incorporate predictor variables in order to model changes in level and in slope and also try to control for extraneous variables such as trends.The following procedures are some of the most studied ones in scientific literature: 1) Gorsuch's (1983) trend analysis includes time as covariate and eliminates its influence prior to testing for change in level.
2) White, Rusch, Kazdin, and Hartmann's (1989) d, taking into consideration the correction presented in Faith, Allison, and Gorman (1997), compares two predicted valuesthe last treatment phase point according to baseline phase regression equation with the last treatment phase point as predicted by the treatment phase regression equation.The model also takes into account the possible relation between time and the measured behavior.
3) Center, Skiba, andCasey's (1985-1986) model, in contrast with the abovementioned procedures, can account for both changes in level and slope, while controlling for the presence of trend.Among the limitations of this procedure have been stated the attainment of more than one magnitude of effect index and the impossibility to obtain a negative d. 4) Allison and Gorman's (1993) model pretends to improve the previous technique, estimating trend solely from the baseline phase and allowing the correspondence between the type of treatment effect (i.e., reducing or increasing the behavior of interest) and the sign of the effect size index (negative or positive, respectively).A shortcoming of the model is the possible effect size overestimation.
Commons drawbacks of the regression-based procedures are the parametric assumptions, while there is also evidence that despite of their conceptual appropriateness those models do not perform as well as simpler indices (Manolov & Solanas, 2008).

Visual effect size indices
These effect size indices are based on a criterion employed in visual analysis in order to decide the effectiveness of a treatmentthe amount of overlap between the data points pertaining to baseline and treatment phases.Their attractiveness to applied researchers is related to calculation easiness and to the fact that visual inspection is still the most commonly applied single-case data analysis technique (Parker, Cryer, & Byrns, 2006).Some of the procedures proposed for using in psychological studies are: 1) Scruggs, Mastropieri, and Casto's (1987) percent of nonoverlapping data (hereinafter, PND).PND is based on the proportion of treatment phase measurements greater than the highest baseline phase data point.It has been criticized for ignoring all phase A data points except for one, a reason for which the following two indices were proposed.
2) Ma's (2006) percentage of data points exceeding the median (hereinafter, PEM).PEM was proposed to correct some of the potential drawbacks of PND, like the sensitivity to floor or ceiling effects, while maintaining its advantages.As its name suggests, this index computes the percentage of treatment measurements greater than the baseline phase median.
3) Parker, Hagan-Burke, and Vannest's (2007) percentage of all nonoverlapping data (hereinafter, PAND).PAND was introduced as an alternative to PND for larger data sets.It takes into account all data points and counts the minimum number of measurements that need to be removed in order to obtain series with no overlap.The ratio between the remaining data points and series' length is the basis of the index.The authors also suggest that the index can be converted into a Phi effect size index or an improvement rate difference.
The objective of the present study was to extend the scientific literature (e.g., Parker & Hagan-Burke, 2007a) assessing the performance of the three measures of effect sizes for AB designs in presence of different degrees of autocorrelation.We aimed to explore which index discriminates better between the distinct data patterns, while an additional purpose was to evaluate the influence of series' length, following Campbell's (2004) suggestions.As the estimation and hypothesis testing of serial dependence from real data can be problematic (Huitema & McKean, 1991;Matyas & Greenwood, 1991), we decided to test the effect size procedures with data constructed with known parameters (i.e., serial dependence, trend, level change, slope change), a method that has already been applied in single-case effect size studies (Manolov & Solanas, 2008;Parker & Brossart, 2003).

Design selection
The study focused on AB designs with several series' lengths (N) and phase lengths (n A and n B ), short enough to be feasible in applied settings where the temporal cost has to be taken into consideration.We chose the following values in order to cover a range of possible "short series":

Data generation
For each series' length we generated data sets with different patterns, defined by the presence or absence of general trend, change in level and/or in slope.
The statistical model used was suggested by Huitema and McKean (2000;2007): T t : value of the time variable at moment t (takes values from 1 to N); D t : dummy variable for level change.For phase A it was set to 0 and for phase B to 1; SC t : value of the slope change variable, computed as [T t -(n A + 1)] * D t , so that it is equal to 0 for phase A, and takes values from 0 to (n B − 1) for phase B; ε t : error term; The error term (ε t ) was generated following a first-order autoregressive model: The values of serial dependence (φ 1 ) ranged from -.9 to .9 in steps of .1.The u t term represents white noise at moment t generated following N (0, 1) and ε 1 = u 1 .
The value of the intercept parameter β 0 was set to zero as it does not affect effect size calculation.In order to ensure the adequacy of the comparison between experimental conditions, we chose the values of β 1 , β 2 , and β 3 so that they produce comparable mean differences between the two phases.We chose to set first the β 2 parameter, as the level change is maintained constant throughout the whole intervention phase.Afterwards, we set the values of β 1 and β 3 leading to the same difference  .Those steps were initially carried out for the shortest series (i.e., n A = n B = 5) in order to explore if longer series imply better discrimination of data patterns.We tested several values for β 2 (from .1 to .6 in steps of .1)for all experimental conditions seeking its most appropriate value.We found that for β 2 = .1 the values of PND were all too low, while for β 2 = .6PEM was close to reaching its maximum value.To avoid the floor and ceiling effects (see Figure 1), which make impossible patterns discrimination, we decided to set β 2 to .3.

INSERT FIGURE 1 ABOUT HERE
The use of β 2 ≠ 0 implies that 2 B A y y    if the other parameters are set to zero.The value of β 3 that leads to the same mean difference can be found through the following expression: , which for β 2 = .3leads to Finally, in order to guarantee suitable simulated data, the 50 values previous to each simulated data series were eliminated in order to reduce artificial effects (Greenwood & Matyas, 1990) and to avoid dependence between successive data series (Huitema, McKean, & McKnight, 1999).

Analysis
Prior to presenting in detail the steps needed to compute the three effect size indices included in the present study, an example of a fictitious data set is presented.Consider a psychological study applying the Parent Child Interaction Therapy (for an in-depth description see Borrego, Anhalt, Terao, Vargas, & Urquiza, 2006) in which the number of praises a parent directs to a child is registered five days prior to treatment introduction and five days during intervention.The data gathered using the AB design structure (4, 5, 3, 6, and 3 praises during baseline and 7, 5, 8, 9, and 7 praises during treatment phase) can be represented graphically as shown on Figure 2. In following section, each of the procedures is applied to the data set presented in order to illustrate their calculus.

INSERT FIGURE 2 ABOUT HERE
We calculated the effect size for each experimental condition using the following indices: Percent of nonoverlapping data: 1) Identify the highest measurement in phase A. In the example it is 6 praises corresponding to baseline day 4.
2) Calculate the number of phase B data points that exceed the value identified in the previous step.The measurements corresponding to days 6, 7, 9, and 10 are greater than 6, so there are 4 values exceeding phase A's highest value.Percentage of all non-overlapping data: 1) Identify the highest measurement in phase A. As obtained above this value is 6.
2) Calculate the minimal number of data points to be eliminated in order to have no inter-phase overlap.If the measurement corresponding to day 7 (i.e., 5 praises) is eliminated, then phase A and phase B would not overlapall phase B data points would be greater than the phase A measurements.
3) Divide the value obtained in step 2 by the total number of observations.
A single value to be eliminated means that the correct division is 1/10 = 0.1.
5) Subtract the value obtained in step 4 from 100.The percentage of all data non-overlapping data is equal to 100 -10 = 90%.
Percentage of data points exceeding the median: 1) Calculate the median of phase A. In the example, the sorted baseline measurements are 3, 3, 4, 5, and 6 and, therefore, the phase A median is equal to 4.
2) Calculate the number of phase B data points that exceed the value identified in the previous step.All data points from the treatment phase are greater than 4, so the value obtained is 5 (equal to n B ).
3) Divide the value obtained in step 2 by the number of observations in phase B. The division to be made is 5/5 = 1.
4) Multiply the value obtained in step 3 by 100 in order to convert the proportion into a percentage.In the example presented, the percentage of data points exceeding the median obtained is, thus, 1*100 = 100%.

Simulation
The specific steps that were implemented in the Fortran programs (one for each of the six series' length) were the following ones: 1) Systematic selection of each of the 19 degrees of serial dependence.
4) Generate an array with 50+N data following a normal distribution with mean zero and unitary standard deviation by means of NAGfl90 mathematical-statistical libraries (specifically external subroutines nag_rand_seed_set and nag_rand_normal).
6) Assign the following N numbers to array u t .

Results
This section is organized according to the objectives of the study: to explore the effect of autocorrelation, to compare data patterns discrimination, and to assess the importance of series' length.

Autocorrelation effect
In order to quantify the degree to which autocorrelation introduces distortion in the effect size estimates, we divided the estimates obtained for φ 1 ≠ 0 by the one obtained for φ 1 = 0. We performed those calculi for the case of no effect or trend simulated to avoid confounding variables.If the ratio obtained is equal to 1, then there is no influence of serial dependence.Ratios lower than 1 imply an underestimation of the effect size associated with autocorrelation, while values greater than 1 entail overestimation.As Table 1 shows, PEM yields practically the same values regardless of the degree of serial dependence.For PND and PAND greater negative or positive autocorrelation is generally associated with higher effect size estimates, being PND the more affected of the two indices.Figure 2 shows and example of those findings.

INSERT TABLE 1 ABOUT HERE INSERT FIGURE 3 ABOUT HERE
When there was treatment effect simulated in data, PEM proved to be sensitive to the presence of autocorrelationpositive as well as negative serial dependence leads to lower effect size estimates (see Figure 3 for an example).
For PND and PAND, the type of relationship between autocorrelation and effect size depends on the type of effect in data.When the intervention involves a level change, positive and negative φ 1 overestimate effect size.
When the treatment effect is expressed as slope change, it would be underestimated if PND or PAND are used.Figure 4 is an illustration of these tendencies.

Data pattern discrimination
The comparison of data patterns discrimination was carried out by constructing graphs combining the three procedures for computing the magnitude of effect with the six series' lengths.In each of these 3 * 6 = 18 graphs we put data patterns in the abscissa and the effect size index (i.e., percentage) in the ordinate, superimposing several autocorrelation levels.
We consider that an effect size index should detect (i.e., yield highest effect size estimates) powerful treatments, like the ones represented by changes in slope and in level in the same direction.The indices would also have to respond with high estimates to the occasions when either a change in level or a change in slope is present.On the other hand, when the intervention is not effective the effect size index ought to yield low (ideally zero) percentages.Additionally, a perfect index would not be sensitive to a general trend, which has no relation to the introduction of a psychological treatment.
The visual inspection carried out following those criteria suggests that PND and PEM approximate the ideal discrimination pattern.Nonetheless, there is one relevant discrepancy between those two indices due to the essence of their calculus -PND yields smaller effect size estimates than PEM.PAND seems to be more deficient, as it yields more similar estimates for data sets with and without treatment effects.An example of those findings can be seen in Figure 5, which is constructed for φ 1 = .3,as it represents a level of serial dependence likely to be found in behavioral data (Parker, 2006), although the abovementioned tendencies are common to all φ 1 values studied.All of the indices tested share a common drawbackthey are affected by the presence of trend in data which leads to overestimating effect size.As expected, complex patterns are associated with greater effect size estimates for all indices.

INSERT FIGURE 6 ABOUT HERE
Complementing the analyses performed, we divided the effect size estimates for series with effect and/or trend present by the estimate for data with no effect or trend simulated.These calculi were carried out for each of the three indices and for all series' lengths.Ratios equal to 1 suggest that there are the same estimates obtained in presence and in absence of effect.Values greater than 1 imply that the effect or the extraneous variable are associated with greater effect size estimates than white noise data.As Table 2 shows, PND is the procedure that differentiates the most between presence and absence of intervention effect.However, it is also the procedure most affected by trend.PAND distinguishes less between data patterns, except for data series with n A = 5 and n B = 15 where its performance is practically equivalent to PEM's.

Series' length effect
In order to explore the variation of the performance of the indices as one of the phases (or both) becomes longer, we divided the effect size estimates obtained for the longer designs with the ones obtained for the shortest one (n A = n B = 5).Ratios equal to 1 suggest that phase length does not influence the performance of the procedures.Values greater or smaller than 1 imply higher or lower effect size estimates, respectively, in comparison to 10-measurements data sets.According to Table 3, increasing series' length leads to a better differentiation between the data patterns.As the example in Figure 6 shows the improvement is expressed basically as lower false alarm rates (i.e., lower percentages for the case of absence of treatment effect) and as higher sensitivity to synergic slope and level changes.Those results highlight the importance of having more measurements of the experimental unit in order to obtain a more precise image of the evolution of its behavior.In accordance with the data simulation method followed, in longer series changes in slope yielded higher effect size estimates than changes in level.

INSERT TABLE 3 ABOUT HERE INSERT FIGURE 7 ABOUT HERE
The performance of PAND improves for designs with unbalanced phase lengths.As Figure 7 illustrates for such designs the distinction between data patterns is more pronounced, implying lower effect size estimates for white noise and trend.On the contrary, for PND the presence of trend is more problematic for designs with unequal phase lengths.PEM is the procedure less affected by the amount of data points in the series.

Discussion
In the current investigation we pretended to continue the search of the most appropriate procedure for quantifying treatment effectiveness and summarizing results from single-case designs.The performance of the effect size indices was tested by means of data patterns generated to represent the likely features of real data (i.e., few observations per phase, serially dependent measurements).Among the desirable features those indices can be stated: a) to detect changes in behavior due to the introduction of an interventionlow miss (Type II error) rates; and b) to produce low, ideally null, effect size estimates in absence of treatment effectlow false alarm (Type I error) rates; c) to be insensitive to extraneous variables such as general trend; and d) to remain unaffected by autocorrelation.
Taking the first two criteria into consideration simultaneously we can point to PND as the best performer as it produces lowest effect size estimates in presence of solely white noise.Moreover, among the three procedures tested, it presents the highest relative differentiation between effective and ineffective interventions.PEM also shows a good patterns' discrimination, being more sensitive but less specific than PND.PAND is the index that performs less satisfactorily in the cases when baseline and treatment phases have approximately the same number of observations.A positive characteristic of all three indices studied is the discrimination between data patterns even when series consist of only ten data points.
As regards autocorrelation, PEM is the less affected procedure in absence of effect and is conservatively biased by both positive and negative serial dependence in presence of treatment effect.Applied researchers should keep in mind that both overestimation and underestimation of an existing treatment effect are possible when PND and PAND are used, depending on the degree of autocorrelation and on the type of effect (change in slope or in level).Out of those two indices PND is the one whose effect size estimates are more distorted by serial dependence.
A shortcoming of the indices is the finding of the distorting impact of trend in data, which makes necessary the visual inspection prior to applying any of the three procedures.PAND was the least affected index, while PND was the most affected one.
In conclusion, what recommendation can be given to applied researchers?
To begin with, they ought to keep in mind what each index represents in order to interpret it correctly.In this sense, we consider that the meaning of PND and PEM is more straightforward that the information given by PAND.In terms of computational accessibility, all three indices can easily be calculated, especially PND.We have to advert that whenever the intervention is supposed to reduce rather than to enhance the behavior measured, the manner of computation of the indices can be adjusted to the needs of the applied researcher.A potential advantage of PAND is the possibility to derive from it a conventional effect size index, like Pearson's Phi (Parker et al., 2007).
Nonetheless, mathematical-statistical calculations beyond the computation of the percentage itself may make the index less attractive to applied researchers.
Applied researchers can be advised to use PND in data sets with no autocorrelation or trend, as it is the procedure that best distinguishes between presence and absence of intervention effect.When there is a high outlier in the baseline phase and the objective of the intervention is to increase the behavior of interest, the use of PND cannot be advised as it would lead to an underestimation of the treatment effect.In cases when the behavioral measurements present general trend or are likely to be sequentially related, PEM ought to be the effect size index chosen.PAND approximates PEM's performance only when the baseline phase is considerably shorter than the treatment phase.
In any case, professionals should not follow the same criteria for labeling the treatment as "effective" when using different procedures (e.g., 70%-90% "effective", 50%-70% "questionable", in Scruggs and Mastropieri, 1998).This is due to the fact that as some of the indices (PEM and PAND) yield systematically higher effect size estimates than others (PND).Whatever index is utilized, visual inspection should not be replaced as a source of supplementary information (Parker et al., 2006).
As regards meta-analysis of single-case data, applied psychologists ought to be cautious when integrating information from studies using different number of measurement times, since these may imply different levels of affection by autocorrelation and general trend.That is, the effect size estimates obtained from studies with a specific N may not have the same precision and the same insensitivity to extraneous variables as the estimates obtained for other series and/or phase lengths.This difficulty is, however, not only applicable to effect size procedures based on visual analysis, but also to the ones based on regression or standardized mean difference (Manolov & Solanas, 2008).
A limitation of the present investigation consists in the fact that only twophase designs were studied.However, as Busse, Kratochwill, and Elliott (1995) claim, the AB designs' results can also be useful for multiple-baseline designs.
Future research may center on calibrating the data generation procedure with the most appropriate values (i.e., β 1 , β 2 , and β 3 ) for simulating treatment effects in order to improve real data modeling.In addition, it is necessary to obtain evidence on the performance of the effect size indices in designs consisting of more than two phases.
Tables Table 1.Distortion due to autocorrelation when no trend or effect is present in datathe values represent the ratio φ 1 ≠0/φ 1 =0.
where: y t : the value of the dependent variable at moment t; β 0 : intercept; β 1 : coefficient associated with general trend; β 2 : coefficient associated with level change; β 3 : coefficient associated with slope change; 3) Divide the value obtained in step 2 by the number of observations in phase B. The number of phase B observations is 5 and the result of the division is 4/5 = 0.8.4) Multiply the value obtained in step 3 by 100 in order to convert the proportion into a percentage.The percentage obtained for the example is 0.8*100 = 80%.
) Systematic selection of the (β 1 , β 2 , and β 3 ) parameters for data generation, leading to 8 different data patternsautoregressive model (i.e., no effect or trend); trend; level change; slope change; trend and level change; trend and slope change; level and slope change; trend, level and slope change.

Figure 2 .
Figure 1.Influence of the simulation parameters β on the effect size indices.

Figure 3 .Figure 4 .Figure 5 .Figure 6 .Figure 7 .Figure 8 .
Figure 3. Autocorrelation effect on the effect size indices when no effect or 10) Obtain the dummy treatment variable array D t , where D t = 0 for phase A and D t = 1 for phase B.
16) Average the obtained percentages from the 100,000 replications of each experimental condition.

Table 2 .
Detection of data patterns in comparison to the case of no effect or trend simulated in independent series.

Table 3 .
Influence of series' length on pattern detection for independent series -comparison to n A = n B = 5.