Weighting strategies in the meta-analysis of single-case studies

Establishing the evidence base of interventions taking place in areas such as psychology and special education is one of the research aims of single-case designs, in conjunction with the aim of improving the well-being of participants in the studies. The scientific criteria for solid evidence focus on the internal and external validity of the studies, and for both types of validity, replicating studies and integrating the results of these replications (i.e., meta-analyzing) is crucial. In the present study, we deal with one of the aspects of meta-analysis—namely, the weighting strategy used when computing an average effect size across studies. Several weighting strategies suggested for single-case designs are discussed and compared in the context of both simulated and real-life data. The results indicated that there are no major differences between the strategies, and thus, we consider that it is important to choose weights with a sound statistical and methodological basis, while scientific parsimony is another relevant criterion. More empirical research and conceptual discussion are warranted regarding the optimal weighting strategy in single-case designs, alongside investigation of the optimal effect size measure in these types of designs.

The evidence-based movement has now been salient for several years in a variety of disciplines, including psychology (APA Presidential Task Force on Evidence-Based Practice, 2006), medicine (Sackett, Rosenberg, Gray, Hayness, & Richardson, 1996), and special education . In this context, single-case designs (SCDs) 1 have been considered one of the viable options for obtaining evidence that will serve as a support for interventions and practices Schlosser, 2009). Accordingly, randomized single-case trials have been included in the new version of the classification elaborated by the Oxford Centre for Evidence-Based Medicine regarding the methodologies providing solid evidence (Howick et al., 2011). Thus, it is clear that one of the ways of improving methodological rigor and scientific credibility is by incorporating randomization into the design , given the importance of demonstrating causal relations (Lane & Carter, 2013). Demonstrating cause-effect relations is central to SCDs, provided that they are "experimental" in essence (Kratochwill et al., 2013;Sidman, 1960), and, apart from using random assignment of conditions to measurement times, it is also favored by replication of the behavioral change contiguous with the change in conditions (Kratochwill et al., 2013;Wolery, 2013). On the other hand, replication is also related to generalization (Sidman, 1960), which benefits from research synthesis and meta-analysis. In that sense, the evidence-based movement has also paid attention to the meta-analytical integration of replications or studies on the same topic (Beretvas & Chung, 2008b;Jenson, Clark, Kircher, & Kristjansson, 2007). The quantitative integration is deemed especially useful when moderator variables are included in the meta-analyses (Burns, 2012;Wolery, 2013). Finally, it has been stressed that meta-analysis and the assessment of internal and external validity should not be considered separately (Burns, 2012), given that the assessment of the methodological quality of a study is an essential part of the process of carrying out research syntheses (Cooper, 2010;Littell, Corcoran, & Pillai, 2008;What Works Clearinghouse, 2008)-for instance, using the methodological quality scale in SCD (Tate et al., 2013) or the Study DIAD (Valentine & Cooper, 2008) as more general tools.
Despite the current prominence of hierarchical linear models (Gage & Lewis, 2012;Owens & Ferron, 2012), more research and debate is needed regarding the optimal way in which research synthesis ought to take place in the context of SCDs (Lane & Carter, 2013;Maggin & Chafouleas, 2013). The present study represents an effort to discuss and obtain evidence regarding the meta-analysis of single-case studies; its focus is on weighting strategies, rather than on the effect size measures that summarize the results. In that sense, it should be stressed that we do not advocate here for or against specific procedures for SCD data analysis. We consider that, while the debate on the optimal analytical techniques is still ongoing, the methodological and statistical progress in SCDs will benefit from parallel research on the meta-analysis of SCD data. That is, it seems reasonable to try to solve the issue of how to combine the effect sizes from multiple studies, while also dealing with the question of which effect size measure is optimal, especially given that meta-analyses of SCD data are already taking place.

Study aims
The purpose of the present study was to extend existing research on the meta-analysis of single-case data, focusing on weighting strategies. After discussing the different weights suggested, a comparison is performed to explore whether the choice of a weighting strategy is critical. One of the weighting strategies studied is a proposal made here, based on considering baseline length and variability together.
The comparison was carried out in two different contexts. We used data with known characteristics (i.e., simulation) in order to study the influence of baseline and series length, data variability, serial dependence, and trend. Simulation has already been used to compare weighting strategies in the context of group designs (e.g., Marín-Martínez & Sánchez-Meca, 2010) and in SCDs (e.g., Van den Noortgate & Onghena, 2003a). Additionally, we applied the weighting strategies to real data sets already meta-analyzed in a previously published study (Burns, Zaslofsky, Kanive, & Parker, 2012).

Weighting strategies
Weighting the individual studies' effect sizes is an inherent part of meta-analysis. When choosing a weighting strategy, two aspects need to be taken into account: their underlying rationale and their performance. Regarding the former aspect, in group designs, the variance of the effect size index is considered optimal (Hedges & Olkin, 1985;Whitlock, 2005), given that it quantifies the precision of the summary measure and is, thus, related to the confidence that a researcher can have in the effect size value obtained. However, the choice of an effect size index is not as straightforward in SCDs as it is in group designs. Moreover, the variance has not been derived for all effect size indices (see Hedges, Pustejovsky, & Shadish, 2012, for an example of the complexities related to deriving the variance of a standardized mean difference). Finally, deriving the variance of the effect size index involves assumptions such as those mentioned in the Data Analysis subsection for the indices included in this study. More discussion is necessary in the SCD context on whether the same weighting strategy should be considered optimal, although such practice has been recommended (Beretvas & Chung, 2008b).
Other suggested weighting strategies also relate to the degree to which a summary measure is representative of the real level of behavior. On the one hand, greater data variability means that a summary measure represents all the data less well;  suggested the inverse of data variability as a possible weight. On the other hand, when a summary measure is obtained from a longer series, the researcher can be more confident that the data gathered represent the actual (change in) behavior well and that the effects are not only temporary. Accordingly, Horner and Kratochwill (2012) and  mentioned the possibility of using series length as a weight, although its appropriateness is not beyond doubt Shadish, Rindskopf, & Hedges, 2008). For instance, multiple probe designs (unlike multiple baseline designs) are specifically intended to produce fewer baseline phase measurements, when the preintervention level is stable or in the specific case of zero frequency of the behavior to be learned (Gast & Ledford, 2010). In the case of multiple probe designs, the aim is to reduce the unethical withholding of a potentially helpful intervention. Moreover, the intervention phase measurements are continuous only until a criterion is reached. Thus, studies using this design structure might be (unfairly) penalized (i.e., treated as quantitatively less important) by weighting strategies based on baseline or series length.
Another possible weight related to the amount of information available is the number of participants in a study, suggested by Kratochwill et al. ( , 2013 and used, for instance, by Burns (2012). Nonetheless, its proponents  state that there is no "strong statistical justification" (p. 24) for its use. Finally, using unweighted averages has also been considered (Kratochwill et al., , 2013 and appears to be a common practice (Schlosser, Lee, & Wendt, 2008).
The proposal we make here is that, when considering the importance of data variability and the number of measurements available, the focus should be on the baseline, consistent with the attention paid to it by applied researchers and methodologists. In SCDs, this phase is used for gathering information on the initial situation and is necessary for establishing a criterion against which the effectiveness of a treatment is evaluated. On the one hand, longer baselines show more clearly what the preintervention level of behavior is, and this level (including any existing trends) can be projected with a greater degree of confidence into the treatment phases and compared with the actual measurements. Baseline length is explicitly mentioned in several SCD appraisal tools (Wendt & Miller, 2012), with a minimum of five measurements for a study to receive a high score in the standards elaborated by the What Works Clearinghouse team  and in the methodological quality scale for SCDs (Tate et al., 2013).
On the other hand, baseline stability is critical for any further assessment of intervention effectiveness (Kazdin, 2001;Smith, 2012), given that consistent responding is key to predicting how the behavior would continue in the absence of intervention . Finally, the focus on the baseline, rather than on the whole series, is warranted, given that if the data series are considered as a whole, any potential effect will introduce variability, since the preintervention and the postintervention measurements will not share the same level or trend. Thus, whole-series variability is not an appropriate weight, given that it is confounded with intervention effectiveness. Besides the justification of the weight chosen, it is relevant to explore the effect of using different weights when integrating SCD studies, and this is dealt with in the remainder of the article.
A comparison of weighting strategies: simulation study

Data generation: design
The simulation study presented here is based on multiple baseline designs (MBDs) for three reasons. First, previous reviews (Hammond & Gast, 2010;Shadish & Sullivan, 2011;Smith, 2012) suggest that this is the SCD structure used with greatest frequency in published studies (around 50 % in the former two and 69 % in the latter). Second, in the metaanalysis carried out by Burns et al. (2012) (and rerun here), most of the studies included in the quantitative integration are MBD. Third, MBDs meet the replication criteria suggested by Kratochwill et al. (2013) for designs allowing solid scientific evidence to be obtained. Subsequent quantifications are based on the idea that the comparisons should be made between adjacent phases (Gast & Spriggs, 2010;)-that is, within each of the three tiers simulated-and, afterward, that averages are obtained across tiers.

Data generation: model and data features
Data were generated using Monte Carlo methods via the following model, presented by Huitema and McKean (2000), and were used previously in other SCD simulation studies (e.g., Beretvas & Chung, 2008b;Ferron & Sentovich, 2002;Ugille et al., 2012): The following variables are used in the model: T refers to time, taking the values 1, 2, . . . , n A + n B (where the latter are the phase lengths), D is a dummy variable reflecting the phase (0 for baseline and 1 for intervention) and used for modeling level change, whereas the interaction between D and T models slope change. In this model, serial dependence can be specified via the first-order autoregressive model for the error term ε t = φ 1 ε t-1 + u t , with φ 1 being set to 0 (independent data), .3, or .6, and u t being a normally distributed random disturbance. These autocorrelation values cover those reported by Shadish and Sullivan (2011) for 531 MBD studies reviewed: A random effects meta-analytic mean of these autocorrelations was .145, which, when corrected for bias, was equal to .320. In order to cover a greater range of possibilities, in some conditions, the degree of autocorrelation was homogeneous for the whole series, whereas in others there was nonzero autocorrelation only for the baseline data (see Fig. 1 for a graphical representation of the experimental conditions of the simulation study).
Regarding the remaining simulation parameters (β 0 , β 1 , β 2 , and β 3 ), we wanted their selection to be based on the characteristics of real behavioral data, rather than selecting completely arbitrary values. Therefore, we focused on the studies included in the Burns et al. (2012) meta-analysis. Nevertheless, we are aware that any selection of parameters is necessarily limited. In order to make the simulation study match more closely to real situations, we chose to include two different metrics, one representing the percentage of time intervals on task (as in Beck Burns, & Lau, 2009), a metric varying from 0 to 100, and another one representing the number of digits correct per minute (ranging up to 30 in Burns, 2005). On the basis of the data in these two studies, we also chose the baseline level β 0 (set to 40 and 7, respectively) and the standard deviation of the random normal disturbance u t with zero mean (set to 7 and 3, respectively). The level change parameter β 2 was set to 26 and 11 for the percentage and the count metrics, respectively, on the basis of the effects found in the abovementioned studies. The slope change parameter β 3 was set to 1 for the 0-30 metric, approximately equal to the difference in slopes in the Burns (2005) data, whereas for the 0-100 metric it was set to 3 in order to represent roughly the ratio between the scales (100:30 ≈ 3:1).
Finally, baseline trend (β 1 ) was set to 0 in the reference condition. In the conditions with change in slope, β 1 was set to 1 for the 0-30 metric, given that in the only MBD tier of the Burns (2005) study in which there was some indication of baseline trend (for student 2), the ordinary least squares slope coefficient was equal to 1.1; analogously, β 1 was set to 3 for the 0-100 metric. Table 1 contains these simulation parameters, as well as the standardized change in level (β 2 ) and change in slope (β 3 ) effects for the different conditions. Standardizing shows that the effect sizes for the two metrics are very similar, for both change in level and change in slope. For slope change, Table 1 includes the corresponding mean difference between phases: Since β 3 represents the increment between two successive points in the treatment phase, the average change between phases can be expressed as ∑ n B −1 i¼0 iβ 3 =n B , where n B is either 5 or 10.

Data generation: phase lengths
Using the model presented above, 10 three-tier MBD data sets (k = 10) were simulated for each iteration and later integrated quantitatively. In previous simulation studies related to singlecase meta-analysis (Owens & Ferron, 2012;Ugille et al., 2012), k = 10 was also one of the conditions studied. However, given that, in those studies, the estimation of effects was the object, k was more relevant than in the present study where weighting strategies are being compared.
The basic MBD data set, used as a reference, contained 20 measurements (n A = n B = 10) in each tier, following two pieces of evidence. On the one hand, Shadish and Sullivan (2011) reported that the median and modal data points in the SCD studies included in their review was 20. On the other hand, Smith (2012) reported a mean of 10.4 baseline data points in MBD, which is consistent with the Shadish and Sullivan data that 54.7 % of the SCDs had five or more points in the first baseline.
Each generation of 10 studies and posterior meta-analytical integration was iterated 1,000 times using R (R Core Team, 2013), and thus, 1,000 weighted averages were obtained for each weighting strategy and each experimental condition (i.e., for each combination of phase lengths, type of effect, data variability, degree of serial dependence, and trend).

Data generation: additional conditions for studying the effect of data variability and phase length
In the simulation study, we wanted to explore the effect of data variability and phase lengths as potentially important factors for the weighting strategies (see Fig. 1). In order to study how more variability or more data points affect the weighted average, it was necessary to set different effect sizes in the different studies being integrated. 2 We decided that half of the k = 10 studies should have the effect previously presented (β 2 = 11 2 Otherwise, it would not be possible to study the effect of these two data features . Consider the following example, with two studies being integrated and with the raw mean difference in both being equal to 11. If the first study is given weight 2 (due to twice as many data points) and the second study is given weight 1, the weighted average is still twice 11 + once 11 divided by 3, equal to 11; the same as the unweighted average. Therefore, it is necessary to have different magnitudes of effect in order to explore to what extent the weighted average moves closer to the effect size of the study given greater weight. and β 3 = 1 for the 0-30 metric, β 2 = 26 and β 3 = 3 for the 0-100 metric), whereas for the other half, the effects were multiplied by the arbitrarily chosen value of 1.5 (thus, β 2 = 16.5 and β 3 = 1.5 for the 0-30 metric, β 2 = 39 and β 3 = 4.5 for the 0-100 metric). The effects and their standardized versions are available in Table 1. In order to study the effect of data variability, we doubled the standard deviation of the random normal disturbance u t to 6 (for the 0-30 metric) and to 14 (for the 0-100 metric) for the five studies with larger effects. Thus, we expected the weighted average to decrease. It should be stressed that with the simulation parameters specified in this way, the simulated data were expected to be generally within the range of possible values, for both metrics. 3 The standardized values in Table 1 are computed, on the one hand, considering the variability in the reference condition and, on the other hand, for the conditions with greater variability.
To study the effect of phase lengths, we divided by two the number of data points in the baseline (n A = 5) or in the whole MBD tier (n A = n B = 10) for the studies with larger effects, expecting once again a reduction in the weighted average. Note that the multiplication factor was the same as when studying the effect of data variability, given that the aim was to be able to compare the changes in the weighted averages as a result of the smaller-effect-size studies containing more measurements or presenting lower variability.

Data analysis: effect size measures
Our choice of effect size measures to include in the present study was based on two criteria: knowledge of the expression of the index variance (under certain assumptions) and actual use in SCDs. Given the considerable lack of consensus on which is the most appropriate effect size measure (Burns, 2012;Kratochwill et al., 2013;Smith, 2012), we are aware that any choice of an analytical technique can be criticized, and, in the following, we explain our choice for this particular study, although we do not claim that the measures included here are always the most appropriate ones. Note. SD,, standard deviation (references equal to 7 and 3 for the 0-100 and 0-30 metrics, respectively; greater SD equal to 14 and 6 for the 0-100 and 0-30 metrics, respectively); MD, mean difference between phases; β 0 , initial (baseline) level; β 1 , general trend not related to the intervention; β 2 , change in level for the reference condition and for the condition with larger effect size; β 3 , change in slope for the reference condition and for the condition with larger effect size In the review of single-case meta-analyses performed by Beretvas and Chung (2008b), the percentage of nonoverlapping data (PND; Scruggs, Mastropieri, & Casto, 1987) and the standardized mean difference were the most frequently used procedures for meta-analyzing single-case data. Taking this into account, we chose two effect size measures for inclusion.
First, for the nonoverlap measure, we chose the nonoverlap of all pairs (NAP; Parker & Vannest, 2009), rather than the PND, for several reasons, despite the fact that the PND has a long history of use and its quantifications have been validated by the researcher's judgments on which interventions are effective (Scruggs & Mastropieri, 2013), apart from the agreement with visual analysis in the absence of an effect (Wolery, Busick, Reichow, & Barton, 2010). The reasons for preferring the NAP are the following: (1) It does not depend on a single extreme baseline measure; (2) in simulation studies, the NAP has also been shown to perform well in presence of autocorrelation (Manolov, Solanas, Sierra, & Evans, 2011), in contrast with the PND (Manolov, Solanas, & Leiva, 2010); (3) the NAP and the PND show similar distributions of typical values, according to the review by Parker, Vannest, and Davis (2011) using real behavioral data; and (4) the critical reason for selecting the NAP was the fact that the PND does not have a known sampling distribution (Parker et al., 2011), which makes impossible using the most widely accepted weight for group-design studies; in contrast, there is an expression for the variance of the NAP as shown below. The NAP is a measure obtained as the percentage of pairwise comparisons for which the result is an improvement after the intervention (e.g., the intervention measurement is greater than the baseline measurement when the aim is to increase behavior). It is equivalent to an indicator called probability of superiority (Grissom, 1994), which is related to the common language effect size (McGraw & Wong, 1992). Grissom and Kim (2001) provided a formula to estimate the variance of the probability of superiority, which is also applicable to the NAP: Note that the probability of superiority was originally intended to compare two independent samples in the same way as the Mann-Whitney U test and, extending this logic to SCD, it would be assumed that the data are independent and also that the variances are equal. The reader should consider whether these assumptions are plausible. The NAP has been used in single-case metaanalyses (e.g., Burns et al., 2012;Petersen-Brown, Karich, & Symons, 2012).
Second, regarding the standardized mean difference index, according to Beretvas and Chung (2008b), the most commonly applied version 4 was the one using the standard deviation of the baseline measurements (s A ) in the denominator, which in group designs comparing a treatment mean X B À Á and a control group mean X A À Á would be Glass's Δ (Glass, McGaw, & Smith, 1981). The index is thus defined as Δ ¼ X B −X A À Á =s A and its variance is given by Rosenthal (1994) as being equal to b Þ . Note that Δ was originally used to compare two independent groups and is based on the assumption that the sampling distribution of Δ tends asymptotically to normality and, thus, this formula is only an approximation. Moreover, although it is a standardized measure of the average difference between phases, its application to SCD data does not lead to a measure comparable to the d-statistic obtained in studies based on group designs (see Hedges et al., 2012, for a more complete explanation). This is also a reason for not using Cohen's benchmarks for interpreting the index's values (Beretvas & Chung, 2008a, b). Once again, we stress that we do not advocate for the use of this measure for quantifying intervention effectiveness in all SCD data.
Three aspects should be considered with regard to these two effect size measures. First, the fact that the first measure is expressed as a percentage of nonoverlap and the second measure is standardized implies that they can be applied to data measured in different metrics (which is the case for both the simulated and the real data used here). Second, the expressions for the variances of these indices do not take into account the fact that single-case data may be autocorrelated; so, (1) they should be used with caution when applied to real data for which it is difficult to estimate autocorrelation precisely (Huitema & McKean, 1991;Solanas, Manolov, & Sierra, 2010) and (2) it would be interesting to explore the effect of serial dependence on the weighted averages by computing the inverse of the indices' variance as a weight.
The third noteworthy aspect is related to situations in which the data do not show stability. It has to be mentioned that neither the NAP nor Δ are suitable for data that present a baseline trend not related to the intervention, as was pointed out by Parker, Vannest, Davis, and Sauber (2011) and Beretvas and Chung (2008b), respectively. This is why we did not apply these indices to conditions with β 1 ≠ 0. In fact, there are several methods for dealing with trend (e.g., Allison & Gorman, 1993;Maggin, Swaminathan, et al., 2011;Manolov & Solanas, 2009;Parker, Vannest, & Davis, 2012). However, modeling trend is not an easy issue, given that it is necessary to consider aspects such as phase length (Van den Noortgate & Onghena, 2003b) and reasonable limits within which data can be projected (Parker, Vannest, Davis, & Sauber, 2011). Moreover, the issue of baseline trend is probably more critical for the effect size indices than for the weighting strategies used to assign quantitative "importance" to these indices.
Another aspect related to the effect size measures and the lack of data stability is that NAP and Δ are not specifically designed to quantify changes in slope. Therefore, a different type of summary measure was computed here for this specific situation: the difference between the standardized ordinary least squares slope coefficients estimated separately for the 4 However, note that in the review by Maggin, O'Keefe, and Johnson (2011), this measure was used only in 19 % of SSED meta-analyses. treatment phase and for the baseline phase (with Tas predictor in both cases). This third summary measure can be defined as The NAP, Δ, and β diff were computed for each generated data set. The quantifications of the ten studies (i = 1, 2, . . . 10) were then integrated via a weighted average, 10 w i ; where w i denotes a weight in the respective study I, based on either of the five strategies studied here.

Data analysis: weighting strategies
The weighting strategies included here were the variance of the effect size indices, series length, baseline length, baseline variability, and a proposal based on both baseline length and variability. It was expected that the data variability of the whole series might be confounded with an intervention effect, given that a mean shift or a change in slope both entail greater scatter. This is why it was not included as a weight. Another possible weight not included here is the number of participants, since it is not strongly supported by its proponents  and raises the question of what weight should be used when there is only one participant in the study, for instance, when an ABAB design is used or whether, in MBD across behaviors or settings, the number of tiers should also be used as a weight. It is important to distinguish between the weighting strategies that involve computing a measure of variability. On the one hand, the classical option is related to the effect size index variance (that is, the variance of its sampling distribution). In this case, the weight is the inverse of this variance, so that a greater weight is related to greater precision of the effect size estimate. On the other hand, the variability of the data (and not of the summary measure) is considered, here focusing on the baseline phase. In this case, the weight is the inverse of the coefficient of variation of the baseline measurements. The coefficient of variation is used to eliminate the influence of the measurement units. In this way, studies with more stable data contribute more to the average effect size.
Regarding series and baseline phase lengths, the weights are n and n A , respectively, giving greater numerical importance to studies in which more measurements are available. The proposal presented here is based on both baseline length and data variability, given that the two aspects are related and should not be assessed separately: Longer baselines are desirable given that they provide more information and confidence about the actual initial situation, but even shorter baselines might be sufficiently informative if the data are stable. The weight in the proposal was defined as n A + 1/CV(A), a direct function of baseline length and inverse function of the baseline data variability measured in terms of the coefficient of variation (a nondimensional measure that makes data expressed in different units comparable). The proposal is well aligned with Kratochwill et al.'s (2010) suggestion that the first step of assessing the usefulness of the single-case data at hand for proving scientific evidence is to check whether the baseline pattern "has sufficiently consistent level and variability." Moreover, the same authors state that "[h]ighly variable data may require a longer phase to establish stability" (p. 19).

Results
The main numerical results are presented in Table 2 for the NAP and Table 3 for Δ for conditions in which level change was simulated and in Table 4 for β diff for conditions including slope change. In the following sections, the results are presented in relation to each data feature whose effect was studied via simulation.

Reference condition
The reference condition included MBD data series with 10 measurements in the two phases of each tier, with no autocorrelation or trend and variability being equal for all studies. It can be seen that the weighted averages were very similar; the only difference being the Δ value observed for the weight based on baseline data variability (and thus, also present in the proposal). Thus, the choice of a weighting strategy does not seem critical. Next, we explore whether specific data features have a differential influence on either of these strategies.

Effect of phase lengths
For the NAP and β diff , there were practically no differences between the weighting strategies. For the NAP, there was no difference with respect to the reference condition. For Δ, the pattern of results was more complex: The unweighted average was close to the index variance only when the whole largeeffect-size series were shorter. However, when only the baseline phases were shorter, the results of the Δ variance as a weight were closer to those for n A . Nonetheless, whether the index variance is an optimal weight given the issues related to its derivation should be discussed. For both types of conditions studied, the values for the proposal were in the middle of the ranges observed and, thus, represent less extreme quantifications of the average effect size.

Effect of data variability
Greater data variability is related to reducing the weighted averages for all three effect size indices, although for the NAP this reduction was only slight. The results obtained with the different weighting strategies showed considerable similarity, the only noteworthy differences were observed for Δ when using baseline variability as a weight. Once again, the results for the proposal were less extreme than all other weighted averages.

Effect of serial dependence
The presence of a positive autocorrelation in the data had the effect of reducing the weighted averages obtained, although this was not as marked for the NAP. In general, φ 1 = .6 leads to underestimating the effect size when it is computed via Δ or β diff , and when a larger proportion of the data is autocorrelated (i.e., both phases of a tier, both large-and small-effect-size studies), this underestimation is more pronounced. In any case, what is central to the comparison of the weighting strategies is that for all three effect size measures, the results were very similar.

Effect of trend
When an improving baseline trend is present in the data and a procedure is not specifically designed to deal with it, this data feature can affect the quantification of the effect size, as shown once again here. For the NAP and for Δ, such a trend leads to overestimating the effect size, given that the initial improvement (and its projection into the treatment phase) is not controlled for; the results for β diff differ because an already positive slope means that the change in slope after the intervention is compared with steeper (not stable) baseline data. However, given that the present work is focused on weighting strategies and not on the performance of the effect size indices, it is important to explore whether this distortion in the estimates is similar across weights or not. In the experimental conditions studied here, the similarity is notable. Once again, there were no major differences among the weighting strategies.
A comparison of weighting strategies: real data meta-analysis

Characteristics of the meta-analysis
The meta-analysis presented here is based on the metaanalysis carried out by Burns et al. (2012), 5 which integrated 10 studies (k = 10; the articles marked with an asterisk in the Note. NAP denotes the nonoverlap of all pairs. Var(NAP), variance of the index; CV(A), coefficient of variation for the baseline data; n, series length n A , baseline phase length; AB, both phases of a tier in the multiple-baseline design; A, the baseline phase of the tier reference list were those included in the meta-analysis). However, the present reanalysis is not a direct replication of the Burns et al. study, given that we did not use median NAP values or convert NAP to Pearson's phi. Most of the studies included in the meta-analysis used multiple baseline designs and focused on an intervention called incremental rehearsal, which is used for several teaching purposes (e.g., words, mathematics) both for children with and for those without disabilities.
Dealing with dependence of outcomes More than one outcome can be computed for most of the single-case studies included in the meta-analysis, and it does not seem appropriate to treat each outcome as independent (Beretvas & Chung, 2008b). Here, we chose to average the effect sizes within a study, which is one of the options used in group-designs meta-analysis (Borenstein, Hedges, Higgins, & Rothstein, 2009). However, it is also possible to choose one of the several effect sizes reported per study according to a substantive criterion or at random (Lipsey & Wilson, 2001). Another issue that requires consideration is how weights are computed in order to have a single weight per study accompanying the corresponding effect size measure. Borenstein et al. (2009) discussed the possibility of calculating a variance of an average of effect sizes within a study. However, their formulae require knowing or, at least, assuming plausible values for the correlations between the different study outcomes. Given that we did not want to make an assumption with no basis, we chose to obtain the average of the weights for each outcome in order to have a single weight per study. This approach has been deemed a conservative solution (Borenstein et al., 2009).
For instance, for multiple baseline designs (e.g., Burns, 2005) or multiple probe designs (e.g., Codding, Archer, & Connell, 2010), there is one outcome for each baseline. In such cases, it has been suggested ) that an effect size should be computed for each baseline before computing the average of these baselines; Burns et al. (2012) also computed the NAP for each baseline and then aggregated them. For designs with multiple treatments (e.g., Burns, 2007), the optimal practice is not clear, but comparing each treatment with the immediately preceding baseline seems to be the logical choice . However, given that in the Burns (2007) study there was only one baseline (the Note. Delta denotes Glass's Δ. Var(delta), variance of the index; CV(A), coefficient of variation for the baseline data; N, series length; n A , baseline phase length; AB, both phases of a tier in the multiple-baseline design; A, the baseline phase of the tier design can be designated as ACBC) and considering the possibility of sequence effects , we chose to include only the comparison of this baseline with the first intervention. For the Volpe, Mulé, Briesch, Joseph, and Burns (2011) study, each measurement obtained under the incremental rehearsal conditions was compared with the corresponding measurement under the traditional drill and practice condition, which was considered the reference, although it is not strictly speaking a baseline condition.

Results
The effect sizes and the different weights for each of the 10 studies are presented in Table 5. Some aspects of the results should be commented upon, before discussing the weighted averages across studies. For the Bunn, Burns, Hoffman, and Newman (2005) study, a perfectly stable baseline (i.e., a complete lack of variability) precluded computing β diff and also Δ, its variance, or the weight related to baseline variability. Additionally, given that only 10 studies were integrated, an extreme effect size in any of them and/or a measure with an extremely high weight may have affected the results of the weighted average across studies. For instance, the rather unfavorable results for the incremental rehearsal for the Volpe Mulé, et al. (2011) study potentially decreased the weighted average, especially for the weighting strategies based on baseline or series length and for the NAP variance. Another example of a study whose results are potentially influential was conducted by Matchett and Burns (2009). In the present meta-analysis, the effect size for the Matchett and Burns study was given greater weight for baseline variability and also for the proposal as weighting strategies, given that their data showed very low relative dispersion (e.g., the values for the first tier ranged between 47 and 50). The influence of the Matchett and Burns study on the average effect size is especially salient for β diff . The values and weights in Table 5 were used to obtain the mean effect sizes for the 10 studies according to each weighting strategy; the unweighted average was also computed. The results obtained following the quantitative integration of the studies are presented in Fig. 2. For both the NAP and Δ, the proposal's results were close to the unweighted average. In contrast, the NAP variance result was closer to that obtained when n A was used as a weight and the Δ variance result was more similar to the series length weight. However, the weighted average using baseline variability as a weight yielded a somewhat different result. The latter finding is especially salient for β diff , due to the influence of the Matchett and Burns (2009) study. Note. β diff denotes the difference between standardized slope coefficients. CV(A). coefficient of variation for the baseline data; n, series length; n A , baseline phase length; AB, both phases of a tier in the multiple-baseline design; A, the baseline phase of the tier

Results and implications
The present study is, to the best of our knowledge, the first one based simultaneously on simulation and real data comparing several weighting strategies in the context of SCDs' metaanalysis. The results obtained here are restricted to the experimental conditions studied, and more extensive research and discussion are required. However, various aspects of this work will fuel further discussion and testing with published data or via simulation. First, the issue of whether weighting is necessary when an average effect size summarizing the results of several studies is obtained should be considered. On substantive grounds, it seems logical to treat an outcome of a study as numerically more important (i.e., contributing to a greater extent) when this outcome is based on a larger amount of data and/or on a clear data pattern (i.e., with less unexplained variability). On empirical grounds, on the basis of the results presented here, there is not enough evidence that weighting yields markedly different results. An implication of these findings (which should be considered taking into account the limitations discussed below) is that series length alone may not be a critical feature for giving more or less weight to the results. In that sense, multiple probe designs characterized by a reduced amount of measurements may not be treated as providing less evidence. However, note that the length of the phases is also considered in the expressions for approximating the variance of the indices included in this study.
Second, for the cases in which certain differences are observed in the weighted averages, it is important to establish the gold standard, so that a result can be judged as more or less desirable. It that sense, whether the variance of the effect size measure is that gold standard and whether it can be derived for single-case data, considering potential serial dependence and/or a baseline trend, should be debated. Even in the context of simulation data, it is not easy to determine which results show the best match for the simulation parameters, given that the question is "what are the optimal weights?" and, thus, "how different from an unweighted average should a weighted average be?" Third, we consider that the discussion on the theoretically most appropriate weight (i.e., the one that has the most solid statistical justification in the context of SCD data) can take place in parallel with empirical testing, carried out with real or simulated data. With the results presented here, the door for a substantive discussion appears to remain wide open, given that no major differences were obtained across the weighting strategies.
Fourth, some methodological implications of the results should be mentioned, taking into account the limitations discussed below. First, it might not be necessary to derive the sampling distribution of an effect-size index analytically (e.g., Hedges et al., 2012) or via simulation (e.g., Manolov & Solanas, 2012) in order to be able to obtain its variance and then use it as a weighting factor. Regarding the variance of standardized mean difference measures such as Δ, it has been claimed that the presence of serial dependence in the data makes the sampling distribution unknown and, thus, the formulae for the variances might not be correct (Beretvas & Chung, 2008b), which is the one of the reasons for the current developments in the field by Hedges and colleagues.
This being said, we consider that until more evidence is available, two approaches seem to be logically and empirically supported. The first approach consists of using the weighting  Glass's Δ (delta), and the difference between standardized slope coefficients (β diff ), and weights obtained according to the different weighting strategies for the 10 studies included in the present metaanalysis, which is based on the meta-analysis carried out by Burns, Zalofsky, Kanive, and Parker (2012) (2011) strategy whose underlying statistical foundations are more solid: the index variance. The work of Hedges et al. (2012) is an important step in this direction in order to have available measurements and weights appropriate for SCD, avoiding the need to make assumptions about the data so that they would fit the measures and weights used in group-design studies. Using a weight based on widely accepted statistical theory can be useful for enhancing the scientific credibility of the meta-analyses of SCD data. Nonetheless, issues such as estimating autocorrelation (so that it can be accounted for) still need to be solved, whereas future developments more closely related to the d-statistic are also expected (Shadish, Hedges, et al., 2013).
The second approach consists of simplifying the weighting strategy to using either baseline length only or baseline length and variability-two widely available and relevant pieces of information. The main reason for such an option would be the lack of difference in performance (considering the limitations of the current evidence), as compared with the index variance weight. That is, following this approach would be based on the principle of scientific parsimony (also known as Occam's razor), according to which a simpler solution might be useful until it is demonstrated to be inferior. We consider that, subject to further testing and discussion, this approach is well aligned with the requirement of being "scientifically sound yet practical" (Schlosser & Sigafoos, 2008, p. 118). The first option would be to use only baseline phase length as a weight, given that it actually is a special case of the variance estimate presented by Hedges and colleagues (2012, Equation 5): It is the case in which autocorrelation is not taken into account and the focus is put solely on the baseline phase. Regarding the assumption of no autocorrelation, it might be justifiable considering the autocorrelations reported by Shadish and Sullivan (2011): the bias-corrected values ranged from −.010 for alternating treatment designs to .320 for MBD. The second option in the context of this parsimonious approach would be to use baseline length and the inverse of baseline data variability as weight. The rationale for such a weight would be to avoid penalizing excessively multiple probe designs in which few preintervention measurements are obtained but they show stability. Choosing either of the two approaches can be a question of further debate.
Fifth, we would like to encourage applied researchers not only to publish their raw data in a graphical format, but also to compute the primary summary measures such as means, medians, and standard deviations for each phase, given that this information is useful for computing the weights that are necessary for meta-analysis. This would help avoid any lack of precision due to imperfect data-retrieval procedures. Metaanalysis and the identification of the conditions under which interventions are useful would also benefit from reporting the details about the participants, the settings, the procedures, and the operative definitions of the main study variables (Maggin & Chafouleas, 2013).
Finally, researchers carrying out meta-analyses are encouraged to report both an unweighted average and a weighted average based on the strategy they consider optimal. In that way, each meta-analysis would serve as evidence based on real data regarding the impact of using weighting in meta-analytical integrations. Furthermore, each meta-analysis not only would contribute to substantive knowledge, but also would give added value in terms of the methodological discussion on how to perform research synthesis in SCDs.

Limitations and future research
The results of the present study are limited to the weighting strategies and the effect size measures included. Regarding the  limitations of the meta-analysis of published data, we should mention the relatively small number of studies included and the inability to calculate variances due to flat baselines. The outlying weights due to lower baseline variability in some data sets can also be seen as a limitation. However, perfectly stable measurements can be obtained in behavioral studies (e.g., Costigan & Light, 2010), especially when the desired effect is to eliminate the behavior studied (e.g., Friedman & Luiselli, 2008) or, regarding the baseline phase, when the initial level is zero (e.g., Drager et al., 2006). The data meta-analyzed also reflect the fact that, in some cases but not in others, there might be lower baseline variability (e.g., for one of the behaviors of only 1 of the 4 participants, studied by Dolezal, Weber, Evavold, Wylie, & McLaughlin, 2007).
As limitations specific to the simulation study, it focused only on MBD, and it is not clear whether the results would have been different if a variety of design structures were simulated for the data sets to be integrated; for instance, in the Burns (2012) meta-analysis, not all studies followed an MBD. Although this is the most common design structure, there are other designs that can provide strong evidence for intervention effectiveness according to the criteria presented by Kratochwill and colleagues (2010) and Tate et al. (2013), such as ABAB (used in 21 % of the empirical studies according to the Hammond & Gast, 2010, review, 17 % in Shadish & Sullivan, 2011, and 8 % in Smith, 2012 and alternating treatment designs (used in 8 % of the studies in Shadish & Sullivan, 2011, review, and 10 % as a combination of MBD and ATD, and in Smith's, 2012, review alternating and simultaneous treatment designs represented 6 % of the studies). Moreover, a restricted set of phase lengths was studied, and the data were generated on the basis of a continuous (normal) model, as is common in single-case simulation studies; but in many cases, the behavior of interest in real single-case studies is measured on a discrete ratio scale (e.g., frequency of occurrence). Additionally, more extreme conditions (e.g., greater degrees of heteroscedasticity) could have been studied, but we decided to constrain the simulation data to realistic values, obtained in the published studies. Finally, the meta-analysis of real-life data was carried out using only 10 studies, and thus, the generalization of the findings requires further field testing.
Apart from empirical comparisons between the procedures, we consider that a more thorough discussion of which is the most appropriate weight from a conceptual perspective is required. Additionally, more discussion is necessary on how to proceed with dependent outcomes within studies in order to obtain a single effect size per study, before carrying out any integration across studies.