Quantifying differences between conditions in single-case designs: Possible analysis and meta-analysis.

The current paper is a call for and illustration of a way of closing the gap between basic research and professional practice in the field of neurorehabilitation. Methodologically, single-case experimental designs and the guidelines created regarding their conduct are highlighted. Statistically, we review two data analytical options, namely (a) indices quantifying the difference between pairs of conditions in the same metric as the target behavior and (b) a formal statistical procedure offering a standardized overall quantification. The paper provides guidance in the analysis and suggests free software in order to illustrate, in the context of data from behavioral interventions with children with developmental disorders, that informative analyses are feasible. We also show how the results of individual studies can be made eligible for meta-analyses, which are useful for establishing the evidence basis of interventions. Nevertheless, we also point at decisions that need to be made during the process of data analysis.


Introduction
Research and professional practice need to keep in touch if the former is to be really useful for the latter and if professionals want their everyday work to contribute to constructing scientific knowledge. We consider that there are two gaps that need to be bridged between research and practice, namely (a) in terms of how data are gathered and (b) in terms of how data are analyzed. In the following, we first briefly discuss some guidelines for collecting data in a rigorous way by means of single-case experimental designs (SCED). Afterwards, we focus on the main objective of this studythe use and interpretation standardized and raw average difference indices. These indices are presented in the context of other options for data analysis, whose important strengths and requirements are also mentioned. In order to discuss the indices in more detail, within-study analysis and acrossstudies meta-analysis are carried out and commented. The analyses presented illustrate of some challenges faced by applied researchers and indications are offered about how to begin coping with these challenges.
Closing the gap between research and practice: Gathering data One of the ways in which an intervention can be implemented and its effect assessed via a methodologically rigorous procedure is through SCEDs. Such designs entail recording a behavior of interest repeatedly, before and after an intervention is introduced. SCEDs have already been suggested (Graham et al., 2012) and actually used (Perdices & Tate, 2009) in research in rehabilitation. In that respect, McMillan (2013) mentions one aspect that may boost the use of SCED in this domain-the possibility of tailor made interventions-as well as one condition for including single-case studies as a tool for identifying effective interventions, their good quality. In terms of applicability, SCED scan be considered well suited for translating practice into research. Actually, the AB design is similar to the natural process of an initial assessment followed by a change in the conditions and continued measurement of the same behavior of interest (Rabin, 1981). Moreover, SCEDs allow studying low prevalence problems, disperse populations, and focusing on individual clients, with the possibility to modify the intervention according to the client's responses (Edgington, 1983) or even to terminate any harmful interventions (Johnston & Pennypacker, 2009).
Nevertheless, the research designs require more than single switch from evaluation to intervention, given that AB designs are not considered sufficient for demonstrating intervention effectiveness (Tate et al., 2013). Such designs do not offer guarantees for ruling out alternative explanations for behavioral change (e.g., history), which is one of the three main criteria for causality, together with the need for the cause to precede and covary with the effect (Shadish et al., 2002). For dealing with this issue, at least three attempts to assess whether a change in the conditions is associated with a change in the target behavior are required. This requirement can be met, for instance, using multiple-baseline designs (MBD), which replicate the AB sequence across different participants, behaviors or settings (sometimes generally referred to as "tiers"), with the intervention for each AB comparison being introduced at a different moments in time. Other recommended, but less frequently used (Shadish & Sullivan, 2011) design structures include ABAB (within-case replication of the introduction and withdrawal of the intervention) and alternating treatment designs with a faster and more frequent change in the conditions (Kratochwill et al., 2010). Apart from choosing an appropriate design structure, choosing the moments of change in phase at random can further help ruling out alternative explanations and boost scientific credibility (Kratochwill & Levin, 2010). In order to make its implementation feasible, the random assignment of the intervention start point can be made after the baseline data have stabilized (Heyvaert & Onghena, 2014).
In order to meet McMillan's (2013) requirement for "good quality" single-case research (p. 793), specific proposals have already been made for assessing quality. For instance, Horner et al. (2005) have proposed a set of quality indicators for the special education field, including requirements for a detailed description of participants and settings, precise definition of dependent and independent variables, as well as indicating how internal, external, and social validity can be improved. With a similar aim, Reichow et al. (2008) developed a set of quality indicators for evidence-based practice in autism. Fortunately, the criteria converge with the ones present in Horner et al (2005). One of the main differences is that procedural fidelity (Ledford & Gast, 2014) is separated from the definition of the independent variable (as in the latter case it would be restricted to "treatment fidelity"). Another distinction is that visual inspection is added as a desired means of analysis (with statistical analysis present as a quality indicator for group design studies). A review applying the indicators proposed by Reichow et al. (2008) and Kratochwill et al. (2010) for assessing the quality of behavioral interventions in autism spectrum disorder (Camargo et al., 2014) suggested that most of the criteria were met by most of the studies included, although aspects, such as design strength and treatment fidelity, were only met without reservations by approximately half of the studies, pointing at aspects that still need improvement.
Several specific proposals have been put forward for assessing the quality of a SCED (Smith, 2012). One of them is the methodological quality RoBiNT scale (Tate et al., 2013), useful for assessing research that has already been conducted and reported, and also for guiding the decision-making process while the research is still on-going. On the other hand, the soon-to-be-available guidelines from the SCRIBE project (Tate et al., 2014) can improve not only the reporting of SCED studies, but also how these studies are carried out, thanks to the specific aspects that these guidelines focus on.
Closing the gap between research and practice: Analyzing the data The second gap that needs to be bridged refers to the distance between the analytical proposals made over the last decade and the actual SCED data analysis practices. Despite the evidence from the beginning of the century that visual analysis is still the most frequently used analytical method (Parker & Brossart, 2003), there is already evidence from the neurorehabilitation domain suggesting that statistical analysis is also being frequently used (Perdices & Tate, 2009). Nevertheless, the variety of current developments summarized in several special issues (e.g., Barker et al. 2013);Burns, 2012;Manolov et al. 2014a;Shadish, 2014) still need to make their way into the public space, so that applied researchers learn about the existence, use, and interpretation of these analytical techniques.
Regarding data analysis, the What Works Clearinghouse (WWC) standards (Kratochwill et al., 2010) stress the usefulness of metrics, such as proportions or rates (when available), as well as the convenience of regression-based estimators and a comparison between those and nonparametric indices. Using the standards as a basis, the RoBiNT scale (Tate et al., 2013) highlights as appropriate either systematic visual analysis, or visual analysis complemented with quasi-statistical techniques, or the justified use of statistical procedures. Accordingly, we here illustrate the joint use of visual analysis and two descriptive or quasistatistical procedures 1mean phase difference (MPD) (Manolov & Solanas, 2013, with the modification from Manolov & Rochat, 2015) and slope and level change (SLC) (Solanas et al., 2010). We also use a proper statistical procedure the d-statistic (Hedges et al., 2012(Hedges et al., , 2013. Our choice of analyses is based on the idea that these analytical techniques are likely to be correctly used and interpreted by applied researchers and due to the fact that they offer complementary information: (a) in terms of the metrics: in the measurement units of the target behavior (MPD and SLC) and in standardized terms (d-statistic); (b) in terms of the data aspects taken into account: the d-statistic deals with autocorrelation and can express the results when trend is not controlled for (if the professional is not sure about the presence or the stability of trend), whereas the other two techniques automatically control for linear trend, but not for autocorrelation; and (c) in terms of the object of the quantifications: MPD and SLC provide separate values for each AB comparison, whereas the d-statistic yields an overall quantification across the replications present in the study. Nevertheless, the reader should be alerted that these are not the only possibilities for SCED data analysis and several other options will be commented in the Discussion.
Closing the analytical gap is possible if practitioners become familiar with the alternative methods (and when each is most useful), but it also requires software implementations that make their application sufficiently easy. Following previous developments in SPSS (Shadish & Marso, 2013), SAS , or R (Brossart et al.,2014;Bulté & Onghena, 2012;Shadish et al., 2014), we refer (in the 1 We use the term "quasi-statistical" here given that it is employed in the methodological quality scale developed by Tate et al (2013). This term refers to procedures that offer quantitative summaries of the data (i.e., a statistical description), but lack the possibility to obtain inferential results (e.g., confidence intervals) on the basis of standard errors that quantify the uncertainty in the point estimate. Thus, such quasi-statistical procedures do not meet the requirement of reporting confidence intervals about the effect size measures (Wilkinson & The Task Force on Statistical Inference, 1999), although the correctness of the standard errors and confidence intervals of inferential statistical techniques is subjected to the completion of their underlying assumptions. appendix) to code in R (R Core Team, 2015) created for performing the analyses presented here.
In the following sections, we illustrate, in the context of published neurorehabilitation data, the information that the techniques chosen provide, as well as the challenges they present. We hope that the illustration and accompanying discussion help to convince applied researchers that it is possible to carry out sound analysis (according to the criteria by Tate et al., 2013) and make the results of a single study eligible for meta-analysis that can help building the evidence basis of interventions. We also hope to encourage researchers to pay close attention to all aspects of the data.

Data selection
Given that the aim of the current article was to illustrate recent analytical developments to a real neurorehabilitation study, any such study would be appropriate, as the analytical techniques need to be applicable to all kinds of situations, if we are to consider them useful. Therefore, in September 2014, we just carried out a hand search of the recent articles of Developmental Neurorehabilitation, looking for a study using a SCED, without further restrictions. The most recent study that we identified was one carried out by Ninci et al (2013) and it aimed to improve, via a behavioral intervention, the eye contact between a therapist and a 4-year-old boy called Felix with pervasive developmental disorder-not otherwise specified. This study was conducted following a multiple-baseline design, which appears to be illustrative of the higher relative frequency of these design structures (Shadish & Sullivan, 2011;Smith, 2012). The study was presented by its authors as a replication of a previous research (Foxx, 1977), which is why the latter was also selected. Foxx's (1977) study is described as a continuation and extension of a previous study by Foxx and Azrin (1973), also using overcorrection as a behavioral intervention, and this is the reason for selecting this study. Specifically, here, we focus on Study 2 reported by Foxx and Azrin (1973) aiming to reduce different kinds of self-stimulation in relatively long time periods. We decided to include this third study as well in our meta-analytical integration on the basis of the similarity in the type of intervention and participant characteristics (not on the basis of the target behavior, which is different) in order to illustrate (a) how to proceed when some studies aim to reduce the behavior of interest and others to increase it, and (b) that both the quasistatistical quantifications and the statistical procedure used here can handle replicated ABAB data, but that they do it differently. Finally, note that the three studies do not constitute a random or a representative sample of the research in the domain of behavioral interventions for children with developmental disorders; we rather chose somewhat related studies that illustrate the possibilities and challenges of meta-analysis.
In the Ninci et al. (2013) study, there are three tiers (i.e., three replications of the AB structure), one for each of three therapists, with independent and prompted eye contact as target behaviors. For all three replications, the study takes place in a childhood playroom and toys are used as reinforcers, which makes the study potentially more ecologically valid. Given the staggered introduction of the treatment (the intervention starts after 4 baseline measurements occasions for Therapist 1, 7 for Therapist 2, and 9 for Therapist 3; with n B being equal to 12, 9, and 6, respectively), Ninci et al.'s (2013) study uses a MBD.
In the Foxx (1977) study, there are also three tiers, one per participant (called Mike, Wilma, and Doug, who are 8, 8, and 6 years old, respectively), with two therapists also being present and active in the setting. Foxx (1977) describes the design as simultaneous treatment combined with changing criterion. The main interest is the effect of a functional movement training (i.e., an overcorrection). The condition with overcorrection also includes edibles and praise as reinforcers, whereas the comparison condition only includes the latter two aspects. For the current analysis, we will consider this comparison condition as a baseline, although it does include treatment, but not the treatment of interest (overcorrection). Therefore, for Therapist A, the phase lengths are as follows n A = 4 and n B = 17 for Mike, n A = 23 and n B = 6 for Wilma, and n A = 4 and n B = 20 for Doug; for Therapist B, n A = 21 and n B = 7 for Mike, n A = 7 and n B = 22 for Wilma, and no intervention (and no comparison possible) for Doug. More information regarding the data selected for the analyses is presented below.

Data analysis
Within-study analysis According to our view on how SCED data should be treated, it is necessary to always take into consideration three aspects: substantive or clinical significance, the graphical representation of the data, and the numerical summaries that can be obtained from them. For deciding on practical significance, we consider that practitioners are best suited to assess the presence and magnitude of improvement in the client, according their knowledge of this person and his/her situation, as well as according to their professional experience.
Regarding visual analysis, the majority of evidence (e.g., Danov & Symons, 2008;Ottenbacher, 1990;Ximenes et al., 2009; but see Kahng et al., 2010 for an exception) indicates that visual analysts may not agree frequently enough. Therefore, complementing naked-eye analysis with visual aids 2 is a reasonable practice (Fisher et al., 2003). Moreover, visual aids can be considered part of the process of systematic visual analyses currently required by the existing standards (Kratochwill et al., 2010) and methodological quality assessment tools (Tate et al., 2013). Specifically, it is important to initially focus on the baseline and assess whether it presents certain stability or any improving or deteriorating trend. As a visual aid helpful in this process, the split-middle trend can be 2 Bulté and Onghena (2012) for a discussion of visual aids such as lines presenting phase means or medians, trend lines fitted to each phase, range lines representing the amount of data variability and also for software tools. mentioned (Miller, 1985). The fitted split-middle trend would indicate, although less precisely than other options such as running medians (Tukey, 1977), whether the baseline is stable or not. Furthermore, one of the steps of systematic visual analysis requires comparing the projection of the baseline data with the actually obtained measurements during the following intervention phase, as when using the MPD. Finally, it has been suggested that data variability around the baseline trend (regardless of whether it is present or flat) needs to be considered via a stability envelope (Gast & Spriggs, 2010). We propose using 1.5 times the baseline phase interquartile range as a measure of variability for constructing this envelope, an option closely related to exploratory data analysis (Tukey, 1977). In the absence of trend, the stability envelope or the tools of statistical process control (Callahan & Barisa, 2005) may be appropriate.
Regarding the quantitative analysis, we propose and illustrate a combination of raw indices, expressed in the same measurement units as the data themselves, and standardized indices, as a classical statistical approach. Raw indices can be considered useful for helping applied researchers decide on the practical relevance of the change (i.e., in relation to the clinical significance mentioned earlier) as they are more directly interpretable, whereas standardized indices favor the comparison and integration of results from outcomes on different metrics in the same study or in different studies (Cumming, 2012).
MPD and SLC both estimate linear baseline trend as the average increase from one measurement to the next one. (In case there is no linear baseline trend, this is reflected in the estimate and no correction is performed.) MPD projects the estimated baseline trend into the intervention phase data and compares it to the actually obtained data, a comparison suggested as part of systematic visual analysis. Thus, the raw mean difference that MPD quantifies is between projected and actual intervention phase measurements. Regarding SLC, the linear baseline trend estimated is removed from the data, being subtracted from all the (baseline and intervention phase) measurements according to their position in the order in the data series. Afterward, using the detrended data, the trend still present in the intervention phase is estimated as the average increase from one measurement to the next one. This latter quantification represents change in slope: the average difference in the increase/decrease rate, per measurement occasion, in the intervention phase as compared to the baseline phase. After removing the intervention phase trend from the intervention data, the net level change is computed as a difference between average of the detrended baseline phase data and the average between the doubly detrended intervention phase data. The joint use of these procedures answers Beretvas and Chung's (2008) call for the separate estimation of different effects and Swaminathan et al. (2014) emphasis on the need for the quantification of overall effect. The quantifications of both MPD and SLC can be standardized by dividing them by the standard deviation of the baseline phase measurements (Manolov & Rochat, 2015).
The general idea underlying the d-statistic by Hedges et al (2012Hedges et al ( , 2013 is that the average difference between the baseline and intervention conditions (i.e., the raw measure of change) is divided by an estimate of the data variability. This latter estimate takes into account the within-case and between-case variances and autocorrelation. Moreover, the standardized mean difference estimate is corrected for small sample bias. The idea is relatively straightforward, but understanding the computations involved in the several formulae presented in Hedges et al. (2012Hedges et al. ( , 2013) requires more advanced statistical knowledge. Given that both within-case and between-case variances are taken into account, the index is comparable to standardized mean differences obtained from group-design studies. Moreover, the standard error of the index allows constructing confidence intervals as well as using the inverse variance as weight in meta-analysis. In case, trend is deemed to be present in the data, detrending is necessary before using the d-statistic, as it is not incorporated automatically in the procedure. The statistical model underlying the index assumes (a) that the change in level is constant across cases; (b) within-case residuals and betweencase variation do not change over time and are normally distributed; (c) within-case errors follow a first-order autoregressive process . The normality assumption can be tested using the relatively more powerful Shapiro-Wilk test (Razali & Wah, 2011), but Shadish et al. (2014) highlight that unbiased estimates of effect are obtained even in the absence of normality. The d-statistic index requires at least three cases. In that sense, although it is common for SCED studies to include more than one participant (Shadish & Sullivan, 2011), this requirement still excludes part of the studies (for instance, the ones that use a single participant following an ABAB design and allowing for a within-case replication to demonstrate the experimental effect; Kratochwill et al., 2010).

Across-studies analysis
When performing a meta-analysis, it is necessary to deal with any possible dependence in the outcomes, especially when several outcomes are obtained from the same study), because it is assumed that the outcomes combined are independent (Cheung & Chan, 2004, for a general overview and a proposal for taking dependence into account). One option is to avoid such dependences by using a single effect size per study (Lipsey & Wilson, 2001): averaging of the effect sizes in a study or picking one of those at random or due to a substantive reason (Borenstein et al., 2009). Another option is to model the dependence. With the advent and adaptation of multilevel models to SCED data, it is possible to use all outcomes obtained within a study and to take into account the nested structure of the data (effects within studies) and the dependencies that arise from it (Van den Noortgate & Onghena, 2003. Multilevel models would thus avoid the need for making (sometimes) arbitrary decisions about how to obtain a single effect per study and has also been shown to yield appropriate standard errors and interval estimates of the effects, even without the need to know in advance the amount of dependence, as assumed by multivariate meta-analytical models (Van den Noortgate, López-López, Marín-Martínez, & Sánchez-Meca, 2013). The option followed here, as we are not using multilevel models, is to obtain a single quantification of effect per study.
The d-statistic yields directly a single quantification for a MBD or a replicated (AB) k design (including replicated ABAB) and thus for a study. In contrast, MPD and SLC were initially proposed for comparing only a pair of phases, as in an AB design. This difference illustrates a distinction between some analytical techniques that handle complex design structures more directly (Moeyaert et al., 2014, for multilevel models andLevin et al., 2012, for randomization tests) and other analytical techniques such as nonoverlap indices for which several proposals have been made regarding their application to design structures more complex than AB: Ross and Begeny, 2014, compare techniques only in MBD data sets for which there is a single AB for each tier; Parker et al., 2011, use only the initial AB comparison from all design structures; and Olive and Smith, 2005, recommend comparing the initial baseline to the final intervention condition.
In order to obtain a single MPD or SLC effect size per study, the weighted average of the quantifications for each AB-comparison is computed, using the number of observations within a comparison as a weight. Focusing on the ABcomparisons is consistent with Scruggs and Mastropieri's (1998) recommendation to perform only comparisons that maintain the A-B sequence and with Parker and Vannest's (2012) caution regarding a possible incomplete return to baseline levels in the withdrawal phase of an ABAB design, pointing at the possibility to omit the B 1 A 2 comparison for the calculation. At the within-study level, the weight for comparison j is computed as w j ¼ n A þ n B and the effect size is Despite the fact that these weights do not capture the influence of all possible nuisance parameters (e.g., autocorrelation, intraclass correlation, Hedges et al., 2013), their use has been suggested by Shadish et al. (2008) and Kratochwill et al. (2010), when the variance of the estimator is unknown. In order to obtain a single effect size per study, our decisions are explained here. Regarding the Foxx and Azrin (1973) study, the d-statistic can be computed for three replicated ABAB designs (Barbara, Wilma, and Tricia), but not for Mike for whom only AB data are availableanother option would have been to use the initial AB for all four participants. Thus, for MPD and SLC, we also omitted the data for Mike to ensure comparability and obtained the weighted average of all six 3 AB comparisons (two per participants). The Foxx and Azrin (1973) data are primarily included for the meta-analytical purpose without paying in-depth attention to data patterns, as the Study 2 data suggest clear effects (high baseline self-stimulation reduced to 0 during the intervention phase) even when inspected only visually. Regarding the Foxx (1977) study, the recommendation for using the d-statistic when there are at least three replications of AB sequences in a MBD made us focus on the data for Therapist A and exclude the data for Therapist B. We focus on the same data for the MPD and SLC to make the results comparable, obtaining the weighted average for the three participants. Regarding the Ninci et al. (2013) study, we focus only on prompted eye contact, as it is the type of behaviour studied in Foxx (1977) and it is also the data that are more interesting for our illustrative purposes as they allow pointing at situations in which the visual aids should be used with caution. Nevertheless, independent eye contact is also relevant for the substantive purpose of the study, as it is likely to be the ultimate goal for this target behavior (i.e., that the child becomes autonomous), although eye contact can be considered a mere prerequisite for teaching other more complex behaviors such as speech. Finally, a reasonable doubt can be raised regarding whether the results of the studies by Foxx and Azrin (1973) and Foxx (1977) are completely independent, given that they share one participant (Wilma, as Mike's data are from 1973 is not included in the meta-analysis).
Once a single effect size per study is obtained, it is also important to assign a weight to this effect size according to the amount of information available. The d-statistic, being based on a solid statistical theory, allows using the inverse of the index variance as a weight, as is common in the metaanalysis of between-group studies. For MPD and SLC, this option is not available and the weight was proposed to be a function of the amount of measurements available in the study and the inverse of the variability of the outcomes (Manolov & Rochat, 2015). Specifically, for each study k, The idea of incorporating the within-study variability of effects is related to Hershberger et al (1999) proposal for using what they call "replication effect" quantification as a moderator. It has to be noted that this weight has not been derived analytically and, thus, it is not as statistically solid as an inverse variance weight. In case a researcher considers that it is not necessary or justified to include the information about the variability of effects and also considers that the impact of 1 CV 0 k is likely to be too small to be relevant 4 , Manolov and Rochat (2015) proposed and illustrated using n Ai þ n Bi only as a weight. Finally, note that we have referred to MPD and SLC so far as raw indices expressed in the same metric as the target behavior. The three studies review here all use percentages (trials with eye contact in Ninci et al., 2013, andFoxx, 1977, and time samples with self-stimulation in Foxx and Azrin,3 Note that neither the MPD nor the SLC, in this application, take into account the fact that the six AB comparisons belong to three (rather than six) participants. Such nesting is taken into account by the d-statistic and would also be taken into account by a multilevel model. The same results for MPD and SLC we obtained would have been obtained via the following steps: (1) obtain a weighted average per participant, with the weight being the number of measurements per AB comparison; (2) obtain a weighted average per study out of the effects per participant, with the weight being the number of measurements per participant; (3) obtain the weighted average across studies, using series length and the inverse of the coefficient of variation, as explained. 4 A preliminary study that we carried out showed that another possible weight P 1973) and it is thus possible to integrate their results without any transformation. However, given that it is not likely that all studies included in a meta-analyses use the same measurement units, we will also illustrate how to apply the standardized versions of MPD and SLC; for a percentage-version see Manolov and Rochat (2015).

Analysis of individual studies
Visual analysis Figure 1 contains the Ninci et al. (2013) data for prompted eye contact with added visual aids in the form a split-middle trend estimated from and fitted to the baseline and projected into the intervention phase. Recall that this projection is made as an interval of values, defined according to the variability (1.5 times the interquartile range) of the baseline data. The improvement of the target behavior is evident, as no treatment phase measurements enter into the interval of values expected in case the intervention was ineffective. These data illustrate that the visual aids are to be interpreted with common sense. First, for Therapist 1, no data are included in the trend stability envelope, but at the end of the series, there is actually a deterioration as compared to what is expected if the baseline trend progressed unchanged and, thus, measurements out of the interval predicted is not necessarily equivalent to improvement. Second, for Therapist 3, the projection includes impossible negative values and, therefore, appears to be an inappropriate reference. For the Foxx (1977) data as gathered by Therapist A (Figure 2), the intervention phase behaviors are also increased with respect to what is expected in case baseline trends are maintained. The visual aids show that even for Mike and Wilma, for whom there is greater variability in the baseline phase and, therefore, less certainty in the exact values expected, the measurements obtained are out of the intervals that would suggest no change in the behavior. However, the short and variable baseline phase for Mike presents the challenge of deciding whether trend can be estimated with sufficient precision from such data. In summary, the behavioral change here seems clearer than for Ninci et al. (2013), given that the effect is sustained (rather than temporary) for all three replications.

Quasi-statistical and statistical analyses
Both Foxx (1977) and Ninci et al. (2013) focus on average levels per condition and ranges of values as the only quantifications on which they base their assessment of intervention effectiveness, although Foxx (1977) does mention a change in the behavior with time when commenting on the data sets.
However, we will see that they use other indicators of effectiveness not related to statistics. Therefore, taken together the numerical and substantive evidence, it can be argued that the analyses performed by Foxx and Ninci et al. are sufficient for their (within-study) purpose of demonstrating the effectiveness of the behavioral intervention. Still, it has to be stressed that the data analysis method used in both studies is very similar, despite more than 30 years of distance and despite the existence of new and promising analytical techniques. We focus on such techniques here as they build on the basic information of the difference in means and make possible computing effect size indices, which are necessary for quantitative integrations useful for establishing the evidence basis of interventions (Jensonet al, 2007).
As far as the numerical analysis is concerned, the raw MPD and SLC values are presented in Table I, whereas their standardized versions can be found in Table II.  Foxx (1977) data on the three participants, as gathered by Therapist A. The continuous line represents the splitmiddle trend fitted to the baseline data; the discontinuous line represents the interquartile range-based envelope around projected split-middle trend.
For the Ninci et al. (2013) data on prompted eye contact, MPD suggests that, for all three replications, the actual intervention phase measurements are, on average, higher than the ones expected in case baseline trend continued. For Therapist 1, the average difference is only 7% which agrees with the visual representation, suggesting a crossing between projected baseline trend and actual intervention phase trend. For the remaining two therapists, the average difference between phases is around 30%, which also agrees with visual impression of clearer effect.
SLC provides more detailed information in its two estimates, with the slope change estimate suggesting for all three replications that the intervention phase trend is deteriorating with respect to the baseline trend. This was clearly noted for Therapist 1. For Therapist 2, the baseline is flat, whereas the intervention phase shows, on average, a decreasing trend: according to the estimate, it decreases, on average, with 7.88% per measurement occasion. For Therapist 3, the baseline trend is decreasing, but in the intervention phase, this decrease is even Table I. Raw quantifications obtained for each of the data sets from Foxx and Azrin (1973), Foxx (1977) and Ninci et al. (2013)  Note. Prompted refers to the data gathered by Ninci et al. (2013) on prompted eye contact, with the digits indicating the therapist intervening and collecting the data. Foxx_A refers to Therapist A intervening and collecting the data in study carried out by Foxx (1977), with the names specified afterward indicating the participant being studied. For the Foxx and Azrin (1973) data, the names refer to the participants, whereas the digits to first or second AB comparison in the respective ABAB design. MPD -Mean phase difference. Weight equal to n A + n B for the AB comparisons. Overall weight equal to the number of measurements in the study plus (1/CV k ) for the overall effect for the study, where CV is the coefficient of variation computed for each type of quantification. Note. Prompted refers to the data gathered by Ninci et al. (2013) on prompted eye contact, with the digits indicating the therapist intervening and collecting the data. Foxx_A refers to Therapist A intervening and collecting the data in study carried out by Foxx (1977), with the names specified afterward indicating the participant being studied. For the Foxx and Azrin (1973) data, the names refer to the participants, whereas the digits to first or second AB comparison in the respective ABAB design. MPD -Mean phase difference. Weight equal to n A + n B for the AB comparisons. Overall weight equal to the number of measurements in the study plus (1/CV k ) for the overall effect for the study, where CV is the coefficient of variation computed for each type of quantification. more pronounced. According to the results of the quantifications of change in slope, it would appear that the intervention led to worse effects. Nevertheless, the net change in level, after eliminating all linear trends, is highly positive: more than 50% average increase between conditions. This information quantifies the visual impression that there is a clear change in level, but that the effect of the intervention is not progressive (does not continue improving with time) or even maintain at the same level (as the negative estimates for the slope change suggest). The d-statistic summarizes the information about all three replications in a raw mean difference equal to 37.51% and a standardized mean difference corrected for small sample bias equal to 2.71 (standard error ≈ 0.60). Other pieces of information used by the d-statistic and provided as output are the autocorrelation estimate of 0.33 (incidentally, very similar to the average autocorrelation for multiple baseline designs studies reported in the review by Shadish and Sullivan, 2011: 0.32, indicating that the data are not independent) and an intraclass correlation equal to 0 (i.e., all the variation in observations is within-therapists not between-therapists). On the one hand, the raw mean difference (37.51%) is greater than the weighted average for MPD (24.57%), probably related to the fact that MPD controls for trend, relevant for the data for Therapist 1. On the other hand, the standardized value of the d-statistic (2.71) is smaller than the weighted average of the standardized MPD values (3.75), probably related to the fact that the latter is standardized according to the (relatively smaller) baseline variability, whereas the former takes into account the variability in all observations.
For the Therapist A data from the Foxx (1977) study, the MPD yields very high quantifications: for two of the participants, the increase in prompted eye contact is greater than 100%. This result can be explained by taking into account the fact that baseline trend is not fitted via the split-middle method (as shown in Figure 1), but as the average increase or decrease in successive measurements. This method leads to a negative trend being estimated for all three cases (Figure 3) and, if projected, this trend "predicts" negative percentages for the intervention phase. We have included this graph and these results to alert applied researchers using procedures Figure 3. Graphical representation of the data for Foxx (1977) data on the three participants, as gathered by Therapist A. The continuous line represents the trend fitted to the baseline data, as done in the Mean phase difference procedure: trend is the average decrease (in this case) between two successive baseline measurements.
controlling for trend, as short and variable baselines like Mike's may lead to such opposed estimates of trend. Taking into account that trend is estimated in SLC in the same way as in MPD, it is not surprising that in all cases a positive change in slope is found (approximately 4% increase per measurement occasion during the intervention for the three participants). Moreover, there is a large net change in level, which is clearer for Wilma (87.66%) for whom the baseline phase measurements are lower than for the other two participants.
The d-statistic summarizes the information about all three replications in a raw mean difference equal to 63.25% and a standardized mean difference corrected for small sample bias equal to 4.17 (standard error ≈ 0.86). Other pieces of information used by the d-statistic and provided as output are the autocorrelation estimate of 0.45 (once again suggesting that autocorrelation should be taken into account) and an intraclass correlation equal to 0. In this case, the raw value of the d-statistic (63.25%) is smaller than the weighted average for MPD (94.74) and the standardized value (4.17) is also smaller (MPD = 12.05). The results for MPD are influenced by the following: (a) the projection of the baseline trends into very low (or even impossibly negative) intervention phase values, which are then compared to the actual high intervention measurements; and (b) the low variability in the baseline for Doug (i.e., a very small the denominator), which contributes to having a very large average standardized difference.
Regarding the Foxx and Azrin (1973) data, due to space limitations, we will not go into detail reviewing the results presented in Tables I and II. We only mention that the raw d-statistic is equal to −67.63%, the standardized one to −4.42 (standard error ≈ 1.01), autocorrelation = 0.56 and intraclass correlation = 0.32 suggesting certain variation across cases. Note that given that the aim of the study was to reduce the target behavior (self-stimulation) and the intervention was effective, practically all quantifications have negative signs. These signs had to be reversed prior to carrying out the meta-analytical integration of results, so that a positive outcome always means the treatment was effective in improving the outcome

Assessment of practical significance
Obtaining evidence on the clinical significance of any behavioral change is a crucial part of data analysis. In the current section, we review the indicators of clinical significance used by Foxx (1977) and Ninci et al (2013) and make some suggestions for additional assessment. First, the design used by Foxx already helps ensuring practical significance as the criterion for "adequate performance" changes according to the improvements observed in the participant (i.e., a glance is required initially and at least a 2-second eye contact in the end). The same role has the criterion established by Ninci et al. (2013), requiring that the prompts are provided consistently until there is eye contact in at least 80% of the opportunities, plus the fact that the intervention phase for Therapist 3 was not terminated until the participant reached 70% independent eye contact.
Second, another planned aspect of the studies was the generalization training in Foxx (1977) and the maintenance measures obtained by Ninci et al. (2013), one and three months postintervention. In the Foxx study, the generalization training took place in a more natural setting, such as the day care program and led to all children reaching 90% eye contact and the fading out of the reinforcers (edibles and praise).
Third, Foxx (1977) reports that after several intervention sessions certain behaviors incompatible with attending the teacher (e.g., bouncing on the chairs, pushing the table) stopped occurring. This is another indicator of the effectiveness of the program applied, as incompatible behaviors can be seen both as a tool (when they are reinforced in order to replace problematic conduct), and as a nuisance, when they stand in the way of the desired target behaviors.
Finally, for the maintenance measures, it could be useful to relate them to normative measures, such as the ones used in the initial assessment of the 4-year-old Felix, which place him in the 0-18 months group according to social interaction skills. It would be interesting to check the age equivalence of his behavior one and three months after the intervention. Another possibility is to evaluate the degree of overlap of the maintenance measure(s) with the baseline data (maintenance performance should be better) and to the intervention phase data (performance should be similar). Such a comparison could be performed using the Nonoverlap of all pairs (Parker & Vannest, 2009) or even the same d-statistic. The application of MPD and SLC is less clear here, as they take baseline trend into account and thus the requirement for comparing only adjacent phases (Gast & Spriggs, 2010) seems crucial.

Quantitative integration of several studies
As explained earlier, MPD and SLC quantify AB-comparisons, which afterwards need to be averaged in order to obtain a single effect size per studythe results of this process, using the amount of measurements in each AB-comparison as a weight, is available in Tables I and II. Once a single effect size per study is available, a weight can be assigned to this effect size. For MPD and SLC, the amount of measurements in the study and the inverse of the variability of the outcomes are used as elements of the weight. If we apply the formula for the weight to the standardized MPD outcomes for the Ninci et al. and the Foxx data, we observe that the relative variation of outcomes is approximately equal (59%) and thus the whole difference in weights is due to the number of measurements available: The modified forest plot representing the MPD effect sizes expressed in the original metric (i.e., percentages) can be seen on Figure 4. We refer to this graphical representation as a modified forest plot, given that the intervals for the effects do not represent confidence intervals (the standard error of MPD is not known), but rather the range of the outcomes within the study for study effects and the range of effects across studies for the weighted average. The size of the square boxes still represents the weight of the study effect size, but this weight is not based on the inverse variance, but rather on the formula for w k presented previously. Finally, we have chosen to order the studies according to their effect sizes in ascending order, which can also be done in a traditional forest plot, if the order is not chronological or alphabetical. From Figure 4, it can be seen that the weighted average difference between the projected baseline trend and the actual intervention phase measurements (82%) is closer to the results of the studies by Foxx (1977) and Foxx and Azrin (1973), the ones for which the effect is greater and for which the data series are longer. Despite the within-study variability of effects, there is clearly effect of the behavioral intervention for children with developmental disabilities.
The same interpretation can be given to the results of the standardized MPD index presented on Figure 5. In this case, the weighted average (10.48) indicates that the overall difference between the actually obtained intervention data and the prediction made on the basis of the baseline trend is ten times the variability of the baseline data.
For meta-analyzing the effect sizes obtained via the d-statistic ( Figure 6), we used a random effects model, because we assume, as it commonly done, that the variability in the effect observed is due to both random error and true variation (e.g., in this case, due to the fact that the target behavior in the Foxx and Azrin, 1973, study is different) and given that random effects models allow making inferences to similar studies that vary in several characteristics beyond the exact people participating. The weighted average d once again suggests the effectiveness of the interventions tested in the three studies, with the overall difference between the measurements obtained in the conditions with and without behavioral intervention being equal to 3.57 standard deviations (which here take into account the variation in the observations within and between replications). The 95% confidence interval is [2.41, 4.74], indicating (a) the statistical significance (at the .05 level) of the weighted average as the value of 0 is not included in the interval; and (b) the relatively low precision of the estimate due to the small number of studies being integrated. The estimated variance of the true effect sizes is τ 2 = 0.42, with the proportion of true heterogeneity out of the total variability observed being rather small I 2 = 39.23%, that is, between the   25% and 50% cutoffs for small and medium heterogeneity. (Similar meta-analytical analyses can be obtained for the d-statistic; Shadish et al., 2014).
Although both MPD and the d-statistic indicate a large effect of the behavioral interventions, there is a difference in the magnitude. Part of this difference can be attributed to the fact that there appear to be deteriorating trends for most of the data sets (even for Foxx and Azrin; figure not included here), and thus, the MPD values become larger. In case of improving trends, it is expected MPD to provide lower values than the d-statistic (if data are not detrended prior to using the latter). The difference is potentially also due to how standardizing is carried out.

Discussion
In the current paper, we argue for closing the gap between methodological and statistical (basic) research and the studies that professionals carry out every day, so that this applied research can contribute to establishing the evidence basis of treatments. We decided to base the analytical options discussed here on the analytical practices already taking placepaying special attention to the visual representation of the data and averages for the conditions being compared. For that purpose, we chose to illustrate procedures that can help visual inspection and that quantify average differences. These procedure go beyond the mere comparison of means, as they allow (a) projecting baseline trend (or level in case data present no trend) and comparing it to the actually obtained intervention phase data (MPD); (b) controlling for trend and quantifying change in slope and change in level separately (SLC), which is especially relevant in case these two affects are not in the same direction, as for the Ninci et al. (2013) data for Therapist 1; (c) obtaining the difference in comparable standardized terms, taking into account autocorrelation, and constructing confidence intervals on the basis of strong statistical theory (d-statistic); (d) carrying out meta-analysis (MPD, SLC, and d-statistic, with the latter being equivalent to classical statistical procedures). Finally, we chose these procedures as MPD and SLC offer very specific quantifications for each AB-comparison, whereas the d-statistic provides an overall estimate of effect considering several features of the data.
Our choice of procedures can also be related to the characteristics of the data. The data used from all three studies show practically all of them 0% overlap and thus nonoverlap indices  are not especially useful for quantifying the magnitude of the difference between conditions when there is complete nonoverlap. For instance, the otherwise recommended Nonoverlap of all pairs (Parker & Vannest, 2009) would have yielded the value of 100% nonoverlap, without further distinction of the different magnitudes of effect. Figures 1, 2, and 3 also suggest that controlling for trend (e.g., via Tau-U; Parker et al., 2011) would probably not have made a difference.

Other options for SCED data analysis
Apart from taking the data features into account and discarding nonoverlap indices, our choice of procedures was also based on the idea of highlighting practical procedures, although these are not necessarily the only ones appropriate.
First, for maintaining practicality, we did not focus on regression models (Swaminathan et al., 2014) and multilevel models , or the proposal of Pustejovsky et al., (2014) for an effect size index based on multilevel models. All these analytical options require that the researcher makes decisions on what aspects of the data are to be modeled (e.g., kinds of effects expectedchanges in level or in slope, relevance of the variation in effects across cases, the way to proceed with autocorrelation, potential need for standardizing the data in case different measurement units are used; Van den Noortgate & Onghena, 2008). The use of such models is advised under supervision from an experienced analyst. Nonetheless, the supervision and/or training pays off if one is willing to model different data features according to the characteristics of the data at hand (e.g., presence of trend or variability in the effects across the cases) or according to a more general theoretical or empirical background (e.g., the presence of autocorrelation, curvilinear trends). Moreover, as stated previously, multilevel models can handle several outcomes per study.
Second, we also did not focus on simple procedures offering information in comparable units (percentages, not standard deviations) such as the Mean baseline reduction (Campbell, 2004) or the percentage reduction data (Wendt, 2009) in order to avoid forcing the researcher to decide whether to use all the data or only the last three measurements, respectively, given that such a choice might sometimes be based on which results match better the research hypothesis rather than on an a priori substantive justification.
Third, we could not use a randomization test (Heyvaert & Onghena, 2014) for the current data given that no random assignment of conditions to measurement occasions had taken place when gathering the data and this is a requirement for the validity of the procedure (Edgington, 1980) and is also necessary for the adequate performance of the procedure (Ferron et al., 2003). If random assignment had taken place, randomization test could provide information in terms of statistical significance and offer the researcher to the possibility to choose the effect size index to be used as a test statistic, although software implementations such as the SCDA plug-in for R (Bulté & Onghena, 2012) include only a limited set of mean difference test statistics.

Cautions necessary when analyzing data
Despite our desire to make data analysis easier for the reader, one of the things to be learned from the analyses presented here is that are still decisions to be made. First, one decision is whether to use a procedure that controls for trend (like MPD and SLC) or not (like the d-statistic) and, in case trend is to be controlled, what method to use for estimating itregression analysis, split-middle common in visual analysis, the method used in MPD and SLC, the method used in Tau-U, the trisplit discussed and promoted by Parker et al (2014). For the Foxx (1977) data, we saw that in some cases different procedures can lead to very different estimates of trend. In case trend is taken into consideration, the researcher has to decide whether its control or projection is reasonable or out of bounds , as for the Ninci et al. (2013) data gathered by Therapist 3. In order to detect such situations, it is critical to interpret the quantitative analysis guided by the visual inspection of the data. Even when the data are visually inspected, the analyst cannot focus only on the visual aids, but also on the scale of the ordinate indicate the values predicted by projecting the trend. Thus, we recommending an in-depth visual inspection of the graph, given the amount of data features it can inform about (Parker et al., 2006) and in order to assess how well baseline trend is estimated and fitted. The SCDA plug-in for R described in Bulté and Onghena (2012) includes several options for estimating trend that can help finding the one that approximates the data best. Second, the user has to know the data well enough to decide whether an overall quantification (d-statistic) is sufficiently informative or it is necessary to distinguish the changes in slope and in level (SLC); another option is to compute both.
Another lesson illustrated is about the standardized quantifications obtained. We saw that the values are far away from Cohen's (1992) benchmark for a large effect (0.8). In this context, it is necessary to stress that Cohen himself proposed the benchmarks tentatively, until further evidence is available. These results also illustrate the generally accepted opinion that Cohen's interpretative benchmarks are not suitable for SCED data (Parker et al., 2005), which has led the US Institute of Education Sciences (2014) to state that one of its priorities is to establish alternative guidelines for these designs.

Limitations and future research
This paper presents certain limitations, apart from the already highlighted fact that not all possible (and promising) SCED analytical techniques were illustrated. First, the study does not offer a formal comparison of the performance of the three techniques, given that a limited set of studies was used for illustrating some challenges that researcher may have to face. Second, the study focused on the quantifications and to a lesser extent on visual analysis. The discussion of clinical importance is left to the professionals, who are better equipped to use substantive criteria than we are. Third, quality indicators were not applied, given that the focus of this already extensive paper was analytical. Nevertheless, it could have been interesting to explore whether methodological and reporting improvements have taken place from the initial study to its replication more than 30 years later. In any case, professionals considering the use of SCED are encouraged to get acquainted with the methodological quality indicators (Horner et al., 2005;Kratochwill et al., 2010;Reichow et al., 2008;Tate et al., 2013), as these indicators are also relevant to the field of neuropsychological rehabilitation.
Finally, we urge methodologists and statisticians to explain the developments that they have worked on in a way that would make them understandable and attractive to applied researchers. We also advocate for incorporating these developments in easy to use software (such as the one included in the Appendix) accompanied by explanations of the quantifications obtained. We hope that the current paper serves as an example of such effort to bring these developments closer to their intended users and that these users would try to keep their data analytical knowledge up to date.

Meta-analysis of several studies
First, the meta-analysis via MPD and SLC can be performed using the R code available at https://www.dropbox.com/s/ wtboruzughbjg19/Across%20studies.R. Second, the meta-analysis via the d-statistic, as presented here, can be carried out u s i n g t h i s R c o d e : h t t p s : / / w w w . d r o p b o x . c o m / s / 41gc9mrrt3jw93u/Across%20studies_d.R. Shadish et al. (2014) offer further R code for performing meta-analyses with this index. Shadish et al. (2014) explain the use of their code, as mentioned above. For the remaining pieces of code mentioned in this Appendix, there is a step-by-step tutorial called "Single-case data analysis: Software resources for applied researchers" available from https://www.research gate.net/profile/Rumen_Manolov and https://ub.academia. edu/RumenManolov. This tutorial offers (a) an initial introduction to R and R-Commander; (b) an explanation of the way in which data should be organized in order to apply the analysis; (c) a visually guided list of actions that are required from the user so that the code can be downloaded and executed; and (d) a short guide on the interpretation of the results obtained, with the corresponding reference to the original articles presenting each analytical technique.

The results obtained here
In order to obtain the results presented in this paper, no further specific code was created or adapted. Therefore, the interested reader can replicate the analysis using the code mentioned above and following the indications of the tutorial. What is specific are the data set used, especially given that they were retrieved from the graphs of the articles by Foxx and Azrin (1973), Foxx (1977), and Ninci et al. (2013). Therefore, we offer an Excel file (available at https://www. dropbox.com/s/ybvdhf4q2u3q73q/FoxxNinci.Data.xlsx?dl=0 and online supplementary material) with all the data used and the different ways of organizing it, according to the procedure used. For the analysis and meta-analysis using MPD, SLC, and the d-statistic, we recommend that, in order to replicate the analysis, the reader saves each Excel worksheet separately as a tab-delimited text file and then load this text file when performing the analysis. For obtaining the graphical representation of the data and the fitted split-middle trend and its projection, it is necessary to modify the corresponding R code (https://www.dropbox.com/s/5z9p5362bwlbj7d/ ProjectTrend.R) introducing the values from the "Measurements" column after score <-c(and the length of the baseline phase after n_a <-. There is also a worksheet for obtaining the weights for MPD and SLC for the data analyzed in the current article.