MULTILEVEL ANALYSIS: CONSISTENT EFFECTS Assessing Consistency of Effects when Applying Multilevel Models to Single-Case Data

In the context of single-case experimental designs, replication is crucial. On the one hand, the replication of the basic effect within a study is necessary for demonstrating experimental control. On the other hand, replication across studies is required for establishing the generality of the intervention effect. Moreover, the "replicability crisis" presents a more general context further emphasizing the need for assessing consistency in replications. In the current text, we focus on replication of effects within a study, and we specifically discuss the consistency of effects. Our proposal for assessing the consistency of effects refers to one of the promising data analytical techniques, multilevel models, also known as hierarchical linear models or mixed effects models. One option is to check, for each case in a multiple-baseline design, whether the confidence interval for the individual treatment effect excludes zero. This is relevant for assessing whether the effect is replicated as being non-null. However, we consider that it is more relevant and informative to assess, for each case, whether the confidence interval for the random effects includes zero (i.e., whether the fixed effect estimate is a plausible value for each individual effect). This is relevant for assessing whether the effect is consistent in size, with the additional requirement that the fixed effect itself is different from zero. The proposal for assessing consistency is illustrated with real data and is implemented in free user-friendly software.


Assessing Consistency of Effects when Applying Multilevel Models to Single-Case Data
Single-case experimental designs (SCEDs) are research designs that entail one or several individuals being studied longitudinally, with multiple measurements taken under different conditions manipulated by the researcher. SCEDs offer the possibility to carry out methodologically rigorous studies for gathering evidence on the effect of interventions (Barlow et al., 2009). SCEDs have been recognized as useful in a variety of contexts such as special education , neuropsychological rehabilitation (Tate & Perdices, 2019), sport psychology (Barker et al., 2011), and biomedicine (Janosky et al., 2009). The field has experienced developments in terms of assessing methodological quality (Ganz & Ayres, 2018), data analysis (Kratochwill & Levin, 2014) and meta-analysis (Maggin et al., 2017), as well as reporting (Tate et al., 2016). Nevertheless, several challenges still remain, such as choosing among many data analytical options (Manolov & Moeyaert, 2017), and discussing the importance of randomization (Kratochwill & Levin, 2010;Ledford, 2018) and replication (Lanovaz et al., 2019).
The aim of the current study is to propose a way of assessing consistency in data features and consistency of effects, when performing a multilevel analysis of single-case data. Given that the assessment consistency is based on the need for replication in single-case research, we first discuss the concepts replication and consistency, highlighting their relevance and recent salience.
Afterwards, we provide a rationale for focusing on multilevel models as a data analytical technique.

Replication and Consistency
A recent special issue of Perspectives on Behavior Science (Hantula, 2019) focused on the "replicability crisis" in psychology, and how behavior analysts can thoughtfully proceed in their use of SCEDs. In the SCED context, within-study replication is relevant for internal validity, although it is only one of several aspects to consider (Ganz & Ayres, 2018;Perdices et al., 2019;Wendt & Miller, 2012). Specifically, the iterative manipulation of the independent variable and the subsequent changes observed in the dependent variable increase the confidence that these changes are not due to external factors (Horner et al., 2005), such as history and maturation (Petursdottir & Carr, 2018). In order to document experimental control, the covariation between changes in the behavioral pattern and the introduction (and withdrawal) of the intervention is to be observed at least three times, with more specific recommendations available according to the SCED used (What Works Clearinghouse, 2020).
Two kinds of replication can be distinguished with a SCED study. "Direct replication" or "within-subject replication" takes place in a reversal/withdrawal, multiple-baseline design, or an alternating treatments design (Horner et al., 2005;Tincani & Travers, 2019). Additionally, "systematic replication" or "inter-subject replication" can be achieved within a study (e.g., replication of a reversal/withdrawal or an alternating treatments design across participants; replication across settings of a multiple-baseline design across participants) or across studies (Horner et al., 2005;Kennedy, 2005).
When dealing with direct replication, one of the relevant concepts is consistency Ledford, 2018). Although consistency has been highlighted especially in the context of visual analysis (What Works Clearinghouse, 2020), there have also been recent proposals for its quantification (Tanious, De, Michiels, et al., 2019;Tanious, Manolov, et al., 2019). Specifically, both visually and quantitatively, two types of consistency can be distinguished: consistency of measurements from similar phases and consistency of effects (e.g., when comparing data points from adjacent phases).
We consider that it is necessary to distinguish between a successful replication of an effect and a successful and consistent replication. Whether a "basic effect" (Horner & Odom, 2014) is present is an assessment that is usually performed visually, dealing with several data features, such as level, trend, variability, overlap, immediacy (Ledford et al., 2019;Maggin et al., 2018;What Works Clearinghouse, 2020). Subsequently, several attempts to replicate the basic effect take place and an evaluation is performed regarding whether the replication was successful (i.e., whether a functional relation or experimental control is documented). However, suppose that we proceed quantitatively and the focus of the quantification is put on the immediate effect because trends are not expected: the difference between the mean of the last three baseline data points and the first three intervention phase data points could be computed (Horner & Kratochwill, 2012;Michiels & Onghena, 2019a). On the one hand, if the immediate effects for each participant in a study are all greater than zero (or than a minimally relevant effect), this would be indicative of a successful replication, in case there are no other data features (e.g., trend, variability) that suggest the contrary. On the other hand, if the values of the immediate effect are similar (e.g., there are small deviations from the average effect, which is greater than zero or than a minimally relevant effect), this would be indicative of a successful replication with a consistent immediate effect. In the following text, we focus on multilevel models and we first discuss a definition for successful replication, before presenting our main proposal for a definition of a successful and consistent replication.

Focus on Multilevel Modeling
Multilevel models are one of the promising analytical alternatives for SCED data analysis (Van den Noortgate & Onghena, 2007) and they have been recommended in domains such as education (Dedrick et al., 2009), experimental psychology (DeHart & Kaplan, 2019, and aphasiology (Wiley & Rapp, 2019). Multilevel models were chosen as the focus of the current text, as they are applicable to different SCEDs and enable taking multiple data features into account Shadish et al., 2013). For instance, unlike nonoverlap measures, multilevel models take autocorrelation into account (Baek & Ferron, 2013). Moreover, unlike the between-case standardized mean difference  and the log response ratio (Pustejovsky, 2018), they do not assume absence of trend or require detrending. Moreover, multilevel models do not preclude using visual analysis (Davis et al., 2013).
The focus of the current text is on the evidence obtained in a single study, using a SCED. This initial clarification is important for two reasons. One the one hand, replication in the SCED context can refer both to repeated demonstrations of a basic effect (e.g., a difference between two adjacent phases) in the same study (Ninci, 2019) and to the replication of effects across studies in relation to the way in which a practice can be established as being "evidence-based" (Jenson et al., 2007;Schlosser, 2009). On the other hand, multilevel models, which are the focus of the current text, have a noteworthy application for meta-analysis (Moeyaert, 2019;Van den Noortgate & Onghena, 2003a, 2003b. In the current text, we here focus on within-study replication and the use of multilevel models as in studies using multiple-baseline designs (Ferron et al., 2009). At the within-study level, the multilevel model usually includes two levels, whereas at the across-studies levels, it usually includes at least three-levels , although several variations are possible.
In the next section we discuss several possible ways in which consistency or results could be assessed when using multilevel models. Afterwards, we make a proposal and illustrate it with real data.

A Ratio of Effects to No Effects
The "replicability crisis" has been linked to the misuse and abuse of null hypothesis testing (Branch, 2019) and to the fact that p-values do not inform about the likelihood to replicate the effect observed in a given sample (Killeen, 2019). As stated previously, in the SCED context, the presence or absence of a basic effect is usually determined by visual analysis rather than by means of statistical tests (Maggin et al., 2018), and this effect has to be replicated several times within the same study (What Works Clearinghouse, 2020). For the most commonly used designs multiple-baseline and reversal/withdrawal (Shadish & Sullivan, 2011)the requirement is for three replications. However, the recommendation of three demonstrations of a basic effect (for direct replications), just like the requirement for the amount of evidence required for calling a practice "evidence-based" (see the 5-3-20 rule in Horner & Kratochwill, 2012), more closely related to systematic replications, do not take into account the number of attempts for replication that did not yield the expected positive result. Following Kratochwill et al. (2018), it is possible to distinguish between a "negative result" (absence of a demonstration of an effect or lack of evidence for effectiveness) and a "negative effect" (an iatrogenic effect of the intervention). The implications of these two different kinds of unexpected and undesired results are not identical.
While a negative effect may more clearly provide evidence against an intervention, a lack of a positive result may lead to introducing methodological modifications (Tincani & Travers, 2018) or to identified relevant moderator variables related to the characteristics of participant and/or the target behavior (Ledford et al., 2016). Such considerations are only possible if selective reporting of positive results does not take place (Shadish et al., 2016;Simmons et al., 2011).
In summary, a given practice can be labelled as evidence-based, potentially evidencebased, neutral/mixed effects, insufficient evidence, or negative effects according to the number of methodologically rigorous studies and their results (Cook et al., 2015). Specifically, for direct replication in the SCED context, it has been suggested that a ratio of at least 3:1 effects to no effects (with no evidence for negative effects) is necessary for demonstrating experimental control (Cook et al., 2015;Maggin et al., 2013). Incidentally, the 3:1 ratio suggested resembles the historically used critical ratio of three (Garrett, 1937), which usually related a mean difference to its standard error (e.g., Nolte, 1937). Before following the 3:1 ratio, it is necessary to define what an "effect" is; the following paragraphs in this section deal with this aspect.

Defining What an "Effect" Is
It may not be straightforward to define what an effect is when performing a visual analysis (see Wolfe et al., 2019), but we will not discuss this here, given that the focus is on multilevel models. In terms of quantifying, it may be more straightforward to objectively define an "effect", but it is still not a flawless process. At the outset, we discard grounding the definition of an "effect" on the estimate of the fixed effect (e.g., whether it is greater than zero), because it only refers to the average and not to each of the replications. Moreover, we also discard using statistical significance as the sole basis for defining an effect. Apart from the usually mentioned interpretative drawbacks of a p-value (Gigerenzer, 2004;Nickerson, 2000), it is not clear that any extrapolation to a population is reasonable in absence of random sampling of individuals (Edgington & Onghena, 2007).
An initial option is to put the focus on the sign of the empirical Bayes estimates obtained for the individual treatment effects (Ferron et al., 2010). An individual treatment effect of the correct sign (indicating an improvement) would be interpreted as an "effect". Subsequently, if the ratio of individual effects with the predicted sign to the effects with the opposite sign is at least 3:1, this could be interpreted as sufficient evidence for direct replication. Additionally, borrowing the logic of the difference between prep (replication of the correct sign) and psupport (replication of an effect size of a certain size or more; see Sanabria & Killeen, 2007), a minimally relevant difference can be determined prior to gathering the data for labeling the effect as significant.
However, we consider the focus on the point estimate of the individual treatment effect may not be justified, given that these estimates are biased (Ferron et al., 2010).
In order to take into account the precision of the estimates, a more stringent and probably more defensible option would be to count as an "effect", the individual treatment effects whose confidence intervals are entirely on the predicted side of 0. That is, only intervals not containing zero would be considered positive effects. Analogously, it could be required for the confidence interval to exceed entirely a pre-specified minimally important difference. Therefore, the definition of a successful replication would be to require a 3:1 ratio of confidence intervals of the individual treatment effects not including zero or a minimally important difference.

Obtaining Individual Treatment Effects in Multilevel Models
When using multilevel models, it is necessary to construct a design matrix that represents the kind of effect that the researcher is interested in modelling (Moeyaert, Ugille, et al., 2014). In order to obtain the individual treatment effect estimates and their confidence intervals, the dummy variable representing the phase has to be included as a random effect but not as a fixed It is noteworthy that, even if all individual treatment effects are greater than zero (or than the minimally relevant difference), this does not mean that they are similar in value. Thus, following this option we would have evidence on whether the replication is successful, but not whether it is consistent. We deal with consistency of effects in the following section.

More Loosely Related Antecedents: A Review of Quantifications of Heterogeneity
The current text deals mainly with one of the two types of consistency: consistency of effects. In order to obtain some overall indication of the difference between conditions and to gain statistical power, "internal meta-analysis" of the results obtained in a single study has been suggested (Goh et al., 2016;Hales et al., 2019). In relation to meta-analysis, it could be considered that it provides a way to measure consistency or heterogeneity of effects (Swan et al., 2020). Specifically, a possible quantification of the degree of (lack of) consistency, could stem from the heterogeneity test and quantifications. However, the Q-test can be expected to have low statistical power when few effect sizes (here, direct replications) are being quantitatively integrated (Lipsey & Wilson, 2001). Additionally, a drawback related to the descriptive quantification known as I 2 (the proportion of true variance in effect sizes, with respect to the total observed variance), is that it is only a relative measure that may not be informative enough (Borenstein et al., 2017). Therefore, it seems that these two options cannot be meaningfully borrowed from the general context of meta-analysis to adopt them for a quantification of consistency of effects at the within-study level.
Two of the analytical procedures proposed for SCED data are noteworthy, due to the fact that they: (a) are directly applicable to studies including several participants; and (b) incorporate quantifications that can be useful for assessing consistency of effects, as an indicator of the degree to which direct replication has been achieved. The between-case standardized mean difference (BC-SMD; Hedges et al., 2012Hedges et al., , 2013 yields, among other quantifications, an "intraclass correlation" (ICC) interpreted as the amount of variability across participants as a proportion of the whole variability (within and across participants). Therefore, this value could be understood to quantify the degree to which the data patterns are not consistent, with 0.3 as a possible cut-off value indicating consistency (Hedges et al., 2012). The ICC in the BC-SMD context can be understood as representing both the consistency of data in similar phases and the consistency of effects, because even if the average difference were the same for all participants, the ICC would not be equal to zero unless the phase means are also the same across participants.
Thus, it is not a pure quantification of consistency of effects.
In the context of multilevel models, an ICC can also be computed, with a similar interpretation as for the BC-SMD (see Dixon & Cunningham, 2006, for several interpretations).
Actually, the ICC is usually computed for a null (also called unconditional or intercept-only) model without predictors, in order to verify whether a multilevel model is needed, i.e., whether there are relevant dependencies to be modelled (Gage & Lewis, 2014). Thus, its use, after the definitive model with predictors is built, is not that common. Tanious, De, Michiels, et al. (2019) propose a quantification of consistency of effects, called CONEFF, referring to five data aspects, as present in the What Works Clearinghouse (2020) Standards: change in level (standardized mean difference), change in trend (using ordinary least squares estimation), change in variability (variance ratio), immediacy of the effect (the last three baseline phase measurements compared to the first three intervention phase measurements), and overlap between data from adjacent phases (using the Nonoverlap of All Pairs; Parker & Vannest, 2009). Actually, CONEFF could be applied to other ways of quantifying these five data features. In contrast, we here focus on the assessment of consistency of the change in level and change in slope, in the context of a multilevel model. As a strength of the current proposal, using multilevel models eliminates the ambiguity regarding exactly how to operatively define data features such as overlap and trend, both with multiple definitions suggested in the SCED context (see Parker et al., 2011, andManolov, 2018, respectively).

More Closely Related Antecedents: Quantifications of Consistency
A quantification of consistency of data in similar phases, called CONDAP has been suggested for several SCEDs (Tanious, De, Michiels, et al., 2019;Tanious, Manolov, et al., 2019). CONDAP can be accompanied by a randomization test in case randomization is present in the design (Tanious, De, & Onghena, 2019). CONDAP is based directly on the data, without referring to any analytical procedure or representation such as a mean line or a trend line. In contrast, we here propose an assessment of the consistency of data in similar phases related to the estimates of the intercept and baseline trend, according to a multilevel model. The aim is to fully benefit from the output of a multilevel analysis (e.g., interpreting individual treatment effects and random effects). Nevertheless, if desired, an additional quantification such as CONDAP can be used for an assessment of consistency of data patterns in similar phases that is not based on modeling.

Discussing Initial Options
In the context of multilevel models, when the immediate change in level and the change in slope are modeled as random effects, it is possible to compute the variance in these effects.
These variance estimates could then be used as an indicator of lack of consistency. One approach of doing so would be to argue that if a variance is not statistically significant, then a random effect is not necessary in the model, because there is not sufficient variability across participants in the treatment effect. However, there are three reasons why we do not recommend using the statistical significance of the variance as a criterion. First, there are different ways to assess statistically the importance of a random effect: via a Z test under the assumption that the sampling distribution of the variances is normal (Moeyaert, 2019) or comparing the deviance values (−2 times the log likelihood) of the models with and without the random effect via a chisquare test (Hox, 2010). These two tests need not necessary coincide, and both are suspect with small sample sizes, because the variance estimates are biased in such contexts (Ferron et al., 2009). Second, not rejecting the null hypothesis does not justify drawing a conclusion about similarity (Gigerenzer, 2004) and it is not the same as performing a test of statistical equivalence (Tryon, 2001). Third, a summary measure such as the variance and the evaluation of statistical significance seem excessively general ways of assessing consistency across individuals, as they collapse all the information about the variation in a single value (the estimate or the p-value). In contrast, in the SCED context, it is recommended to summarize the information in such a way as to maintain the information about each individual (Hagopian, 2020), which is also well-aligned with some statistical approaches for contrasting hypothesis for all participants, rather than on average (Klaassen, 2020). Accordingly, the proposal that we make in the following section allows representing how much each individual effect differs from the average, rather than how much all individuals, on average, differ from the average.
In order to help interpreting the variance, and not to focus exclusively on its associated pvalue, a coefficient of variation could be computed for each of the effects: immediate change in level and change in slope. The numerator would be the square root of the estimated variance of the effect and the denominator would be the absolute value of the corresponding fixed effect (i.e., estimated average). The coefficient of variation can be expressed as a percentage, but unlike the ICC or I 2 , it is relative to the average effect estimated, which may lead to more meaningful interpretations regarding whether this variability is considerable or not. In order to be referring to the coefficient of variation of as quantification of how consistent (or actually, not consistent) the effect is, the fixed effect estimate should be indicative of an effect being present.
As a limitation of the use of the coefficient of variation, it has to be mentioned that a specific and universal cut-off point for a "small" coefficient of variation (and sufficient consistency to be interpreted as a successful and consistent replication) does not exist. A second, and more important, limitation is that there is evidence that the variance estimates can be biased 1 for fewer than five participants in the study (Ferron et al., 2009;Moeyaert et al., 2017). A third limitation is that, in case the fixed effect estimate is very small (e.g., close to zero), a large coefficient of variation can be expected and this would reduce its informative value. Therefore, the coefficient 1 Ferron et al. (2009) use restricted maximum likelihood estimation report applied to data including an immediate and sustained change in level and report that the between-participants variance in the treatment effect was overestimated. In contrast, Moeyaert et al. (2017) generated data including both an immediate change in level and a change in trend and report that the between-participants variance in the immediate treatment effect was underestimated, both for full and restricted maximum likelihood estimation. For the evaluation of consistency, underestimating the variance of the effect would induce a false "evidence" for consistency (i.e., a false positive), whereas overestimating the variance would induce a false "evidence" against consistency (i.e., a false negative). The former is likely to be considered more detrimental, considering the alpha and beta error rates that are usually considered acceptable (Cohen, 1992).
of variation needs to be interpreted with caution. As an alternative, we present our main proposal next.

A Proposal for Assessing Consistency of Individual Effects
An alternative to using the variance estimate as the basis for assessing consistency, would be to use the random effect estimates. This proposal is similar to the previously mentioned possibility for assessing replication in that it is based on confidence intervals. For assessing replication, we focused on the confidence intervals for the individual intervention effects. In contrast, we here focus on the confidence intervals for the random effects (i.e., the difference between the fixed effect estimate and the individual treatment effect estimate). Specifically, it is possible to check how many of the confidence intervals for the random effects include 0. In this case, a value of zero for the random effect would represent an individual treatment effect equal to the fixed effect estimate (i.e., the average for all participants). In that sense, if the confidence interval for a random effect includes 0, then it would be plausible for the individual treatment effect to be equal to the average. This method of assessing consistency is strengthened by having longer observation series with less error variance, because studies that are designed in this manner will tend to have more precise estimates of the random effects (i.e., narrower confidence intervals that all include 0 make a stronger argument for consistency). With this option, it would still be necessary to check that fixed effect estimate exceeds zero or a minimally relevant value. Note that for obtaining the estimates of the random effects for the treatment effect, it is necessary to include the dummy variable representing the phase both in the fixed and in the random part of the equation.
Once the number of positive and consistent effects is tallied, two quantifications are possible.
On the one hand, it can be checked whether the ratio of effects to no effects meets or exceeds 3:1. On the other hand, the percentage of positive and consistent effects can be computed.
Obviously, the 3:1 ratio corresponds to 75% of the confidence intervals for the random effects including 0.

Illustrating the Assessment of Consistency of Effects: Lambert et al. (2006) Data
The data on disruptive behaviors, gathered by Lambert et al. (2006) following an ABAB design replicated over nine participants, has been used in several articles that present and compare different analytical options (e.g., Michiels & Onghena, 2019b;Peng & Chen, 2015;Shadish et al., 2014). In terms of a quantification of consistency, we will comment the results of two of these articles. Shadish et al. (2014) applied the BC-SMD and obtained a bias-adjusted standardized mean difference equal to −2.51 and, more importantly for the current aim, an ICC equal to .03, suggesting the almost all the variability in scores is within participants, indicating consistent results across participants.  applied several multilevel models and we here focus on the model that quantifies the average difference in level, without considering trend or autocorrelation, presenting quantifications separately for the A1-B1 comparison and for the A2-B2 comparison (Model 1B in . For the change in level in the A1-B1 comparison, the variance reported is equal to 0, suggesting a marked consistency in the effect. For the change in level in the A2-B2 comparison, the variance reported is equal to 1.02, with an associated p-value of .148, indicative of lower consistency, as compared to the effect in the A1-B1 comparison. The software used by  for obtaining the estimates is SAS 9.3. In order to be able to use a caterpillar plot for representing the random effects, we used the R package called lme4 (https://cran.r-project.org/web/packages/lme4/index.html). The data file used for the illustration plot representing the measurements obtained by Lambert et al. (2006). For this initial illustration ("Model.L1"), we apply a multilevel model which includes only a dummy variable representing phase and treats this dummy variable as a random effect. In that sense, the estimates obtained are the average baseline level and the average change in level when the intervention is introduced (as fixed effects) and the between case variance of these effects.
The numerical results can be consulted from Table 1. Additionally, the individual empirical Bayes estimates for level and change in level were obtained, ranging from 5.86 to 7.47 for the baseline level and from −6.20 to −4.86 for the change in level. The graphical representation of the model can be consulted in Figure 1.

Table 1
Results from Applying Multilevel Models Representing Change in Level, Using the lme4 Package.

Aspect Average estimate Standard Error Standard Deviation
Model.L1 for the Lambert et al. (2006)  Note. The average estimate represents the fixed effect, whereas the standard deviation represents the random effect reported by Moeyaert, Ferron, et al., 2014, using SAS). Additionally, the lower panel of Figure   2, including the empirical Bayes estimates of the individual treatment effects (LevelChange), indicates that all nine point estimates suggest a reduction. Actually, eight of the individual effects exceed a reduction of five disruptive behaviors. However, a cut-off value for a minimally relevant difference would ideally be established prior to gathering the data. The coefficient of variation, dividing the square root of the variance by the estimate of the fixed effect would be 100×(0.62423/|-5.7955|)=10.77%.

Figure 2
The

upper panel includes the caterpillar plot for the random effects their confidence intervals as obtained via a multilevel model and including only change in level, for the A1-B1 comparison from Lambert et al. (2006). The lower panel includes the empirical Bayes estimates of the individual treatment effects.
For the A2-B2 comparison ("Model.L2"), the graphical representation is available on Figure 3, whereas the numerical results regarding the fixed and random effects can be consulted from Table 1. Additionally, the individual empirical Bayes estimates for level and change in level were obtained, ranging from 4.18 to 8.52 for the baseline level and from −6.45 to −3.37 for the change in level.

Figure 3
Graphical Representation of the Multilevel Model Representing Change in Level, Applied to the A2-B2 Comparisons from the Lambert et al. (2006) Data represented in Figure 4, upper panel. It can be seen that five out of nine confidence intervals (55.56% or a ratio of 1.25:1) include the fixed effect estimate (equal to −5.06 using the lme4 package vs. −5.08 reported by Moeyaert, Ferron, et al., 2014, using SAS). Additionally, the lower panel of Figure 4, including the empirical Bayes estimates of the individual treatment effects, indicates that all nine point estimates suggest a reduction. However, greater variability in the A2-B2 effects is visible, as compared to the A1-B1 effects in the lower panel of Figure 2.
Accordingly, the coefficient of variation, dividing the square root of the random effect by the estimate of the fixed effect, is larger than for the A1-B1 comparison: 100×(1.2349/|-5.061032|)=24.40%.

Figure 4
The Lambert et al. (2006). The lower panel includes the empirical Bayes estimates of the individual treatment effects.

Illustrating the Assessment of Consistency of Effects: Sherer and Schreibman (2005) Data
In order to illustrate the results for a data set with lower consistency, we use the data on appropriate speech, gathered by Sherer and Schreibman (2005), using a multiple-baseline design across participants, and included in the illustration of multilevel modeling for meta-analysis by . Just as for the previous illustration, we apply a multilevel model ("Model.S1") which includes only a dummy variable representing phase and treats this dummy variable as a random effect. The fixed and random effects can be consulted from Table 1.
Additionally, the individual empirical Bayes estimates for level and change in level were obtained, ranging from −0.47 to 49.83 for the baseline level and from 1.10 to 56.82 for the change in level. The graphical representation of the model can be consulted in Figure 5.  (2005) study: responders and nonresponders. We here used the data in order to illustrate a study with lack of consistency, without suggesting that it is necessarily meaningful to integrate quantitatively the results of all participants. The data file used for the illustration provided here can be downloaded from https://osf.io/p3bna/, where there is also a time series line plot representing the measurements obtained by Sherer and Schreibman (2005).

Figure 6
The upper panel includes the caterpillar plot for the random effects and their confidence intervals, as obtained via a multilevel model including only change in level, for the Sherer and Schreibman (2005)

Additional Illustration with More Complex Models: Consistency of Effects and Consistency in Similar Phases
The illustrations presented in the text so far refer to the simplest model in which only a mean difference is modelled in absence of trend. In the current section, we present the results for a model that also includes general trend and change in trend after introducing the intervention. For such a model, it is most common to code and interpret the change in level as an immediate change taking place during the first intervention phase measurement occasion (Moeyaert, Ugille, et al., 2014). Moreover, there are two effects whose consistency can be assessed: the immediate change in level and the change in trend. Additionally, it is also possible to perform a more complete evaluation of the consistency of data in similar phases, by comparing the intercept (initial baseline level) and the baseline trend across participants. In contrast, in the previously presented simpler models, for performing an assessment of the consistency of similar phases, we could have only focused on the intercept, which then represented the average baseline level.
The more complex model can be applied to the Lambert et al. (2006) data, because  and Shadish et al. (2014) also discuss possible baseline trends. We refer to this model as "Model.L3": Table 2

Aspect Average estimate Standard Error Standard Deviation
Model.L3 for the Lambert et al. (2006)

Graphical Representation of the Multilevel Model Representing Immediate Change in Level and
Change in Slope, Applied to the A1-B1 Comparisons from the Lambert et al. (2006) Data For assessing consistency, the caterpillar plot for the A1-B1 comparison is presented in Figure   8. In terms of consistency of effect, all nine confidence intervals include the fixed effect estimate for the immediate change in level, whereas for the change in trend eight of the nine confidence intervals include the fixed effect estimate. According to the coefficient of variation, for the immediate change in level, there is very small variability and high consistency: 100 × (0.423604/|−6.1857317|) = 6.85% . Given that the estimate for the change in trend is very close to zero (i.e., there is practically no change in trend), the coefficient of variation suggests less consistency (100 × (0.575299/|−0.2671592|) = 215.34%), but it should not be the main quantification for such a small effect. In terms of consistency of data in similar phases, focusing on the baseline, eight of the nine confidence intervals include the fixed effect estimate for the intercept and for the baseline trend. The coefficient of variation for the intercept is 100 × (0.424273/|6.2486904|) = 6.79%, whereas for the baseline trend is 100 × (0.081208/ |0.1432806|) = 56.68%. Once again, there is apparently lower consistency in baseline trend, but this is related to the data presenting almost no baseline trend on average.

Figure 8
Caterpillar plot for the random effects of a multilevel model including trend, change in trend, and immediate change in level, for the A1-B1 comparison from Lambert et al. (2006).
The visual inspection of the Sherer and Schreibman (2005) data suggest that there are different trends in the baseline and intervention phase, which makes the more complex model reasonable. We refer to this model as "Model.S2": Table 2 includes the numerical results for fixed effect (baseline level, immediate change in level, baseline trend, and change in trend) and the standard deviations representing the random effects. Additionally, the individual empirical immediate change in level, ranging from −5.63 to 11.55; (c) for baseline trend, ranging from −0.08 to 0.57; and (d) for change in trend, ranging from −0.68 to 1.27. The graphical representation of the model can be consulted in Figure 9.
The caterpillar plot is presented on Figure 10. Regarding the consistency of effects, none of the six confidence intervals includes the fixed effect estimate for the immediate change in level, whereas for the change in trend two of the six confidence intervals include the fixed effect estimate. Accordingly, the coefficient of variation is very high in both cases: 100 × In summary, the graphical representation of the confidence intervals for the random effects can be used to distinguish between a data set with more consistent and successful replications (Lambert et al.) and a data set with lower consistency in similar phases and lower consistency in effects (Sherer and Schreibman).

Beyond Multiple-Baseline Designs
Multilevel modelling in general and the current proposals for assessing consistency of effects at the within-study level is most straightforward for multiple-baseline designs. Actually, several reviews of published SCED research suggest that multiple-baseline designs are the most commonly used ones (Hammond & Gast, 2010;Shadish & Sullivan, 2011;Smith, 2012), present in more than half of the articles reviewed.
For other SCEDs, some decisions need to be made before applying a multilevel model. For instance, for an across participant replicated ABAB, several design matrices are possible, allowing for different comparisons (Moeyaert, Ugille, et al., 2014). For an across participant replicated ATD, in case there is an initial baseline phase before the comparison phase with rapid alternation of conditions, it is possible to compare the baseline to each of the alternating conditions (Moeyaert, Ugille, et al., 2014). Otherwise, the average difference between the alternating conditions can be computed . For a changing criterion design, one option is to compare the baseline phase to the last intervention subphase, i.e., for the final criterion level (Faith et al., 1996). Another option is to quantify the slope of the trend line across all intervention subphases . It should be noted that for applying a multilevel model and for assessing the consistency of effects within a study, it is necessary to replicate the reversal/withdrawal, alternating treatments, or changing criterion design across participants.
Once the appropriate design matrix is constructed and the multilevel analysis is carried out, the assessment of the consistency of effects can be performed as described in the previously presented examples.

Discussion
The research on multilevel models in their application to a single study has primarily focused on studying the estimation of fixed and random effects, as well as the coverage of confidence intervals (e.g., Baek & Ferron, 2013;Ferron et al., 2009;Ferron et al., 2010;Moeyaert et al., 2017), Type I error and power (Heyvaert et al., 2017), or dealing with count data (Declercq et al., 2019). Thus, the focus of the current text (namely, consistency of effects) is novel and it complements previous research. Moreover, the focus on consistency is well-aligned with recent research on the topic (Tanious, De, Michiels, et al., 2019;Tanious, Manolov et al., 2019). As a strength of the proposal made here, this assessment of consistency can be performed using a free user-friendly website and it can be easily represented visually. This makes it more likely to be accepted by applied researchers.

The Assessment of Consistency in the Context of Model Building
One of the questionable research practices mentioned in relation to the "replicability crisis" are the ambiguous choices regarding data analysis (Hantula, 2019), which could be countered by preregistering analysis plans (Hales et al., 2019). A multilevel model, just like using the BC-SMD , imposes the same kind of quantification for all participants for whom the different conditions are being compared. Such an analytical practice avoids the possibility of adapting the analysis or the quantitative emphasis to the most salient features of the data. However, the flexibility of multilevel models comes with the price of many decisions that need to be made regarding the exact model to apply (Baek et al., 2016).
In relation to model building, the decisions (e.g., include trend or not, which effects to include as random) could be made in relation to what is visible on the plots of raw data, but such a practice leads potentially to overfitting (Baek et al., 2016;Hox, 2010). The subjective visual inspection can be complemented by fit indices (e.g., the Akaike or the Bayesian information criterion) for deciding whether a more complex model offers sufficient improvement in fit (Dedrick et al., 2009;Ferron et al., 2008). In that sense, more complex models are not necessarily desirable, given that they may entail estimation problems and require larger samples (Wiley & Rapp, 2019). A different kind of comparison across models can be made via sensitivity analysis: checking the degree to which the conclusions change, for different modeling options (Baek & Ferron, 2013;.
In order to avoid excessively data-driven decisions and to reduce the possibility of overfitting, the model can be selected prior to data collection on the basis of theoretical considerations and previous evidence (Ferron et al., 2008;Onghena et al., 2018;Wiley & Rapp, 2019). For instance, modeling baseline trend or not can be based on the expectations regarding spontaneous improvement (e.g., in neurorehabilitation, Krasny-Pacini & Evans, 2018) or on the knowledge about baseline stability (Baek et al., 2014), whereas modeling change in trend or not can be related to whether a gradual effect is expected (e.g., in academic interventions, Maggin et al., 2018). Additionally, if modeling trend is considered necessary, Shadish et al. (2013) suggest that random intercepts and random slopes are both needed for the proper modelling of autocorrelation. In fact, several illustrations of the use of multilevel models incorporating terms for trend, include both random intercepts and random slopes (e.g., Baek et al., 2014;Gage & Lewis, 2014;. In context of the current proposal, including random intercepts and random slopes allows for the assessment of consistency in similar phases (i.e., consistency of baseline level and baseline trend across cases) and consistency of effects (i.e., consistency of change in level and change in slope. In summary, considering that all models are wrong (Box & Draper, 1987), trying out multiple models without an a priori basis may lead not only to capitalizing on chance, but also to ethical concerns (Levin et al., 2017). Thus, we recommend that the fundament for the model chosen should at least partially be related to the expectations stemming from the available literature, whereas visual analysis can still be used post hoc, in order to comment on the meaningfulness of these quantifications (Parker et al., 2006). Specifically in relation to the current proposal, including pre-defined criterion for what is considered to be a successful and consistent replication, as suggested here, is expected to lead to results that are less affected by the "researcher degrees of freedom" (Hantula, 2019).

Software Considerations
For the proposals made in the current text, we opted for a software implementation in R, because it offers the possibility to create a freely available menu-driven website, via the Shiny package.
In contrast, software such as SAS (which has been previously presented for using multilevel models; Baek & Ferron, 2013;Moeyaert et al., 2013) is commercial and would require that the user works with programming code (syntax).
Using the website https://manolov.shinyapps.io/ExpectedPattern/ it is possible to obtain both numerical results and graphical representations. The website provides an example of the expected data structure, whereas the example data sets used in the current text can be obtained from https://osf.io/p3bna/. Once a data file is located and loaded, it is possible to specify expectations such as the presence of baseline trend, or the immediacy of effect that help with choosing a multilevel model. After the expectations are specified, the quantitative results of multilevel models are obtained, along with line graphs representing the measurements for all participants, with superimposed mean or trend lines. Additionally, caterpillar plots such as the ones included in the present text are also obtained.
However, given that the topic is consistency, we have to mention that there may be inconsistencies between the different software programs used for carrying out multilevel models and even between different packages within R. Specifically, the nlme package (https://cran.rproject.org/web/packages/nlme/index.html) allows modelling for autocorrelation, which can be considered an advantage given the evidence available on the presence of autocorrelation in SCED data (Shadish & Sullivan, 2011

Limitations and Future Research
The focus of the current text is on the quantification and graphical representation of the consistency of effects (i.e., direct replication) within a SCED study. Therefore, the illustrations are presented with verbal descriptions of the model building process. The reader interested in multilevel model building and formal presentation of the multilevel models within a single study, can refer to Baek et al. (2014), Dedrick et al. (2009), and .
Provided that the focus is on within-study replication, we did not deal extensively with replication across studies. Although certain uses of SCEDs are not aimed at demonstrating the generality of the intervention effects (Riley-Tillman & Burns, 2009), if the aim is to establish the generalizability of the intervention effects, systematic replications across studies are relevant (Maggin, 2015;Onghena et al., 2018;Tate & Perdices, 2019), Even when generalization is desirable, the external validity in the SCED context is not an issue of statistical inference and extrapolation, but rather follows a more inductive approach (Kennedy, 2005). In this approach, the descriptions of participants, interventions, target behaviors, and settings are crucial (Maggin, 2015;Tate et al., 2013) and the amount of generality can be understood as a continuum according to the number of variables (related to participants, target behaviors, and settings) that change across systematic replications across studies Riley-Tillman & Burns, 2009). In that sense, failing to replicate an effect allows for discovering the limitations of an intervention, which is also useful for prompting further modifications and further research for understanding better the causes due to which an intervention does or does not work . Finally, building the evidence about generality on the basis of a series individual studies makes the meta-analyses of the SCED studies and multilevel models relevant (Jenson et al., 2007;Onghena et al., 2018).
Regarding the proposal of quantifying the percentage of random effect confidence intervals that include 0, it should be noted that this percentage is not expected to approximate any theoretically desirable quantity. In that sense, we are not quantifying how many of the confidence intervals in different samples or replications are including a population parameter, which would be equivalent to studying the coverage of a confidence interval (e.g., Baek et al., 2019;Ferron et al., 2009;Moeyaert et al., 2017), expected to be .95 for a 95% confidence.
Additionally, what we are proposing is not the same as estimating the capture percentage of an initial confidence interval in reference to the means of subsequent replications, expected to be equal to .83 for a 95% confidence (called "prediction interval for a replication mean" by Cumming, 2012). Therefore, for the percentage of random effects' confidence intervals that include 0, there is not an exact cut-off point that suggests sufficient consistency, just like experimental control should be understood as a continuum and not as something that is either present or absent (Horner & Odom, 2014). The 3:1 ratio (Maggin et al., 2013) and the corresponding percentage of 75% is only an indication, but not a fixed criterion. Nevertheless, it has been highlighted that statistical thinking is more important than applying mechanically a given ritual (Gigerenzer, 2004).
In terms of the statistical properties of the quantifications proposed, some comments are necessary. There is evidence that the confidence intervals for the variance of the treatment effect (i.e., change in level) present undercovering (Ferron et al., 2009). However, it is unclear whether this evidence can be extrapolated to each of the confidence intervals for the random effects (i.e., for the confidence intervals for the difference between the individual treatment effects and the fixed effect estimate). Similarly, it is not clear whether the evidence about the confidence intervals for the individual treatment effects (wider intervals, but better coverage when using the Kenward-Roger estimation of the degrees of freedom; Ferron et al., 2010) can be extrapolated to the confidence intervals for the difference between the individual treatment effects and the fixed effect estimate. Therefore, more research is needed on the latter.

Open Practices Statements
The current text is not based on gathering data (e.g., in the context of an experiment). Therefore, there are no primary data or materials to be made available and there is no empirical study requiring preregistration. Nonetheless, the data used for the illustrations and the R code for constructing the caterpillar plots are available at https://osf.io/p3bna/.