Running head: SIMULATION STUDY ON ATD A simulation study on two analytical techniques for alternating treatments designs

Alternating treatments designs (ATDs) are single-case experimental designs entailing the rapid alternation of conditions and the specific sequence of conditions is usually determined at random. The visual analysis of ATD data entails comparing the data paths formed by connecting the measurements from the same condition. Apart from visual analyses, there are at least two quantitative analytical options also comparing data paths. On option is a visual structured criterion (VSC) regarding the number of comparisons for which one conditions has to be superior to the other in order to consider that the difference is not due only to random fluctuations. Another option, denoted as ALIV, computes the mean difference between the data paths and uses a randomization test to obtain a p value. In the current study, these two options are compared, along with a binomial test, in the context of simulated data, representing ATDs with a maximum of two consecutive administrations of the same condition and a randomized block design. Both VSC and ALIV control Type I error rates, although these are closer to the nominal 5% for ALIV. In contrast, the binomial test is excessively liberal. In terms of statistical power, ALIV plus a randomization test is superior to VSC. We recommend that applied researchers complement visual analysis with the quantification of the mean difference, as per ALIV, and with a p value whenever the alternation sequence was determined at random. We have extended an already existing website providing the graphical representation and the numerical results.


Introduction
Alternating treatments designs (ATD) are a type of single-case experimental designs (SCED) that allow for rapid changes in the conditions in which the target behavior is measured.
According to reviews of published SCED studies, ATDs represent approximately 6% to 16% of the designs used (Hammond & Gast, 2010;Shadish & Sullivan, 2011;Smith, 2012). Moreover, ATDs are similar to the N-of-1 trials used in health research (Carriere, Li, Mitchell, & Senior, 2015), as they include several alternations from an A (baseline) phase to a B (intervention) phase and also in reversed order. The main characteristics of ATDs have been dealt with extensively elsewhere (Barlow & Hayes, 1979;Wolery, Gast, & Ledford, 2014). One of the main features that we will focus on here is the possibility to determine the sequence of conditions at random. In case randomization is used, we could distinguish three kinds of design: (a) ATDs defined as completely randomized designs allow any sequence of conditions (e.g., AAABBBBBAA); (b) ATDs including randomized blocks, also called sometimes randomized block designs, entail choosing at random the first condition in a pair of conditions (e.g., AB-BA-AB-AB-BA); and (c) ATDs including restricted randomization (Onghena & Edgington, 1994), in which the sequence can contain a maximum of two consecutive administrations of the same condition (e.g., AABBABBAAB). For this latter type of ATDs, the restriction mentioned is the most common one (Heyvaert & Onghena, 2014;Kratochwill et al., 2013;Wolery et al., 2014). Regarding the requirements for ATDs, a minimum of five repetitions of the alternating sequence has been suggested (Kratochwill et al., 2010), which entails and is well-aligned with the need for at least five measurement occasions per condition (Kratochwill et al., 2010;Wolery et al., 2014).
Actually, five measurements per conditions is a relatively typical situation according to the data obtained in a previously performed review of published ATD literature (Manolov & Onghena, 2017): the median of measurements per condition was 5.5, whereas the mean was 6.64.
In terms of data analysis, visual inspection is suggested as a first choice (Barlow, Nock, & Hersen, 2009), but there have also been proposals for using quantifications. Actually, a recent review of published ATD research (Manolov & Onghena, 2017) revealed that most studies compute the difference between the means of the conditions compared, as well as a measure of variability. Additionally, there have been proposals for using an adapted version of Percentage of nonoverlapping data (Wolery et al., 2014) or for using piecewise regression (Moeyaert, Ugille, Ferron, Beretvas, & Van Den Noortgate, 2014) or local regression with nonparametric smoothers (Solmi, Onghena, Salmaso, & Bulté, 2014).
Finally, two more recent proposals focus on the same kind of comparison that is usually carried out in visual analysis, namely, the degree to which the data path1 for one condition is different from (and superior to) the data path of the other condition (Ledford, Lane, & Severini, 2018). The visual structured criterion (VSC; Lanovaz, Cardinal, & Francis, 2017) assesses whether the number of comparisons for which one condition is superior can be considered to represent more than random fluctuations. In that sense, the comparison performed in VSC is ordinal and the empirically-derived cut-off points presented by  could be understood as critical values for identifying statistically significant results. Therefore, the VSC assesses the presence of an effect. In contrast, the comparison involving actual and linearly interpolated values abbreviated as ALIV (Manolov & Onghena, 2017) assesses the magnitude of 1 The data path is defined as the line that connects the measurements from the same condition. Therefore, the data path includes both actually obtained values (i.e., the points being connected) and the line that connects the point thereby interpolating the values for the condition that could have been obtained for a measurement occasion in which the other condition was administered. See Figure 1 introduced later in the text. effect, by focusing on the average amount of distance between the data paths. Unlike the common difference in means, ALIV uses actually obtained values and interpolated value (located on the data path), providing a difference for each measurement occasion and afterwards computing the average of these differences. Alternatively, ALIV could be understood as a mean difference in which greater weight is assigned to the values from one condition, which are surrounded by more values from the other condition (see Manolov & Onghena, 2017, for more details).
For ALIV, the statistical significance of the average difference can be obtained using a randomization test. The valid application of such a test requires that the sequence of conditions is actually determined at random, prior to gathering the data and that the randomization scheme used in the design is the same as the one used for obtaining the reference (randomization distribution). This latter condition entails that if an ATD with restricted randomization, with a maximum of two consecutive administrations of the same condition, is used to gather the data, the randomization performed for obtaining the statistical significance should also be the one corresponding to such an ATD and not, for instance, the randomization for an ATD with randomized blocks (see Onghena & Edgington, 1994, 2005, for more details).
Regarding the existing evidence on the performance of ALIV and VSC, the former has not been formally tested yet, whereas the latter was tested in the context of an ATD with systematic alternation of conditions (i.e., ABABABABAB). The main findings of the study on VSC are the control of Type I error rates when there are at least five measurements per condition and an adequate power (i.e., 0.80) for an effect size expressed as a standardized mean difference of 2, even when there are only three measurements per condition.
The main aims of the present research are (a) to extend the amount of evidence available on the VSC with ATDs with restricted randomization (hereinafter, ATD-RR) and ATDs with randomized blocks (ATD-RB); and (b) to obtain initial evidence on the performance of ALIV used together with a randomization test (hereinafter ALIV+RT) for the same kinds of design.
The comparison in the performance is done in terms of Type I error rates (i.e., relative frequency of false alarms: indicating a statistically significant difference when no intervention effect actually exists) and in terms of statistical power (i.e., rate of detection of actually existing effects as statistically significant).

Rationale for the Simulation
Given that the aim is to explore Type I error rates and statistical power, using simulated data was the obvious choice, given that it allows knowing (i.e., specifying) whether there is actually an intervention effect or not. Simulation was also used in the initial study on VSC  and has extensively been used for studying the performance of randomization tests with other test statistics (i.e., mean difference instead of ALIV) across several SCEDs: ABAB (Ferron, Foster-Johnson, & Kromrey, 2003), multiple baseline (Ferron & Ware, 1995), and ATD (Levin, Ferron, & Kratochwill, 2012).

Data Generation Model
The data generation model used was an adaptation of the commonly used model, proposed by Huitema and McKean (2000): = 0 + 1 + 2 + 3 ( − ( + 1)) + , where 0 is the intercept at the moment prior to the first measurement occasion, 1 is the parameter for the general linear trend, 2 is the parameter for an average difference in level, 3 is the parameter for the difference in slope, is the time variable and taking integer values from 1 to the number of measurement occasions, is a dummy variable representing the change in level and taking the value of 0 for the A condition and 1 for the B condition. The adaptation consisted in droping the terms for the change in slope and, thus, the model was reduced to = 0 + 1 + 2 + .
The error term was specified to follow a commonly used (e.g., Levin, Lall, & Kratochwill, 2011) first-order autoregressive process: = 1 −1 + , where 1 is the autocorrelation parameter and the term is the random disturbance.
All the simulations were performed using the R software (https://cran.r-project.org). For obtaining all possible randomizations in a systematic way for ATD-RB and for alternating treatment designs with the same number of measurement occasions for each condition and a restriction of a maximum of two consecutive measurement occasions per condition the SCDA plug-in for R was used (Bulté & Onghena, 2013). For obtaining all possible randomizations in a systematic way for alternating treatment designs with an unequal number of measurement occasions for each condition and a restriction of a maximum of two consecutive measurement occasions per condition the SCRT 1.1 stand-alone software (Onghena & Van Damme, 1994) was used, as available in the CD accompanying the book by Edgington and Onghena (2007).

Simulation Parameters
In determining the simulation parameters, our intention was to match the conditions studied by  as the present research is an extension of their study. Nevertheless, there are some differences. First, as stated in the Introduction, we focused on ATD-RR and ATD-RB rather than on systematic alternation. Second, regarding the number of conditions being compared,  included scenarios with 2, 3, and 4 conditions, with no major effect on the results. Therefore, in the current study, we only compared two conditions. Third, in terms of series length,  included 6 to 24 measurements ( = 6,8, … ,24), with both conditions being equally represented ( = ). In the current study, the minimum number of measurements included in the present simulation study is 5, given that this is also the minimum required for achieving five repetitions of the alternating sequence, as required by the WhatWorks Clearinghouse Standards (Kratochwill et al., 2013). In that sense, the series lengths included were between 10 and 24 for ATD-RB and between 10 and 22 for ATD-RR, given that for the latter it was not possible to obtain the systematic listing of all possible randomizations for = = 12 in several hours.
Regarding the simulation parameters that were established in the same way as in Lanovaz et al. (2010), the average baseline level was set to 10 (i.e., 0 = 10), there was no general trend simulated in the data (i.e., 1 = 0), and the intervention effect simulated was a change in level ( 2 ≠ 0), not a change in slope ( 3 = 0). Specifically, regarding the effect size β2 simulated, the values used were 1, 2 and 3. These effect sizes are similar to the ones used in a recent simulation on randomization tests (Levin, Ferron, & Gafurov, 2017). Moreover, the effect size values cover a considerable range, because according to Harrington and Velicer (2015) an effect size of 1 would represent a small effect, whereas 2 (i.e., between 1 and 2.5) would be a medium effect and 3 (i.e., above 2.5) would be a large effect. The autocorrelation parameters ( 1 ) were set to range from −0.3 to 0.6 in steps of 0.1, also coinciding with the ones studied by Levin et al. (2012).
Regarding the random disturbance ut term, we specified it to follow a normal distribution with a mean of zero and a standard deviation equal to 1. The simulation study on randomizations tests by Michiels, Heyvaert, and Onghena (2017) showed that the differences between independent normal error and an independent uniform error.
In were carried out, which is similar to previous simulation studies on randomization tests (e.g., Ferron & Ware, 1995) and other analytical techniques for SCED data (e.g., Arnau & Bono, 2004;Beretvas & Chung, 2008). This number of iterations was set in order to make the investigation feasible, because the use of R as statistical platform and our willingness to obtain exact p values through intensive computation (i.e., a randomization test) makes the simulation of a single experimental condition rather slow (e.g., taking more than an hour when = 14 measurements).

Data Analysis
Three ways of comparing the conditions in the ATDs were used, on the basis of the fact that both entail excluding the first and last measurements for which only one data path is present. This   Yakubova and Bouck (2014), aim to increase the target behavior. Lower panel: data by Eilers and Hayes (2015), aim to decrease the target behavior.
ALIV computes the difference between the two data paths for each measurement occasion.
Afterwards, the average of these differences is computed. In order to obtain the statistical significance of the outcome, ALIV is computed for all possible sequences that could have been obtained at random. These ALIV values form the reference (randomization) distribution. The actual outcome is located in the randomization distribution and the p value is the proportion of ALIV values that are as large as or larger than the outcome. That is, a one-tailed test is performed under the assumption that the researcher should know the direction of the difference before carrying out the experiment (Levin et al., 2017). iterations of conditions with absence of effect ( 2 = 0). For an adequate control of Type I error rates, our intention was to use Serlin's (2000) criterion for robustness, ± 25% , requiring Type I error rates between 0.0375 and 0.0625, which is a in between Bradley's (1978) stringent and liberal criteria. Nevertheless, according to Robey and Barcikowski with 1,000 iterations, Bradley's (1978) liberal criterion has to be followed, ± 50% , requiring proportions between 0.025 and 0.075. Statistical power is estimated as the same proportion but computed for conditions with effect simulated ( 2 ≠ 0). Power is judged to be appropriate if it is at least 0.80, following Cohen (1992).

Type I Error Rates
Effect of the number of measurements for independent data. For both ATD-RR and ATD-RB the Type I error rates are controlled by VSC and ALIV+RT, regardless of the number of measurements. (see the left panels of Figure 2). Actually, Type I error rates were controlled for all conditions tested (i.e., when ≥ 10). In contrast, the Type I error rates for the binomial test were systematically greater than the upper limit of the liberal criterion, 0.075: the false alarm rates were excessively high. Moreover, for the binomial test, there is an apparent increase of Type I error rates with the number of measurements. Thus, it could be stated that the binomial test applied to the number of comparisons for which the data path of one condition is superior to the data path of the other condition is inappropriate even for independent data.
Effect of autocorrelation. The presence of autocorrelation does not seem to be related with systematic changes in the Type I error for either of the three tests (see the right panels of Figure   2 for an example). Thus the results commented for independent data are also applicable here.
The graphical illustrations provided focus on the shortest series length, but given that the effect of the number of measurements is only slight, these illustrations provide an appropriate summary of the results. All values obtained can be consulted from https://osf.io/yr8tg/.

Statistical Power
Effect of the number of measurements for independent data. As expected, statistical power increases with the number of measurements (see Figure 3). When 2 = 1, power never reaches 0.8 for either of the tests and designs. When 2 = 2, a power of 0.8 is reached for all three tests for ATD-RB already for = 12. Also for 2 = 2 and = 12, the ALIV+RT and the binomial test reach power higher than 0.8. When 2 = 3, all three tests reach a power of 0.8 for ATD-RB already for = 10. For ATD-RR, ALIV+RT and the binomial test reach this power for = 10 and VSC for = 12.
In general, power is highest for the binomial test (which does not control for Type I error rates), whereas it is lowest for VSC. For VSC, power is clearly lower for ATD-RR than for ATD-RB.

Effect of autocorrelation.
Positive autocorrelation is associated with higher statistical power (see Figures 4 and 5). Considering that positive autocorrelation does not lead to higher Type I error rates, such conditions cannot be labelled as jeopardizing the performance of VSC and ALIV+RT. In contrast, data with negative autocorrelation are unfavorable for these techniques.
However, the average corrected autocorrelation reported for ATDs by Shadish and Sullivan (2011) is approximately zero (−0.01).

Figure 4.
A selection of results, for n=10, for statistical power as a function of the degree of autocorrelation; for alternating treatments designs with restricted randomization (ATD-RR) and with randomized blocks (ATD-RB). Legend: black dotted line: ALIV plus randomization test; dark grey dashed line: binomial test; grey solid line: visual structured criterion (VSC).

Figure 5.
A selection of results, for n=20, for statistical power as a function of the degree of autocorrelation; for alternating treatments designs with restricted randomization (ATD-RR) and with randomized blocks (ATD-RB). Legend: black dotted line: ALIV plus randomization test; dark grey dashed line: binomial test; grey solid line: visual structured criterion (VSC).

Discussion
The present study provides initial simulation evidence on the performance of ALIV (Manolov & Onghena, 2017) plus a randomization test and it provides further evidence on the performance of VSC  for ATD-RR and ATD-RB. Additionally, the study also extends the evidence available on the performance of randomization tests with ATDs: (a) Levin et al. (2012) studied systematically alternating designs (e.g., ABABABABABAB) with 12 and 24 measurement occasions; (b) Michiels et al. (2017) studied the conditional power (Keller, 2012) for ATD-RR and ATD-RB with 12 to 40 measurement occasions; and both (a) and (b) used the mean difference (not ALIV) as a test statistic. In what follows, we compare the current results with these previous findings. , focusing on systematic ATD found that VSC controlled the Type I error rate for ≥ 10. Additionally, Type I error rates decreased with autocorrelation (i.e., there was a negative relation). Our findings for VSC applied to ATD-RR and ATD-RB, both with random alternation of conditions, are consistent, in terms of Type I error rates being controlled.
Autocorrelation does not seem to have a clear positive or negative relation with false alarm rates.
Statistical power reached 0.80 for 2 = 2, but not for 2 = 1; regardless of the number of measurements. Power increased with autocorrelation. Our findings concur with the positive relation between power and autocorrelation, but for VSC a power of 0.80 for 2 = 2 is only reached when there are 12 measurements in an ATD-RB and 16 measurements in an ATD-RR.
Thus, the current results concur with previous findings for systematic ATDs that the main strength of VSC is ensuring that false alarm rates are controlled, whereas power may be insufficient in certain conditions. the autocorrelation was positive autocorrelation, whereas for ATD-RB it was always controlled.
Our findings for the randomization test using ALIV as test statistic show that the false alarm rate is always controlled for both the ATD-RR and the ATD-RB. According to Levin et al. (2012), the statistical power for 2 = 2 and = 12, statistical power reached 0.8 for ATD-RB for practically all degrees of autocorrelation tested and it was above 0.8 for the systematic ATD. Our findings for 2 = 2 and = 12 using ALIV as test statistic are practically identical to those reported by Levin and colleagues (2012). Therefore, the correspondence between the current findings and previous evidence is high and the conclusions about the performance of the randomization test can be generalized beyond systematic ATDs and beyond the mean difference as a test statistic. Michiels et al. (2017) found that the conditional power for ATD-RR is higher than for ATD-RB and our findings are consistent for ALIV as a test statistic. Moreover, despite some notable differences between the two studies (i.e., test statistic, conditional vs. unconditional power studied, two-sided vs. one-sided alternative hypothesis), both suggest that a medium-sized effect (according to the benchmarks proposed by Harrington & Velicer, 2015) such as 2 standard deviations can be detected as statistically significant with sufficient power for ATD-RR with as few as 12 measurement occasions, whereas small effects such as 1 standard deviation require more than 30 measurement occasions.

Implications for Applied Researchers
The current evidence suggests that VSC with the cut-off values derived for systematic alternation (i.e., ABABABABAB) does not produce excessive false alarm rates even for ATD-RR and ATD-RB. In terms of statistical power, we found that it was higher for ATD-RB than for ATD-RR. We speculate that this finding might be related to the type of ATD for which the VSC were developed (systematic) and the sequences that can be obtained in ATD-RB and ATD-RR. For a systematic ATD, there is necessarily only one administration of the A or B conditions in the beginning and in the end of the sequence (i.e., for n=10, ABABABABAB or BABABABABA).
For an ATD-RB, this is also the case; for instance, for n=10, a sequence such as ABABBABAAB can be obtained, but the sequence AABABABABB cannot be obtained (but it is possible under ATD-RR). Both the systematic sequence ABABABABAB and the randomized block sequence ABABBABAAB (which is also acceptable in an ATD-RR) entail eight comparisons, whereas the ATD-RR sequence AABABABABB would entail only six comparisons. Therefore, for the same number of measurement occasions, for some random sequences but not for all of them, the number of comparisons can be different. This could explain the differences in power, given that when fewer comparisons are performed, the VSC requires a greater percentage of superiority of one condition over the other in order to detect the presence of an intervention effect. Therefore, applied researchers are encouraged to use VSC for either systematic ATDs or ATD-RB.
Overall, the Type I error rates for ALIV plus a randomization test are closer to the nominal value of .05 and this test also presents greater statistical power than the VSC. In order to be able to use the inferential information, we recommend using randomization in the design, ALIV as a descriptive measure of the difference between data paths and a randomization test for estimating the statistical significance of this difference.
Regarding the usefulness of ALIV as a descriptive measure (i.e., an effect size in raw or unstandardized terms), it is not subjected to presence of randomization in the design. Given that a comparison between data paths is performed, ALIV is appropriate for ATDs, but not for multiple-baseline designs or ABAB designs. In relation to adapted ATDs in which nonreversible behaviors are studied, ALIV can be applied to part of the information obtained (e.g., the percentage of steps executed correctly under two conditions compared), but it would not be useful to quantify other critical aspects, such as the rapidity of learning, the extent of maintenance and generalization, or the breath of learning (Wolery et al., 2014). Thus, we encourage researchers to consider all pieces of evidence and we also echo recent calls for greater prominence of social validity assessment (Snodgrass, Chung, Meadan, & Halle, 2018).
Regarding the inferential use of ALIV, the possibility to obtain a p value via a randomization test applied to the ALIV outcome is not intended be a substitute for visual analysis. Moreover, statistical significance should not be understood in terms of inference from an individual to a population, but rather in relation to the null hypothesis of no differential effect of the intervention (Edgington, 1967). We rather recommend that visual inspection be used together with considering the descriptive value of ALIV and the p value. In that sense, we concur with previous recommendations for the joint use of visual and statistical analysis (e.g., Franklin, Gorman, Beasley, & Allison, 1996;Harrington & Velicer, 2015). If all pieces of evidence coincide that there is a difference between the conditions, an inference of a causal effect of the intervention on the target behavior would be justified in presence of a random determination of the alternating sequence (Kratochwill & Levin, 2010).
The application of ALIV and the randomization test has been made feasible thanks to the development of a web-based application. Specifically, we have extended the already existing web application for ATD data analysis https://manolov.shinyapps.io/ATDesign/. In the newly created tab, the statistical significance of ALIV can be obtained for ATD-RR and ATD-RB having 3 or more measurement occasions per condition, although such data short series may be deemed insufficient (Kratochwill et al., 2010;Wolery et al., 2014). Additionally, it should be noted that the validity of the p values obtained is subjected to randomization actually being used in the determination of the alternation sequence.
The p values obtained are the result of listing systematically all possible randomizations for ATD-RR with up to 11 measurements per condition and for ATD-RB with up to 12 measurements per condition (i.e., the same cases studied in the present simulation). For longer series, the p values are based on 1,000 randomly selected sequences, which are used for constructing the randomization distribution. In that sense, for these longer series, the p value obtained is not an exact p value, but a p value approximated by Monte Carlo sampling. Using 1,000 random samples for estimating the p value is well-aligned with previous research (Hayes, 1996;Levin et al., 2012;Michiels et al., 2017).

Implications for Methodologists
We consider that it is important that methodologists offer tools that could be potentially attractive to applied researchers, apart from being methodologically sounds. In that sense, ATDs offer a unique opportunity, given that randomization has been shown to be common in these designs (Manolov & Onghena, 2017). Moreover, proposals such as VSC and ALIV are closely related to the graphical representation of the data, which is the basis of visual analysis, because both compare the data paths for the different conditions. Finally, apart from suggesting feasible design options (randomization in determining the condition for each measurement occasion) and an empirically-tested quantification based on the data features object of visual analysis, another relevant ingredient of a potentially attractive analytical procedure is a user-friendly free software.
However, it is yet to be verified whether these expected advantages of an analytical proposal such as ALIV plus a randomization test are perceived as such by applied researchers.

Limitations and Future Research
The current paper only focuses on analytical procedures that compare data paths (i.e.,

combinations of actual and interpolated values). Other analytical options included comparing
only actually obtained measurements (e.g., Wolery et al., 2014) or comparing intercepts and slopes, that is, only the estimates of the parameters of the models underlying the actual data (Aerts, 2015).
In terms of the evidence provided here, the simulation conditions included data with no trends and they can be expanded by including trends in the data paths. For instance, further comparisons can be performed for crossing data paths (i.e., an upward trend in one condition, starting from a lower initial level, and a downward trend in the other condition, starting from a higher initial level) and for data paths that are increasing separated with each successive measurement occasion (i.e., an upward trend in one condition, starting from a higher initial level, and a downward trend in the other condition, starting from a lower initial level).
Moreover, the evidence provided in the current text is solely based on simulated data.
Despite the fact that this is the most common way of assessing Type I error rates and statistical power, obtaining information regarding the former aspect is also possible using real data and extended baselines Lanovaz, Huxley, & Dufour, 2017).
A different line of research can focus on studying the degree to which applied researchers are open to incorporating p values in their assessment of the difference between conditions in an ATD. They are likely to be fond of descriptive measures such as ALIV, considering the results of the review (Manolov & Onghena, 2017) showing that most of the ATD published studies incorporate a calculation of a mean difference.