Complete title : Random assignment of intervention points in two phase single-case designs : Data-division-specific distributions

The present study explored the statistical properties of a randomization test based on the random assignment of the intervention point in a two-phase (AB) single-case design. The focus is on randomization distributions constructed with the values of the test statistic for all possible random assignments and used to obtain p values. The shape of those distributions is investigated for each specific data division defined by the moment in which the intervention is introduced. Another aim of the study consisted in testing the detection of inexistent effects (i.e., production of false alarms) in autocorrelated data series, in which the assumption of exchangeability between observations may be untenable. In this way, it was possible to compare nominal and empirical Type I error rates to obtain evidence on the statistical validity of the randomization test for each individual data division. The results suggest that, when either of the two phases has considerably fewer measurement times, Type I errors may be too probable and, hence, the decision-making process to be carried out by applied researchers may be jeopardized.

SUMMARY. The present study explores the statistical properties of a randomization test based on the random assignment of the intervention point in a two-phase (AB) single-case design. The focus is on randomization distributions constructed with the values of the test statistic for all possible random assignments and used to obtain pvalues. The shape of those distributions is investigated for each specific data division defined by the moment in which the intervention is introduced. Another aim of the study consisted in testing the detection of inexistent effects (i.e., production of false alarms) in autocorrelated data series, in which the assumption of exchangeability between observations may be untenable. In this way, it was possible to compare nominal and empirical Type I error rates in order to obtain evidence on the statistical validity of the randomization test for each individual data division. The results suggest that when either of the two phases has considerably less measurement times, Type I errors may be too probable and, hence, the decision making process to be carried out by applied researchers may be jeopardized.
Key words: Randomization tests, AB single-case design, random intervention point Single-case designs are useful in psychological and educational research, as they permit examining the effects of a treatment over time for an individual subject or a group taken as a whole. An important distinction to be made is between single-case designs and case studies in terms of experimental rigor (Backman & Harris, 1999). Regarding data analysis of single-case designs, agreement among researchers has been found to be low (Ferron & Ware, 1995). The main concern commonly arises from the autocorrelated errors that are often assumed to exist in behavioral data. Autocorrelation (also referred to as "serial dependence") concerns the existence of a relationship (i.e., lack of independence) between measurements sequentially ordered in time. When an applied study involves the registration of a single experimental unit, it is likely that its behavior at one moment is related to its previous behavior. Although it has been advocated that conventional statistical methods can be properly employed for analyzing single-case designs data (Huitema, 1985), empirical evidence suggests that the presence of serial dependence can be problematic for several analytical techniques. As regards visual inspection of graphed data, as the most commonly applied method for single-case data analysis (Parker, Cryer, & Byrns, 2006), serial dependence disturbs agreement between statistical and visual inference (Jones, Weinrott, & Vaught, 1978) and increases Type I error rates . In relation to parametric statistical tests, t-test for level does not perform properly in presence of serial dependence , as Type I empirical error rates are distorted, similar results being obtained for ANOVA (Toothaker, Banz, Noble, Camp, & Davis, 1983). Another strategy for analyzing behavioral data consists in statistically modeling the dependencies in the error structure, but this requires phase lengths that are uncommon in single-case designs (Ferron & Ware, 1995;. Permutation or randomization tests have also been proposed as a way of statistically analyzing single-case experiments (Edgington, 1967;Edgington & Onghena, 2007).
These permutation methods require some characteristic of the design to be randomized and a test statistic sensitive to the expected effect of the intervention to be chosen.
Random assignment is an essential condition for a randomization test to meet internal and statistical validity (Edgington, 1980a). After conducting the experiment, the researcher computes the test statistic and determines statistical significance by locating where the obtained test statistic falls within the permutation or randomization distribution. This randomization test allows researchers to test both change in level and change in slope, the permutation procedure being identical apart from the definition of the statistic of interest (Wampold & Furlong, 1981).
Randomization tests are supposed not to make any assumption about the shape of distributions and, as a consequence, have been considered distribution-free (Edgington, 1980a;Marascuilo & Busk, 1988). However, comparing average performance in different experimental conditions can be obstructed by differences in variance (Gorman & Allison, 1997). Moreover, the precision of the results obtained by randomization tests depends on the exchangeability of observations (Good, 1994). That is, data permutations are only suitable when measurements' order does not influence the value of the test statistic (Good, 1994;Randles & Wolfe, 1979). In cases where one observation is related to the previous one (i.e., when series are autocorrelated), the exchangeability of data points is dubious, as the sequence in which they are obtained is relevant (Good, 1994).
The exchangeability of observations is important for preventing Type I error rates distortions and so for ensuring the validity of the randomization test. A statistical test is said to be statistically valid when the probability of committing a Type I error is less than or equal to nominal alpha set by the applied researcher prior to conducting the experiment (Edgington, 1980a;Hayes, 1996). The need for the exchangeability assumption has been recognized in randomization tests, although it has often been established as the requirement for independence among data or nonautocorrelated errors (Levin, Marascuilo, & Hubert, 1978;Marascuilo & Busk, 1988). Regarding the serial dependence and statistical validity of randomization tests, it has been stated that these tests overcome autocorrelation problems (Crosbie, 1987;Levin et al., 1978;Wampold & Worsham, 1986). Nevertheless, some preliminary results of simulation studies have shown that randomization tests do not control Type I error rates if data are autocorrelated (Gorman & Allison, 1997). Recently, other simulation studies have found that at least some randomization tests do not control Type I error rates in the presence of serial dependence (Ferron, Foster-Johnson, & Kromrey, 2003;Sierra, Quera, & Solanas, 2000;Sierra, Solanas, & Quera, 2005).
The AB single-case design is the most basic form of single-case phase design (see Bulté & Onghena, 2008, for a discussion on phase and alternation designs). It involves a succession of two experimental conditionsa baseline or control phase (designated by A) is followed by a treatment phase (B) which lasts until the end of the study without being withdrawn. An effective treatment implies that the level of behavior during phase B deviates from the projected level of baseline performance (Kazdin, 1978). The fact that there is only one change in the experimental conditions implies that internal validity is not guaranteed. History, maturation, testing, and instrumentation effects are common examples of threats to internal validity. Nevertheless, the AB single-case design is often used in applied research, both in clinical and nonclinical settings, especially for nonreversive behaviors, in spite of its drawbacks. That is why the present study focuses on a randomization test for analyzing the data resulting from the AB single-case design.

Random assignment of an intervention point
Let us take for example a 30-point AB single-case design, in which the time of introduction of the intervention is randomly determined prior to collection of the data (Edgington, 1975). The selection of the intervention point determines the lengths of both phases, assigning the measurement times previous to that point to phase A and the remaining ones to phase B. The random choice of the point of intervention must be restricted to guarantee that neither of the two phases, A and B, has an excessively small number of data pointsfor instance, Edgington (1980b) suggests a minimum of five measurement times per phase, that is, k = 5. Therefore, considering the series' length (n = 30), the intervention point could be randomly selected from the set of integers ranging from p = 6 to p = 26, p 0 being used to denote the randomly chosen intervention point. Thus, there are 21 possible assignments (denoted by q) of the intervention point. q can be obtained through the following expression n−2k+1, which in the example presented is equal to 30−2(5)+1 = 21. The experimenter could randomly select one of the following bipartitions, where the first and second numbers in each parenthesis respectively correspond to the number of measurements in phases A and B: (5, 25), (6, 24), …, (24, 6), (25,5). It should be noted that (5, 25) is equivalent to p 0 = 6, (6, 24) to p 0 = 7, and so on. Note that any bipartition is equally probable before randomly choosing the intervention point. After randomly selecting the intervention point p 0 , the experiment is carried out. The value of the statistic that is relevant and sensitive to the purpose of the research is firstly calculated for the observed data, that is, taking into consideration the actually selected intervention point and the outcome (denoted by d 0 ) is obtained. The same test statistic is then computed for all possible random assignments of the point of intervention, which are represented by the remaining 20 (not selected) data bipartitions. The randomization distribution is then constructed by sorting all 21 possible values of the statistic (denoted as d 6 , d 7 , …, d 26 ) in an ascending order. Then, by means of the order statistic, the values of the statistic can be ordered. Thus, d (1) ≤ d (2) ≤ … ≤ d (21) . The value of the statistic for the data at hand (d 0 ) is located in the randomization distribution. It has been assumed that the statistical significance associated with the outcome is the proportion of test statistics as large as or larger than the obtained value (Edgington, 1980b;Wampold & Furlong, 1981). At least 20 possible intervention points would be required to allow for the possibility of statistical significance at the .05 level. When q = 21, the minimal possible p-value is 1/21 = .0476.
This way of determining the statistical significance of the outcome is founded on the common randomization distributiona procedure that mixes all possible intervention points to generate the randomization distribution independently of the specific random intervention point that was selected by chance. The abovementioned procedure of obtaining p-values is based on the idea that the randomization distribution follows a discrete uniform or rectangular distribution for all admissible randomly chosen intervention points. Evidence suggests that mixing all possible data division actually leads to a uniform randomization distribution (Manolov & Solanas, 2008). However, when randomization distributions are investigated for each data division, shapes different from the rectangular appear (Manolov & Solanas, 2008;Sierra et al., 2005).
Shapes' variation is reflected in disparity in Type I error rates. Therefore, the statistical significance of the outcome ought to be determined individually for each specific data division (i.e., using data-division-specific randomization distributions).
The idea subjacent to the common randomization distribution can be expressed by where Pr () corresponds to the p-value associated with the outcome, d denotes the test statistic of interest (e.g., mean difference between phases A and B) and card {·} denotes the number of set elements.
On the other hand, the idea underlying data-division-specific randomization distributions can be expressed by Equation 2: where the only difference with respect to Equation (1) is that the p-value (Pr) and the number of set elements (card {·}) are conditional to the intervention point, as the term "│p 0 " denotes. After randomizing the intervention point, the way in which the specific design will be carried out is absolutely determined. That is why the proper randomization distribution is that associated with the specific intervention point that was randomly chosen. Then, the data-division-specific randomization distribution is the appropriate distribution to determine the statistical significance, and not the common randomization distribution (Sierra et al., 2005).
The main aim of the present study was to explore if the variation of distribution shapes and Type I error rates, in independent data series, across data divisions found for ABAB designs (Manolov & Solanas, 2008;Sierra et al., 2005) is also applicable to twophase designs. The influence of autocorrelation for each specific intervention point was also to be tested, while additional objectives consisted in proposing an explanation of the results and showing their practical importance for applied researchers.

Method
A Monte Carlo simulation was conducted to estimate data-division-specific randomization distributions and to determine the effect of autocorrelation levels on the statistical decision-making process when the method of randomization involves the random assignment of an intervention point within the series of measurement times. The AB single-case design consisted of 30 observations, and at least five observations in each phase were planned, leading to 21 possible data bipartitions (Wampold & Furlong, 1981).
These values are common in randomization tests simulations (e.g., Ferron et al., 2003;Ferron & Onghena, 1996;Ferron & Sentovich, 2002;Ferron & Ware, 1995). The program then computed values of the statistic of interest and its randomization distribution. In the data-generation process, NAG mathematical-statistical libraries were used to generate normal random values for the error term of the autoregressive model and to set the initial seeds for data simulation, respectively. Data were generated according to Equation 3: where y t and y t−1 are data points corresponding to measurement times t and t−1, φ 1 is the first order autoregressive parameter, and ε t are N(0,1) random variables. For each call to the NAG libraries, 130 data (ε t ) were generated, and the first 100 were discarded to reduce artificial effects , that is, to attenuate as far as possible the effect of anomalous initial values or seeds of the pseudorandom generator and to stabilize the series. The remaining 30 data points were used in the analysis.
According to Robey and Barcikowski (1992), the number of iterations in a simulation needed for detecting deviations from the exact Type I error rates under the strong criterion α ± 1/10 α, a Type I error rate ω = .01, and a prior power 1 − β = .9, is 29,600. The forty thousand iterations used in the present study amply satisfy those criteria.

Test statistics
Two statistics were computed for each simulated data series. One of them was the difference between the mean for phase A and the mean for phase B, called thereafter Statistic 1. Statistic 2 was computed as presented in Equation 4: where s 2 , n A and n B , respectively, correspond to the pooled estimation of the variance, the number of observations in phase A and the measurement times in phase B. Both statistics were calculated, since empirical Type I error rates could depend upon how the statistic was defined. While Statistic 2 takes into account phase lengths and variability, Statistic 1 does not. Data-division-specific randomization distributions for Statistic 2 might be more similar to the discrete uniform distribution than those for Statistic 1.

Simulation
The steps in the simulation were as follows: data points were generated according to Equation (3) for a given φ 1 and a random intervention point; the outcome was computed for the data series using both Statistic 1 and Statistic 2; admissible intervention points were permuted and the statistic is computed for each; values of the statistic are sorted to obtain the exact randomization distribution; the outcome was located in the randomization distribution and its rank (i.e., an integer between 1 and 21) is obtained.
The abovementioned steps were repeated 40,000 times. These steps were repeated for each autoregressive parameter value and each possible random intervention point. In total, 147 experimental conditions were investigated, the combination of 21 possible random intervention points and 7 autocorrelation values. Table 1 shows summary statistics for data-division-specific randomization distributions as a function of the randomly selected intervention point and the test statistics.

TABLE 1 ABOUT HERE
Since data series were generated for φ 1 = .0, the exchangeability condition was met.
While the mean of the ranks associated with the outcome was close to 11 for all datadivision-specific randomization distributions, the variance of those ranks ranged from 27.099 to 55.574 according to the intervention point. The mean of ranks corresponded to the mathematical expectancy expected in case the data-division-specific randomization distribution follows a discrete uniform distribution. However, the variance expected for that distribution shape, (21 2 − 1)/12 ≈ 36.667, did not approximate the dispersion values obtained for all data bipartitions. As regards the two test statistics used, the main difference between them is that for some intervention points Statistic 1 presented greater variability, while for others it was Statistic 2. All data-divisionspecific randomization distributions showed an evident symmetry for both statistics, and that is why the mean ranks were close to the mathematical expectancy for each random intervention point. The kurtosis for a discrete uniform distribution ranging from 1 to 21 is approximately equal to −1.202, but the simulation study showed that, in general, datadivision-specific randomization distributions had different kurtosis values. The two statistics also showed differences in their kurtosis values. Furthermore, considering the empirical Type I error rates, those values corresponding to the ranks 1 and 21 did not match 1/21 = .0476 (see Figure 1), which is the expected value for a discrete uniform distribution. Therefore, data-division-specific randomization distributions are not uniformly distributed for independent data series.

FIGURE 1 ABOUT HERE
In contrast, if no distinction is made regarding the intervention points and if the common randomization distribution is considered, all summary statistics resemble what 13 is expected for a discrete uniform distribution (see Table 1).

FIGURES 2 AND 3 ABOUT HERE
The effect of autocorrelation The results described above suggest that empirical Type I error rates are equal or inferior to nominal ones (i.e., statistical validity is ensured) for the majority of data divisionswhen the intervention point is between 9 and 23, both inclusive. For those cases, it was important to know whether the presence of autocorrelation in data (i.e., the violation of the assumption of exchangeability of observations) distorted the false alarm rates. Table 2 shows that positive serial dependence can lead to underestimation or overestimation of Type I error rates in comparison to independent data series, according to the data division. For an applied researcher, this would suppose increased probability of omitting an effective intervention or of a false alarm, respectively. Nonetheless, the effect of autocorrelation was only slight for the random intervention points ranging from 9 to 23, for which the randomization test is statistically valid. In the case of negative serial dependence (see Table 3), the results were similar to those found for positive autocorrelated data series. It should be noted that if the empirical Type I error rate is estimated regardless of the random intervention points, its value practically matches .0476, which is the value expected for a discrete uniform distribution with a total of 21 possible values.

Discussion
The results of the present research suggest that applied researchers should be cautious when using the random intervention point randomization test studied here.
Psychologists ought to know that if the data division randomly chosen contains 7 or less measurement times in either of the phases, there is high risk of labeling an ineffective treatment as effective. Therefore, in order to enhance the accuracy of the decision making process, applied researchers should be cautious if the selected intervention point is not between the 9th and the 23rd observation. There are two reasons for accepting only integers in the interval 9-23. First, if α is set equal to .05, the statistical test is valid.
Second, although the exchangeability assumption has been violated in several experimental conditions of the simulation study, the randomization test is relatively robust for the random intervention points between the 9th and 23rd measurement. Also, note that this randomization test has zero power at α = .05 if p 0 equals 6, 7, 8, 24, 25, or 26. If the random intervention point was equal to one of those values, statistical decision-making process should not be conducted and only descriptive statistical analysis should be carried out.
The rationale of the abovementioned recommendations can be found in the shape of the randomization distribution, which is used to obtain the p-value of the observed test statistic. It is often supposed that the statistic of interest follows a discrete uniform distribution when randomization tests are used to analyze the data resulting from singlecase experiments. For example, if the number of possible random intervention points in an AB single-case design is equal to q, it is generally assumed that the minimal significance value equals 1/q (Edgington & Onghena, 2007). The present simulation study showed that this assumption is not met if data-division-specific randomization distributions are taken into account for obtaining statistical significance. It would be suitable if the standard errors of the statistic were identical for each intervention point, but this does not hold for all random intervention points. The results of the present simulation suggest that, under the null hypothesis and for independent series, the minimal significance value does not equal 1/q if data-division-specific randomization distributions are considered. In other words, the shape of the distribution of the statistic depends upon the random intervention point being chosen, as the variance and kurtosis values showed. All data-division-specific randomization distributions were symmetrical and the mathematical expectancy equals the mean rank, the kurtosis values depending upon the random intervention point. That is, the randomization distribution of the statistic was conditioned to the random intervention point.
The question remains of why the data-division-specific randomization distribution does not, in general, follow a discrete uniform distribution in the randomization test studied. Suppose that the random intervention point was chosen and the outcome was computed. It should be noted that in most cases the data-division-specific randomization distribution of the statistic will be generated by bipartitions of data that vary in size.
Therefore, data-division-specific randomization distributions would be composed of mixing phase lengths, and the standard errors of the statistic would be different for distinct permutations. Thus, given that the null hypothesis is true, large departures of the statistic value from zero are likely to occur in permutations based on clearly unequal group sizes. The present simulation has verified that the variance of the rank associated with the statistic value was larger in clearly unequal bipartition sizes than in approximately equal bipartition sizes. Although the data-division-specific randomization distributions are symmetrical, the mass moved from the center of the distribution to the tails as the bipartition of data were more unequal. If the common randomization distribution is considered, the results concur with those corresponding to other simulation studies in which the common randomization distribution is analyzed instead of data-division-specific randomization distributions (Ferron & Ware, 1995).
The common randomization distribution suppresses the marked deviations from the discrete uniform distribution that can be clearly identified in data-division-specific randomization distributions. The main reason for this fact is the differential kurtosis in data-division-specific randomization distributions. If the random starting point divides data into two markedly different series lengths, the distribution of the statistic becomes more platykurtic than the discrete uniform distribution. When the phase lengths are approximately equal, the data-division-specific randomization distribution is less platykurtic than the discrete uniform distribution.
The conclusions of the present study are restricted by the experimental conditions explored and its generalization to another set is not suggested. An AB design composed of 30 observations was considered because 21 possible random intervention points are required to reach a statistical significance value less than or equal to .05 if the intervention point is constrained to ensure that there will be at least five observations in A and B phase.
Future research could be directed towards studying whether the present results can be verified for larger data series and to analyze the power of this randomization test. In any case, the present simulation suggests that data-division-specific randomization distributions should be analyzed when the validity and power of randomization tests are studied.