Carryover negligibility and relevance in bioequivalence studies

The carryover effect is a recurring issue in the pharmaceutical field. It may strongly influence the final outcome of an average bioequivalence study. Testing a null hypothesis of zero carryover is useless: not rejecting it does not guarantee the non‐existence of carryover, and rejecting it is not informative of the true degree of carryover and its influence on the validity of the final outcome of the bioequivalence study. We propose a more consistent approach: even if some carryover is present, is it enough to seriously distort the study conclusions or is it negligible? This is the central aim of this paper, which focuses on average bioequivalence studies based on 2 × 2 crossover designs and on the main problem associated with carryover: type I error inflation. We propose an equivalence testing approach to these questions and suggest reasonable negligibility or relevance limits for carryover. Finally, we illustrate this approach on some real datasets. Copyright © 2015 John Wiley & Sons, Ltd.


INTRODUCTION
Average bioequivalence (ABE) studies are performed to demonstrate that the ratio of geometric mean bioavailabilities (BA) of a brand or reference (R/ drug and a generic or test drug (T/ lies within pre-specified limits of equivalence. In the original scale of measurements, these limits are typically 0.80 and 1.25 [1]. Bioavailability is measured in terms of specific variables like 'area under the curve until time t' , AUC 0 t , or maximum concentration, Cmax. Normally, a logarithmic transformation of data is recommended. In the transformed scale, these limits become˙0.2231 and the difference of mean log-bioavailabilities, the formulation effect, must lie between them. Most regulatory agencies recommend that ABE studies be based on a 2 2, RT/TR, crossover design (two treatments, two periods and two sequences) and inference on the two one-sided tests (TOST) procedure. Thel evel TOST is operationally equivalent to the interval inclusion principle, say, to declare ABE if the usual parametric normal 1 2' shortest' confidence interval for the formulation effect lies within the bioequivalence limits.
Crossover designs allow within-subject comparison, but, as each subject receives a sequence of treatments, a carryover (or residual) effect may occur in the second (and any subsequent) administration period of the assay [1]. One of the assumptions underlying the standard ABE methods based on crossover trials is that carryover effects are absent [1]. In theory, we can avoid, minimise or rule out these effects if there is a presumed sufficient washout time between drug administrations. It is recommended that washout periods exceed five drug elimination half-lives [1,2].
Given the possibility of disturbing carryover effects, Grizzle [3] proposed a two-stage procedure for the analysis of data from 2 2 crossover studies. First, to test the null hypothesis of non-existence of carryover at a significance level of˛D 0.1, or even 0.15, to ensure there is enough power. In case of non-rejection of the null hypothesis, he recommended proceeding with the standard analysis under no carryover. Otherwise, the recommendation was to use only the data from the first period, like data obtained in a fully randomised parallel trial. This strategy has been recommended in the past by the Food and Drug Administration [1] and is widely used in practice despite much criticism ( [4][5][6][7][8]). The two-stage procedure is not mentioned in recent regulations (e.g. [2]).
Opponents of the two-stage procedure state that the best policy is not to test for carryover beforehand (or not to use this test as a basis for any further decisions on the analysis course) and to proceed as if it were absent. In well-performed experiments, carryover will commonly be absent, as the washout will normally succeed in eliminating it. This opinion seems to be confirmed by D'Angelo et al. [9] in their review of 324 two-way and 96 three-way crossover studies. Only a small proportion of these studies, compatible with the common significance level at which they were performed, resulted in a significant carryover. Moreover, for the subset of studies reporting the p-value, its empirical distribution was very close to the uniform. With these data, this distributional null hypothesis is never rejected by the Kolmogorov-Smirnov (KS) test [8]. These results are contested in [10] and [11], with simulations that suggest the lack of power of these KS tests. Senn et al. in [12] rebut these arguments, arguing the irrelevance of power calculations to interpret observational data. However, a presumed proper washout time does not always guarantee that carryover effects are removed, as is suggested, for example, in [13] and [14] (contested by Bolton [15]). Mills et al. in [16] review the methodological aspects of 116 crossover studies and conclude that carryover may likely be present in some of them. Their arguments mainly concern the design, including the lack of washout, and not the outcome of a carryover significance test. In 71% of papers, the possibility of carryover is not taken into consideration in the methods section. Similar conclusions are reported in [17].
In recent years, a growing body of pharmacogenetics evidence also suggests that avoiding carryover in bioequivalence studies may pose problems. Peiró et al. in [18] identify a single nucleotide polymorphism associated with cytochrome P-450 (CYP2C9*3), directly related to the pharmacokinetics of Tenoxicam. It may affect a bioequivalence study if, by chance, different proportions of each genotype are assigned to each sequence, as it is related to low drug clearance and high AUC 0 1 and t 1=2 (high-life time) values. The study was developed in 18 healthy volunteers. A detectable plasma drug concentration before the second administration (and after a presumed adequate washout period of 21 days) was observed in five volunteers. This situation could strongly influence the existence of carryover. Bioequivalence is declared when all volunteers are considered, but no bioequivalence is declared if only the volunteers with a particular variant of the polymorphism (CYP2C9) are considered. Wu et al. in [19] describe three different types of pharmacokinetic behaviour related to individual genotypes, the so-called extensive, high and early metabolisers. The previously mentioned results seem to reinforce the experimental grounds of the simulation studies in [20], where differences in pharmacokinetic behaviour between individuals may induce some carryover. It seems unquestionable that the genetic characteristics associated with the metabolising ability (high, medium or slow) of the volunteers in a bioequivalence study directly affect the concentration of a drug in the second period, and that, despite a presumed adequate washout period, in some cases, a percentage of the drug is left over from the first period.
Carryover considerations aside, in more general statistical terms, any pre-testing strategy like Grizzle's two-stage procedure should be avoided, as it leads to invalid tests, which do not respect the nominal global test size [6,7]. On the other hand, if used as a complementary diagnostic instead of a pre-test, it provides some insights on possible carryover, which seem desirable in any crossover study. But testing a null hypothesis of zero carryover is useless: not rejecting it does not guarantee the non-existence of carryover, and rejecting it is not informative of the true degree of carryover and its influence on the validity of the main conclusions of the study, for example, to conclude bioequivalence (or not). In other words, statistical significance is not synonymous of relevance.
A more reliable approach would be equivalence testing: even if some carryover is present, is it enough to seriously distort the study conclusions or is it negligible? This is the point of view taken in this paper, with average bioequivalence studies based on 2 2 crossover designs as the main goal.
In the next section, we summarize some results and notation. In section 3, an approach for establishing the equivalence or negligibility limits (and their complementary relevance limits) for carryover in ABE studies is proposed. Section 4 introduces an equivalence testing procedure based on these limits. Section 5 is devoted to some illustrative examples. The paper concludes with a short discussion and some conclusions.

BASIC RESULTS AND NOTATION
In a 2 2 crossover design, each experimental subject receives a single dose of both formulations, R and T, in only one of two possible orders or treatment sequences, RT or TR. A sample of N D n 1 C n 2 subjects are randomly allocated, n 1 to sequence RT and n 2 to sequence TR. For a given variable Y in the logarithmic scale, say, Y D log Cmax or Y D log AUC 0 t , Y ijk will designate an observation made on the i th individual, in the j th period and the k th sequence, i D 1, . . . , n k , j D 1, 2 and k D 1, 2.
We consider the following underlying linear model: where is a global mean, P j is the fixed effect of the administration period j, F . j, k/ is the fixed effect of the formulation administered on the k-th sequence and j-th period, and C .j 1,k/ corresponds to the fixed effect of carryover. The possible carryover effect of the reference formulation from the first period to the second period in sequence 1 is denoted by C R , while the equivalent effect of the test formulation in sequence 2 is denoted by C T . Therefore, with F R D F T D F, and P 1 D P 2 D P as we consider P 2 jD1 P j D 0. We will designate the formulation effect as D F T F R D 2F, the carryover effect as Ä D C T C R D 2C and the period effect as D P 2 P 1 . S i.k/ N 0, 2 S represents the random effect of the i th subject nested in the k th sequence. 2 S is the inter-subject variance. e ijk N.0, 2 / is the random error, residual or disturbance term. Additionally, we assume independence between all S i .k/, all e ijk , and mutual independence between the fS i .k/g and the fe ijk g.
For simplicity, we assume constant residual (or within or intrasubject) variance, 2 .
The inference on the formulation effect is based on the period difference contrasts for each subject i within each sequence k, d ik D 0.5 .Y i2k Y i1k /. Its expectation and variance are d ik are the sample means of the period differences, is an unbiased estimate of the formulation effect , provided that no carryover is present, that is, if The variance of the semidiference contrasts d ik may be estimated as and then the standard error of N D can be independently estimated by According to the confidence interval inclusion principle, ABE is declared if the 1 2˛'shortest' confidence interval lies within the bioequivalence limits. In (7), t 1 N 2 corresponds to the 1 ˛quantile of a Student's t distribution with N 2 degrees of freedom.
While inference on the formulation effect is typically based on the difference contrasts, the inference on the carryover may be based on the sums of observations within each subject along all periods. Using the common 'dot' notation, writing that may be estimated as From the previously presented results, the usual estimator of the carryover effect may be expressed as For a more in-depth introduction to these matters, see, for example, [21] or [22].

ESTABLISHING CARRYOVER NEGLIGIBILITY (OR RELEVANCE) LIMITS
The numerical specification of the equivalence limits depends on each field of application, for example, as a consensus among experts in the field. This is the origin of the 0.80/1.25 or˙0.2231 limits used in ABE [1]. Many studies on the impact of carryover in crossover assays refer to the case where the end goal of these assays is establishing difference and the main magnitude under consideration is the test power. For example, in this context, Willan and Patter [23] obtained a threshold for the relative carryover, Ä/ , in order to determine which strategy (either analysing the full set of data or only data from the first period) is better in terms of power.
In a previous paper [24], Sanchez et al. established that the most disturbing effect of carryover in bioequivalence studies is the considerable increase in the probability of type I error or consumer risk, that is, of inappropriately declaring bioequivalence. This inflation occurs when the carryover effect and the formulation effect both have the same sign (and then the relative carryover Ä/ is positive), in accordance with the fact that the expectation of the usual estimator of the formulation effect is Ä / 2. Then, in a scenario of true non-bioequivalence (e.g. positive , to the right of the bioequivalence limit), if the true carryover effect has the same sign as the formulation effect (e.g. it is positive), the estimated values of the formulation effect will more frequently tend to be within the bioequivalence limits (e.g. left-deviated with respect to /. On the other hand, when the carryover effect and the formulation effect have different signs, the size of the usual bioequivalence test is only slightly reduced. Thus, it seems appropriate to establish carryover negligibility limits in terms of its tolerable impact on the true test size, say˛ . With a fixed nominal ABE significance level˛(e.g. the usual 0.05), our proposed strategy will be to determine the maximum tolerable value of˛ over˛(e.g. two times˛/ and then to determine the level of carryover in which this level of true type I error is reached. In Appendix I (available online as Supporting Information), we conclude that the crucial parameter in establishing carryover negligibility should be based on the scaled carryover, Ä / . Specifically, we recommend the parameter Â defined as A good, simple approximation to the negligibility limit in terms of this parameter is (see Figure 1) whereˆcorresponds to the N(0,1) distribution function and z 1 t o its 1 -˛quantile. On the other hand, this negligibility limit may be computed more exactly (resulting in slightly more permissive negligibility limits), without a great deal of computational effort. In any case, irrespective of the origin of the limits, the carryover negligibility problem should be stated as an equivalence problem Alternatively, a carryover relevance test to prove the existence of a very disturbing level of carryover (out of a given threshold Â 0 associated with a given unacceptable level of consumer risk,˛*) should be stated as the complementary problem  (14)).
Note that greater sample sizes and/or lesser residual variabilities will tend to make Â D .Ä= / p n 1 n 2 = .2N/ greater. In other words (and perhaps counter-intuitively at first sight), the same level of carryover will affect type I error to a greater extent than with smaller sample sizes and/or greater variability. This tendency and the validity of the aforementioned limits were confirmed in the simulations in [24] and in the simulations presented in the succeeding discussions. Note also that Wellek's test of carryover negligibility ( [25], p. 284) is not directly applicable to (15), as its scaling variance is 2 C D 4 2 S C 2 2 , while the scaling considered here is based on the residual variance 2 .

Carryover negligibility
The testing problem (15) for carryover negligibility may be rewritten as Note in advance that there may be some confusion because we are concerned with three 'alpha' values: the nominal significance level˛of the BE test, the limit of permissibility for its true BE test size,˛*, and the significance level at which we are testing if carryover is negligible, test (17). From now, this last significance level will be designated as˛0. Let U Á be the upper limit of a 1 ˛0 confidence interval This upper limit may be derived using the Howe's method, [26], adapted to a bioequivalence context in [27] and [28]. For a linear combination of parameters P c j Â j , like Á with c 1 D 1, Â 1 D Ä 2 , c 2 D .2N=.n 1 n 2 // Â 2 0 and Â 2 D 2 , let E j be independent point estimators for each summand c j Â j and U j be the corresponding upper limits of 1 ˛0one-sided confidence intervals for c j Â j . If is the upper limit of an approximate 1 ˛0one-sided confidence interval for Á. Unfortunately, the variance of the usual estimator of Ä is 2 C , which depends on the intrasubject and intersubject variation and is usually large. The variance of O Ä= O is even larger because of the random denominator. As a consequence, the test for (17) based on (18) tends to be biased for the most reasonable˛ values, like 0.06 (a 20% increase over 0.05), 0.1 or 0.15 for a nominal˛D 0.05, even using 'permissive' values˛0 D 0.10 or 0.15 in the same line suggested by Grizzle in [3]. Their power properties improve for more extreme values like˛ D 0.50, but the statement that carryover 'is negligible' because 'the risk of inadequately declaring ABE is not over 0.50' lacks any interest.
So, for the moment, the problem of carryover negligibility must remain in a descriptive but not inferential status: the estimate of the scaled carryover (13) may only suggest lack of alarming carryover levels. On the other hand, limited but possibly more interesting results may be obtained for the reciprocal problem of carryover relevance.

Carryover relevance
A test of carryover relevance for the problem (16) may be of interest for 'large' values˛ like 0.10 or 0.20, as an a posteriori diagnostic of extreme carryover. An interesting value is˛ D 0.50; then, rejecting the null hypothesis of carryover negligibility will suggest that the BE study under consideration has a user's risk control not better than simply tossing a coin and deciding to declare BE or not, ignoring data. If D 1 / Â , that is, D . =Ä/ p 2N= .n 1 n 2 /,the aforementioned problem reduces to an equivalence or negligibility problem According to Howe's method, we can obtain an upper confidence interval limit U Á for the parameter from the estimators and upper confidence interval limits summarised in Table I, where 2 0 , .N 2/ corresponds to the˛0 quantile of a chi-square distribution with N 2 degrees of freedom, and t˛0 N 2 corresponds to the˛0 quantile of a Student's t distribution with N 2 degrees of freedom.
If U Á <0, then the null hypothesis in (19) may be rejected, concluding that there is a relevant carryover, perhaps questioning the validity of a previous bioequivalence study declaring bioequivalence. This test is approximately valid provided that the intersubject variance 2 S is not much larger than the resid-   ual variance, or more precisely, provided that the intraclass correlation I D 2 S ı 2 S C 2 is not too large. Once fixed an upper bound for the maximum degree of true type I error level for the relevance test, the maximum allowable I is a growing function of˛ . Figure 2 displays the maximum allowable intraclass correlation for which the true type I error probability of the negligibility test is sufficiently closer (±20%) to a nominal sizę 0 D 0.05, in a balanced 2 2 design for sample sizes n D 12, 24 and 36. These results were obtained in a simulation study whose complete results and R code are available at www.ub.edu/stat/recerca/ materials/Carryover_negligibility_and_relevance.htm displays a subset of the more interesting simulation results. It corresponds to the power curve of the relevance test (say the probability of declaring carryover relevance) when 'relevance' is set at˛ D 0.50 and the test is performed at three possible significance levels,˛0 D 0.05, 0.1or0.15, for a balanced sample size n D 12. The probability of declaring carryover relevance is displayed in function of the parameter Â defined in (13), in terms of a fraction of the relevance limit Â 0 .˛*). Each probability line corresponds to a given proportion between the 'intra' and the 'inter' subject variances, 2 and 2 S , expressed in terms of a given value of intraclass correlation, I . Ideally, the probability of declaring carryover relevance should be below˛0(horizontal thick line) for fractions at left of 1 in the abscises axis, it should be exactly 0.05 for a unit fraction and should be above this reference value for fractions at right of 1. This behaviour is acceptably displayed in all situations except when the intraclass correlation is too high.

Example 1
We illustrate the aforementioned procedures using a dataset, which is accessible through the Food and Drug Administration website. It corresponds to dataset 29, 'Cholinesterase inhibitor' , in Section 2, which is devoted to non-replicate designs, at: http://www.fda.gov/downloads/Drugs/ScienceResearch/UCM301 914.txt.
These data correspond to a balanced 2 2 crossover design for a total of N D 28 subjects, n D 14 in each sequence. The measured variables were the area under the curve until time t, AUC 0 t , and the peak plasma concentration of a drug after oral administration, Cmax. Table II shows the main results of a standard bioequivalence and analysis of variance (ANOVA). ANOVA is performed through a parameterisation that allows for estimation of the overall mean, period effects, treatment effects and carryover effects, assuming that no sequence effects exist. The drug has low within-subject variability and the study has adequate power, provided that the number of healthy volunteers included in the protocol is sufficient. For both variables, the standard ANOVA or Student's t-test procedures reject the null hypothesis of null carryover effect, Ä D 0, but do not give any idea of the magnitude of these non-null carryovers and their possible impact on a bioequivalence study.
For the logarithmically transformed AUC 0 t , the estimated carryover (expression (11)) is O Ä D 0.7568 and the residual standard deviation O D 0.1166. This makes O Ä= O D 6.4878 and the estimated Â parameter becomes 12.1376. Considering a limit for type I error˛ D 0.50, and the standard˛D 0.05 for the bioequivalence test, the associated relevance limits for standardised carryover become˙Â 0 D˙1.6889, so the estimated Â is more than seven times this limit. This may be interpreted as a suggestion of a possibly highly relevant carryover. In fact, the test for carryover relevance proposed in Section 4 gives a significant result at a standard significance level˛0 D 0.05 as the upper limit of the one-sided confidence interval for the parameter (20) is negative: 0.0457. Following Grizzle's recommendation of testing carryover with more permissive significance levels, like 0.10 or 0.15, the preceding result is still clearer, for example, for˛0 D 0.15, the confidence interval upper limit becomes 0.2969.
These results must be taken with care as the estimated intraclass correlation is very high, 0.9245, which makes the relevance test too permissive, according to Figures 2 and 3. They may suggest the convenience of revising the experimental protocols but should not be taken as a full evidence of distorting carryover.  15, for a balanced sample size n D 12 in a 2×2 crossover design. The probability is displayed in function of the carryover relevance parameter Â defined in (13), in terms of a fraction of the relevance limit Â 0 .˛*),Â/Â 0 . Fraction values below 1 reflect non-relevant carryover degrees, and fraction values above 1 reflect relevant carryovers, able to put the true user risk at a too high˛ D 0.50 in a BE study. Each probability line corresponds to a given proportion between the 'intra' and the 'inter' subject variances, 2 and 2 S , expressed in terms of a given value of intraclass correlation, I .  P n j iD1 Y i1k stands for the mean of all observations in period 1 and sequence k, and the synthetic estimator of Longford [30] ( 0.2161 and 0.0787, respectively), which is based on a weighted average of N D and O 1 , give some credibility to the possibility of a truly negative formulation effect and thus to the formulation effect and the carryover effect having the same sign.

Example 2
As a second example, we use the results of a true but unrecognisable bioequivalence study available at www.ub.edu/stat/recerca/ materials/Example2Carryover.pdf.
In short, for a balanced sample size of 12 in each sequence, for all three pharmacokinetic parameters, the ANOVA for carryover is non-significant at a 0.05 level. The p-values are 0.1859, 0.2077 and 0.1123 for the logarithms of AUC 0 t , AUC 0 1 and Cmax, respectively. The null hypothesis of zero carryover may be rejected for Cmax using the more permissive level 0.15. But in any case, these results do not provide any indication on the true distorting effect of carryover on the BE study, if present. For example, one may question if these carryovers may put the probability of erroneously declaring BE at an unacceptable˛* D 0.50 level.
For Cmax, the estimated carryover is 1.096, and the estimated within-subject is 0.333, which gives an estimated Â value of 5.699. Testing relevance at a 0.05 level, 5.699 corresponds to more than three times the relevance limit Â 0 D 1.6977. These results seem to suggest carryover relevance, but the upper confidence interval limit U Á is 0.114, so no significant carryover relevance may be declared. On the other hand, if relevance is also tested at a 0.15 level, the upper confidence interval limit U Á becomes 0.0485, and carryover relevance is declared, thus suggesting evidence for an unacceptable user's risk of 0.50 of incorrectly declaring bioequivalence. The intraclass correlation for Cmax is 0.8446, in the limit but still supporting a credible carryover relevance test.
The same results (evidence of relevant carryover at a 0.15 level but not at 0.05, when the possibility of reaching a true type I error probability 0.5 is considered) are obtained for the other two pharmacokinetic parameters, but at very high intraclass correlation values, 0.9538 and 0.9566, for AUC 0 t and AUC 0 1 , respectively, which decrease the credibility of the corresponding relevance results Table III summarizes these results.

DISCUSSION
In our opinion, there is enough evidence to state that some factors may directly affect the concentration of a drug in the second period of a crossover study, and that despite a presumed adequate washout period, sometimes, a certain degree of carryover may be present. Among these factors, there is the possible presence of phenotypes associated with metabolising ability (e.g. extensive, intermediate, poor and ultra-rapid metabolizer, [31]) in the volunteers who participate in a bioequivalence study, a factor that may directly affect the concentration of a drug in the second period. Provided that the frequencies of the alleles associated with these phenotypes may vary across human groups, these considerations also pose some doubt in the automatic transportability of bioequivalence studies between countries or ethnical groups. Obviously, these considerations are only relevant (for carryover) if the differences in metabolising ability are translated in some way to differences associated with the different formulations.
Our first example was chosen quite deliberately to illustrate a case where high carryover was suspected in advance because of heterogeneity with respect to gender. These effects of subgroup heterogeneity (gender, phenotypes, age, etc.) that induce some subject-by-formulation interaction (confounded with the carryover effect) have been emphasised by the regulators. Chapter III of [32], 'Methods to document BA and BE, Part A, Pharmacokinetic Studies, item 5. Study Population' recommends that 'in vivo BE studies be conducted in individuals representative of the general population, taking into account age, sex, and race. We recommend that if the drug product is intended for use in both sexes, the sponsor attempt to include similar proportions of males and females in the study' . Here we are dealing with subject-by-formulation interaction and not with representativeness, but it is worth pointing that there is also some scepticism among specialists concerning these recommendations, as BE studies are performed on healthy volunteers and not in patients, and presumably, representativeness will be not an issue.
Therefore, ignoring the carryover issue in bioequivalence studies may not be the best strategy, especially given that carryover may severely affect the type I error, that is to say, the user risk associated with wrongly declaring bioequivalence. We suggest that bioequivalence studies should be accompanied by some analysis exploring the possible presence of disturbing degrees of carryover, as a way of reinforcing its credibility or lack thereof. In its present status, a positive result of the testing procedure for carryover relevance may not be presented as a feasible proof of inadequacy of the BE study (and even more clearly, the test for carryover negligibility may not be presented as a proof of its adequacy), but perhaps should be taken by a regulatory authority as a suggestion for the convenience of requiring more information about the experiment to the applicant laboratory.
The aforementioned comment suggests where our method may be of main interest: when analysing data coming from an external source, with limited control on the amount of information available by the analyser (e.g. a journal reviewer or a regulatory agency examining a generic application). There are other possible ways of evaluating carryover, for example, using baseline measurements before the second administration; our method may conduct to conclude that such complementary information is necessary and to seek for it.
In any case, we are not promoting a two-stage approach to BE determination. Our point of view is strictly one-stage, always assuming null or negligible carryover and thus the correctness of the decision on BE based on the confidence interval (7). But in the same way that, for example, a look to the residuals is always advisable, a look to any suspected trace of possible disturbing carryover may be a good policy, possibly for asking for supplementary information on the experiment, with the desirable end goal of finally establishing its correctness.