Two‐stage designs versus European scaled average designs in bioequivalence studies for highly variable drugs: Which to choose?

The usual approach to determine bioequivalence for highly variable drugs is scaled average bioequivalence, which is based on expanding the limits as a function of the within‐subject variability in the reference formulation. This requires separately estimating this variability and thus using replicated or semireplicated crossover designs. On the other hand, regulations also allow using common 2 × 2 crossover designs based on two‐stage adaptive approaches with sample size reestimation at an interim analysis. The choice between scaled or two‐stage designs is crucial and must be fully described in the protocol. Using Monte Carlo simulations, we show that both methodologies achieve comparable statistical power, though the scaled method usually requires less sample size, but at the expense of each subject being exposed more times to the treatments. With an adequate initial sample size (not too low, eg, 24 subjects), two‐stage methods are a flexible and efficient option to consider: They have enough power (eg, 80%) at the first stage for non‐highly variable drugs, and, if otherwise, they provide the opportunity to step up to a second stage that includes additional subjects.

ratio (GMR) should lie fully within the ABE limits of 0.80 to 1.25 (=1/0.80), corresponding to ±0.223 on the logarithmic scale. 2,5 Highly variable drugs (HVD) are characterized by high within-subject variability in the rate and/or extent of absorption of its active principle. This hinders researchers from declaring ABE when it really holds, unless unacceptably large sample sizes are used. Most regulations classify a drug as HVD if the within-subject coefficient of variation of the reference formulation R (CV WR ) is 30% or greater on the original scale. The percentage of HVD is not negligible. Davit et al 6 collected data from all in vivo bioequivalence studies reviewed by the FDA's Office of Generic Drugs from 2003 to 2005, and they concluded that 31% of the studies (57/180) corresponded to HVD, many of them around CV WR = 30%.
If HVD is suspected, the European Medicines Agency (EMA) allows linearly scaling the Cmax margins as a function of the R variability to a maximum plateau of 0.6984 to 1.4319, and it further allows application of the interval inclusion rule over these expanded limits. 2 Similarly, the FDA also allows researchers to rescale the AUC limits. 1,3 These scaled approaches require the use of high-order crossover designs like the replicated TRTR/RTRT or semireplicated TRR/ RTR/RRT designs. 2,7,8 However, these scaled methods, as defined by FDA and EMA regulations, do not adequately preserve the type I error rate in the neighborhood of CV WR = 30%. 9,10 Thus, the proportion of non-ABE products erroneously declared as ABE is higher than its desired nominal value.
Regulators also allow using two-stage adaptive designs (TSD) with unblinded interim sample size reestimation 2,5,11,12 based on the usual 2 × 2 crossover RT/TR design. Bioequivalence may be declared at the interim look with N 1 subjects; otherwise, the sample size can be increased on the basis of the estimated within-subject variability at the first stage, then ABE is tested again at a second stage with cumulated data N = N 1 + N 2 . Two-stage designs preserve the type I error rate 13 by adjusting significance boundaries at each stage in various ways that are not fully specified in the regulations. 14,15 In turn, the planned sample size is crucial because it may lead to underpowered studies, as there is a high uncertainty about the assumed GMR and/or variability.
The main objective of this paper is to critically compare the EMA's original scaled method based on a replicate TRTR/ RTRT design (or, more precisely, an adjusted variant intended to preserve the type I error rate, as shown by Labes and Schütz 10 ) with 2 TSD methods based on the usual RT/TR crossover design. Section 2 describes the compared methods and details the simulation methodology. Section 3 shows the results; and Section 4 discusses them to recommend the most appropriate approach.
2 | STATISTICAL METHODOLOGY 2.1 | 2010 regulatory EMA reference scaled average bioequivalence approach (for Cmax only) Replicate TRTR/RTRT designs allow separately estimating the CV WR 9 and can easily be rearranged for comparison with a 2 × 2 crossover design (needed for TSD) once the first 2 periods are sliced (see Section 2.3). We focus on the EMA regulation because the FDA's approach is based on scaled limits, which are discontinuous at CV WR = 30%. This discontinuity is associated with a sharp peak of type I probability around this CV value, which threatens its validity.
On the original scale, the null hypothesis of bioinequivalence is tested against an alternative of bioequivalence, as follows: In the reference scaled average bioequivalence (RSABE) approach, the ABE limits are a function, say GMR EMA , of the unknown population within-subject R coefficient of variation CV WR , so the hypotheses being tested differ from the standard ones enunciated above: 3. Obtain the estimate of the within-subject coefficient of variation of the reference product, c WR is the estimated value of the reference residual standard deviation in the logarithmic scale; 4. Obtain the 90% confidence interval for GMR around its estimate d GMR, CI d  9 among others, showed that the above decision criterion does not adequately control the type I error probability, or false positive rate (say, if bioequivalence is erroneously declared when in fact it does not hold) in the neighborhood of CV WR = 30%.

| Significance-level adjustment on the regulatory EMA scaled approach
As has been previously stated, the 2010 former EMA RSABE procedure does not control completely the type I error probability. To focus on an easy to use method for practitioners, and with chances to be included in the regulations, we considered the method already implemented in the function "scABEL.ad" in the R package PowerTOST. 10 As a consequence of adjusting the significance level, the EMA's scaled method (labeled AdjEMA in the table results) may lose some power. But this (small in general) loss of power is worth because it converts a potentially invalid procedure (with respect to the type I error probability) in a fully correct one.
As a function of the reference coefficient of variation, the type I error probability has only one single maximum at CV WR = 30%. Consequently, though somewhat conservatively, we let the argument "CV" of scABEL.ad at its default value of 0.3. The alternative strategy of estimating the coefficient of variation from data and assigning this (random function of data, unknown in advance) value to the argument CV induces some type I error probability inflation.
In accordance with EMA Questions & Answers guideline, 11 section 10, the estimation of the required parameters was based on the analysis of variance procedure labelled as "Method A" in this document, and not in the intrasubject contrasts, as are, for example, allowed in the FDA regulation for scaled ABE.

| Two-stage modified Potvin B and C designs
We consider 2 adaptive TSD with one interim analysis (at the first stage) with N 1 subjects to (1) establish equivalence early, or (2) stop for futility, or (3) recruit an additional group of N 2 subjects to repeat the bioequivalence assessment at a second stage with N = N 1 + N 2 subjects. Each stage is based on a 2 × 2 crossover balanced RT/TR design, and so the within-subject variability CV W should be estimated by means of the pooled variability of R and T. Unlike the scaled approach, 2-stage hypotheses always rely on the standard fixed limits 0.8 to 1.25.
Among adaptive approaches to bioequivalence, 15 we focused on those (almost partially) mentioned in regulations, considering 2 "Pocock-like" variants, 16 as described by Potvin et al and labelled A, B, C, and D. 17 In particular, we studied a type 1 Potvin B method 5 consisting of using the same adjusted α in both stages regardless of whether a study stops in the first stage or proceeds to the second stage (Figure 1), and a type 2 Potvin C method where an unadjusted α may be used in the first stage, dependent on interim power ( Figure 2).
Both methods calculate N 2 as the minimum even number of additional subjects required for having a total sample size of N, which achieves a conditional power of at least 80% for declaring bioequivalence at the second stage. This is conditional on the estimated within-subject coefficient of variation c CV W at the first stage for an assumed true GMR of 0.95.
Potvin A was discarded, as it did not adjust the significance boundaries; Potvin D was a more conservative variant of Potvin C and therefore not recommended because it requires larger average sample sizes than Potvin C. 13 We propose a modification to the original Potvin B and C algorithms, including 2 constraints consisting of using a minimum sample size in the second stage (like in other jurisdictions or organizations) 5 and a maximum overall number of 150 subjects enrolled 18,19 in ABE studies, as follows: , the trial fails and it is stopped at the first stage.
In any case, regardless of the method used, at least 12 evaluable subjects should be included in the first stage. 1,11 The adjusted significance level of α = 0.0294 used by Potvin et al 13,[16][17][18] at each stage did not always control the overall type I error rate at a maximum 0.05 (eg, when using our modified Potvin C algorithm with N 1 = 12 and considering a true unknown CV W = 20%, the false positive rate would be inflated to 0.053). Like in Xu et al, 20 we did look for a significance level by strictly controlling the type I error rate below 0.05, which was useful for our specific modified Potvin B and C methodologies. Because the sponsor is unaware of the true CV W value, we looked for a significance level that was applicable to a broad set of N 1 and CV W , {N 1 /CV W } (scenarios shown in Section 2.4).
We used the method implemented in the function "power.2stage" (via noncentral t distribution) in the R package Power2Stage. The treatment effect was evaluated at the frontier 1.25, and assuming an expected GMR = 0.95 and a target power of 80%. A short statement for assessing the adjusted significance level, α adj , is as follows: 1. Define a grid with a set of {N 1 /CV W } 2. Start with an arbitrary, eg, α adj = 0.0290 3. Obtain the empirical probability of type I error, Pr{TIE}, over the grid (m = 30 000 simulation trials per scenario). Filter for the scenarios where Pr{TIE} is at least 95% of the max(Pr{TIE}) observed in the grid, let us say {N 1 /CV W } TIE≥P95% 4. For {N 1 /CV W } TIE≥P95% , find the N 1 /CV W with max(Pr{TIE}) (m = 1 000 000) 5. Set up a range of α j close to the one used before, α j ∈ {α adj ± δ j } j = 1…5 (eg, by δ increments of 0.0001 unit). By using the N 1 /CV W associated to max(Pr{TIE}), estimate the Pr{TIE} of all α j (m = 1 000 000) 6. Adjust linear α = g lin (Pr{TIE}) and quadratic α = g quad (Pr{TIE}) models, with and without the intercept. Choose the model with the lowest Akaike information criterion value 7. Use this model to predict a new α adj , where α adj = g(0.05) 8. Evaluate the entire grid of {N 1 /CV W } with this new α adj (m = 1 000 000) 9. If Pr{TIE} < 0.05 for all {N 1 /CV W }, STOP and select this new α adj ; otherwise, start again over with step 4.
As the 2010 EMA guideline uses a type 1 TSD method, 2 we used the modified Potvin B as the main TSD approach and the modified Potvin C as a sensitive case.

| Simulation methods
The results described in the next sections are based on simulations using 64 bits R and Microsoft R Open. The main outputs are type I error rate, power, and the number of trials stopping at the first stage for the TSD approach. For most scenarios, m = 100 000 datasets were generated but m = 1 000 000 for those devoted to estimating the most crucial type I error probabilities, ie, for simulated GMRs just on the bioequivalence limit.
In the simulations, we considered all combinations of 3 factors: sample size, true GMR, and true within-subject variability under the homoscedasticity assumption that CV W = CV WR = CV WT (from now on, we use CV W and CV WR interchangeably, provided the assumed simulated homoscedasticity). The sample sizes were N 1 = 12, 18, 24, 30, 36, 48, and 60 subjects for RSABE methods and at the first stage for TSD methods, always considering a balanced design, ie, 6,9,12,15,18,24, and 30 subjects per sequence. The simulated population GMR values were 0.95, 1.00, 1.12, 1.25, and 1.31, with the first three corresponding to scenarios under true bioequivalence (alternative hypothesis) and the last two corresponding to the true nonbioequivalence (null hypothesis). In fact, this statement is exactly true for the TSD approach, where the bioequivalence limits are the constants 0.80 to 1.25; see the next paragraph for clarification in the RSABE case. Finally, the simulated within-subjects coefficients of variation were 10%, 20%, 25%, 30%, 40%, 50%, and 60%. A coefficient of variation of 30% or higher indicates an HVD. Section 3 reports only the results for a subset of the simulated values on sample size, true GMR, and true coefficient of variation. In addition, these TSD simulations were done using the "exact" method.
Provided that TSD and RSABE are based on different definitions of bioequivalence, comparing them is quite difficult. To have a reference case for comparison, we took the simulated true GMR values "on the frontier" of each approach (constant 1.25 in TSD or a function GMR EMA in RSABE for varying simulated CV WR values), which should provide similar proportions of bioequivalence declaration (near 0.05) if both approaches are adequately controlling the user's risk. For GMRs that are progressively inside or outside the corresponding bioequivalence regions, these probabilities should also be comparable. To define these concordant simulation scenarios, we reasoned at the logarithmic scale. The constant simulated GMR values in the TSD approach are 0.95, 1.00, 1.12, 1.25, and 1.31, and they correspond to formulation effects on the logarithmic scale of −0.0513, 0, 0.1133, 0.2231, and 0.2700, respectively. With respect to the (frontier) 0.2231 value, these formulation effects correspond to proportions λ = −0.230, 0, 0.508, 1, and 1.210, respectively. Then, λ = 1 refers to values on the frontier, |λ| < 1 to scenarios of true bioequivalence, and |λ| > 1 to scenarios of bioinequivalence. Therefore, the same λ value defines concordance in TSD and RSABE scenarios: The population GMRs in the original scale were taken as exp{λ 0.2231} in the TSD approaches and for all simulated CV WR values; while in the RSABE approach, they were taken as exp{λ 0.2231} for CV WR < 30 % , as exp λk EMA ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi log CV 2 WR þ 1 À Á q n o for CV WR values between 30% and 50% and as exp{λ 0.3590} for a CV WR ≥ 50%. For simplicity, the simulated GMRs in the next sections will always be labeled as 0.95, 1.00, 1.12, 1.25, and 1.31; but it should be remembered that these values in the RSABE case correspond only to the simulated coefficients of variation below 30%.
Following the EMA Questions & Answers guideline, 11 adjusted analysis of variance models for analysis of the combined second stage data included the following terms: stage, sequence, interaction sequence*stage, subject nested in sequence*stage, period nested in stage, and formulation.

| SIMULATION RESULTS
The adjusted significance level predicted for the modified Potvin B was assessed at α adj = 0.0301 at each stage; for the modified Potvin C, the adjusted significance level predicted was assessed at α adj = 0.0280 (Figures 1 and 2).
Both adaptive TSD modified Potvin B and C methods performed similarly in respect to the power achieved and the required median sample size Me[N] (Table 1). Because almost all simulated studies required stepping up to a second stage and resulted in large final sample sizes, it was not advisable to start with a too small sample size, like N 1 = 12, in scenarios with high variability (CV W ≥ 30%).
On the other hand, when N 1 ≥ 24, the global power (including both stages) was at least 80% when variabilities were raised up to 40%. Additionally, those sample sizes increased the likelihood of stopping for bioequivalence at the first stage. For the high value of CV W = 60%, results were poor, with power always below 80%.
For the RSABE EMA method, a crucial variability value is at the threshold CV W = 30%, where there is a maximum type I error peak. Table 2 shows that for a true GMR of 1.25, the highest false positive rate is 0.085, confirming the already known risk control problems of the EMA scaled approach. On the other hand, the RSABE adjusted EMA method (AdjEMA) accurately respected the nominal 0.05 level. Both TSD approaches also respected the type I error at 0.05. In addition, for a sample size of N 1 = 24, all methods with a type I error close to the nominal 0.05 level provide satisfactory and similar powers on bioequivalent drugs (GMR = 0.95, 1.00, and 1.12). The apparently larger sample sizes required by TSD methods should be relativized: With half periods, they did not double mean size and reached a bioequivalence statement at the first stage in a notable proportion of times (approximately 41%, 47%, and 24%). Figure 3 shows a more comprehensive picture of the extended N 1 and CV W values for a bioequivalent scenario fixed at GMR = 0.95. When N 1 = 12, TSD methods showed higher power than the RSABE adjusted EMA method for CV W > 20%, requiring relatively larger global sample sizes of Me[N] = 44 and around 70 for CV W = 30 % and 40%, respectively. For N 1 = 24, the RSABE adjusted EMA method showed a similar trend as both TSD methods; and for N 1 = 36, both methods showed power above 80%, for a true CV W below 60%. For a true CV W ≥ 60%, the power for both TSD methods seriously suffered from the futility criterion of not allowing studies with more than 150 subjects, though for the RSABE adjusted EMA, the power was still above 80%. Figure 4 explores the power for different true levels of bioequivalence: GMR = 0.95, 1.00, and 1.12. It is remarkable that for a true value of GMR = 1.12, no methods reached 80% power for any HVD with CV W ≥ 30%.

| DISCUSSION
Bioequivalence studies are the pivotal clinical studies submitted to regulatory agencies to support the marketing applications of new generic drug products. High levels of within-subject variability make it difficult to assess bioequivalence  Step to St2 N ABE Step to St2 N through standard procedures using reasonable sample sizes, thus delaying treatment. After many years of discussion, some agencies issued regulations describing those methods. In general, their approach is based on bioequivalence limits being scaled as a function of the reference formulation variability. This is the RSABE approach of the EMA regulation issued 2 in 2010. Although also mentioned in the regulations, adaptive TSD are not used nearly as much as the widespread scaling methods, despite having some appealing characteristics. Deciding on the study's experimental design is crucial and must be done in advance (eg, including it in the study protocol), generally without full knowledge of the within-subject variability. We compared 2 variants of well-known adaptive methods and an RSABE adjusted (type I error) EMA approach. Both methods showed similar statistical power, but the RSABE adjusted scaled method required less sample size, although at the expense of exposing subjects twice as long as TSD methods. For initial sample sizes of at least 24 subjects, TSDs are a good option to consider, as they have a power of around 80% at the first stage for non-HVD while at the same time, they offer the opportunity for stepping up to the second stage (including additional subjects) for truly bioequivalent products. Statistical power is used to evaluate the performance of adaptive methodologies in ABE clinical trials. A power of at least 80% is desirable when considering N 1 subjects at the first stage and assuming an expected but unknown within-subject coefficient of variation, CV w . In turn, this is always conditioned to not exceed the overall type I error rate of 0.05 for true bioinequivalent drugs. In our modified Potvin B and C methods, we found adjusted significance levels covering a wide range of N 1 and CV w combinations (ie, α adj = 0.0301 and α adj = 0.0280 at each stage for Potvin B and C, respectively). This is useful to regulators since they can widely rely on the protection of patients against false positive results. However, we understand that for a specific actual (local) N 1 and CV w combination, the power might be slightly downgraded, although it is always above 80% in case of true bioequivalence. They showed that by using 2 × 2 crossover designs with conventional ABE limits of 0.8 to 1.25 and CV w of 60% or above, the required sample size exceeds 150 subjects (though replicate designs require smaller sample size). Using adaptive designs, we avoid conducting studies with such a large sample size by imposing a futility criterion so that we can stop the trial at an interim look with only N 1 subjects. According to Karalis and Macheras, 19 we added a constraint to the original TSD methods, specifically by not recruiting more than 150 subjects overall. For example, in the case of a true bioequivalent drug with 0.95 ≤ GMR ≤ 1.05, and for HVD with an estimated within-subject coefficient of variation above 58% at the interim analysis, the final sample size needed for achieving a power of 80% at the second stage already exceeds 150 subjects. At first glance, this constraint represents some global loss of power, but this possibility of cancelling a study for futility may ultimately be considered a positive trait, since the sponsor is unaware of the true treatment effect value during the planning phase, and the overall sample size could unnecessarily soar above this threshold for a scenario of true bioinequivalence. However, from an ethical perspective, even starting a study with such a low expected power might be questionable. 22 Kieser and Rauch 15 and Karalis and Macheras 19 pointed out a potential limitation of the original TSD methods stated by Potvin et al 17 and Montague et al, 13 as although unblinded data are available after the first stage, the knowledge about the estimated GMR in the interim analysis is not used for sample size recalculation. We assumed a fixed true treatment effect of GMR = 0.95 after the first stage since Cui et al 23 showed that a determination of the second-stage sample size based on an interim estimate of the GMR can substantially inflate the probability of type I error in most practical situations.
In addition, the expected total sample size E[N] is usually used to compare the performance characteristics of different TSD methods. However, by their very nature in TSD, the distribution of total sample sizes N is bimodal, mainly due to the imposition of N ≥ 1.5N 1 . For example, using our modified Potvin B, with α adj = 0.0301 at each stage, GMR = 0.95, FIGURE 4 Bioequivalence acceptance of the adjusted reference scaled ABE EMA method and two-stage designs (TSD) modified Potvin B for different levels of true bioequivalence and a progressive increase in the within-subject variability. ABE, average bioequivalence, RSABE, reference scaled average bioequivalence; EMA, European Medicines Agency; N 1 , initial and fixed sample size (EMA method); GMR, geometric mean ratio; CV w , within-subject coefficient of variation; Me[N], TSD median total sample size (beside the squares in the figure); AdjEMA, type I error adjusted EMA CV w = 0.3, N 1 = 24, and target power 80%, we obtain a E[N] of 40 subjects, but with 24 and 36 subjects having more likelihood of occurrence ( Figure 5). As the average is skewed towards 2 sample values, we believe that the median of N is more useful to compare different TSD methods.
In general, regulators allow using adaptive methods, though they usually favor sample size reestimation procedures that maintain the blinding of the treatment allocations throughout the trial, as shown by Golkowski et al. 24 However, even though both TSD Potvin B and C methods studied in this article assume unblinded data at the interim analysis, the agencies do specifically also recommend using these 2 TSD methods, 2 as they have demonstrated that they control the type I error rate in a strong way.
So, given that either the RSABE or TSD methods are suitable approaches for ABE studies, we have compared them through the behavior of the type I error rate and its power to facilitate the discussion about which to choose. In terms of power, both approaches perform similarly despite both adaptive methods requiring a higher mean sample size to reach the same power, especially for clearly variable drugs. Nevertheless, they demonstrate suitable power at the first stage in some cases. However, as RSABE relies on replicate designs, double exposure of subjects is needed. The crucial point to consider is the assessment made by sponsors regarding the relative importance of the number of required subjects (an argument favoring the scaled approach) and the exposure of these subjects (which tips the balance in favor of the TSD approach).
The applicability of the TSD approaches is essentially the same as the classical approach, in that they have the same RT/TR design and fixed standard limits. 25 The RSABE approaches (with type I error adjustment) are appropriate for drugs with low to moderate variability, because dose-to-dose variability within a patient is comparable to the width of the criteria. However, with HVD, dose-to-dose variability within a patient is greater than the width of the standard criteria, and it is usually characterized by flat dose response curves and wide safety margins. Therefore, broadening the acceptance limits in the RSABE approach is at the very least controversial, since clinically sound criteria should be used to clearly prove if a greater difference in Cmax (and also in AUC for the FDA) is irrelevant.
In conclusion, the RSABE approach is well powered and usually requires enrolling fewer patients than adaptive TSD methods, even though scaling the ABE limits ultimately depends on additional clinical judgment. For HVD in general, samples of 36 subjects provided well-powered studies using RSABE methods. As there is a considerable chance of declaring ABE at the first stage in adaptive approaches, sponsors should consider them because they imply less subject exposure and less treatment duration.