Flexible Online Repeated Measures Experiment Yu Guo Alex Deng Microsoft Microsoft OneMicrosoftWay OneMicrosoftWay Redmond,WA98052 Redmond,WA98052 [email protected] [email protected] ABSTRACT 1. INTRODUCTION 5 Online controlled experiments, now commonly known as Manyrecentpublicationsattesttothepowerofusingonline 1 A/Btesting,arecrucialtocausalinferenceanddatadriven A/B testing as the golden rule for making causal inference 0 decisionmakinginmanyinternetbasedbusinesses. Whilea inwebfacingcompanieslargeandsmall. Byrandomassign- 2 simple comparison between a treatment (the feature under ment of feature to otherwise balanced groups of users and test) and a control (often the current standard), provides a measuring subsequent changes in user behavior, A/B test- n startingpointtoidentifythecauseofchangeinKeyPerfor- ingisolateseffectoffeaturechange,i.e. thetreatmenteffect a J manceIndicator(KPI),itisofteninsufficient,asthechange from extraneous sources of variance. wewishtodetectmaybesmall,andinherentvariationcon- 2 tainedindatamayobscuremovementsinKPI.Tohavesuffi- Toperformstatisticalinferenceinbothpointestimationand cientpowertodetectstatisticallysignificantchangesinKPI, hypothesistestingforthetreatmenteffect,whilecontrolling ] P an experiment needs to engage a sufficiently large propor- typeIerroratpre-specifiedlevel,wewoulddesirelowertype A tionoftraffictothesite,andalsolastforasufficientlylong II error, or equivalently, higher powered experiments. That duration. This limits the number of candidate variations is,wewishtobeabletodetecttheeffectwhenthereisany. . t to be evaluated, and the speed new feature iterations. We Runningunderpoweredexperimentshavemanyperils. Not a t introduce more sophisticated experimental designs, specifi- onlywouldwemisspotentiallybeneficialeffects,wemayalso s cally the repeated measures design, including the crossover getfalseconfidenceaboutlackofnegativeeffects. Statistical [ designandrelatedvariants,toincreaseKPIsensitivitywith powerincreaseswithlargereffectsize,andsmallervariances. 1 thesametrafficsizeanddurationofexperiment. Inthispa- Let us look at these aspects in turn. v perwepresentFORME(FlexibleOnlineRepeatedMeasures 0 Experiment),aflexibleandscalableframeworkforthesede- Whiletheactualeffectsizefromapotentialnewfeaturemay 5 signs. Weevaluatethetheoreticbasis,designconsiderations, notbeknown,wegenerallyselectasizethatmakesbusiness 4 practical guidelines and big data implementation. We com- sense, i.e. one that justifies the cost of feature development 0 pareFORMEtoanexistingmethodologycalledmixedeffect and ongoing maintenance of the code base. Dramatic fea- 0 model and demonstrate why FORME is more flexible and tures that drastically alter user behavior and get reflected 1. scalable. We present empirical results based on both sim- in KPI as large effect sizes are few and far in between. Of- 0 ulation and real data. Our method is widely applicable to tenthecandidatefeaturehasbutasmalleffectontheKPI. 5 online experimentation to improve sensitivity in detecting Nonetheless, by accumulating a portfolio of small changes, 1 movements in KPI, and increase experimentation capabil- a business can achieve big business success. Quote from : ity. Rule #2 of Kohavi et al. (2014), winning is done inch by v inch. This is especially true for mature web facing busi- i X CategoriesandSubjectDescriptors nesses where most low hanging fruits were picked already. r G.3 [ Probability and Statistics]: Experiment Design a In general one expects variance to decrease with increased samplesize. Butthisisnotalwaystrue. Foronlinebusiness, GeneralTerms at first glance it may seem that the number of visitors may Measurement, Experimentation Design, Web search, A/B belarge,andwithacasuallookpeoplemaythinkthepower Testing to detect any change is large. In reality, however, intrinsic variation between users is large and may obscure the small movement in KPI. Variation in measured treatment effect comes from various sources. Exogenous to the treatment itself includes user to user variation, e.g. some users from slower internet connection would always have slower page load time regardless of what experiments are run. Variance for some metrics does not decrease over time, instead they plateauaftersomeperiodoftime(say,twoweeks),andrun- ning longer experiments no longer results in corresponding benefits (see Kohavi et al. 2012, Section 3.4). This poses a limitation to any online experimentation platform, where within-subject variation. We also discuss practical consid- fast iterations and testing many ideas can reap the most erations to repeated measures design, with variants to the rewards. crossoverdesigntostudythecarryovereffect,includingthe “re-randomized”design (row 5 in table 1). 1.1 Motivation 1.2 MainContributions Toimprovesensitivityofmeasurement,apartfromaccurate implementation and increase sample size and duration, we Inthispaper,weproposeaframeworkcalledFORME(Flex- can employ statistical methods to reduce variance. Using ibleOnlineRepeatedMeasuresExperiment). Wemadecon- the user’s pre-experiment behavior as a baseline for his/her tributions in both novel application and new methodology. post-experiment behavior, we can reduce the variance in measured treatment effect. The experiment setup in a two- • Novel applications. We propose different experiment weekexperimentisshowninTable1. ThetypicalA/Btest designs with repeated measurement. We demonstrate is illustrated in the first row. In the past we have used through real examples the value of these new designs regression to reduce variance (CUPED: Controlled Experi- comparingtotraditionalA/Btest. Methodsformodel ments Using Pre-Experiment Data, see Deng et al. (2013)) assumption checking is also presented. We also com- and have achieved good results, e.g. reducing variance in pare different designs for practical use and propose a numberofqueriesperuniqueuserinagiventimeperiodby general workflow for practitioners. 40-50%. CUPEDhasthebenefitofhavingreadilyavailable • NewMethodology. Wereviewstandardrepeatedmea- baselinedata“forfree”. Thisimprovementisperformedwith sures models in the framework of mixed effect mod- existingdesign,usingthe“free”dataascovariatesonlyinthe els. We present a new method to fit the model that analysis stage. CUPED is in fact a form of repeated mea- is scalable to big data. Our method is flexible in suresdesign,wheremultiplemeasuresonthesamesubjects the sense that it makes far less assumptions than tra- are taken over time. In particular, in the pre-experiment ditional method based on mixed effect model (Bates stage, all users received the default feature C (control) and et al. 2012a). It naturally handles missing data with- none received the new feature T (treatment). outmissingatrandomassumption(commoninonline experimentation) and still provides unbiased average Pre- treatment effect estimation when mixed effect model Experiment Experiment fails. FORME canfit different types of repeatedmea- Groups Week0 Week1 Week2 sures models under the same framework. It also can 1 -T- beappliedtometricsbeyondthosedefinedasasimple A/B Test 2 -C- average,suchasmetricsdefinedasafunctionofother CUPED 1 C -T- metrics. 2 C -C- Parallel 1 T T 2. ILLUSTRATIONOFFORME 2 C C In this sections we will take a close look at several designs, 1 T C Crossover with a treatment and a control, and with experiments car- 2 C T riedoutover several periods. Many commononline metrics 1 C C 2 C T display different patterns between weekdays and weekends. Re-Randomized 3 T C Therefore experiments at Bing and many large IT compa- 4 T T nies, in general are run for at least a full week to account for the difference between weekdays and weekends. In the Table 1: Repeated Measures Designs following section we assume the minimum experimentation “period”to be one full week, and may extend to up to two In this paper we extend the idea further by employing the weeks. To facilitate our illustration, in all the derivation repeated measures design in different stages of treatment in this section we assume all users appear in all periods, assignment. The traditional A/B test can be analyzed us- i.e. no missing measurement. We also restrict ourselves ing the repeated measures analysis, reporting a“per week” to metrics that are defined as simple average and assume treatment effect, as show in row 3“parallel”design in ta- treatment and control have the same sample size. We fur- ble 1. The two week experiment can be considered to be ther assume treatment effects for each subjects are fixed. conducted in two periods, even though users received the We emphasis this is just for illustration purpose and our same treatment assignment during both periods. In one methoddoesnotrelyontheseassumptionsandwedescribe of the new designs, the“crossover”design, in contrast, we how we handle missing data and more complicated metrics swaptreatmentassignmenthalfwaythroughtheexperiment in Section 4. Impatient reader who are familiar with re- (row 4 in table 1). Each user will be exposed to both ver- peated measures analysis might jump over to Section 5 to sions of the treatments, instead of only one of the two in seedetailsofFORME’smodelassumptionsandcomparison the usual A/B testing scenario. In sequence, a user will re- to linear mixed effect model. ceive either T followed by C, or C followed by T, with the flight re-assignment happening at the same moment for all Denote the metric value mean in the treatment group as users. Instead of randomizing treatments to users, we ran- µ , and that in control as µ . We are interested in the T C dom treatment sequences (TC or CT) to users. This way average treatment effect (ATE) δ = µ −µ which is a T C eachuserservesashis/herowncontrolinthemeasurement. fixed effects in the model in this section. This way, various In fact, the crossover design is a type of repeated measures designs considered can be examined in the same framework design commonly used in biomedical research to control for and easily compared. We will proceed to show, with theoretical derivations, that 2.1 TwoSampleT-test given the same total traffic Let X denote the observed average metric value in control group and Y denote that in the treatment group. Since • Variance using CUPED ≤ T-Test usersarerandomlyassignedintoeithertreatmentorcontrol • With CUPED: Variance in parallel design ≤ Cumula- group, X and Y are thus independent. For simplicity of tive Design notation,weassumedvarianceinthetwogrouptobeequal. • Variance in Crossover design ≤ Parallel Design Given large enough sample size, under CLT, and plug in observed sample variances, we have: Denoteobservedsamplevaluesinthetreatmentgroupsand ttiomreofpmereiotrdicsavsalXu(cid:126)e,saXndftohreidriffmeeraennstβt(cid:126)i.mNeopteeritohdastiXn(cid:126)diesxaedvebcy- (cid:104)−CT(cid:105):(cid:104)XY (cid:105)∼N(cid:16)(cid:2)δ+µµ(cid:3),(cid:104)s02X s02Y (cid:105)(cid:17) i i,andthetreatmenteffectδcanbeformulatedasafunction where µ is mean metric value in the control group, δ is the of β(cid:126) depending on model specification. Under the central treatmenteffectcomparedtothecontrolgroup,ands2X and limit theorem (CLT), with sufficiently large samples X(cid:126) is s2Y are variances of X and Y respectively. Here (cid:126)λ = (µ,δ), asymptotically normal andthe−2logLikelihood,denotedbyl,ofparametervector β(cid:126)((cid:126)λ)=(µ,δ)T given observed data is then X(cid:126) ∼N(β(cid:126),Σ). l=(cid:104) X−µ (cid:105)T (cid:104)s2X 0 (cid:105)−1(cid:104) X−µ (cid:105)+const The likelihood of β(cid:126) given observed data is then Y−δ−µ 0 s2Y Y−δ−µ L= √ 1 exp(cid:0)1(X(cid:126) −β(cid:126))TΣ−1(X(cid:126) −β(cid:126))(cid:1) Solving for MLE of δ and obtain its variance as 2π|Σ|12 2 Var(δˆ)|TTest =s2X +s2Y (1) To get maximum likelihood estimates (MLE) of β(cid:126), denoted It is simply the sum of variances from the treatment and by βˆ, we seek to minimize -2log(Likelihood) control groups, which is the asymptotic variance of X−Y. 1 l=− (X(cid:126) −β(cid:126))TΣ−1(X(cid:126) −β(cid:126))+const 2.2 UsePre-experimentDataforVarianceRe- 2 duction Solving ∂l =0givesMLEofβ(cid:126). Anditsvariance-covariance ∂β(cid:126) At the analysis level, different models seek to explain the matrix is amountofvariationinobserveddata,whichmaycomefrom (cid:20) ∂2l (cid:21)−1 intrinsic, within-user difference, as well as variation intro- Var(βˆ)= =1/I(β(cid:126)) duced by differential treatment. For example, users that ∂β(cid:126)∂β(cid:126)T connect through broadband tend to have faster page load time than people using dial-up connection. This difference where Fisher Information existsregardlessofwhichtreatmentconditionstheusersare I(β(cid:126))=−E(cid:104)(cid:0) ∂ logf(X|β(cid:126))(cid:1)Tlogf(X|β(cid:126))(cid:12)(cid:12)β(cid:126)(cid:105). exposedto,andisthusirrelevantwhenmeasuringdifference ∂β(cid:126) (cid:12) introduced by different treatments. As a result, the mea- surements on the same users over time tend be positively In the following sections we will explicitly model the mean correlated. β(cid:126) asafunctionofotherparametersβ(cid:126)((cid:126)λ),oneofthecompo- nents is treatment effect δ, and study expected variance of CUPED and previous work has established that by includ- the MLEs of(cid:126)λ. In fact this is simply: ing covariates that are unrelated to the treatment, we can Var(λˆ)=I((cid:126)λ)−1 improve sensitivity and reduce variance of estimated treat- ment effect. Specifically, the users’ pre-experiment behav- =(cid:2)(cid:0)∂β(cid:1)TΣ−1E(cid:2)(X−β)(X−β)T(cid:3)Σ−1∂β(cid:3)−1 iorsservers as agood baseline for their behavior duringthe ∂λ ∂λ experiment. Byincludingpre-experimentdataasacovariate =(cid:2)(cid:0)∂β(cid:1)TΣ−1∂β(cid:3)−1 in the regression model for treatment effect, we can reduce ∂λ ∂λ the variance of the estimated treatment effect. Coefficientofvariation(CV)definedasthemeanoverstan- Denote the pre-experiment average metric value to be X darddeviationofametric,determinesthesensitivityorthe 0 and Y for the later control and treatment groups respec- power of the experiment, given the same sample size. To 0 tively. By CLT study sensitivity or power of various experimental designs, osotnnacbvelaewriaeactrhiooasnvseodfeisffetesatrbiemlnistahtmeeddeateshffuaerticntegffsipezceetrisosoidlzesel,yrw.eemSpcaaeinncisfithrceaelnlalytf,iovctehulyes (cid:34)−CCC(cid:35):XXY001∼N(cid:16)(cid:20)δ+µ+µµµ+θθ(cid:21),[Σ0 Σ0](cid:17),Σ=(cid:104)ρss020s1 ρss021s1(cid:105) diagonal cell in Var(λˆ) corresponding to treatment effect δˆ T Y1 gives its variance, and is our main focus in the following of where θ is the difference between the pre-experiment and this section. experimentperiods,i.e. thelongitudinaleffect,andρisthe correlation between the two periods. Here(cid:126)λ=(µ,δ,θ). We Analysis from randomized two-group experiments employs assume correlation ρ in both treatment and control groups thetwosamplet-testundertheusualA/Btestingscenario. to be the same for simplicity. Results do not dependent Asagentleintroductionwewillfirstlookatthet-testusing onthisassumption. Eventhoughthetwotreatmentgroups this notation. are still independent, metric value measured on the same group of users across different time periods are in general In practice, also there is a lot of non-recurring users. We correlated. As we will later see, this correlation effectively opted to show empirical results instead in results section. reduces variances on δˆ. Similarly we can solve for MLEs fromsolvingpartialderivativeofl =0andderivevariances Carefulreadersmayhavenoticed,thismethodmakesakey for these estimates. assumptionthattreatmenteffectδ remainsthesameinthe Var(δˆ)| =2s2(cid:0)1−ρ2(cid:1) (2) twoweeks. Tocheckthisassumption,wecanexplicitlytest CUPED 1 for δ =δ by fitting the model this way: 1 2 It’s easy to see (2) has smaller variance of δˆ than (1) by aaenmcrtootusisnmtteimopfee,2rρiio.2eds.21s,.wtiAhtihssunasomenro-szu’enrbtoeihcsaovprirooesrliatitisvioeun.suρTahallemyoacnmognosudinsifftteenort-f EXXYY1212=(cid:20)δ2δµ+1+µ+µθ+µθ(cid:21) variance reduced is ρ2 that of the original variance. and test for the equivalence of MLEs H : δˆ = δˆ. The 0 1 2 parallel design is appropriate if we fail to reject H . 0 2.3 Cumulativevs. ParallelDesign Notethatinthepreviousdesignwemakenoassumptionon 2.4 CrossoverDesign thedurationofthepre-experimentandexperimentperiods. Now with the preliminary background information setup, Empirical studies in Deng et al. (2013) have shown that we then look at variation reduction achieved through the usingone-weekpre-experimentdataprovidessimilaramount crossover design. of variance reduction as using even longer durations. For simplicity, in practice we recommend using one-week such ThecrossoverdesignemploysasimilarideatoCUPED.In- data. steadofusingpre-experimentdataasthebaseline,incross- over experiments, each user is exposed to both treatments And we have mentioned that to capture the difference be- sequentially, while the order of treatment groups is deter- tweenweekdayandweekends,werecommendrunningexper- minedrandomly. Eachuser’sbehaviorwhileheorsheison iments for whole weeks, typically 14 days. Assuming treat- thecontrolconditionservesasabaselineforhisorherbehav- ment effect is the same across time, this gives us two ways ior on the treatment condition. By accounting for within- of reporting treatment effects, i.e. reporting cumulative ef- user variation, analysis based on the crossover design also fectsforthewhole14days,andreportingweeklytreatment reduces variance for the estimated treatment effect. effect as a weighted average between observed values in the twoweeks. Forthelatter,usingthesamenotationasabove, Incausalinference,weoftenseektoeliminateanyconfound- we have ingfactorsandisolatetherootcauseofobserveddifference. (cid:34)−CCTT(cid:35):XXYY1212∼N(cid:16)(cid:20)δ+µδ++µµ+µθθ(cid:21),[Σ0 Σ0](cid:17),Σ=(cid:104)ρss121s2 ρss122s2(cid:105) DcWoumineesthofirpnaom2t0eo0wb7os)re,krrv(aRinnogdsoetmhnbeiazcauotmuionntaenridsfaucRtsuuebdailntion1tm9h8ae3k;peMottehonregtiacaonlnoaturnotd-l group as the surrogate for counterfactual. This surrogate only works on average. In reality often some imbalance in We can solve for MLE and their variances. some observed or unobserved factors will remain. Cross- s2s2(cid:0)1−ρ2(cid:1) overdesignuseseachtestsubjectashisorherowncontrol, Var(δˆ)| =2 1 2 (3) Parallel s2+s2−2ρs s thus reducing the influence of confounding covariates, and 1 2 1 2 achieve better sensitivity in estimating treatment effect. Fortheformer,ifthemetricvalueisstrictlyadditiveacross time,anexamplebeingrevenue,underourtoymodelwhere Distribution of observed sample averages is: all users appear in both periods, the cumulative treatment effect wo(cid:104)ul−CTd(cid:105)b:e(cid:104)δ˜XY=11++2XYδ22,(cid:105)si∼ncNe (cid:16)(cid:104)2δ2+µ2+µθ+θ(cid:105),[Σ0 Σ0](cid:17), (cid:34)−CCTT(cid:35):XXYY1212∼N(cid:16)(cid:20)δ+µδ++µµ+µθθ(cid:21),[Σ0 Σ0](cid:17),Σ=(cid:104)ρss121s2 ρss122s2(cid:105) Σ=Var(X1+X2). Similarly, treatment effect estimate has variance Using (1), variance for the MLE is s2s2(cid:0)1−ρ2(cid:1) Var(δˆ)| =2 1 2 (5) Var(δˆ˜)| =2Var(X +X )=2(s2+s2+2ρs s ) Crossover s21+s22+2ρs1s2 Cumulative 1 2 1 2 1 2 (4) Comparing (3) to (5), it is obvious that in the crossover Comparing coefficient of variation (CV) in (3) to (4), design, treatment effect has smaller variance as long as the correlation ρ is positive. Similar to CUPED, the amount of Var(δˆ˜) − Var(δˆ) = (s1+s2)2(s1−s2)2 ≥0 sensitivityimprovementisdeterminedbythesizeofρ. The 4δ2 δ2 2δ2(s21+s22−2ρs1s2) larger the correlation between time periods, the more im- provementthecrossoverdesignhasovertheparalleldesign. Equalityholdswhenthetwoperiodshaveidenticalvariance, Theequivalenceoftreatmenteffectcanbesimilarlychecked i.e. s = s . In other words, for additive metrics which 1 2 as in section 2.3. variation over time is large, reporting weekly metrics alone will improve metric sensitivity. 2.5 AbsoluteorRelativeChange? For non-additive metrics, such as ratio metrics like Click Sofarinthispaperweconsideredtheabsolutetreatmentdif- ThroughRate(CTR),thederivationbecomesmoreinvolved. ference δ = µ −µ . In practice we measure thousands of T C metrics simultaneously. These metrics may have vastly dif- no carry over effect exists. Usually a“wash-out”period can ferent magnitude in their treatment effects. Even the same be injected in between treatment periods. metric measured over different duration, or over different sample sizes may have different absolute δ’s. This renders 3.1 Wash-outPeriod comparison of effect size across different experiments diffi- This approach calls for a“wash-out”period after the end cult. To overcome this difficulty, we often seek to measure of the first period, where all users will receive the control. percent delta, %δ = δ ·100%. The relative change is less µC Datafromthewash-outperiodcanbeanalyzedinsimilarly influencedbythebase differenceandisamorerobustmea- in linear mixed model to estimate the carry over effect and sure of treatment effect. In online experimentation we usu- subsequentlyinformthedesignoflaterstage. Weleavethis ally deal with hundreds of thousands of samples, therefore as an exercise to the reader. CLT still holds and relative change would still have asymp- toticnormality. Theadditivemodeldescribedabovecanbe 3.2 EstimateCarryoverEffect readilyadaptedtomodelrelativedifferenceinsteadofabso- lutedifference,byformulatingtheexpectedgroupmeansin Inthecrossoverdesignwhereonlytwogroupsareallocated, the mixed variance-covariance structure model. For exam- it is not hard to see that potential carry over effect is con- ple, the crossover model with relative treatment effect can founded with the week to week difference of treatment ef- be written as: fect. Using only the crossover model, we can only measure one of these two effects. As an alternative, at the cost of (cid:34)−CCTT(cid:35):(cid:34)XXYY¯¯¯¯1212(cid:35)∼N(cid:16)(cid:34)µ(µ1(µ+1+µ+δθ)δ+)θ(cid:35),[Σ0 Σ0](cid:17),Σ=(cid:104)ρss121s2 ρss122s2(cid:105) lwfeoesllsomewffiainycgiee4nst-cgiymrogauatpiendtohevseeigrcnat.hrreytroavdeirtioeffneacltnoenxp-clricoistsloyveursidnegsitghne, Theoreticderivationtoshowvariancereductioncanbecom- plex, but MLE estimates and their variances can be easily In a 4-group re-randomized design, we conduct the experi- solved using numeric methods. ment over two periods, and split the users into four equally sizedgroups,onereceivingcontrolsinbothperiods,onere- 2.6 TheUnifiedTheme ceivingtreatmentsinboth,andonereceivingtreatmentfol- We illustrated different model designs of FORME. Careful lowed by control, and the last receiving control followed by readersmightalreadynoticedthattheunifiedthemehereis treatment, we can then tease apart carry over effect and to study the joint distribution of X(cid:126) and Y(cid:126), which by cen- treatment effect. We call this the re-randomized design, as trallimittheoremisknowntobemultivariatenormal. Each it is equivalent to having another round of user randomiza- model specification maps to the mean vector β(cid:126) of this mul- tionbetweenthefirstandthesecondperiod. Usingnotation tivariatenormal. Thereforeforanymeanvectorbasedona from the linear mixed model, the model considering a po- model specification, we can solve the MLE and estimate its tential carry over effect α is then vliaersiainncheowusitnogeFstiismheart’estIhnefocromvaatriioann.ceTmheatdriiffixcinulgtye,nheroawlecvaesre, −CT XX12 δ+µµ+θ wainrietphnaprotrtiecdsueelnfiacrne,edwofeamsailsssisominnpgeleeddaatasaawavnaeydraitgnoeg.deenFceoidrraeclrwfoohsrsemothveeetrrridwcseestichgaannt, −−CCCT:YYZZ1212 ∼N(cid:16)α+µδδ+++µµµµ+θθ,(cid:20)Σ000 Σ000 Σ000 Σ000(cid:21)(cid:17), safely assume the treatment effect in both periods are the T W1 δ+µ+θ same without any carry over effect. We will address these T W2 indetailsinSection3andSection4. Section5explainswhy Σ=(cid:104) s21 ρs1s2(cid:105) FORME is more flexible and scalable than existing method ρs1s2 s22 of fitting linear mixed effect model. Thecarryovereffectisinthegroupthatreceivedfirsttreat- mentandthenrevertedbacktocontrolinthesecondstage. 3. CARRYOVEREFFECT Using observed data we can estimate carry over effect α as an additional term in β(cid:126). When α is not statistically signifi- Thecrossoverdesignisnotwithoutconcerns. Animportant cantunderpre-specifiedtypeIerrorcutoff(usually0.05),it assumption in crossover model is that the treatment effect remains the same in the experimental periods. Since test issafetodropthetermαfromβ(cid:126),andre-fitthemodel. This subjectsrandomlyreceiveallcombinationsoftreatmentsin way, we gain one more degree of freedom and thus reduces sequence, different users will receive difference sequence or variances in the MLEs. “order”thetreatments. Itispossiblethattheorderinwhich users are exposed to treatments may change the effect. For Thisapproachenablesdirectestimationofcarryovereffect. an extreme example, suppose our treatment introduced a It can also be considered a hybrid approach between the bug that results in severely negative user experience, and crossover and parallel design, as half of the users received these group of users fail to revisit the website in the later crossed treatments, and half received the same treatments crossoverperiod,thetreatmenteffectisthendifferentinthe in the two periods. It is not hard to see this design will two periods. We call this the carry over effect, as the users achieve sensitivity improvement between the crossover and exposed to treatment first, and then the control later may parallel designs. behave differently from the other group. 4. MISSING VALUES AND METRICS BE- In some experiments where the treatment condition is less YONDAVERAGE noticeable to the users, the expected treatment effect is small, and based on historical insight, it is safe to assume 4.1 Lossoffollow-upandintenttotreat Loss of follow-up is a common term in clinical studies. It ing it as simple average over page level measurements this refers to patients who were active participants during some needsextracare. Abetterapproachistoseethisasaratio period of the trial, but stopped participating at some point oftwouserlevelmetrics: clicksperuserandpage-viewsper of follow-up. This can lead to incomplete study results and user. potential bias when the attrition is not random. Intention- (cid:80) (cid:80) (cid:80) Clicks Clicks / I to-treatanalysisiscommonlyemployed,whenthesubjects’ (cid:80)useri i = (cid:80)useri i (cid:80)useri i, Pages Pages / I initial assigned treatment is used regardless of actual re- useri i useri i useri i ceived treatment. In online A/B testing the idea is similar; where (cid:80) I is the the count of appeared users. The usersareassignedtotreatmentgroupsatsomepointintime same deltuas-emrietihod mentioned in Section 4.1 naturally ex- before the experiment starts, often by user id, but may or tends here, with slightly more complicated formula. Since may not appear in the actual duration of the experiment. delta-methodappliesingeneraltoanycontinuousfunction, Thismissingpatternisfarfromrandom,thereforemethods we can handle any metric that is defined as a continuous that rely on strong MCAR assumption (missing completely function of other metrics. at random) are not appropriate and even MAR (missing at random) assumption is questionable as it requires missing 5. FLEXIBLEANDSCALABLEREPEATED pattern to be random conditioned on observed covariates. MEASURESANALYSISVIAFORME One way tosee measurements are notmissingat randomis to realize infrequent users are more likely to have missing 5.1 ReviewofExistingMethods values and the absence in a specific time window can still Itiscommontoanalyzedatafromrepeatedmeasuresdesign provideinformationontheuserbehaviorandinrealitythere with the repeated measures ANOVA model and the F-test, might be other factors causing user to be missing that are undercertainassumptions,suchasnormality,sphericity(ho- not even observed. Instead of throwing away data points mogeneity of variances in differences between each pair of where user appeared in only one period and is exposed to within-subject values), equal time points between subjects, only one of the two treatments, in practice, we included an and no missing data. Such assumptions in general do not additionalindicatorforwhetherornottheuserappearedin hold for large-scale online experiments, where the assign- the study in the period. ment of users into different treatment group may not be completely balanced. Specifically, we use an additional indicator for the pres- ence/absence status of a user in each experimentation pe- A generally more applicable method is to analyze the data riod. For user j in period i, let Iij =1 if user j appears in usingthelinearmixedeffectmodel,forwhichcompletebal- period i, and 0 otherwise. For each user j in period i, in- ance is not necessary (Bates et al. 2012a). In particular, a steadofonescalarmetricvalue(Xij),weaugmenteditinto linear mixed effect model treat each measurement of a sub- a vector (Iij,Xij). When Iij = 0, i.e. user is missing, we ject as a data point, and model the measurement as defineX =0. Underthissimpleaugmentation,themetric ij valueX forperiodi,takingaverageoverthosenon-missing Y =θ+αX+βZ+(cid:15) i measurements,isthesameas (cid:80)(cid:80)kkXIiikk. Inthisconnection,to Here θ is the global mean and α stands for the vector of all obtain MLE and its variance, we only need to estimate the deterministicfixedeffectswhileβisthevectorofallrandom covariance matrices for each group across time periods, i.e. effects and (cid:15) is noise. X and Z are covariates in the model. Cov(Xi,Xi(cid:48))=Cov(cid:18)(cid:80)(cid:80)kkXIiikk,(cid:80)(cid:80)kk(cid:48)(cid:48)XIii(cid:48)(cid:48)kk(cid:48)(cid:48)(cid:19) IpAnesroiaoundrsecoxafastmehspeltmeh,eeyaosnuaerreepmoisnesnditbic,laeutsomerrosiddoe,flatnfroderaartnmeypeeonatthteeadrsscmiogvenaamrsiuearnteets., =Cov(cid:18)Xi,Xi(cid:48)(cid:19) using lme4’s formula syntax (Bates et al. 2012a;b) is Ii Ii(cid:48) Y ∼1+IsTreatment+Period+(1|UserID), wherethelastequalityisbydividingbothnumeratorandde- where the only difference of this model to the usual lin- nominatorbythesametotalnumberofuserswhohaveever ear model behind two sample test is the extra random ef- appeared in the experiments. Thanks to the central limit fect(clustered by UserID) to model user “baseline”. More theorem, the vector (Ii,Xi,Ii(cid:48),Xi(cid:48)) is also asymptotically complicated models exist to further model interaction and normal. Plugging in observed sample means and covari- joint random effects. ance matrix, Cov(Xi,Xi(cid:48)) can be trivially computed with the delta-method; see Deng et al. (2013, Appendix B) for Random effect makes modeling within-subject variability a similar example; also see (Van der Vaart 2000) for a text possible. In repeated measures data, users might appear in book treatment of the delta-method. multipleperiods,representedasmultiplerowsinthedataset. As a result, rows of the dataset are not independent but 4.2 MetricsBeyondAverage withdependenciesclusteredbyuser. Toseethis,eachuser’s Treatment groups are assigned to users, but not all metrics “baseline”measurementiscapturedasarandomeffect. The are simple averages across users. We can define a metric as same user in different period will share the same“baseline” a function of other metrics. One important family of met- random effect, therefore resulting in dependency. Mixed ef- rics is page level metrics such as click through rate. Page fect model effectively takes advantage of this and is able to level metrics use number of page-views as their denomina- estimatethevarianceoftherandomeffectwhilereducingthe tor. At first glance it might look like just a simple aver- variance of average treatment effect. In the case of cross- age. Sincetreatmentsareassignedtousers(theindependent over design, the model can take advantage of the positive unit), page-views are therefore not independent. Consider- correlationbetweenthetwoperiodsofthesameuser,which improvesaccuracyintheestimationoftreatmenteffect,sim- linear mixed effect model. In our experience FORME is 1 ilartotheillustrationwederivedinSection2.4. Treatment to 2 magnitudes faster than lme4 with much less memory effect can be modeled as either a fixed effect or random ef- footprint even without map-reduce type parallelism. fect1. If our interest is the average treatment effect, we can model it as a fixed effect. Note that modeling treatment In the remaining of this section, we explain why FORME effect as fixed effect does not mean we need to assume it is is both more scalable and flexible than linear mixed effect fixed,whichingeneralisnotsincedifferentsubjectsreactto model. the treatment differently, but rather because the focus here is the mean of the random treatment effect, not the vari- 5.2 FORMEisScalable ance of the random treatment effect. One can still fit the Insteadofmodelingatthelevelofeachindividualmeasure- model with random treatment effect and the results gener- ments, FORME sees the problem from a higher level and ally agree, though fixed effect is believed to be more robust takeadvantageofbigdata. Basedoncentrallimittheorem, against model assumptions; see Wooldridge (2012). metrics of interest in each period for treatment and control followsnormaldistribution. UsingthesamenotationinSec- We point out two issues of using traditional mixed effect tion2,thismultivariatenormalrandomvectorisdenotedby model, and claim that FORME is a better alternative on X ,Y ,i=0,...,2, with mean β(cid:126)((cid:126)λ) and certain covariance axes of flexibility and scalability. i i matrix. These metric values are correlated with each other via common user level random effects modeled explicitly in First, linearmixedeffectmodel(andalsogeneralizedlinear linear mixed effect model but not in FORME. This is be- mixedeffectmodel)isafamilyofparametricmodels,andre- cause when our interest is only in the average treatment lies on full knowledge of the likelihood function to perform effect, the estimates of those random effects are irrelevant. parameter fitting. This means the model need to rely on Instead,FORMEseestheaveragetreatmenteffectδ asjust distributionalassumptionssuchasnormality. Inparticular, all random effects are typically modeled as normally dis- oneparameterinthemeanvectorofthemetricvaluesβ(cid:126)((cid:126)λ). tributed or jointly normally distributed. And noise (cid:15) need That is, when modeling metric values directly using mul- to be either i.i.d normal or the modeler needs to provide a tivariate normal distribution with parameters in the mean knowncovariancematrix. Theseassumptionsareindispens- vector, all the complexities involving the structures of the able in the theory and pivotal in the fitting of the model. randomeffectsareburiedinthecovariancematrixofmulti- For our application in online A/B Testing, many of these variate normal and we are left with a simple task, which is assumptionsareinappropriate. Tonameafew,forametric to estimate the parameters(cid:126)λ of this multivariate normal. like revenue per user, it is inappropriate to model the user “baseline”revenue per week as normally distributed due to FORMEestimates(cid:126)λbyfittingMLE.Theuseofasymptotic itslargeskewness. Alsothenoiseterm(cid:15)ishardtojustifyto statistics also guarantees that the estimates are normally betrulyindependentofotherrandomeffect. Aheavieruser distributedwithcovariancematrixderivedfromFisher’sIn- mighthave bigger“baseline”revenue, andalsobigger noise, formation(VanderVaart2000). Notethatthescaleofthis and bigger (or smaller in some cases) treatment effect. It stepismuchsmallerthantheMLEfittingofatypicallinear alsoassumesdataaremissingatrandom. Modelersoflinear mixedeffectproblem. FORMEonlyneedtofitamultivari- mixedeffectmodelwillneedtomodifythemodelbymaking ate normal with small dimension, typically smaller than 12 randomeffectsjointlyrandom,orincludingmoreinteraction (6 in a crossover design: treatment and control for each of terms. However the more complicated the model, the more the pre-experiment, period 1 and period 2.) questionsonmodelassumptionswillarise. WeshowinSec- tion 6 through simulation study, linear mixed effect model Themaincomputationburdenisthereforeintheestimation fitted in R package lme4(Bates et al. 2012a) could result in of covariance matrix. Fortunately, this step only involves biasedestimationoftheaveragetreatmenteffectwhenthere estimation of pair-wise covariance between metric values, iscorrelationbetweendatamissingpatternanduserrandom andtheyallcanbemap-reducedwithonepassofthedata. effect. To handle missing data and general form of metrics (as a continuous function of other metrics), delta-method can be Second,fittingmixedeffectmodelcouldbeexpensive. Avail- employed(Section4). Theapplicationofdelta-methodonly able packages in SAS or R are based on fitting MLE or involves slightly more complicated covariance matrix so we REML(restrictedmaximumlikelihood). Ineithercase,much need to estimate more covariance pairs in one map-reduce effort is taken to estimate the variance of random effect(s) pass of the data, inducing negligible increase in complexity. or covariance matrix if they are jointly random. Fitting al- gorithm takes the full dataset with each row representing a 5.3 FORMEisFlexible measurement. InonlineA/Btesting,wheretensofmillions FORMEisnotonlyscalablebutalsomoreflexible. Because of users are involved, this dataset could be large. In model FORME doesn’t explicitly model random effects as linear fitting, each iteration requires some operations on this full mixed effect models do, FORME makes no distributional dataset. Making the efficiency of model fitting a concern assumptions on random effects and noises (cid:15). FORME also in big data scenario. To the authors’ best knowledge, there make zero assumption on missing data pattern. FORME is no literature on the topic of big data implementation of needs only one critical assumption , i.e. that central limit theoremisapplicable,whichisrarelyviolatedinonlineA/B 1When there are only two measurements for a subject like crossover design, modeling treatment effect and user“base- testing, since traffic size is large enough even for the most line”bothasrandomeffectisunidentifiable. Butthemodel highlyskewedmetricssuchasRevenue(Kohavietal.2014). can be fit if there are more measurements per subject. Specifically, FORME can be applied to all these cases: 1. Data can have arbitrary missing pattern. In other and all conditions have roughly 50% user missing in each words not assumptions on missing at random. period. 2. Treatment Effect is random. 3. TreatmentEffectanduserrandomeffect(baseline)are First of all in condition 3 when there is a random treat- not independent. ment effect, we found lme4 consistently gave biased esti- 4. Noises (cid:15) are not i.i.d. mation(when ground truth effect is 6.6, FORME estimates 5. Noise and random effects are not independent. are very close to ground truth while lme4 always gave bi- 6. Interactions. (e.g. treatment and control have differ- ased estimation around 7.2). This is because lme4 relies on ent user random effect distribution, etc.) the assumption of missing at random, and it is violated as randomeffectisnegativelycorrelatedtothechanceofmiss- To close this section, we make the final remark that the ing. We believe this is a fundamental problem issue with flexibility of FORME really comes from its simplicity, com- mixed effect model as missing data pattern is often corre- paring to linear mixed effect model. We believe FORME lated with some underlying user characteristics that is cor- is also easier for practitioners to understand. The cost of relatedwithuser’sresponsetotreatment. Onemightargue FORME to put less assumptions than mixed effect model thatmixedeffectmodelcanbefixedbythrowingmoreinter- is the expectation that when mixed effect model assump- actionterms. Howeverinpracticemorecomplexmodelsare tions hold, FORME estimate could possess larger variance often not identifiable (parameters more than data points) than mixed effect model estimate. Next we’ll explore these andtheyonlymakesmoreassumptions. Wealsonotedthat through simulation study. exceptforcondition3,lme4providedunbiasedestimatesfor condition 2, 4 and 5 where some assumptions in mixed ef- 6. RESULTS fect model are violated. We believe central limit theorem also helped in this case for lme4 to stay unbiased. But bias 6.1 SimulationfromKnownDistributions in condition 3 seems to be more fundamental. We leave a WecomparevariancesreportedfromourFORMEproduces more thorough study of the bias of lme4 with violation of to the traditional linear mixed model under various simu- different assumptions in future work. lation assumptions. As illustration we used the crossover design. We simulate a total of 2N users, where N =10000 WealsocomparedvarianceinLMEandFORMEunderthe and randomly split them into two treatment groups. crossovermodelbelowinFigure1. BothFORMEandlme4 X =µ+δ +u +(cid:15) ,(cid:15) ∼N(0,σ2) provided very good estimation of variance. And also as ex- ij ij i ij ij pected FORME pays a price for its flexibility and almost whereiisindexforuserandjfortimeperiod. (cid:15)ij represents “model free”as variance from lme4 estimations are gener- random noises and ui represents random user“baseline”ef- allysmaller. Thevariancegapisbiggerwhenmissingrateis fect. δij isthetreatmenteffectforuseriinperiodj (0ifnot higher and between-periods correlation is higher. Although intreatment). Inthismodel,thebetweenperiodcorrelation not shown, in the conditions when either there is no miss- is then σ2 . If user i is in treatment for time period j, ingdata,orthecorrelationis0,FORMEandlme4estimates σu2+σ2 havethesamevariance. Althoughlme4estimatehassmaller δ ∼N(δ,σ2)×p ,whereδ×E(p )isthegroundtruthaver- ij δ i i variance,itspotentialbiasisashow-stoppersincefortreat- agetreatmenteffectsize,p isacontinuousvaluebetween0 i ment effect estimation a low variance estimate is not useful and1,anditrepresentstheuser’sactivitylevel. Wedesigned if biased. p tobecorrelatedtou . Thiswayweallowtreatmenteffect i i tovarybyhowfrequentauservisitsthesite. Finallyweal- lowX tobemissingwithprobabilityofmax(90%,1−p ). ij i This is intuitive since a less active user would be missing more often. Note that in this simulation study we know exactly what the true average treatment effect is. We sim- ulated this process K = 10000 times so that we can have a good estimate of the ground truth variance of treatment effectestimatedbyFORMEandmixedeffectmodel(lme4). We want to learn the following for both FORME and lme4 from this simulation study: 1) is estimate unbiased, 2) is variance estimation correct. If both methods are unbiased, thenwewanttoknowwhichonehassmallervariance. With- out loss of generality, we used µ = 0, σ = 4, σ = 2 or 4, √ u δ=10,andσ =0.1 12. Wechose5simulationconditions δ as the following: Figure 1: Effect Variance in LME and FORME 1. Normalnoise,notreatmenteffect,normaluserrandom effect u ∼N(0,σ2) 6.2 SimulationfromEmpiricalData i u 2. Normal noise, no treatment effect, Poisson user ran- Next we randomly sampled from our in-house data a small dom effect u ∼Poisson(σ2) subset of N = 1250 users, randomly split the users into i u 3. Normal noise, with random treatment effect that is equal sized subsets, and applied various designs. We then correlated with user random effect: N(δ,σ2)×p simulate K = 10000 bootstrap samples (with replacement) δ i 4. NoiseiscorrelatedwithUseractivitylevel: σ=2×p from this dataset, fit FORME and report estimated MLEs. i 5. NoiseiscorrelatedwithUseractivitylevel: σ=4×p The variance based on these MLEs are then compare to i the variance estimated from Fisher Information using the full dataset. Figure 2 shows the two agrees well. Note the cumulativeeffecthaddifferenteffectsizefromtherestofthe designs. Forthisparticularmetric,usingCUPEDresultsin roughly 50% reduction in variance. Crossover design shows reduction of around 50% compare to the parallel design. Figure 3: Percent samples needed to achieve the same sensitivity for four metrics. Baseline is the crossover design. significanceinKPI,wemaychoosetoterminatetheexperi- mentalready. Howeverinpractice,wegenerallyrecommend Figure 2: Effect Variance from Fisher’s Information runningtheexperimentslongenoughtogainenoughpower and Bootstrap method for not only the KPI, but other metrics designed to moni- tor data quality and serve as guardrail against unexpected changes. 6.3 RealExperiments Otherwise, in running the second stage, we can use domain Finally, we report results from three typical metrics in one knowledge to inform about carry over effect. If historical of our real experiments. Here we used percent change as experiments in similar feature iterations indicate potential the effect size. This way, weekly effect size is comparable carryovereffect,werecommendrunningacomplete4-group to cumulative effect size across two weeks. The variance of crossoverexperiment,sowecandirectlyestimatecarryover effect size therefore indicates sample size needed to achieve effect. Otherwise, we recommend using the 2-group cross- the same sensitivity. Figure 3 displays the percent samples over design to achieve the maximum power for KPI. If we needed to achieve the same sensitivity for three metrics us- are not sure, it is still possible to leave a few days’“wash- ing various models, with the crossover design as baseline. out”period after completing the first stage, and see if any Therefore crossover design had value of 100. All models carry over effect can be observed. included CUPED since pre-experiment data always exists andisfree. Thecrossoverdesignconsistentlyhadthefewest • No swapping: When it is critically important to en- samplesneeded. Nextthere-randomizedesignhadvaluebe- sure consistent users experience, such as changing the tween crossover and the parallel design. Cumulative design entirelayoutofasite,itmaynotbedesirabletoshow follows. When the re-randomize model includes a leftover usersthenewsiteforaweek,andthenswapthemback effect, the samples needed can be larger than cumulative to the old site. The experience may be too jarring to design for metric 2. Note that compared to the previous users and hurt the brand. In such cases, we do not benchmark,thecumulativedesign,thecrossoverdesigncan recommend re-assigning treatment variants half way save up to 2/3 the traffic for metric 3, while for others, the through. traffic savings is in the 30-40% range. This is due to inher- • Crossover: Relatively small changes that are less di- ent difference in week to week correlation in different met- rectly noticeable are better candidates for treatment rics. Notethedrasticreductioninvarianceforsuchmetrics swapping. If similar experiments from the past, or means the same feature can be tested with only 1/3 of the earlierexplorationdatadonotindicatethepresenceof original traffic! carryovereffect,thecrossoverdesigncanbeemployed. • Re-randomized: If we suspect the presence of car- 7. PRACTICALCONSIDERATIONS ryover effect, the re-randomized design enables us to At the design stage, we face a few choices under the same measure it directly and should be used here. frameworkofrepeatedmeasuresdesign. Experimentersshould • Wash-out and decide: If we have little informa- use domain knowledge and past experiments to inform the tion to judge carry over effect, we can run the first design. This is rather an art than pure science. Here we week of the experiment, and then leave a few days as give guidelines according to our own experience. a “washout” period. The next stage is data driven. Using such data we can estimate the carry over effect 7.1 RecommendedWorkFlow explicitly. Due to the flexibility in a two-stage setup in repeated mea- – Ifthereisnosignificantcarryovereffect,proceed sures design, we can use the information gathered in the as the crossover design. first stage to inform procedures in the next stage. We rec- – Otherwise, proceed as the re-randomized design. ommend using crossover design as validation stage experi- ments, for which we already have gathered exploratory di- Having collected experiment data, they can then analyzed rectional data. If the first stage already result in statistical in the following work flow to achieve the most power. No swapping: 8.1 Extendingtomorefrequentswaps – Test equivalence of treatment effect across time The crossover design achieves sensitivity by exposing users – Iftheyareequivalence,reporttreatmenteffectin to both treatment variants in sequence, by swapping the the“pertimeunit”metricvaluesbyanalyzingus- treatment assignment once during the experiment. Using ing the parallel model, including pre-experiment each subject as his or her own control and this design to data. account for within-subject variance. A natural extension – Otherwise, analyze only cumulative effects, and of the idea is to swap treatment groups more than once. includingpre-experimentdata. NotethisisCUPED. Essentially, this changes to a more granular randomization Crossover design: unit,fromuserstopageviews. Exploratoryworkshowsthis – Test equivalence of treatment effect across time indeed achieves further variance reduction. – Iftheyareequivalence,reporttreatmenteffectin the“pertimeunit”metricvaluesbyanalyzingus- However,thisalsoraisestheconcernforinconsistentuserex- ingthecrossovermodel,includingpre-experiment perience,diminishedtreatmenteffectsize,strongerlearning data. effect,andlackofalongertermmeasure. Despitethesecon- – If, however, unexpected significant difference is cerns,itremainsavaluableoptioninearlystageexperiments found, you have several choices to quickly select promising features for further iteration. ∗ Report the two treatment effects separately ∗ To understand the difference properly, an- 8.2 Limitationsandconcerns otherphraseoftheexperimentcanbeadded, using re-randomized design. With a total of Duetouserbehaviordifferencesbetweenweekdayandweek- three weeks’ data, we can see whether the ends,weusuallyrecommendrunningeachphaseofthecross- treatmenteffectdifferenceisduetotrueweek- over design for at least a full week. A crossover experiment to-week to difference, and study its trend, or thenrequirestwocompleteweekstogatherdata,whichhin- due to carry over effect. dersagility. Anotherlimitationisthatforveryhighlyvisible Re-randomized design: featureslikechangingprominentUIfeatures,suchswapping – Test equivalence of treatment effect across time maynotbedesirablesinceitmayconfusetheusers. Finally, and presence of carryover effect not all features can be tested this way, as there might be – Reducethemodelifanyoftheeffectsarenotsta- a“learning”effect, where we can’t have the users exposed tistically significant, and report treatment effect. to treatment “unlearn” the feature, while having controls naivetothetreatment. Forexample,ifthewebsiteprovides This carries the subtle difference of reporting a treatment new features and personalized content to signed in users to effect in the entire duration of the experiment, versus that encourage higher rate of signing in and staying signed in. pertimeunit(aweekhere). Wearguethataslongasweekly Theseuserscannotthenbeforcedtologoutintothecontrol treatment effects are stable over time, reporting weekly ef- group. Ma et al. (2011) shows one interesting case where fect is intuitive, easy to understand, and easy to compare crossover design can be extended to tackle this issue. across different experiments. In real life, various things can happen during an experiment, and we may end up with an 8.3 FurtherImprovementsofFORME experimentthatranonlyinpartialweeks. Inthesecases,re- We’veshowninSection6.1thatmixedeffectmodelvialme4 portingtreatmenteffectintheentiredurationwillbebetter provides a competing estimate of the average treatment ef- than throwing away data or ignore weekdays difference. fect that could be biased when missing data pattern corre- latewithuserrandomeffect,butoftenwithsmallervariance 7.2 SampleSizeConsiderations thanFORME.WenotedthatFORMEhastopaysomeprice Whiledirectestimationinsamplesizeisdifficultinthelinear to be more flexible and robust, similar to nonparametric mixed model, in practice there is an easy work around. In model usually is less efficient than their parametric coun- the traditional design, using CLT, with a simple two-sided terparts. However we suspect that efficiency of FORME test for H :δ=0, sample sizes can be easily calculated. 0 can be further improved to match the efficiency of mixed δ2 effectmodelevenunderperfectmixedeffectmodelassump- n=(z +z )2/ 1−α/2 1−β Var(δ) tion. Such improvement would be very desirable. But even without such improvement we believe the bias when there where α is the allowed false positive rate, usually 0.05, and is missing data that is not missing at random is a big issue 1−β is the desired power, usually at 80% to 90%. for mixed effect model to be adopted in online controlled experiment. And FORME should be used instead. From historical data we can record the amount of variance reduced for each metric. The magnitude is determined by inherent variance in the metric, and correlation across time References periods,bothofwhichareobservedtobefairlystableacross Bakshy,E.andFrachtenberg,E.(n.d.),‘StatisticsandOpti- many experiments. Suppose the variance for metric X in malDesignforBenchmarkingExperimentsInvolvingUser crossover experiment is k% that in the conventional t-test. Traffic’. IfN subjectsarerequiredtodetectachangeofδ%int-test Bates, D., Maechler, M. and Bolker, B. (2012a), ‘Fitting with,say,80%power,thenk%N isthereducedsamplesize linear mixed-effects models using lme4’, J. Stat. Softw. to achieve the same power. p. 51. Bates,D.,Maechler,M.andBolker,B.(2012b),‘lme4: Lin- 8. DISCUSSIONSANDFUTUREWORK ear mixed-effects models using S4 classes’.