Bayesian Diagnosis Tracing Model: Assigning Blaming for Multiple Knowledge Components (Draft)

                                                                                                                                                               I.        Introduction

When a practice item is linked to multiple knowledge components, there are no widely accepted best practices on how to assign blame. In the context of mathematics education in China, the blaming problem comes into prominence as early as second grade. Despite its practical importance, this problem is under-studied in both the literature and the industry. Several solutions known to the authors are flawed as formative assessment tools, as we will discuss in the next section.

This paper proposes a new framework, Bayesian Diagnosis Tracing (BDT), to deal with the blaming problem in a logically consistent way without losing track of learning progress. The BDT model combines two well established streams in the learning analytics literature: the research on Bayesian Knowledge Tracing model and the study of procedural misconceptions. In a nutshell, the binary response in the BKT model is replaced with a vector of misconception diagnosis, allowing an intelligent tutor to mimic the reasoning of a human tutor.

                                                                                                                                          II.       Relevant Theories of Learning

A.   The Bayesian Knowledge Tracing and the Cognitive Diagnosis Model

The BKT model is the most widely accepted analytical framework in modeling the learning process. There are two implicit assumptions. The first assumption is that each practice item maps to ONE knowledge component. The second assumption is that wrong responses are qualitatively the same conditional on the latent state[[1]]. In the context of multiple knowledge components, the first assumption is clearly violated. A common practice is to pretend that the learner practices N separate items that each map to one knowledge component. If the learner answers the original item correctly, they answer N artificial items correctly. Otherwise, the learner fails all N artificial items. This method has a fatal logic inconsistency. In the case of a correct response, then the components are treated as if they were connecting by an “AND” gate. In the case of a wrong response, then the components are treated as if they were connected by an “OR” gate. Both statements cannot be true for the item in question.

The Deterministic Input Noisy “And” Gate (DINA) model avoids such logical fallacy. However, DINA model, and likewise the family of cognitive diagnosis, cannot replace BKT model because it has no learning process. Tracing the progress of a learner is a must-have feature of the Intelligent Tutor System (ITS). In addition, the DINA model only uses the binary response, which means it throws away the information encoded in different responses.

To our best knowledge, there is not any analytical framework that can logically handle the problem of blaming multiple knowledge components in the context of dynamic mastery tracing. The ambition of this paper is to do just that.

B.   Procedural Misconceptions

Inspired by the study of procedural misconceptions by VanLehn(1990), this paper to propose a semi-automatic wrong response generator based on a data informed expert system. Such data informed expert system is import because the distribution of wrong responses has a long tail. Without the help of the frequency table based on large amount of data, experts are likely to only recall a few common mistakes, which probably accounts for less than 20% of the cumulative distribution. However, without the help of pedagogical expert, one cannot make sense of the wrong answers.

                                                                                                                                    III.      Enabling Technological Advances

A.   Variant, Solution Tree and Automatic Diagnoses of Procedure Misconceptions

This paper defines the solution tree as all possible paths of mathematical reasoning for solving a problem, including those of correct reasoning and those of incorrect reasoning[[2]]. A variant is defined as the class of problems that are identical in solution tree[[3]] but different in application settings[[4]]. In essence, the distribution of numerical responses of an instance can be generated from the distribution of algebraic responses of its variant when the parameter of the instance is plugged in. The application section offers an example of using solution tree to generate responses.

When given a wrong response of an instance of a variant, the analytical system can trace back the solution and accurately blame the components that contribute to the error[[5]]. Under some assumptions, the answers of the practice sequences can be split into K answer sequences, one for each knowledge components, and then the learner’s mastery on each knowledge components are estimated by independent HMM models.

B.   Bayesian Diagnosis Tracing Model

Let  be the state vector of k knowledge components at time t, . Let  be the state transition matrix. When the response is diagnosed as a vector of diagnosis , .

The likelihood is given as

The state transition is

If there are K knowledge components, each with M states. The state transition matrix T  is size . If there are L possible diagnosis combinations, the observation matrix E  is size . In real applications, even the sizes of K,M,L are reasonable, the model still has too many parameters to estimate. Practically, the joint likelihood has to break into the product of the likelihood of each knowledge component. In the product form, each component can be estimated by an independent HMM.

The product form requires the following assumption:

A1: The learning process of each component is independent.

A2: The diagnoses of each knowledge components are exchangeable[[6]]

A3: The symptom of each knowledge components is not related to other components.

                                                                                                                                               IV.      Real World Applications

This paper illustrates the Bayesian Diagnosis Tracing model, and its contrast to the classical BKT model, with data collected during Oct.2018 and Dec.2018 in a Chinese learning app. About one thousand students finished three items in a row. After answering each item, the student knew if she got it right. In the case of a wrong answer, the correct answer was revealed on the same screen as the mark and the solution procedure was also revealed in the next screen.

Here is the translation of the three items.

Q1: The school has a pool, 16m in length and 5m in width. To setup a fence around the pool, the length of the fence is ____ m.

Q2: The width of a rectangle is 3cm. The length is twice the width. The circumference of the rectangle is ____cm

Q3: Boy: The length of the basketball court is 28m and the width is 15m.

Girl: I walk around the court twice.

Question: The girl has walked ____m in total.

These are instances of three variants, consisting of two key knowledge components: calculating circumference of a rectangle given length and width (K1), calculating the product given multiplicand and multiplier(K2). The difference between Q2 and Q3 shows that procedures are not sufficient to distinguish variants.

For each component, the misconceptions are listed in the following table

Knowledge Component Right formula Misconception Symptom
K1 W*2+L*2

(W+L)*2

Wrong formula W*L

L*4

W*4

(W+L)*4

Flawed execution W+L

L+W*2[[7]]

W+L*2

L*2[[8]]

W*2

K2 X*Y No multiplication Y

X

Mistake for addition X+Y

The solution tree of Q1 is just P1. Q2 is K2->K1. Q3 is K1->K2. When multiple procedures are involved, the wrong responses are generated by the convolution of each procedure. For example, for Q2, K2 generate 4 intermediate answers. Taking each as an input to P1, it generates 8 answers (but only 4 category labels including right). In total, Q2 has 32 leaf nodes. If none of the leaf nodes produce the same answer, then each answer pins down a particular set of procedure misconception(s). However, when different leaf nodes produce the same answer, we rely on pedagogical expert to decide which set of misconceptions are more likely and choose it as the diagnosis.

There are two other types of special diagnosis, give-up and calculation error. Give-up are defined as answer 3/5/8[[9]] or a number from the question text, such as 16 in Q1 or 28 in Q3. Give-up does not necessarily mean the student cannot do it, but more likely not bother to. Nevertheless, if a wrong answer is identified as give-up, then all procedural diagnoses are give-up.  Calculation error is defined as wrong answers that do not contain any procedural misconceptions, for example 33[[10]] for Q3. If a wrong answer is identified as a calculation error then all procedural diagnoses are right. A wrong answer is labelled as no-diagnosis if it is not diagnosed with procedural misconceptions, nor is marked as give-up, nor is marked as calculation error [[11]].

Table 1 shows the composition of the wrong responses

Table 1
Question Procedural Misconceptions Give-up Calculation Error No-diagnosis
Q1 27% 10% 8% 55%
Q2 41% 47% 0% 13%
Q3 58% 6% 6% 31%

Among the identified procedural misconceptions, Table 2 shows the share of blame. Within the category of procedural misconceptions, the probability of both to blame is about 10%. It means the prevailing BKT method is logically correct for 10% of the wrong response. No matter how sophisticated the BKT model evolves, if the data are seriously compromised, the analysis is likely to be far from truth.

Table 2
  K1 is to blame K2 is to blame Both
Q2 28% 60% 12%
Q3 6% 87% 7%

 

In addition to better data quality, the vector of diagnoses enables the HMM models to do a better job in forming learner type cluster. Table 3 is the observation matrix of the 3 state BKT model. It is almost impossible to interpret what each state means. Table 4 is the observation matrix of the 3 state BDT model, one can easily name the three clusters: who skip, who slip and who master.

Table 3
  K1 K2
State 1 93% 97%
State 2 10% 13%
State 3 2% 2%
Table 4
    State1 State2 State3
K1 Right 96.6% 68.5% 38.5%
Wrong Formula 0.6% 4.3% 6.8%
Flawed Execution 0.2% 3.4% 7.4%
Give-up 0.2% 5.5% 22.1%
No Diagnosis 2.3% 18.2% 25.1%
K2 Right 93.9% 43.2% 33.9%
No Multiplication 4.4% 36.7% 12.3%
Mistake for Addition 0.6% 1.3% 1.7%
Give-up 0.0% 0.9% 39.5%
No Diagnosis 1.1% 17.9% 12.7%

Table 5 are the transition matrix, or the learning process. BDT model only allow for adjacent

Table 5-1
P1 State 1 State 2 State 3
State 1 91.8% 8.2%
State 2 38.6% 38.8% 22.6%
State 3 52.8% 47.2%
Table 5-2
P2 State 1 State 2 State 3
State 1 58.9% 41.1%
State 2 16.4% 60.4% 23.2%
State 3 67% 33%

 

                                                                                                                                          V.       Evidence of Potential Impacts

There are two potential productive application of the BDT model. For one thing, it makes the latent states much easier to interpret and communicate. This is essential for human tutors when interfacing with an ITS. They often complain that a probability statement of mastery is either mysterious or meaningless. For another thing, authors believe that different diagnosis shall lead to different pedagogical interventions.

                                                                                                                                                                  VI.      Summary

The Bayesian Diagnosis Tracing model solves the blaming problem in the context of the mastery tracing. The paper provides some evidence on its benefit, and some of its practical limitations. This paper only scratch the surface of the work can be done and needs to be done. For one thing, how much more effective the BDT driven intelligent tutor is compared to the current system remains to be see. While we have strong faith in its accuracy, it is expensive and risky to build solution-tree-diagnosis compatible assessment content inventory and state-based teaching and practicing material. Whether it is commercially viable remains in doubt. For another thing, the BDT framework needs to be extended to handle the multiple diagnoses. In principle, since the likelihood function can be explicitly written, the parameters can be estimated with MCMC algorithm. However, the computation complexity of real dataset and convergence quality of the resulted parameters needs to be examined.

References
  • De La Torre, Jimmy. “The generalized DINA model framework.” Psychometrika 76.2 (2011): 179-199
  • Piech, Chris, et al. “Deep knowledge tracing.”Advances in neural information processing systems. 2015.
  • Corbett, Albert T., and John R. Anderson. “Knowledge tracing: Modeling the acquisition of procedural knowledge.” User modeling and user-adapted interaction 4.4 (1994): 253-278.
  • VanLehn, K. (1990). Mind bugs: The origins of procedural misconceptions. MIT press.

 

 

 

[[1]] In the classical BKT framework, not all wrong responses are created equal. The wrong answer means incompetency for the non-mastery state while it means careless for the mastery student. This metaphor runs into trouble when there are three latent states.

[[2]] As demonstrated in the application section, incorrect reasoning branch may produce correct answer if the problem is poorly designed. Therefore, the branch of the solution tree is categorized by its reasoning process rather than the answer it produces.

[[3]] The complexity of calculation is ignored in this definition. Chinese pedagogical experts generally disagree with this simplification. However, from the perspective diagnosis, computational error can be handled separately from the mathematical modeling. In fact, if the error is attributed to calculation error, the solution blame none of the components.

[[4]] This definition is not very precise because a novel application context challenges a learner’s creativity. However, it is not clear where the line shall be drawn in the sand.

[[5]] If the parameter is not well crafted, multiple branches produce the same answer. This paper picks a branch that is favored by a human expert by experience.

[[6]] In practice, it means order the knowledge components on the solution tree is irrelevant. It is not always true.

[[7]] The origin of this error can be either (L+W)*2 but forgetting the parenthesis or L*2+W*2 but forgetting the last 2.

[[8]] It can be debated whether the student does not remember the circumference formula, or the student forget to complete the calculation. Here we chalk it up flawed execution

[[9]] Because a student uses a classical 4*3 number pad interface, students type in a number then submit without thinking. Since the student faces almost no punishment in answering an item wrong, the phenomena of “without thinking fastidiously” is obvious.

[[10]] 28+15 = 33

[[11]] For a few answers, the paper also treats typing error as calculation error. For example, 421 for Q1 is likely to be the result of student accidentally typed extra “1” after the correct answer 42. These cases are rare.

基于IRT的计算机自适应测评系统

本文将介绍在教育模型中理论基础最完善且使用最广泛的自适应算法:计算机自适应测评系统(Computerized Adaptive Testing, CAT)。大家耳熟能详的一些计算机测评服务,例如ETS的GRE和TOEFL,都采用CAT作为推荐引擎。此外,CAT所依赖的测评理论,即项目反应理论(Item Response Theory, IRT),是心理测量学(psychometrics)的中流砥柱,有逾40年的理论研究基础,发展非常完善。最后,中国大部分教育技术公司在使用的自适应算法,可能不超过CAT的范畴。

本文首先介绍CAT的直觉基础,然后简要介绍其所依赖的IRT模型,并描述一种最简单的CAT工程设计。文章最后简单批判CAT作为一种学习算法的不足。

相较于传统的纸笔考试,CAT的优点是在相同题目数量下所达到的测评精度更高。或者反过来说,为了达到相同的测评精度,CAT所需要的题目更少。CAT能够实现更高效测评的直觉原因就是它能针对不同的作答挑选信息量最大(aka最具有甄别性)的考题。IRT模型是对于考题对于考生的信息量的定量评估的方法论基础。

 

I.自适应测评的理论基础

假设纸笔考试和计算机考试(CAT)的唯一区别在于是否存在自适应,那么什么时候CAT比纸笔考试好?这个小节将回答这个问题。

在这个小节中,我们将假设学生可以从一个能力维度上被分为两类人:学霸和学渣。顾名思义,学霸能力强,学渣能力弱。假设学霸和学渣在人群中各占50%。

(A)

假设有两道题,分别为送分题(易)和压轴题(难)。学霸和学渣都可以做出送分题,但只有学霸能做出压轴题。

假设你的目标是区分一个学生究竟是学霸和学渣,那么你应该出一张怎么样的卷子?

如果把送分题给学生做,不论是学霸和学渣都会做出来;其结果对于判定学生类型毫无帮助。如果把压轴题给学生做,做出来的就是学霸,做不出来的就是学渣;其结果对于判定学生类型非常有帮助。

让我们用更加严谨的数学语言来翻译上面的直觉推理(假设你熟悉贝叶斯定理)。假设你给了学生难题(Y_h),学生答对了(1)。此时,这个学生是一个学霸(X=1)的后验概率为

    \[P(X=1|Y_h=1) =\frac{P(X=1,Y_h=1)}{P(Y_h=1)}=1\]

与此相反,如果学生答错了,他为学霸的后验概率为0%。如果我们给了学生易题,学生答对了,他是学霸的后验概率依然是50%,和先验概率没有改变。(注意当给学生送分题时,不可能出现错误)

考卷/答案组合10
100%0%
50%NA

上表(计算过程参见这个链接的例1)列举了所有试卷/答案组合下的学霸后验概率。给定答案的组合,最优的试卷组合应该给与尽可能极端的后验概率(即最小化信息熵)。例如,当学生答对时,难题优于易题(100%>50%)。

综上所述,在这个情况下,卷子里只要有压轴题就够了。最优试卷方案唯一,因此纸笔考试和CAT一样。

(B)

案例A的结论主要来自于学生水平和解题结果存在确定关系的超强假设。而事实上两者的关系有不确定性。学霸也有失手的可能,学渣也有蒙对的希望。不可观察的学生能力和可观察的解题结果之间的不确定性是学习数据分析的根本问题

现在假设有难中易三道题,其中:

(1)难题学霸有60%的概率做出来,学渣有10%的概率做出来;

(2)中题学霸有80%的概率做出来,学渣有60%的概率做出来;

(3)易题学霸有95%的概率做出来,学渣有90%的概率做出来。

假设你的目标是尽可能的精确地判断学生类型,你应该怎么出卷子?

答案是三道题全做。每一道题都有一定的区分度,信息量不为0。如果我们希望最精准的类型判断,每一道题都有价值。因此,如果可以不计代价地进行测评,选题没有什么价值,纸笔考试和CAT一样好。

(C)

然而事实上测评这件事情本身是有成本的,主要体现在时间成本;因此,大部分的考试都有时间限制。所以一个更加现实的目标是:

假设目标是在两道题内尽可能精确地判断学生类型,你应该怎么出卷子?

让我先给出答案:最优的测试方案取决于学生回答。因为CAT可以根据学生的作答进行选择,而纸笔考试不行。因此,给定相同题库和相同测试长度,CAT比纸笔测试要强。

[1]  纸笔考试

根据贝叶斯概率公式,我们可以得到全部试题+答案组合的学霸后验概率(具体计算过程请查看这个链接的例2[1])。因为只有两类学生,所以只要最佳辨认学霸的方法也是最佳辨认学渣的方法。

考卷方案/答案组合11100100
难+中89%75%37%18%
难+易86%75%31%18%
中+易58%40%34%20%

类似的,给定答案组合,后验概率最为极端的试题组合为最优。例如,当两题都答对时,难题加中题的组合最优,因为它的后验概率最大(89%>86%,89%>56%)。上表告诉我们,大部分情况下(除了第三列之外),难中的组合是最优的试卷组合。因此纸笔试卷会选择“难题+中题”的组合

[2] CAT出题

CAT的优势在于它在出第二道题时,已经观察到了第一道题的结果。因此,它在第二题可以做出更好的选择。

通过类似于(A)中的计算,CAT的第一题会选择难题。假设学生做对了,他是学霸的后验概率为86%,他是学渣的后验概率为31%。第一题的后验概率会成为第二题的先验概率;并且此时我们从题库中刨去难题,只需考虑中题和易题。

第一题结果
试卷/答案1010
89%75%36%18%
86%75%31%18%

上表告诉我们,当学生第一题做对时,第二题应该出中题;当学生第二题做错时,第二题应该出易题。因此,CAT的选择是 难+中(第一题对)/难+易(第一题错)

在这个具体例子中,CAT在学生第一题做错,第二题做对的情况下比纸笔考试的后验概率方差有所降低,尽管差距并不是很明显(0.1476 Vs. 0.1474)。

2. IRT模型

第一节简单介绍了CAT的原理。因为测试题对于隐藏能力的探测具有不确定性,而且相同测试题对于不同能力的学生而言,不确定性不同。因为CAT可以根据应试者的回答动态选择测评不确定性最小的题目,因此和纸笔考试相比,它达到相同的精度需要更少的题目,或者使用相同的题目数获取更好的精度。

第一节的分析做了一些非常强的假设:

(1)我们知道学生的类型(学霸Vs学渣)

(2)我们知道不同类型的学生在各个测试题上的正确率(eg,学霸难题60%正确率)

在现实中,这两个假设都不成立。因此我们需要一个模型来提供这两个重要的信息。项目反应理论(Item Response Theory,IRT)是提供这些信息的最流行的模型。关于IRT模型的较为详细的介绍在可以参见这个链接(ETS在2013年提供了一个IRT发展的综述)。常见的IRT会做了以下的假设:

(1)被试者能力是一维的(\theta

(2)被试者能力是连续的(即不存在学霸VS学渣这样的有限多且可被解释分组,但是不同学生间可以排序)

(3)题目参数(\alpha,\beta)和学生能力(\theta)关于正确率(P(Y=1),假设题目只有对错两种)的函数对应是

    \[P(Y=1|\theta,\alpha,\beta) = \frac{e^{\alpha(\theta-\beta)}}{1+e^{\alpha(\theta-\beta)}}\]

通过数据估计\theta,\alpha,\beta就可以得到学生能力的分布以及不同学生在同一道题上的正确率分布。此外,这个结构式对于某些CAT选题算法而言,也是必须的。

3. 最简CAT的工程设计

上面两节介绍了CAT的原理以及IRT的概要,这一节将描述一个最简单(也是最经典)的CAT工程设计。这一节将以数学和符号为主,只想大概了解CAT的读者可以直接跳到讨论CAT扩展的第四节。

题目选择的算法有很多。这里我们介绍历史最悠久的一种:最大信息原则(Maximum Information Criterion)。

 

记学生id为i。学生i的真实能力值为\theta^i,学生i在第t道测试题的预估能力值为\hat{\theta}^i_t

记题目id为j。题目j的参数为\Theta_j。在二参数IRT模型中(第二节的例子),\Theta_j= \{\alpha_j,\beta_j\}

我们只把学生能力当做变量,把题目参数和做题结果都当做常量。

在二参数IRT模型下,记题目参数和学生参数与正确率的对应关系为

    \[p^i_j(\theta^i) = P(Y^i_j=1,\theta^i|\Theta_j) = \frac{e^{\alpha_j(\theta^i-\beta_j)}}{1+e^{\alpha_j(\theta^i-\beta_j)}}\]

这种函数设定的好处是,Fisher information function(一种度量信息的方法)是线性可加的。因此,假设学生已经做了k-1题,第k题的Fisher Information就是

    \[I_j(\theta^i) = \frac{[p^i_j'(\theta^i)]^2}{p^i_j'(\theta^i)[1-p^i_j'(\theta^i)]}\]

其中

    \[p^i_j'(\theta^i) = \frac{\partial}{\partial \theta^i} p^i_j(\theta^i)\]

如果\theta^i是已知的,那我们就可以计算每道题的Fisher Information,然后选取最大的那个题。然而遗憾的是,我们并不知道\theta^i。所以在每一步中,我们都用估计值\hat{\theta}^i_t来替代\theta^i。这个估计值可以这么估算

(1)初始化

给定学生能力值的先验概率分布g(\theta)

    \[\hat{\theta}^i_1 = \int \theta g(\theta) d\theta\]

(2)后续迭代

之后每一步的\hat{\theta}^i_k都是似然函数的最大似然估计值(Maximum Likelihood Estimator),其中似然函数是

    \[L(\theta) = \prod_{j=1}^{k-1} \frac{(e^{\alpha_j(\theta-\beta_j)})^{Y^_j}}{1+e^{\alpha_j(\theta-\beta_j)}}\]

 

 

4. CAT的高阶话题

CAT作为一个有近30年历史的学科,其理论深度和宽度远远超过本文所讨论的内容。有兴趣的朋友可以阅读由Wim J. van der Linden和Cees A.W. Glas编辑的Computerized Adaptive Testing: Theory and Practice。这本书覆盖了2000年左右CAT发展。

大体上说,一个完善的CAT系统需要解决以下几个麻烦的问题:

(1) 第三节假设题目参数是已知的,而实际上题目参数是未知的。怎么解决题目参数估计误差所导致的题目选择偏差?

(2)因为学生能力估计是有误差的,如果选择区分度高的题,可能导致估计该步信息失准

(3) 怎么评估题库题目的质量?怎么确保题库有足够多的题目可以应付多场考试?(例如GRE要保证70套完整题目组)怎么确保题库题目更新迭代不影响不同考试间的可比性(GRE题目的寿命是3个月)

(4)如果能力是多元的怎么办?更麻烦的是(对于国内中高考尤其是如此),如果一道题涉及多个能力项怎么办?

(5)怎么处理额外的试卷限制,例如考试时长,平均分限制和题目曝光总量控制

 

5.CAT作为自适应“学习”算法的不足

CAT作为一种测评算法已经被证明是可行且高效的。 随着人机交互界面的改善和算法的进步,计算机将不仅可以完全兼容纸笔试题(主要是主观题),而且可以实现许多纸笔考试难以完成的试题(例如VR支持的实验动手测试)。因此,CAT将会在测评领域取得更进一步的发展,这是毋庸置疑的。

然而,作为一种学习算法,CAT却是先天不足。CAT假设学生在做题过程中,其能力值是不变的;这从根本上否定了学习的可能性。一个否定学习可能性的算法,如何用来指导教学推荐呢?因此,仅仅是CAT(或者IRT)是不能够独立支撑起一个学习推荐系统的。

当自适应学习发生在知识点间时(参考我的这篇文章),CAT可以作为测评模块,配合知识空间理论(比如ALEKS)或者专家系统(比如Khan Academy)的教学模块,组成一个智能学习推荐系统。

 

适用于在线服务的A/B测试方法论

简介:

这篇文章旨在介绍适用于为在线服务进行A/B测试(A/B Test)的方法论。中文网络中目前还缺乏全面的入门级介绍。

我将首先讨论在线服务业进行A/B测试所考虑的决策问题,然后介绍两种不太常见的统计测试:序列概率比测试(Sequential Probability Ratio Test, SPRT)和基于多臂老虎机(Multi-armed Bandit, MAB)算法的测试。

I.不只是P-Value

经典统计学会用如下的方法进行A/B测试:

(F1)选择一个样本N

(F2)随机平分到A/B组

(F3)计算原假设H_0:\mu_A=\mu_B 的t统计量(双样本+未知方差)所对应的p-value

(F4)如过p-value小于一个特定的显著性水平(例如0.05),那么就拒绝原假设,否则啥也做不了。

基于农业和工业生产体系产生的经典统计检验方法并不一定适用于在线服务业。这不完全是因为p-value经常被误解了,更多是因为经典假设检验不能满足在线服务业的业务要求。

1.1 为持续改进服务的AB测试是决策问题而不是推断问题

大部分时候,我们只关心是否A方案比B方案好,而不关心A方案比B方案好多少。换句话说,我们只关心实验的业务正确性(decision),而不在乎实验的统计可靠程度(inference)

从这个角度来看,双边检验是一个糟糕的决策问题。拒绝原假设H_0:\mu_A=\mu_B 并不能告诉我们到底应该选择A方案还是B方案。单边检验H_0:\mu_A<\mu_B 是一个更好的原假设。如果拒绝了原假设,那么应该选择A方案;否则就选择B方案。

假设检验还可以有其他形式。我们可能并不关心A方案和B方案是否完全一致(贝叶斯主义者会说永远不会一致),我们可能关心A方案的B方案差距是否在一定范围内。如果是,那么我们应该做出选择,否则可以保持现有的方案。这样的原假设检验是等价检验(equivalence test):H_0: |\mu_A-\mu_B| < \epsilon

总而言之,F3不适合在线AB测试。事实上,假设检验的鼻祖Ronald Fisher反对将原假设检验作为决策依据。

1.2 不仅关心实验结果,而且关心实验成本

在线上运行试验(特别是大规模实验)时,可能会影响用户体验,从而影响用户留存等关键业务指标。除非主管老板正好是实证主义的狂热信徒,否则他们可能会对于做实验颇有微词。如果你运气好,去了一家完全由A/B测试驱动的公司,那么实验成本的问题不仅没有消失反而更为重要了。

因此,我们希望在实验过程中,尽可能降低优势方案向用户曝光的频率。要实现这个目标有两种方法,动态结束机制和非均衡分组。动态结束机制是指一旦确认某个方案胜出,就立即停止实验;采用这种方法意味着F1不适合在线AB测试,因为我们不必死等到耗尽预订样本为止。非均衡分组是指向在当前估计下优势方案多分配用户,我们将在多臂老虎机一节中更加详细地讨论这个问题;采用这种方法意味着F2不适用在线AB测试。

1.3 样本是序列生成的

经典统计学暗中假定数据是同时生成的,而不是依次生成的(大概要归功于假设检验的始祖Ronald Fisher是个农业统计学家)。然而在线服务业的用户总是依次到达的。这个重要特征使得我们可以兼顾实验可靠性和实验的成本,也是下面两种检验方法的核心特征。

II.序列概率比测试

序列概率比测试(Sequential Probability Ratio Test,以下简称SPRT)可以认为是经典假设检验在序列样本环境下的变体。在相同显著性(significance)和检验力(power)下,SPRT可以比经典t检验结束的更早,需要的样本更少。而其代价就是更加复杂的统计量和更加复杂的决策程序。

对于SPRT理论性质和正规检验方法感兴趣的读者可以钻研David Seigmund(1985)的Sequential Analysis一书。在这里我们提供两个更为实际的检验方法。第一是optimizely的统计检测引擎,另一个是Evan Miller的简易SPRT。前者虽然提供一个严格的False positive边界,但是没有完全披露其核心参数,因此不能下架即用。后者是个傻瓜式的检验,但是其显著性和检验力堪忧,能够适用的业务场景也较少(几乎仅限于二元变量)。

如果你对于显著性这个概念情有独钟,那么SPRT可能是一个值得投入时间研究的方法。

III. 多臂老虎机测试

如果SPRT所涉及的数学水平是精深的话,那么真正理解多臂老虎机(以下简称MAB)问题所需要的数学水平已经是非人了。Scott(2010)为MAB问题的历史提供了一个深入浅出的回顾。O’Reilly有一本书介绍了MAB的经典解法和其Python代码

幸运的是,MAB有一个名为汤普森抽样(Thompson Sampling)的贝叶斯解法。它容易理解,容易执行,扩展性强;因此非常适合用于在线AB测试。事实上,它正是Google AnalyticsYahoo!Bing实际使用的方法。

关于汤普森抽样的技术细节请参见Google Analytics的帮助菜单和例子Scott(2010)提供了更为正式的表述。在这儿我主要讨论一下GA算法的核心思想。

3.1 什么是多臂老虎机

Bandit是美国人给老虎机取得别名。下图的老虎机是一个独臂老虎机。你拉一下摇杆,得到一个奖励。进一步假设奖励是取值0,1的随机变量,取值为1的概率是p。

假设这个老虎机有多个摇杆,从1到N编号,每个摇杆返回奖励的概率为p_1, p_2, \dots, p_n ,但是这个概率你不知道。每个回合你可以选择拉动一个摇杆。

假设各个摇杆的返奖概率不变,使用什么策略可以使得累计奖励最大化?

简而言之,这就是多臂老虎机问题。回顾第一节关于在线服务优化的讨论,可以发现两者间的强关联:两者的核心目标都是做出决策;两者的核心考量都是成本收益,两者的数据生成过程都是序列化的。

3.2 探索与利用的权衡

如果用经典t检验的思路来解决多臂老虎机问题,你的策略是这样的:

(1)在1到tN轮,确保每个摇杆拉t

(2)在tN结束时,计算每个摇杆的样本平均\hat{p_i},选出样本返奖概率最高的摇杆i^*

(3)从第tN+1轮开始,一直选择摇杆i^*

tN+1轮开始的策略是纯利用(Exploration):即选择实证奖励最高的选项(play the winner);而前tN轮所用的策略是纯探索(Exploitation):即无视经验奖励,让概率做决定。

一般而言,像这样泾渭分明的策略组合不是最优的。在实验初期,应该鼓励探索,因为我们不知道哪个选项更好。但是,当有证据表明某几个选项相对更优时,应该增加“利用”的比例。这样做有两个好处:第一,多选赢家可以提高策略的整体收入;第二,多选赢家可以比平均分配更快地降低标准误,从而提高我们对于“赢家估计”的信心。因此,利用的比例应该随着时间的增加而增加。

汤普森抽样提出了这样一种直观的贝叶斯解决方案:

在每个周期计算“摇杆i是最优摇杆”的后验概率,以此作为下一轮随机分配摇杆的依据。

汤普森抽样保证”在实验过程中“坑用户”的比例是最小的。此外,如果你懂一些贝叶斯模型,你就可以看到这种方法的灵活性和可扩展性有多强。

3.2 停止条件:潜在剩余价值

尽管汤普森抽样看似秒杀经典t检验,但是在决策过程中它有一个致命问题。所有的MAB策略本身都不是一个统计检验,因为它们没有停止条件。汤普森抽样会跑到海枯石烂,但是实际操作不允许无限长的实验。目前为止,还没有人公开发表一个简单靠谱的终止条件。和经典t检验相比,这是基于MAB测试的一个重要缺点。

Google Analytics推荐使用“潜在剩余价值”(Potential Value Remaining,PVR)来判断实验结束的时间。当选择摇杆i^*后,我们想知道(基于返奖概率的后验概率分布)如果选择其他摇杆的潜在收益有多大。如果这个收益很小,那么我们就有足够的理由的选择最优解;如果这个收益还很大,那么还需要收集更多的数据来消除对于返奖概率的不确定性。事实上,PVR并不是一个值,而是一个分布。因此,潜在收益“大小”需要用两个数(\epsilon\delta)衡量,分界线\epsilon,分界概率\delta

潜在最优收益比目前最优收益大\epsilon%的概率小于\delta

比如,Google Analytics默认的设置是(0.01,0.05),即潜在最优收益比目前最优收益大1%的概率不到5%。

需要警告读者的是,即使A/B方案没有任何区别,PVR停止政策也会导致相当一部分的实验在有限时间内终止;并且随着时间的延长,这样第一类错误的概率会不断增加。这与经典t检验下固定pvalue,样本越大,显著性越小的特征正好相反。如果你非常关心第一类错误概率,那么MAB测试不是一个好选择

IV. 什么时候应该关心显著性?

在假设检验中,我们一般讨论第一类错误(Type I Error)和第二类错误(Type II Error)。假设原假设为H0:你没有怀孕,被择假设为H1:你怀孕了。那么,两类错误描述的情况如下图所示。

Scott在好几处都提到一个观点,那就是线上实验根本不用考虑第一类错误,因为其代价极小;反而是第二类错误犯了才致命。事实上,Scott反对的并不是第一类错误,而是双边假设检验。

考虑形如H_0:\mu_A=\mu_B 下第一类错误和第二类错误的业务含义。如果犯了第一类错误,A方案和B方案完全等效,我们错误认为其中一个方案更好。这有什么实际代价么?如果代码转换和用户体验变化的成本可以忽略不计(对于小改变而言的确如此),那么第一类错误的代价的确是0。与此恰恰相反,假设我们犯了第二类错误,并且假设新方案是更优的选择,我们就错过了一个永久改进的机会:这是一个代价不菲的错误。

当产品线稳定后,持续改进只能从细节入手时,Scott的批评是非常有用的。因为PM们能想到的改进都已经做了,意外惊喜靠人的想象力来发掘已经不太现实了;相反,用算法生成大量组合,进行持续测试和优化,说不定可以积跬步而致千里。因此对于诸如Google或者Facebook这样的公司来说,用实验主导开发是更合理的选择。

如果产品线不稳定,实验数据被用来做改版的调研,此时我们在做推断,而不是决策。因为人很容易被随机性所欺骗,看到海市蜃楼般的改进道理;所以实验方法论应该设计的更为保守。如果因为第一类错误而导致失败的all in,这是一个代价不菲的错误。

因此,我个人推荐使用汤普森抽样作为自动化实验体系的统计检验方法,使用经典t检验作为战略决策实验的统计检验方法。

延展阅读

为了提高可读性,本文避开了以下重要话题。这些问题有助于我们更加深入地理解假设检验与业务的契合。有兴趣的童鞋可以从这些博客出发,进一步阅读:

(1)频率学派贝叶斯学派关于p-value的相爱相杀

(2)重复显著性检验(repeated significance test):为什么Bayesian Test或者不是解决方案。我个人认为信贝叶斯又信p value属于人格分裂,Andrew Gelman有更加科学的解释(下面的评论值得详细阅读!)。

(3)多重对比,为什么过度的(简易)AB测试是伪科学

关于经典t检验应该怎么做,实验经济学大神John List(2010)有一个非常实用的指导手册。经典t检验绝不是把样本五五开这么简单。