Bayesian Diagnosis Tracing Model: Assigning Blaming for Multiple Knowledge Components (Draft)

                                                                                                                                                               I.        Introduction

When a practice item is linked to multiple knowledge components, there are no widely accepted best practices on how to assign blame. In the context of mathematics education in China, the blaming problem comes into prominence as early as second grade. Despite its practical importance, this problem is under-studied in both the literature and the industry. Several solutions known to the authors are flawed as formative assessment tools, as we will discuss in the next section.

This paper proposes a new framework, Bayesian Diagnosis Tracing (BDT), to deal with the blaming problem in a logically consistent way without losing track of learning progress. The BDT model combines two well established streams in the learning analytics literature: the research on Bayesian Knowledge Tracing model and the study of procedural misconceptions. In a nutshell, the binary response in the BKT model is replaced with a vector of misconception diagnosis, allowing an intelligent tutor to mimic the reasoning of a human tutor.

                                                                                                                                          II.       Relevant Theories of Learning

A.   The Bayesian Knowledge Tracing and the Cognitive Diagnosis Model

The BKT model is the most widely accepted analytical framework in modeling the learning process. There are two implicit assumptions. The first assumption is that each practice item maps to ONE knowledge component. The second assumption is that wrong responses are qualitatively the same conditional on the latent state[[1]]. In the context of multiple knowledge components, the first assumption is clearly violated. A common practice is to pretend that the learner practices N separate items that each map to one knowledge component. If the learner answers the original item correctly, they answer N artificial items correctly. Otherwise, the learner fails all N artificial items. This method has a fatal logic inconsistency. In the case of a correct response, then the components are treated as if they were connecting by an “AND” gate. In the case of a wrong response, then the components are treated as if they were connected by an “OR” gate. Both statements cannot be true for the item in question.

The Deterministic Input Noisy “And” Gate (DINA) model avoids such logical fallacy. However, DINA model, and likewise the family of cognitive diagnosis, cannot replace BKT model because it has no learning process. Tracing the progress of a learner is a must-have feature of the Intelligent Tutor System (ITS). In addition, the DINA model only uses the binary response, which means it throws away the information encoded in different responses.

To our best knowledge, there is not any analytical framework that can logically handle the problem of blaming multiple knowledge components in the context of dynamic mastery tracing. The ambition of this paper is to do just that.

B.   Procedural Misconceptions

Inspired by the study of procedural misconceptions by VanLehn(1990), this paper to propose a semi-automatic wrong response generator based on a data informed expert system. Such data informed expert system is import because the distribution of wrong responses has a long tail. Without the help of the frequency table based on large amount of data, experts are likely to only recall a few common mistakes, which probably accounts for less than 20% of the cumulative distribution. However, without the help of pedagogical expert, one cannot make sense of the wrong answers.

                                                                                                                                    III.      Enabling Technological Advances

A.   Variant, Solution Tree and Automatic Diagnoses of Procedure Misconceptions

This paper defines the solution tree as all possible paths of mathematical reasoning for solving a problem, including those of correct reasoning and those of incorrect reasoning[[2]]. A variant is defined as the class of problems that are identical in solution tree[[3]] but different in application settings[[4]]. In essence, the distribution of numerical responses of an instance can be generated from the distribution of algebraic responses of its variant when the parameter of the instance is plugged in. The application section offers an example of using solution tree to generate responses.

When given a wrong response of an instance of a variant, the analytical system can trace back the solution and accurately blame the components that contribute to the error[[5]]. Under some assumptions, the answers of the practice sequences can be split into K answer sequences, one for each knowledge components, and then the learner’s mastery on each knowledge components are estimated by independent HMM models.

B.   Bayesian Diagnosis Tracing Model

Let  be the state vector of k knowledge components at time t, . Let  be the state transition matrix. When the response is diagnosed as a vector of diagnosis , .

The likelihood is given as

The state transition is

If there are K knowledge components, each with M states. The state transition matrix T  is size . If there are L possible diagnosis combinations, the observation matrix E  is size . In real applications, even the sizes of K,M,L are reasonable, the model still has too many parameters to estimate. Practically, the joint likelihood has to break into the product of the likelihood of each knowledge component. In the product form, each component can be estimated by an independent HMM.

The product form requires the following assumption:

A1: The learning process of each component is independent.

A2: The diagnoses of each knowledge components are exchangeable[[6]]

A3: The symptom of each knowledge components is not related to other components.

                                                                                                                                               IV.      Real World Applications

This paper illustrates the Bayesian Diagnosis Tracing model, and its contrast to the classical BKT model, with data collected during Oct.2018 and Dec.2018 in a Chinese learning app. About one thousand students finished three items in a row. After answering each item, the student knew if she got it right. In the case of a wrong answer, the correct answer was revealed on the same screen as the mark and the solution procedure was also revealed in the next screen.

Here is the translation of the three items.

Q1: The school has a pool, 16m in length and 5m in width. To setup a fence around the pool, the length of the fence is ____ m.

Q2: The width of a rectangle is 3cm. The length is twice the width. The circumference of the rectangle is ____cm

Q3: Boy: The length of the basketball court is 28m and the width is 15m.

Girl: I walk around the court twice.

Question: The girl has walked ____m in total.

These are instances of three variants, consisting of two key knowledge components: calculating circumference of a rectangle given length and width (K1), calculating the product given multiplicand and multiplier(K2). The difference between Q2 and Q3 shows that procedures are not sufficient to distinguish variants.

For each component, the misconceptions are listed in the following table

Knowledge Component Right formula Misconception Symptom
K1 W*2+L*2


Wrong formula W*L




Flawed execution W+L





K2 X*Y No multiplication Y


Mistake for addition X+Y

The solution tree of Q1 is just P1. Q2 is K2->K1. Q3 is K1->K2. When multiple procedures are involved, the wrong responses are generated by the convolution of each procedure. For example, for Q2, K2 generate 4 intermediate answers. Taking each as an input to P1, it generates 8 answers (but only 4 category labels including right). In total, Q2 has 32 leaf nodes. If none of the leaf nodes produce the same answer, then each answer pins down a particular set of procedure misconception(s). However, when different leaf nodes produce the same answer, we rely on pedagogical expert to decide which set of misconceptions are more likely and choose it as the diagnosis.

There are two other types of special diagnosis, give-up and calculation error. Give-up are defined as answer 3/5/8[[9]] or a number from the question text, such as 16 in Q1 or 28 in Q3. Give-up does not necessarily mean the student cannot do it, but more likely not bother to. Nevertheless, if a wrong answer is identified as give-up, then all procedural diagnoses are give-up.  Calculation error is defined as wrong answers that do not contain any procedural misconceptions, for example 33[[10]] for Q3. If a wrong answer is identified as a calculation error then all procedural diagnoses are right. A wrong answer is labelled as no-diagnosis if it is not diagnosed with procedural misconceptions, nor is marked as give-up, nor is marked as calculation error [[11]].

Table 1 shows the composition of the wrong responses

Table 1
Question Procedural Misconceptions Give-up Calculation Error No-diagnosis
Q1 27% 10% 8% 55%
Q2 41% 47% 0% 13%
Q3 58% 6% 6% 31%

Among the identified procedural misconceptions, Table 2 shows the share of blame. Within the category of procedural misconceptions, the probability of both to blame is about 10%. It means the prevailing BKT method is logically correct for 10% of the wrong response. No matter how sophisticated the BKT model evolves, if the data are seriously compromised, the analysis is likely to be far from truth.

Table 2
  K1 is to blame K2 is to blame Both
Q2 28% 60% 12%
Q3 6% 87% 7%


In addition to better data quality, the vector of diagnoses enables the HMM models to do a better job in forming learner type cluster. Table 3 is the observation matrix of the 3 state BKT model. It is almost impossible to interpret what each state means. Table 4 is the observation matrix of the 3 state BDT model, one can easily name the three clusters: who skip, who slip and who master.

Table 3
  K1 K2
State 1 93% 97%
State 2 10% 13%
State 3 2% 2%
Table 4
    State1 State2 State3
K1 Right 96.6% 68.5% 38.5%
Wrong Formula 0.6% 4.3% 6.8%
Flawed Execution 0.2% 3.4% 7.4%
Give-up 0.2% 5.5% 22.1%
No Diagnosis 2.3% 18.2% 25.1%
K2 Right 93.9% 43.2% 33.9%
No Multiplication 4.4% 36.7% 12.3%
Mistake for Addition 0.6% 1.3% 1.7%
Give-up 0.0% 0.9% 39.5%
No Diagnosis 1.1% 17.9% 12.7%

Table 5 are the transition matrix, or the learning process. BDT model only allow for adjacent

Table 5-1
P1 State 1 State 2 State 3
State 1 91.8% 8.2%
State 2 38.6% 38.8% 22.6%
State 3 52.8% 47.2%
Table 5-2
P2 State 1 State 2 State 3
State 1 58.9% 41.1%
State 2 16.4% 60.4% 23.2%
State 3 67% 33%


                                                                                                                                          V.       Evidence of Potential Impacts

There are two potential productive application of the BDT model. For one thing, it makes the latent states much easier to interpret and communicate. This is essential for human tutors when interfacing with an ITS. They often complain that a probability statement of mastery is either mysterious or meaningless. For another thing, authors believe that different diagnosis shall lead to different pedagogical interventions.

                                                                                                                                                                  VI.      Summary

The Bayesian Diagnosis Tracing model solves the blaming problem in the context of the mastery tracing. The paper provides some evidence on its benefit, and some of its practical limitations. This paper only scratch the surface of the work can be done and needs to be done. For one thing, how much more effective the BDT driven intelligent tutor is compared to the current system remains to be see. While we have strong faith in its accuracy, it is expensive and risky to build solution-tree-diagnosis compatible assessment content inventory and state-based teaching and practicing material. Whether it is commercially viable remains in doubt. For another thing, the BDT framework needs to be extended to handle the multiple diagnoses. In principle, since the likelihood function can be explicitly written, the parameters can be estimated with MCMC algorithm. However, the computation complexity of real dataset and convergence quality of the resulted parameters needs to be examined.

  • De La Torre, Jimmy. “The generalized DINA model framework.” Psychometrika 76.2 (2011): 179-199
  • Piech, Chris, et al. “Deep knowledge tracing.”Advances in neural information processing systems. 2015.
  • Corbett, Albert T., and John R. Anderson. “Knowledge tracing: Modeling the acquisition of procedural knowledge.” User modeling and user-adapted interaction 4.4 (1994): 253-278.
  • VanLehn, K. (1990). Mind bugs: The origins of procedural misconceptions. MIT press.




[[1]] In the classical BKT framework, not all wrong responses are created equal. The wrong answer means incompetency for the non-mastery state while it means careless for the mastery student. This metaphor runs into trouble when there are three latent states.

[[2]] As demonstrated in the application section, incorrect reasoning branch may produce correct answer if the problem is poorly designed. Therefore, the branch of the solution tree is categorized by its reasoning process rather than the answer it produces.

[[3]] The complexity of calculation is ignored in this definition. Chinese pedagogical experts generally disagree with this simplification. However, from the perspective diagnosis, computational error can be handled separately from the mathematical modeling. In fact, if the error is attributed to calculation error, the solution blame none of the components.

[[4]] This definition is not very precise because a novel application context challenges a learner’s creativity. However, it is not clear where the line shall be drawn in the sand.

[[5]] If the parameter is not well crafted, multiple branches produce the same answer. This paper picks a branch that is favored by a human expert by experience.

[[6]] In practice, it means order the knowledge components on the solution tree is irrelevant. It is not always true.

[[7]] The origin of this error can be either (L+W)*2 but forgetting the parenthesis or L*2+W*2 but forgetting the last 2.

[[8]] It can be debated whether the student does not remember the circumference formula, or the student forget to complete the calculation. Here we chalk it up flawed execution

[[9]] Because a student uses a classical 4*3 number pad interface, students type in a number then submit without thinking. Since the student faces almost no punishment in answering an item wrong, the phenomena of “without thinking fastidiously” is obvious.

[[10]] 28+15 = 33

[[11]] For a few answers, the paper also treats typing error as calculation error. For example, 421 for Q1 is likely to be the result of student accidentally typed extra “1” after the correct answer 42. These cases are rare.


本文将介绍在教育模型中理论基础最完善且使用最广泛的自适应算法:计算机自适应测评系统(Computerized Adaptive Testing, CAT)。大家耳熟能详的一些计算机测评服务,例如ETS的GRE和TOEFL,都采用CAT作为推荐引擎。此外,CAT所依赖的测评理论,即项目反应理论(Item Response Theory, IRT),是心理测量学(psychometrics)的中流砥柱,有逾40年的理论研究基础,发展非常完善。最后,中国大部分教育技术公司在使用的自适应算法,可能不超过CAT的范畴。












    \[P(X=1|Y_h=1) =\frac{P(X=1,Y_h=1)}{P(Y_h=1)}=1\]

















[1]  纸笔考试




[2] CAT出题




上表告诉我们,当学生第一题做对时,第二题应该出中题;当学生第二题做错时,第二题应该出易题。因此,CAT的选择是 难+中(第一题对)/难+易(第一题错)

在这个具体例子中,CAT在学生第一题做错,第二题做对的情况下比纸笔考试的后验概率方差有所降低,尽管差距并不是很明显(0.1476 Vs. 0.1474)。

2. IRT模型





在现实中,这两个假设都不成立。因此我们需要一个模型来提供这两个重要的信息。项目反应理论(Item Response Theory,IRT)是提供这些信息的最流行的模型。关于IRT模型的较为详细的介绍在可以参见这个链接(ETS在2013年提供了一个IRT发展的综述)。常见的IRT会做了以下的假设:




    \[P(Y=1|\theta,\alpha,\beta) = \frac{e^{\alpha(\theta-\beta)}}{1+e^{\alpha(\theta-\beta)}}\]


3. 最简CAT的工程设计


题目选择的算法有很多。这里我们介绍历史最悠久的一种:最大信息原则(Maximum Information Criterion)。



记题目id为j。题目j的参数为\Theta_j。在二参数IRT模型中(第二节的例子),\Theta_j= \{\alpha_j,\beta_j\}



    \[p^i_j(\theta^i) = P(Y^i_j=1,\theta^i|\Theta_j) = \frac{e^{\alpha_j(\theta^i-\beta_j)}}{1+e^{\alpha_j(\theta^i-\beta_j)}}\]

这种函数设定的好处是,Fisher information function(一种度量信息的方法)是线性可加的。因此,假设学生已经做了k-1题,第k题的Fisher Information就是

    \[I_j(\theta^i) = \frac{[p^i_j'(\theta^i)]^2}{p^i_j'(\theta^i)[1-p^i_j'(\theta^i)]}\]


    \[p^i_j'(\theta^i) = \frac{\partial}{\partial \theta^i} p^i_j(\theta^i)\]

如果\theta^i是已知的,那我们就可以计算每道题的Fisher Information,然后选取最大的那个题。然而遗憾的是,我们并不知道\theta^i。所以在每一步中,我们都用估计值\hat{\theta}^i_t来替代\theta^i。这个估计值可以这么估算



    \[\hat{\theta}^i_1 = \int \theta g(\theta) d\theta\]


之后每一步的\hat{\theta}^i_k都是似然函数的最大似然估计值(Maximum Likelihood Estimator),其中似然函数是

    \[L(\theta) = \prod_{j=1}^{k-1} \frac{(e^{\alpha_j(\theta-\beta_j)})^{Y^_j}}{1+e^{\alpha_j(\theta-\beta_j)}}\]



4. CAT的高阶话题

CAT作为一个有近30年历史的学科,其理论深度和宽度远远超过本文所讨论的内容。有兴趣的朋友可以阅读由Wim J. van der Linden和Cees A.W. Glas编辑的Computerized Adaptive Testing: Theory and Practice。这本书覆盖了2000年左右CAT发展。


(1) 第三节假设题目参数是已知的,而实际上题目参数是未知的。怎么解决题目参数估计误差所导致的题目选择偏差?


(3) 怎么评估题库题目的质量?怎么确保题库有足够多的题目可以应付多场考试?(例如GRE要保证70套完整题目组)怎么确保题库题目更新迭代不影响不同考试间的可比性(GRE题目的寿命是3个月)





CAT作为一种测评算法已经被证明是可行且高效的。 随着人机交互界面的改善和算法的进步,计算机将不仅可以完全兼容纸笔试题(主要是主观题),而且可以实现许多纸笔考试难以完成的试题(例如VR支持的实验动手测试)。因此,CAT将会在测评领域取得更进一步的发展,这是毋庸置疑的。


当自适应学习发生在知识点间时(参考我的这篇文章),CAT可以作为测评模块,配合知识空间理论(比如ALEKS)或者专家系统(比如Khan Academy)的教学模块,组成一个智能学习推荐系统。




这篇文章旨在介绍适用于为在线服务进行A/B测试(A/B Test)的方法论。中文网络中目前还缺乏全面的入门级介绍。

我将首先讨论在线服务业进行A/B测试所考虑的决策问题,然后介绍两种不太常见的统计测试:序列概率比测试(Sequential Probability Ratio Test, SPRT)和基于多臂老虎机(Multi-armed Bandit, MAB)算法的测试。





(F3)计算原假设H_0:\mu_A=\mu_B 的t统计量(双样本+未知方差)所对应的p-value



1.1 为持续改进服务的AB测试是决策问题而不是推断问题


从这个角度来看,双边检验是一个糟糕的决策问题。拒绝原假设H_0:\mu_A=\mu_B 并不能告诉我们到底应该选择A方案还是B方案。单边检验H_0:\mu_A<\mu_B 是一个更好的原假设。如果拒绝了原假设,那么应该选择A方案;否则就选择B方案。

假设检验还可以有其他形式。我们可能并不关心A方案和B方案是否完全一致(贝叶斯主义者会说永远不会一致),我们可能关心A方案的B方案差距是否在一定范围内。如果是,那么我们应该做出选择,否则可以保持现有的方案。这样的原假设检验是等价检验(equivalence test):H_0: |\mu_A-\mu_B| < \epsilon

总而言之,F3不适合在线AB测试。事实上,假设检验的鼻祖Ronald Fisher反对将原假设检验作为决策依据。

1.2 不仅关心实验结果,而且关心实验成本



1.3 样本是序列生成的

经典统计学暗中假定数据是同时生成的,而不是依次生成的(大概要归功于假设检验的始祖Ronald Fisher是个农业统计学家)。然而在线服务业的用户总是依次到达的。这个重要特征使得我们可以兼顾实验可靠性和实验的成本,也是下面两种检验方法的核心特征。


序列概率比测试(Sequential Probability Ratio Test,以下简称SPRT)可以认为是经典假设检验在序列样本环境下的变体。在相同显著性(significance)和检验力(power)下,SPRT可以比经典t检验结束的更早,需要的样本更少。而其代价就是更加复杂的统计量和更加复杂的决策程序。

对于SPRT理论性质和正规检验方法感兴趣的读者可以钻研David Seigmund(1985)的Sequential Analysis一书。在这里我们提供两个更为实际的检验方法。第一是optimizely的统计检测引擎,另一个是Evan Miller的简易SPRT。前者虽然提供一个严格的False positive边界,但是没有完全披露其核心参数,因此不能下架即用。后者是个傻瓜式的检验,但是其显著性和检验力堪忧,能够适用的业务场景也较少(几乎仅限于二元变量)。


III. 多臂老虎机测试


幸运的是,MAB有一个名为汤普森抽样(Thompson Sampling)的贝叶斯解法。它容易理解,容易执行,扩展性强;因此非常适合用于在线AB测试。事实上,它正是Google AnalyticsYahoo!Bing实际使用的方法。

关于汤普森抽样的技术细节请参见Google Analytics的帮助菜单和例子Scott(2010)提供了更为正式的表述。在这儿我主要讨论一下GA算法的核心思想。

3.1 什么是多臂老虎机


假设这个老虎机有多个摇杆,从1到N编号,每个摇杆返回奖励的概率为p_1, p_2, \dots, p_n ,但是这个概率你不知道。每个回合你可以选择拉动一个摇杆。



3.2 探索与利用的权衡





tN+1轮开始的策略是纯利用(Exploration):即选择实证奖励最高的选项(play the winner);而前tN轮所用的策略是纯探索(Exploitation):即无视经验奖励,让概率做决定。





3.2 停止条件:潜在剩余价值


Google Analytics推荐使用“潜在剩余价值”(Potential Value Remaining,PVR)来判断实验结束的时间。当选择摇杆i^*后,我们想知道(基于返奖概率的后验概率分布)如果选择其他摇杆的潜在收益有多大。如果这个收益很小,那么我们就有足够的理由的选择最优解;如果这个收益还很大,那么还需要收集更多的数据来消除对于返奖概率的不确定性。事实上,PVR并不是一个值,而是一个分布。因此,潜在收益“大小”需要用两个数(\epsilon\delta)衡量,分界线\epsilon,分界概率\delta


比如,Google Analytics默认的设置是(0.01,0.05),即潜在最优收益比目前最优收益大1%的概率不到5%。


IV. 什么时候应该关心显著性?

在假设检验中,我们一般讨论第一类错误(Type I Error)和第二类错误(Type II Error)。假设原假设为H0:你没有怀孕,被择假设为H1:你怀孕了。那么,两类错误描述的情况如下图所示。


考虑形如H_0:\mu_A=\mu_B 下第一类错误和第二类错误的业务含义。如果犯了第一类错误,A方案和B方案完全等效,我们错误认为其中一个方案更好。这有什么实际代价么?如果代码转换和用户体验变化的成本可以忽略不计(对于小改变而言的确如此),那么第一类错误的代价的确是0。与此恰恰相反,假设我们犯了第二类错误,并且假设新方案是更优的选择,我们就错过了一个永久改进的机会:这是一个代价不菲的错误。


如果产品线不稳定,实验数据被用来做改版的调研,此时我们在做推断,而不是决策。因为人很容易被随机性所欺骗,看到海市蜃楼般的改进道理;所以实验方法论应该设计的更为保守。如果因为第一类错误而导致失败的all in,这是一个代价不菲的错误。





(2)重复显著性检验(repeated significance test):为什么Bayesian Test或者不是解决方案。我个人认为信贝叶斯又信p value属于人格分裂,Andrew Gelman有更加科学的解释(下面的评论值得详细阅读!)。


关于经典t检验应该怎么做,实验经济学大神John List(2010)有一个非常实用的指导手册。经典t检验绝不是把样本五五开这么简单。