Table Of ContentProbabilistic Models for Computerized Adaptive Testing: Experiments
MartinPlajner Jirˇ´ıVomlel
InstituteofInformationTheoryandAutomation InstituteofInformationTheoryandAutomation
AcademyofSciencesoftheCzechRepublic AcademyofSciencesoftheCzechRepublic
Podvoda´renskouveˇzˇ´ı4 Podvoda´renskouveˇzˇ´ı4
Prague8,CZ-18208 Prague8,CZ-18208
CzechRepublic CzechRepublic
6
1
0 Abstract 2 CATPROCEDUREANDMODEL
2
EVALUATION
b
e
F This paper follows previous research we have All models proposed in this paper are supposed to serve
already performed in the area of Bayesian net- for adaptive testing. In this section we briefly outline the
1
works models for CAT. We present models us- process of adaptive testing1 with the help of these models
] ing Item Response Theory (IRT - standard CAT andmethodsfortheirevaluation. Foreverymodelweused
I method), Bayesian networks, and neural net- similarprocedures.Thespecificdetailsforeachmodeltype
A
works. We conducted simulated CAT tests on are discussed in the corresponding sections. At this point
.
s empirical data. Results of these tests are pre- wediscussthecommonaspects.
c
sentedforeachmodelseparatelyandcompared.
[ In every model type we have the following types of vari-
ables. Forsomemodelstheyhaveadifferentspecificname
2
v becauseofanestablishednamingconventionofthecorre-
9 spondingmethod. Nevertheless,themeaningofthesevari-
2 1 INTRODUCTION ablesisthesameandweexplaindifferencesforeachmodel
9
types. Inthispaperweusetwotypesofvariables:
7
0
. Allofusareintouchwithdifferentabilityandskillchecks • A set of n variables we want to estimate S =
1
almosteveryday. Thecomputerizedformoftestingisalso {S ,...,S }. These variables represent latent skills
0 1 n
gettinganincreasingattentionwiththespreadofcomput- (abilities,knowledge)ofastudent. Wewillcallthem
6
1 ers,smartphonesandotherdeviceswhichalloweasycon- skillsorskillvariables. WewillusesymbolS tode-
: tactwithtargetgroups. ThispaperfocusesontheComput- notethemultivariableS =(S ,...,S )takingstates
v 1 n
erized Adaptive Testing (CAT) (van der Linden and Glas, s=(s ,...,s ).
Xi 2000; Almond and Mislevy, 1999; Almond et al., 2015) 1,i1 n,in
• A set of p questions X = {X ,...,X }. We will
r anditfollowsapreviousresearchpaper(PlajnerandVom- 1 p
a use the symbol X to denote the multivariable X =
lel,2015).
(X ,...,X )takingstatesx=(x ,...,x ).
1 p 1 p
In this previous paper we explained the concept of CAT.
Next, we describe our empirical data set. The use of We collected data from paper tests conducted by gram-
Bayesian networks for CAT was discussed and we con- mar schools’ students. The description of the test and its
structed different types of Bayesian network models for statistics can be found in the paper (Plajner and Vomlel,
CAT.Thesemodelsweretestedonempiricaldata. There- 2015). Alltogether,wehaveobtained281testresults. Ex-
sultswerepresentedanddiscussed. periments were performed with each model of each type
that is described in following sections. We used 10-fold
In this paper we present two additional model types for
cross-validation method. We learned each model from 9
CAT: Item Response Theory (IRT) and neural networks. 10
of randomly divided data. The remaining 1 of the data
Moreover,newBNmodelsareproposedinthispaper. We 10
setservedasatestingset. Thisprocedurewasrepeated10
conductedsimulatedCATtestsonthesameempiricaldata
times to obtain 10 learned student models with the same
as in the previous paper. This allows us to make compar-
structureanddifferentparameters.
isonsoftwonewmodeltypes(BNandNN)withtheCAT
standardIRTmodel. Resultsarepresentedforeachmodel 1AdditionalinformationaboutCATcanbefoundin(Wainer
separatelyandthentheyareallcompared. andDorans,2015)
WiththeselearnedmodelswesimulateCATusingtestsets. 3 ITEMRESPONSETHEORY
ForeverystudentinatestsetaCATprocedureconsistsof
thefollowingsteps: The beginning of Item Response Theory (IRT) stem back
to 5 decades ago and there is a large amount of resources
• Thenextquestiontobeaskedisselected. available, for example, (Lord and Novick, 1968; Rasch,
1960, 1993). IRT allows more specific measurements of
• Thisquestionisaskedandananswerisobtained.
certainabilitiesofanexaminee.Itexpectsastudenttohave
• Thisanswerisinsertedintothemodel. an ability (skill) which directly influences his/her chance
ofansweringaquestioncorrectly. Whenwehaveonlyone
• Themodel(whichprovidesestimatesofthestudent’s variable2, itiscommontorefertoitasaproficiencyvari-
skills)isupdated.
able. This ability is called latent ability or a latent trait θ.
• (optional) Answers to all questions are estimated ThetraitθcorrespondstothegeneralskillS1definedinthe
Section2. EveryquestionoftheIRTmodelhasanassoci-
giventhecurrentestimatesofstudent’sskills.
ateditemresponsefunction(IRF)whichisaprobabilityof
asuccessfulanswergivenθ.
This procedure is repeated as long as necessary. It means
untilwereachaterminationcriterion,whichcanbe,forex- Wefittedourdataonthe2parametricIRTmodel. Itmeans
ample,atimerestriction,thenumberofquestions,oracon- that characteristic Item Response Functions, as the prob-
fidence interval of the estimated variables. Each of these ability of a correct answer to i-th given the ability θ, are
criteriawouldleadtoadifferentlearningstrategy(Vomlel, computedbytheformula
2004b), but finding a global optimal selection with these
strategies would be NP-hard (L´ın, 2005). We have cho- 1
p (θ)=
sen an heuristic approach based on greedy optimization i 1+e−ai(θ−bi)
methods. Methodsofthequestionselectiondifferforeach
wherea setsthescaleofthequestion(thissetsitsdiscrim-
modeltypeandareexplainedintherespectivesections.All i
inationability-asteepercurvebetterdifferentiatebetween
ofthemusethegreedystrategytoselectquestions.
students),b isthedifficultyofthequestion(horizontalpo-
i
ToevaluatemodelsweperformedasimulationofCATtest sitionofacurveinspace).
for every model and for every student. During testing we
ForquestionselectionstepofCATweuseiteminformation
firstestimatedtheskill(s)ofastudentbasedonhis/heran-
ofaquestionithatitisgivenbytheformula
swers. Then, based on these estimated skills we used the
modeltoestimateanswerstoallquestionsX. Letthetest (p(cid:48)(θ))2
beinthesteps(s−1questionsasked). Attheendofthe Ii(θ)= p (θi)q (θ)
i i
step s (after updating the model with a new answer) we
computemarginalprobabilitydistributionsforallskillsS. wherep(cid:48) isthederivationoftheitemresponsefunctionp .
i i
Thenweusethistocomputeestimationsofanswerstoall This item information provides one, and most straightfor-
questions,whereweselectthemostprobablestateofeach ward,wayofthenextquestionselection. Ineverystepthe
questionXi ∈X: questionX∗ whichisselectedisonewiththehighestitem
information.
x∗ =argmaxP(X =x |S).
i i i
xi X∗(θ)=argmaxI (θ)
i
Bycomparingthisvaluetotherealanswertoi−thquestion i
x(cid:48) we obtain a success ratio of the response estimates for
i Thisapproachminimizesthestandarderrorofthetestpro-
allquestionsX ∈ X ofatestresultt(particularstudent’s
i cedure(Hambletonetal.,1991)becausethestandarderror
result)inthesteps
ofmeasurementSE producedbyi−thitemisdefinedas
i
(cid:80) I(x∗ =x(cid:48))
SRt = Xi∈X i i , where 1
s |X| SEi(θ)= (cid:112) .
I (θ)
(cid:26) i
1 ifexpristrue
I(expr) =
0 otherwise. Thismeansthatthebetterprecisionofdifficultyweareable
toachievewhileaskingquestionsthesmallererrorofmea-
Thetotalsuccessratioofonemodelinthestepsforalltest
surement.
dataisdefinedas
TheresultofCATsimulationisdisplayedintheFigure1.
(cid:80)N SRt
SR = t=1 s . We can notice that this model is able to choose correct
s N
2TherearevariantsofmultidimensionalIRTmodelwhereitis
SR0 isthesuccessrateofthepredictionbeforeaskingany possibletohavemorethenoneabilitybutinthissectionweare
questions. goingtodiscussonlymodelswithoneonly.
5
8
0.
0
8
0.
e
at
ss r 75
ce 0.
c
u
s
0
7
0.
5 IRT
6
0.
0 5 10 15 20 25 30 35
step
Figure1: SuccessratesofIRTmodel
questions to ask very quickly and its prediction success Inthispaperweuseamodifiedmethodformodelscoring
risesafteraskingthefirsttwo. Afterthesequestionsitcan comparedtothemethodusedinourpreviousresearch.The
notimprovemuchanymore.Thisiscausedbythesimplic- currentmethodisdescribedinthesection2.Thedifference
ityofthemodel. is that in this case we estimate answers to all questions in
thequestionpoolandthencomparetorealanswersinev-
ery step. In the previous version we were estimating an-
4 BAYESIANNETWORKS
swers only to unanswered questions in every step. It led
to a skewed results interpretation because the value in the
In this section we use Bayesian networks (BN) as CAT denominatorofthesuccessrate
models. Details about BNs can be found in (Nielsen and (cid:80) I(x∗ =x(cid:48))
Jensen,2007;KjærulffandMadsen,2008).TheuseofBNs SRt = Xi∈X i i
s |X|
in educational assessment is discussed in (Almond et al.,
2015;Culbertson,2015;Milla´netal.,2010). Thistopicis wasdecreasingineverystep.Themodifiedversioniscom-
alsodiscussed,forexample,in(Vomlel,2004a,b). paring all questions and because of that the denominator
staysthesameineverystep.
A Bayesian network is a probabilistic graphical model,
a structure representing conditional independence state- From previous models we selected the model marked as
ments. Itconsistsofthefollowing: “b3” and “expert”. The former means that it has Boolean
answervalues,thereisoneskillvariablehaving3statesand
• asetofvariables(nodes), no additional information (personal data of students) was
used. SeeFigure2foritsstructure. Thelaterisanexpert
• asetofedges, model with 7 skill nodes (each having 2 states), Boolean
answervaluesandnoadditionalinformationaboutstudents
• asetofconditionalprobabilities. wasused. SeeFigure3foritsstructure.
In this paper we present three new BN models. The first
SpecificdetailsabouttheuseofBNsforCATcanbefound twoaremodificationsof“b3”model. Theyhavethesame
in(PlajnerandVomlel,2015). TypesofnodesinourBNs structure and differ only in the number of states of their
correspondtotypesofvariablesdefinedintheSection2.In skillnode. Wepresentexperimentswith4and9states. We
thispaperweusequestionnodeswithonlyBooleanstates, performedexperimentswithdifferentnumbersofstatesas
i.e., question is either correct or incorrect. Edges are de- well,buttheydonotprovidemoreinterestingresults.Next,
finedusuallybetweenskillsandquestions(wepresentex- weaddamodifiedexpertmodel. Thismodifiedmodelhas
amplesofconnectionsinfigures). Conditionalprobability alsoBooleanquestionsandnoadditionalinformation. We
valueshavebeenlearnedusingstandardEMalgorithmfor haveaddedonestateto7skillnodesfromthepreviousver-
BNlearning. sion (they have 3 states in total now). The reason for this
es 1. aninputlayer,
d
o
n
2. severalhiddenlayers,and
s kill
e s 3. anoutputlayer.
d f
o o
n s
skill state WtheeiunspeuNtlNayaesr.aTsthuedseenvtamluoedseal.reWtreafnesefdorsmtueddentotathneswhiedrdseton
f f
o o
layer(s).Thereisnogeneralrulehowtochoosethenumber
o. o.
Modelname Figure N N of hidden layers and their size. In our case we performed
simple 3s 2 1 3 experiments with one hidden layer of different sizes. The
simple 4s 2 1 4 hidden layer then further transforms to the output layer.
simple 9s 2 1 9 NNs are not suitable for unsupervised learning. Because
expert old 3 7 2 ofthat,wedonotestimateanunknownstudentskillinthe
expert new 4 7+1 3 output layer. We would not have any target value needed
duringthelearningstepoftheNN.Insteadofthat, wees-
Table1: OverviewofBayesiannetworkmodels
timate the score (the test result) of a student directly. The
scoreofastudentisknownforeverystudentatthetimeof
learning. Theoutputlayerthenprovideanestimateofthis
addition is an analysis of the question selection criterion.
score. Nevertheless,thisscoreisacorrespondingvariable
We select questions by minimizing the expected entropy
toskillvariablesdescribedintheSection2
overskillnodes. Withonlytwostatesitmeansthatweare
pushingastudentintooneortheothersideofthespectrum Toselectthenextquestionweusethefollowingprocedure.
(basically, we want him to be either good or bad). With We want the selected question to provide us as much in-
3 states we allow them to approach mediocre skill quality formationaspossibleaboutthetestedstudent. Thatmeans
as well. Moreover, we realized that the model structure thatastudentwhoanswersincorrectlyshouldbeasfaras
asintheFigure3hasonlyskillsthatareveryspecialized. possibleonthescorescalefromanotherwhoanswerscor-
Weintroduceanew8thskillnodewhichconnectprevious rectly. LettheS|Xi,x bethescorepredictionafteranswer-
7 skill nodes. Its representation is an overall mathemati- ing the i−th question’s state x, P(Xi,x) the probability
cal skill combining all other skills. It allows skills on the ofstatextobetheanswertothequestioni. P(Xi,x)can
lowerleveltoinfluenceeachotherandtoprovideevidence beobtained,forexample,bystatisticalanalysisofanswers.
between themselves. The final model structure is in the We select a question X∗ maximizing the variance of pre-
Figure4. dictedscores:
AllmodelsaresummarizedintheTable1. ResultsofCAT
simulation with BN models are displayed in the Figure 5.
X∗ =argmaxVar(SC|X )
Increasingthenumberofstatesofoneskillnodeimproved i x i,x
prediction accuracy of the model (simple 4s, simple 9s), (cid:88)
= P(X )(SC|X −SC|X ),where
i,x i,x i
but only slightly. As we can see, one additional state (4
x
statesintotal)isbetterthanmorestates(9). Thisconfirms (cid:88)
SC|X = P(X )SC|X
ourexpectationthatsimplyaddingnodestatescannotim- i i,x i,x
x
prove the model quality for long due to over fitting of the
model.Next,wecanobservethatthereisalargedifference isthemeanvalueofpredictedscores.
between the new and the old expert model. The success
Inourexperimentweusedonlyonehiddenlayerwithmany
rate of the new version exceeds all other models. Adding
differentnumbersofhiddenneurons. Fromthemweselect
additional skill node connecting other skills proved to be
modelswith3,5,and7neuronsinthehiddenlayerbecause
acorrectstep. Possibilitiesinthemodelstructurearestill
they provide the most interesting results. The structure of
large and it remains to be explored how to create the best
thenetworkwith5hiddenneuronsisintheFigure6. Re-
possiblestructure.
sults of CAT simulation with NN models are displayed in
theFigure7. Aswecanseeinthisfigure,thequalityofes-
5 NEURALNETWORKS timates while using NNs increases very slowly. This may
becausedbythequestionselectioncriterion.Ifwewerese-
lecting better questions, it is possible that the success rate
Neural networks are models for approximations of non-
wouldbeincreasingfaster. Itremainstobeexploredwhich
linearfunctions. FormoredetailsaboutNNs, pleaserefer
selection criterion would provide such questions. Never-
to(Haykin,2009;AleksanderandMorton,1995).
theless, this better question selection does not change the
TherearethreedifferentpartsofaNN: final prediction power of the model (the maximal success
Figure2: Bayesiannetworkwithonehiddenvariableandpersonalinformationaboutstudents
ratewouldnotbeexceeded). Thispredictionpowercould els and Computerized Adaptive Testing. Applied Psy-
be increased by using a different version of NNs. More chologicalMeasurement,23(3):223–237.
specifically, we will perform experiments with one of the
Almond, R. G., Mislevy, R. J., Steinberg, L. S., Yan, D.,
recurrent versions of NNs, i.e., Elman’s networks or Jor-
and Williamson, D. M. (2015). Bayesian Networks in
dan’snetworks.
Educational Assessment. Statistics for Social and Be-
havioralSciences.SpringerNewYork,NewYork,NY.
6 MODELCOMPARISONAND
Culbertson, M. J. (2015). Bayesian Networks in Educa-
CONCLUSIONS tionalAssessment: TheStateoftheField. AppliedPsy-
chologicalMeasurement,40(1):3–21.
Wepresentagraphicalcomparisonofallthreemodeltypes
D´ıez, F. J. and Druzdzel, M. J. (2007). Canonical Proba-
intheFigure8. Onemodelisselectedfromeachtype. We
bilistic Models for Knowledge Engineering. Technical
canseethattheneuralnetworkmodelscoredtheworstre-
report,ResearchCentreonIntelligentDecision-Support
sult.ThismaybefurtherimprovedbyabetterNNstructure
Systems.
andbetterquestionselectionprocess. ThenewBNexpert
modelisscoringthebest. Eveninthiscasewebelievethat Hambleton, R. K., Swaminathan, H., and Rogers, H. J.
further improvements are possible to increase its success (1991). Fundamentals of item response theory, vol-
rate. We will focus our future research into methods for ume21. SAGEPublications.
BNmodelscreationandcriteriafortheircomparison. Es- Haykin,S.S.(2009). NeuralNetworksandLearningMa-
pecially,wewouldliketouseaconceptofthelocalstruc- chines. Number v. 10 in Neural networks and learning
tureinBNmodels(D´ıezandDruzdzel,2007). Thatwould machines.PrenticeHall.
allowustocreatemorecomplexmodels,yetwithlesspa-
Kjærulff, U.B.andMadsen, A.L.(2008). BayesianNet-
rameters to be estimated during learning. Both previous
worksandInfluenceDiagrams. Springer.
modelscanbecomparedwiththeIRTmodelwhichisthe
standardinthefieldofCAT. L´ın, V. (2005). Complexity of Finding Optimal Observa-
tion Strategies for Bayesian Network Models. In Pro-
Acknowledgements ceedingsoftheconferenceZnalosti,Vysoke´ Tatry.
Lord,F.M.andNovick,M.R.(1968). StatisticalTheories
The work on this paper has been supported from GACR
ofMentalTestScores. (Behavioralscience: quantitative
projectn. 16-12010S.
methods).Addison-Wesley.
Milla´n,E.,Loboda,T.,andPe´rez-de-laCruz,J.L.(2010).
References
Bayesiannetworksforstudentmodelengineering. Com-
Aleksander, I. and Morton, H. (1995). An Introduction to puters&Education,55(4):1663–1683.
Neural Computing. Information Systems. International
Nielsen,T.D.andJensen,F.V.(2007). BayesianNetworks
ThomsonComputerPress.
and Decision Graphs (Information Science and Statis-
Almond,R.G.andMislevy,R.J.(1999). GraphicalMod- tics). Springer.
Figure3: Bayesiannetworkwith7hiddenvariables(theoldexpertmodel)
Plajner, M. and Vomlel, J. (2015). Bayesian Network
Models for Adaptive Testing. Technical report, ArXiv:
1511.08488,http://arxiv.org/abs/1511.08488.
Rasch, G.(1960). Studiesinmathematicalpsychology: I.
Probabilistic models for some intelligence and attain-
menttests. DanmarksPaedagogiskeInstitut.
Rasch, G. (1993). Probabilistic Models for Some Intelli-
genceandAttainmentTests.ExpandedEd.MESAPress.
vanderLinden,W.J.andGlas,C.A.W.(2000).Computer-
izedAdaptiveTesting: TheoryandPractice,volume13.
KluwerAcademicPublishers.
Vomlel,J.(2004a). Bayesiannetworksineducationaltest-
ing. InternationalJournalofUncertainty,Fuzzinessand
Knowledge-BasedSystems,12(supp01):83–100.
Vomlel,J.(2004b).BulidingAdaptiveTestUsingBayesian
Networks. Kybernetika,40(3):333–348.
Wainer,H.andDorans,N.J.(2015). ComputerizedAdap-
tiveTesting: APrimer. Routledge.
Figure4: Bayesiannetworkwith7+1hiddenvariables(thenewexpertmodel)
5
8
0.
0
8
0.
e
at
ss r 75
ce 0.
c
u
s
0
7 simple_3s
0.
simple_4s
simple_9s
expert_old
65 expert_new
0.
0 5 10 15 20 25 30 35
step
Figure5: ResultsofCATsimulationwithBNs
X H
1 1
X H
2 2
...
H SC
3
score
X H
m 4
H
5
Figure6: Neuralnetworkwith5hiddenneurons
5
8
0.
0
success rate 0.750.8 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
0 l
7
0.
neural_5
l neural_7
5 neural_10
6
0.
0 5 10 15 20 25 30 35
step
Figure7: ResultsofCATsimulationwithNNs
5
8
0.
0
success rate 0.750.8 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l
0 l
7
0.
IRT
l neural_7
5 expert_new
6
0.
0 5 10 15 20 25 30 35
step
Figure8: CATsimulationresultscomparison