ebook img

ERIC EJ1064356: Adaptive Educational Software by Applying Reinforcement Learning PDF

2013·68.2 MB·English
by  ERIC
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ERIC EJ1064356: Adaptive Educational Software by Applying Reinforcement Learning

InformaticsinEducation,2013,Vol.12,No.1,13–27 13 2013VilniusUniversity Adaptive Educational Software by Applying Reinforcement Learning AbdellahBENNANE CentredeFormationdesInspecteursdel’Enseignement(CFIE) UM5,RabatMorocco&LaboratoiredesSystèmesdeTélécommunicationsetIngénieriedela Décision(LASTID),UniversitéIbnTofail,Kénitra,Morocco e-mail:[email protected] Received:December2012 Abstract.Theintroductionoftheintelligenceinteachingsoftwareistheobjectofthispaper.In software elaboration process,one usessome learning techniquesin order to adaptthe teaching softwaretocharacteristicsofstudent.Generally,oneusestheartificialintelligencetechniqueslike reinforcementlearning,Bayesiannetworkinordertoadaptthesystemtotheenvironmentinter- nal andexternal conditions,andallow this systemto interact efficiently with its potentialsuser. Theintentionistoautomateandmanagethepedagogicalprocessoftutoringsystem,inparticular theselectionof thecontentandmannerof pedagogicsituations.Researcherscreateapedagogic learningagentthatsimplifiesthemanuallogicandsupportsprogressandthemanagementofthe teachingprocess(tutor-learner)throughnaturalinteractions. Keywords:adaptivesystem,tutoringsystem,agent,environment,naturalinteraction,reinforcement learning,pedagogicalprocess. 1. Problematic Theclassicalconceptionofintelligenttutoringsystemsiscomposedoffourelements,the domainmodel (knowledge),thestudentmodel, pedagogicmodule, andcommunication module(Wenger,1987). Thepedagogicmoduleisthemanagerofthepedagogicactionundertakeninatutor- ing system. The choice of situationswhich the system introducesto the learners is the functionofthepedagogicmodule.Traditionally,pedagogicmoduleiscollectionofrules elaborated by authoror designer of the system. These rules are defined and developed manuallybybeingbasedontheexperienceofauthorordesigneroftutoringsystem.This “manual”logicismoresubjectivethanobjective,becauseitislinkedtotheauthorandits pedagogicexperiences. The tendency of research in this field is to replace with statistical approaches the traditional approaches based on rules established manually. It is necessary to develop probabilisticapproachesandflexiblemodels,whichallowthemachinetolearn(Aimeur, 2004)byitselffromreceiveddata. Thecorrelationofthelearningperformanceisbased on theengineofthe statisticaldeductionwhich collects informationaboutthe user be- haviorsforeverycourseofindividualizedstudyandcreatingadistributionoflikelihood 14 A.Bennane for thewholeof trainingmodule(Sonwalkar, 2004). The research inmachine learning wasinterested,forabigpart,todevelopthealgorithmswhichcanlabelnewdata,rather systems which learn naturallyin interactionwithpeople. I liketo contributeto change thisfocusandtodevelopanewtypeofsystemsthatlearnwiththeusersacrossanatural interaction(Picardetal.,2004). Tosolvetheproblemmentionedabove,oneasks,isitpossibletoreplacethe“manual logic”witha“numericallogic”whichcanexploitthestudentlearningroute?Canitas- suretheadaptabilityoftheteachingenvironmenttolearnersandbedoneobjectively?The naturalinteractionbetweenlearnerandteachingenvironmentcanautomatetheteaching module.Theaim istoprovethishypothesis.Several machinelearningtechniqueswere used to solvethistypeof problem. The automationallows tospare effortand time and to surmount thenumerous problems of theadaptabilityof tutoringsystems. The peda- gogic module can be dynamically generated, only based on the interaction of learners withitsteachingenvironment.Theinvestedeffortswereintwodirections,ononehand, the conception of a database which takes intoaccount the significant characteristics of thepedagogicact.Secondly,toexploreandtotryamongthelearningtechniques,based onnumericalcalculations,thatismoreappropriateinthefieldofstudysuchastherein- forcementlearningtechniques. 2. ElaborationofSolution Thesolutionrestsontwopillars.Thefirstispedagogicandconcernsthewaywithwhich author organize and structure the teaching environment in order to answer differences whichexistbetweenlearners. Fromthisperspective, authorfindthatdifferentiatedped- agogy can be very useful. Second pillaris technical and concerns theuseof a learning agentwhichwilloccupythefunctionoftutor(pedagogue)inthesystem.Tothisend,the authorwillusethereinforcementlearningmodelthathasdemonstrateditseffectiveness asalearningcontroller. Thegoalofadaptiveeducationalsystemsistocreateaninstructionallysoundandflex- ibleenvironmentthatsupportslearningforstudentswitharangeofabilities,disabilities, interests,backgrounds,andothercharacteristics(Shute,2011). 3. DifferentiatedPedagogy:IndividualizationandVarietyofMethods We linked the modular approach and differentiated pedagogy (Przesmycki, 1991) to structure and organize the teaching environment. The differentiated pedagogy renews theconditionsoftrainingbyopeningamaximumofaccessforlearners. Inthissense,apedagogicsequenceisasuccessionoflearningsituations.Andasitu- ationisnomoreandnolessanencounterofcircumstances.Asituationposesaproblem whenitputssubjectinfrontofatasktobefulfilled,allprocedureswhichitdoesnotcon- trol(Hoc,1996).Anapprenticeshipisataskthatposesacognitivechallengetolearner. AdaptiveEducationalSoftwarebyApplyingReinforcementLearning 15 Then, the development of teaching situation is based on two important parameters, in- dividualizationandvariety(Bennane,2001).Individualizedteachingisapedagogythat recognizesthestudentasanindividualwithitsownrepresentationsoftheteachingact.In- dividualizedteachingorlearningisadaptingtobothefficiencylevels,therhythmofwork, reactionstofailureandsuccess,etc.Whileavarietyofteachingsituationsisapedagogy which offers a range of approaches and strategies. This approach can help to solve the problemofschoolfailurewheretheleveloflearnersintheevaluationprocessismissing orignored.Ingeneral,teachersandtrainerswhentheyprepareateachingmodule,they focused their effort around the medium learner. Then differentiation between learners andindividualizationareomitted.Thedesigncallsforanextraeffortfromteachersand trainerssothattheywilltakeintoaccounttheindividualcharacteristics–likestudylevel –whentheypreparetheirteachingmodule.Forthisreason,ateachingsituationwillbe apackageofsubsituations.How? Ingeneral,thelearnersofaclassformaheterogeneouspublic.Ineachclass,author findfivegroups(subclass):good,relativelygood,medium,weak,andveryweak.From classdecompositionandlevellearners,authordeducethevalue5.Weproposethisvalue (5)inordertomoveforward,knowingthatthechoiceofthisvaluedependsonthedomain teachingandthelevelofpublicheterogeneity. Everysituationwillcontainfivesub-situationsbytakingintoaccounttwodimensions. Thefirstoneconcernstheheterogeneityoflearnerswhocanberegroupedinfivelevels. Theseconddimensionconcernsdifferentlearningstrategiesthatwillbeused. Indevelopingalearningmodule,itisadvisable(1)tosolicitpermanentlythelearner activity;(2)totreatsituationsfirstsimple(3)toleadthelearnerprogressivelytomaster thelessongoalsbyofferingsituationsmoreandmoreofincreasingcomplexity.Byfol- lowinganapproachthatrespectstheprogressiveness,thegradualdifficulty,andvariety ofpedagogicmethods,ateachingmodulewillhaveallchancestoleadtotheacquisition bylearnerthesolidcompetencesanddirectlyoperational. EXAMPLE1. Figure 1 is a pedagogic sequence of ten situations, with each situation having five sub-situation levels. One asks, what is the number of course possibilities (NCP)inthisnetwork(sequence)?NCPisthefollowing:510 =9,765,625.Thisnumber Fig.1.Sequence:difficultyandcomplexity. 16 A.Bennane isveryimportant.Withsupportofthisnetworkthatisopenedtoallpossibilities,andnot frozen, adaptation can come true. The agent learning designed and implemented offers itsservicestoadaptthelearningenvironmenttolearnersandnotcontrary. 4. ReinforcementLearning 4.1. Introduction The reinforcement learning (RL) is the study of how animals and artificial systems can learn to optimize their behavior to rewards and punishments. It was developed the RL algorithms that are closely related to the methods of dynamic programming, which is ageneralapproachtooptimalcontrol(Suttonetal.,1998).Itwasobservedphenomena RL in psychological studies of animal behavior, in neurobiological investigations, etc. (Dayan et al., 2001). One way in which animals learn complex behavior is by learning togetrewardsandavoidpunishments.Forthistypeoflearning,RLtheoryisamodelof aformalcalculus(Fig.2). The paradigm of reinforcement learning standard, an agent is connected to an envi- ronment by perception and action. A learning agent (an animal, a robot, etc.) observes onseveraloccasionsthestateoftheenvironmentandthenselectsandexecutesanaction. The execution of action changes the state of the “world” and the agent acquires an im- mediate numerical reward. The positive earnings are called “rewards” and negative are called“punishment”. The reinforcement learning consists of a set of concepts and algorithms. RL is not definedbyacertainclassofalgorithmsbutbytheproblemwhichittriestosolve,thatis theoptimumcontrol. RL is traditionally defined as part of a Markov decision processes (MDP). Bellman founded the theory of MDPs (Bellman, 1957a, 1957b) by the unification of previous workonthesequentialanalysis,functionsstatisticaldecision(Wald,1950),andmodels ofdynamicgamesfortwopersons(Shapley,1953).AnMDPisaquadruplet(S,A,P,R) such as S is a set of states, A is a set of actions, and P and R are the distribution of probabilitiesandrewardsrespectively. At time t, an agent receive situation/state s , and chooses an action a ; the world t t changestosituation/states ;andagentperceivessituations andgetsrewardr . t+1 t+1 t+1 Fig.2.Environment–agent. AdaptiveEducationalSoftwarebyApplyingReinforcementLearning 17 Learning is a mapping from situations to actions in order to maximize a reinforcement signal(scalarreward). Thegeneralreinforcementlearningalgorithmmaybe: 1. Initializelearner’sinternalstate; 2. Doforalongtime: • Observecurrentstate(s); • Chooseaction(a)usingthepolicy; • Executeaction(a); • Let(r)beimmediatereward,(s(cid:2))newstate; • Updateinternalstatebasedon(s,a,r,s(cid:2)). 3. Outputapolicybasedon(Hayes,2008). 4.2. MethodsofResolutions In Reinforcement learning, there are three methods of solving the learning problem, model-basedmethod,model-freemethodandunifiedmethod(planningandlearning method). The model-based method: allows finding the environment model P and R, using dy- namicprogrammingtechniques.Letgivenadatabasewithmobservations(s ,a ,s , t t t+1 r )∈S×A×R×Sgeneratedonsomeexperiences.Thismethodisfunctioningontwo t steps: 1. Extractionofprobabilitiesdistributionandreinforcementfunction.Thevalueses- timationofthesedistributionsarebasedontheoccurrencesofthe(s ,a ,s ,r ) t t t+1 t inthedatabase: • P(s |s ,a )istheprobabilitythattakingactiona instates willresultin t+1 t t t t atransitiontostates . t+1 • r(s ,a ,s )istheexpectedrewardwhentransitioningfroms tos by t t t+1 t t+1 actiona . t 2. Usage of the dynamic programming techniques in the end to determine the opti- malpolicy.ThisgoalisachievedindirectlybycomputingtheQ-values,usingthe Bellmanoptimalityequation: (cid:2) Q∗(s, a) = s(cid:2)Psas(cid:2).(Rsas(cid:2) +γ.maxa(cid:2)Q∗(s(cid:2),a(cid:2)))∀(s,a) ∈ S ×A; and γa dis- countfactor,0(cid:2)γ (cid:2)1. Themodel-freemethod:allowsavoidingtheexplicitcalculusoftheenvironmentmodel. TheMonteCarloandTemporaldifferencetechniquesareusing.TheQ-Learningforex- ample generate a suite of functions: Q1, Q2, Q3,.. It is an approximation of Q-value, onlineandwithoutmodel.It’sallowstoconstructanoptimalpolicydirectly. Planning and learning method (Unified method): is a unified view of methods that requireamodeloftheenvironment,suchasdynamicprogramming,andmethodsthatcan beusedwithoutamodel,suchastemporal-differencemethods.Wethinkoftheformeras planningmethodsand of the latter as learningmethods. Within a planningagent, there 18 A.Bennane Fig.3.ThegeneralDynaarchitecture. areatleasttworolesforrealexperience:itcanbeusedtoimprovethemodelanditcan beusedtodirectlyimprovethevaluefunctionandpolicyusingthedirectreinforcement learningmethods(Suttonetal.,1998;seeFig.3). The heart of planning and learning methods is the estimation of value functions by backupoperations.Thedifferenceisthatwhereasplanningusesimulatedexperiencegen- eratedbymodel,learningmethodsuserealexperiencegeneratedbytheenvironment. Bothdirectandindirectmethodshaveadvantagesanddisadvantages.Indirectmeth- odsoftenmakefulleruseofalimitedamountofexperienceandthusachieveabetterpol- icy with fewer environmental interactions. On the other hand, direct methods are much simpler and are not concerned by the design of the model but expensive in experiences number(Sutton,1998). 4.3. InteractionScenario ThechoicesofRLareduetothenatureofthemodel.Itisadaptedtohumanlearningas hasbeendonesincetheworkofpioneerssuchasWatson,Pavlov,Skinner,Bellman,etc. The choice starts from the following idea: in the teaching environment, one finds two agents. The first one is external to the system. It is a student who needs to learn a teachingmodule.Thisnaturalagentneedatutorwhocanselectsituationsadaptedtothe student level. Then the second agent is internal and represents the tutor (the pedagogic agent). The pedagogic agent can’t fulfill this function unless it has the ability to learn. The RL theory and model can provide the pedagogic agent to learn through trial and error (experience). In other words, the pedagogic agent learning can be achieved only throughstudentlearning(Fig.4). 4.4. DynamicGenerationofPedagogicMulti-Agents Teachingandlearningcanbeanywhere,synchronousorasynchronous.Ourgoalisthat thecharacteristicsoflearnerssuchaslevelofstudyamongothersmustbetakenintoac- count by tutoring system via specialized agents in order to produce pedagogic adaptive courses. In this optic, the tutoring systems must have, on one hand, an agent that sup- portsthestudentmodel.Ontheotherhand,thedynamicgenerationprocessofpedagogic AdaptiveEducationalSoftwarebyApplyingReinforcementLearning 19 Fig.4.Interactionscenario. Fig.5.Studentanddynamicgenerationofagents. agentsstartseachtimethatastudentisconnectedtothesystem,andanewagentiscre- atedautomatically.Studentwillbeassistedbyanagentwhoindividualizestheirlearning coursestakingintoaccounttheirparticularfeatures(Fig.5). Fordistributedtutoringsystems,thedynamicgenerationofpedagogicalagentsmay becomethekeyelementsattendingtheadaptabilityofthesystemtopotentialusers(Ben- nane,2010). 20 A.Bennane 4.5. RLandMulti-AgentSystems(MAS) Agent-basedsystemstechnologyhasgeneratedlotsofexcitementinrecentyearsbecause ofitspromiseasanewparadigmforconceptualizing,designing,andimplementingsoft- ware systems. This promise is particularly attractive to creating software that operates in environments that are distributed and open, such as the internet. Currently, the great majorityofagent-basedsystemsconsistofasingleagent(Sycara,1998). Research in MAS is concerned with the study, behaviour and construction of a col- lectionofpossiblypre-existingautonomousagentsthatinteractwitheachotherandtheir environments. Astochasticgameisatuple(A,S,{Ui}i∈A, p,{ri}i∈A)where: • A=1,...,nisasetofagents; • S isafinitestatespace; • Uiisthefinitesetofagent’sactions,U =xi∈AUi; • p:S×U ×S −→[0,1]isthestatetransitionprobabilitydistribution; • r :S×U ×S −→Ristheagent’srewardfunction. i State transition and rewards depend on the joint action. Then the consequence of learningisthatoptimalpolicydependsonpoliciesofotheragents. Accordingtotasks,onemaybeclassifyingtheMAStasksonthree:(1)cooperative tasks where r = r = .... = r ; (2) competitivetasks where r = −r (zero – sum 1 2 n 1 2 games);(3)mixedtasksisageneralcase(general–sumgames). TheRLapproachinMASmaybethree.Thefirstoneappliessingleagentreinforce- ment learning by ignoring the presence of other agents. It considers other agents indi- rectly,troughstherewardsignal.Thesecondapproachguaranteeconvergenceindepen- dently of other agents,but it is aware of learning agents. The third, an agent is adapted to other agents in order to strive for optimality (Babuska, 2006). Better results can be obtained if the agents attempt to model each agent. In this case, each agent maintains anactionvaluefunctionQ(i)(s,a)forallstatesandjointactionpairs.Ajointaction(a) is continually under process in the state (s) and a new state (s(cid:2)) is observed, thus each agent (i) updates its Q(i)(s,a). The first and second approaches are the simplest case thattheagentslearnindependentlyof eachother.Eachagenttreatsotheragentsas part ofenvironment,anddoesnotattemptthemorpredicttheiractions(Vlassis,2003). Thisisamodelofstochasticgames,alsocalledMarkovGames(Tuyls.2005).Inthis approach,itiscertainthattheinfluenceofanagentonallotheragentscanbemodelledso thattheMarkovpropertystillstands.This,combinedwithauniquesolution,suchasthe notionofbootstrapStackelbergequilibrium,isgenerallyperformedintheRLtechniques (Könönen,2004). Inreinforcementlearning,thebehaviourofasingleagentisspecifiedbyapolicy.In gamesstochastic,policiesareseparatefromeachagentandthegoalofeachagentisto find an equilibrium policy, i.e., a policy that has the best response in comparison with “Adversaries”policies. TheRLindependentagentstrytooptimizetheirbehaviourwithoutanyformofcom- municationwithotheragents.Theyuseonlythefeedbackreceivedfromtheenvironment. AdaptiveEducationalSoftwarebyApplyingReinforcementLearning 21 TheseindependentagentscanusethetraditionalRLalgorithmsdevelopedinthestation- arycaseforasingleagent.Notethatthefeedbackfromtheenvironmentisgenerallyde- pendentonacombinationofactionstakenbymultipleagents,notonlybyasingleagent. Inthiscase,theMarkovpropertynolongerholds,andtheproblembecomesdependent andnon-stationary. Inmulti-agentenvironments,ifthebehaviourofotheragentsconverges,i.e.,these- lectiondistributionbecomesstationaryatthelimit,theQ-learningupdatedtoconvergeto optimalfunctionQ∗withprobabilityone. InmanyapplicationswhererealRLtechniquesareusedinMAS,RLtechniquesfor asingleagentareapplieddirectly.Althoughsomeassumptionsthatunderlietheselearn- ingmodelsareviolated,methodsworksurprisinglyandinmanycases(Könönen,2004). 5. StudentModel Howanagentcanselectsubsituationsadaptedtostudentlevel?Inordertoanswertothis question,wemustfirsttopresentthestudentmodelfunction(Vanlehn,1988).Astudent model is a qualitative representation of student knowledge for a domain, a particular subjectoracompetencewhichcantakeintoaccount,fullyorpartially,specificaspectsof studentbehavior(Sisonetal.,1998).Inotherwords,itdescribestheobjectsandprocesses inspatial,temporalorcausalrelationship(Clancey,1986).Inthissense,thestudentmodel maybehavingtwofunctions.First,it’samemorystoringalltransactionspassingfroman objecttoanotheranditisorganizedusingthequadruplet(situation,action,nextsituation, reward) at least. The second function is a set of methods using dynamic resources to managethestudentmodel.Thepedagogicmoduleselectstheteachingcourseofstudent basedonstudentbehaviors.Whereasthepedagogicagentneedssomedataandmethods inordertoanswertogoaltaskrequirements.Itsgoalistodetermineanoptimalfunction Q∗(Suttonetal.,1998)thatwillbethecriterionofsubsituationselection.Itistheengine ofthetutoringsystemadaptability.Itusestudentmodeldatabaseinordertocomputethe optimal policy. In our case, first determine environment model, then compute optimal functionvalue. 6. LookingforanOptimalPolicy 6.1. Model-BasedMethod InordertodeterminetheoptimalfunctionQ∗ whichrepresenttheselectioncriterionof studentadaptivecourse,wehaveadoptedthemodelbasedmethodbecause: • With few environment interactions, one can deduce the optimal policy (Sutton etal.,1998); • Statespacedimensionismodest; • Theexplorationandexploitationphaseswillbeseparated.Thisseparationallows easilycomparethetwophasesaccordingtothecriteriagivenas“transitionproba- bilitiesdistribution”,“reductionoftimeinthestudentlearningprocess”etc. 22 A.Bennane ThevaluesofoptimalfunctionQ∗arecomputedasfollows: • collectminteractions(s ,a ,s ,r ); t t t+1 t • computeprobabilitiesdistributionvalues; • computerewardfunctionvalues; • computeanoptimalfunctionvalues. Theenvironmentmodelisdetermined,andthenoptimalvaluesareupdatesusingQ- Learning 6.2. ProbabilitiesDistributionandRewardFunction Accordingtotheorganizationofateachingsequenceaswehavedefined,thesuccessors ofagivensituation(s)are: • (s)itselfiftheactionisaccomplishbyfailure; • (s’)iftheactionisachievedbysuccess. TherewardfunctionisdefinedfromS×A×S toRandnotedfr.Weunderlinethe valuesRasu,Raec,Rasu,Raec,Rasu,Raec,Rasu,Raec,Rasu,Raecusedasindicative 1 1 2 2 3 3 4 4 5 5 inFig.6areunknowns. The choice of the reward function is not evident because, in several cases, the pro- posed reward function allows not satisfying the given hypothesis (Beck et al., 2000) in ordertodeterminetheoptimalpolicy. TheprobabilitiesdistributionisdefinedfromS ×A×S to[0,1]anditisnotedP. Weunderlinethatvaluesα,χ,δ,β,andφusedasindicativeinFig.6areunknowns. Thevaluesoftwofunctionsfrr andP areunknowns?Itswillbelearnedbyexperi- enceinordertodeterminetheenvironmentmodel. Thevaluesoftheprobabilitiesdistributioncomeasfollow: • P(s,a,s(cid:2))= |{x∈LS/st(x)=s,at(x)=a,st+1(x)=s(cid:2)ets(cid:2)(cid:5)=s}|sis(cid:5)=s(cid:2), |{x∈LS/st(x)=s,at(x)=a}| • P(s,a,s(cid:2))= |{x∈LS/st(x)=s,at(x)=a,st+1(x)=s}|sis(cid:2) =s, • P(s,a,s(cid:2))+p(s,a|,{sx)∈L=S/1st(x)=s,at(x)=a}| Fig.6.Statediagramfromanonterminalstate(s)toanother(s’).

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.