ebook img

KnightCap: A chess program that learns by combining TD(lambda) with game-tree search PDF

9 Pages·0.12 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview KnightCap: A chess program that learns by combining TD(lambda) with game-tree search

KnightCap: A chess program that learns by combining TD(λ) with game-tree search JonathanBaxter AndrewTridgell LexWeaver DepartmentofSystemsEngineering DepartmentofComputerScience DepartmentofComputerScience 9 9 AustralianNationalUniversity AustralianNationalUniversity AustralianNationalUniversity 9 Canberra0200,Australia Canberra0200,Australia Canberra0200,Australia 1 [email protected] [email protected] [email protected] n a J Abstract backgammonplayers[9]. In TD-Gammonthe neuralnet- 0 workplayedadualrole,bothasapredictoroftheexpected 1 In this paper we present TDLeaf(λ), a varia- cost-to-goofthepositionandasa meanstoselectmoves. tion on the TD(λ) algorithm that enables it to Inanypositionthenextmovewaschosengreedilybyeval- ] G be used in conjunction with game-tree search. uating all positions reachable from the current state, and We present some experiments in which our thenselectingthemoveleadingtothepositionwithsmall- L chess program “KnightCap” used TDLeaf(λ) est expected cost. The parameters of the neural network . s to learn its evaluation function while play- wereupdatedaccordingtotheTD(λ)algorithmaftereach c [ ing on the Free Internet Chess Server (FICS, game. 1 fics.onenet.net).Themainsuccesswere- Althoughthe results with backgammonare quite striking, portisthatKnightCapimprovedfroma1650rat- v there is lingering disappointment that despite several at- 2 ingtoa2150ratinginjust308gamesand3days tempts,theyhavenotbeenrepeatedforotherboardgames 0 of play. As a reference, a rating of 1650 corre- such as othello, Go and the “drosophila of AI” — chess 0 sponds to about level B human play (on a scale [10,12,6]. 1 fromE(1000)toA(1800)),while2150ishuman 0 masterlevel. Wediscusssomeofthereasonsfor Manyauthorshavediscussedthepeculiaritiesofbackgam- 9 9 thissuccess,principleamongthembeingtheuse mon that make it particularly suitable for Temporal Dif- / ofon-line,ratherthanself-play. ference learningwith self-play [8, 6, 4]. Principle among s these are speed of play: TD-Gammon learnt from sev- c : eral hundred thousand games of self-play, representation v 1 Introduction smoothness: the evaluation of a backgammon position i X is a reasonably smooth function of the position (viewed, r Temporal Difference learning, first introduced by Samuel say, as a vector of piece counts), making it easier to find a [5] andlaterextendedandformalizedbySutton[7] inhis a good neural network approximation, and stochasticity: TD(λ) algorithm, is an eleganttechnique for approximat- backgammonisarandomgamewhichforcesatleastamin- ingtheexpectedlongtermfuturecost(orcost-to-go)ofa imalamountofexplorationofsearchspace. stochastic dynamical system as a function of the current As TD-Gammon in its original form only searched one- state. The mapping from states to future cost is imple- plyahead,wefeelthislistshouldbeappendedwith: shal- mentedbyaparameterizedfunctionapproximatorsuchas lowsearchisgoodenoughagainsthumans. Therearetwo a neural network. The parameters are updated online af- possible reasons for this; either one does not gain a lot ter eachstate transition, orpossiblyin batchupdatesafter by searching deeper in backgammon (questionable given severalstatetransitions.Thegoalofthealgorithmistoim- that recent versions of TD-Gammon search to three-ply prove the cost estimates as the number of observed state and this significantly improvestheir performance), or hu- transitionsandassociatedcostsincreases. mansaresimplyincapableofsearchingdeeplyandsoTD- PerhapsthemostremarkablesuccessofTD(λ)isTesauro’s Gammonisonlycompetinginapoolofshallowsearchers. TD-Gammon, a neural network backgammon player that Althoughweknowofnopsychologicalstudiesinvestigat- was trainedfrom scratch using TD(λ) and simulated self- ing the depth to which humanssearch in backgammon,it play. TD-Gammon is competitive with the best human is plausible that the combination of high branching fac- torandrandommovegenerationmakesitquitedifficultto The remainder of this paper is organized as follows. In searchmorethanoneortwo-plyahead. Inparticular,ran- section2wedescribetheTD(λ)algorithmasitappliesto dommovegenerationeffectivelypreventsselectivesearch games.TheTDLeaf(λ)algorithmisdescribedinsection3. or“forwardpruning”becauseitenforcesalowerboundon Experimentalresultsforinternet-playwith KnightCapare thebranchingfactorateachmove. giveninsection4. Section5containssomediscussionand concludingremarks. In contrast, finding a representation for chess, othello or Gowhichallowsasmallneuralnetworktoordermovesat one-ply with near human performanceis a far more diffi- 2 The TD(λ) algorithmapplied to games cult task. It seems that for these games, reliable tactical evaluation is difficult to achieve without deep lookahead. InthissectionwedescribetheTD(λ)algorithmasitapplies Asdeeplookaheadinvariablyinvolvessomekindofmini- toplayingboardgames.Wediscussthealgorithmfromthe maxsearch,whichinturnrequiresanexponentialincrease pointofviewofanagentplayingthegame. in the number of positions evaluated as the search depth Let S denote the set of all possible boardpositions in the increases, the computational cost of the evaluation func- game. Play proceedsin a seriesof movesatdiscretetime tionhastobe low, rulingoutthe useofexpensiveevalua- steps t = 1,2,.... At time t the agent finds itself in tionfunctionssuchasneuralnetworks.Consequentlymost some position x ∈ S, and has available a set of moves, t chessandothelloprogramsuselinearevaluationfunctions oractionsA (thelegalmovesinpositionx ). Theagent (thebranchingfactorin Go makesminimaxsearchto any xt t choosesanactiona ∈ A andmakesatransitiontostate significantdepthnearlyinfeasible). xt xt+1 withprobabilityp(xt,xt+1,a). Herext+1 isthepo- In this paper we introduce TDLeaf(λ), a variation on the sition of the board after the agent’s move and the oppo- TD(λ) algorithm that can be used to learn an evaluation nent’sresponse. Whenthegameisover,theagentreceives function for use in deep minimax search. TDLeaf(λ) is ascalarreward,typically“1”forawin,“0”foradrawand identical to TD(λ) exceptthat instead of operating on the “-1”foraloss. positionsthatoccurduringthegame,itoperatesontheleaf Foreaseofnotationwewillassumeallgameshaveafixed nodesoftheprincipalvariationofaminimaxsearchfrom lengthofN (thisisnotessential).Letr(x )denotethere- N eachposition(alsoknownastheprincipalleaves). wardreceivedattheendofthegame.Ifweassumethatthe To testtheeffectivenessofTDLeaf(λ), weincorporatedit agentchoosesitsactionsaccordingtosomefunctiona(x) intoourownchessprogram—KnightCap. KnightCaphas of the current state x (so that a(x) ∈ Ax), the expected aparticularlyrichboardrepresentationenablingrelatively rewardfromeachstatex∈S isgivenby fast computation of sophisticated positional features, al- J∗(x):=E r(x ), (1) thoughthisisachievedatsomecostinspeed(KnightCapis xN|x N about10timesslowerthanCrafty—thebestpublic-domain wheretheexpectationiswithrespecttothetransitionprob- chess program—and6,000timesslower than Deep Blue). abilities p(xt,xt+1,a(xt)) and possibly also with respect We trained KnightCap’s linear evaluation function using totheactionsa(x )iftheagentchoosesitsactionsstochas- t TDLeaf(λ)byplayingitontheFreeInternetChessServer tically. (FICS, fics.onenet.net) and on the Internet Chess Club (ICC, chessclub.com). Internet play was used For very large state spaces S it is not possible store the to avoid the prematureconvergencedifficulties associated value of J∗(x) for every x ∈ S, so instead we might try self-play1.Themainsuccessstorywereportisthatstarting to approximate J∗ using a parameterized function class froman evaluationfunctioninwhichallcoefficientswere J˜: S×Rk →R,forexamplelinearfunction,splines,neu- settozeroexceptthevaluesofthepieces,KnightCapwent ralnetworks,etc. J˜(·,w)isassumedtobeadifferentiable froma1650-ratedplayertoa2150-ratedplayerinjustthree functionofitsparametersw =(w1,...,wk).Theaimisto daysand308games.KnightCapisanongoingprojectwith findaparametervectorw ∈Rk thatminimizessomemea- newfeaturesbeingaddedtoitsevaluationfunctionallthe sureoferrorbetweentheapproximationJ˜(·,w)andJ∗(·). time. We use TDLeaf(λ)andInternetplaytotunetheco- TheTD(λ)algorithm,whichwedescribenow,isdesigned efficientsofthesefeatures. todoexactlythat. Suppose x ,...,x ,x is a sequence of states in one 1 N−1 N game. Foragivenparametervectorw,definethetemporal 1Randomizingmovechoiceisanotherwayofavoidingprob- differenceassociatedwiththetransitionx →x by lemsassociatedwithself-play(thisapproachhasbeentriedinGo t t+1 [6]),buttheadvantageoftheInternetisthatmoreinformationis providedbytheopponentsplay. dt :=J˜(xt+1,w)−J˜(xt,w). (2) Note that d measures the difference between the reward Successive parameter updates according to the TD(λ) al- t predictedbyJ˜(·,w)attimet+1,andtherewardpredicted gorithmshould,overtime,leadtoimprovedpredictionsof by J˜(·,w) at time t. The true evaluation function J∗ has the expected reward J˜(·,w). Provided the actions a(x ) t theproperty areindependentoftheparametervectorw,itcanbeshown thatforlinearJ˜(·,w),theTD(λ)algorithmconvergestoa Ext+1|xt[J∗(xt+1)−J∗(xt)]=0, near-optimalparametervector[11]. Unfortunately,thereis nosuchguaranteeifJ˜(·,w)isnon-linear[11],orifa(x ) so if J˜(·,w) is a good approximation to J∗, Ext+1|xtdt dependsonw[2]. t shouldbeclosetozero.Foreaseofnotationwewillassume thatJ˜(x ,w) = r(x )always, sothatthefinaltemporal N N 3 Minimax Search and TD(λ) differencesatisfies dN−1 =J˜(xN,w)−J˜(xN−1,w)=r(xN)−J˜(xN−1,w). Forargument’ssake, assumeanyactiona takenin state x leads to predeterminedstate which we will denote by x′. That is, dN−1 is the differencebetween the true outcome Once an approximationJ˜(·,w) to J∗ hasbeen found, wae ofthegameandthepredictionatthepenultimatemove. canuseittochooseactionsinstatexbypickingtheaction At the end of the game, the TD(λ) algorithm updates the a∈Axwhosesuccessorstatex′aminimizestheopponent’s parametervectorwaccordingtotheformula expectedreward2: N−1 N−1 a∗(x):=argmin J˜(x′,w). (6) w :=w+α ∇J˜(x ,w) λj−td  (3) a∈Ax a X t X t t=1 j=t  ThiswasthestrategyusedinTD-Gammon.Unfortunately, for games like othello and chess it is very difficult to ac- where∇J˜(·,w)isthevectorofpartialderivativesofJ˜with curately evaluate a position by looking only one move or respect to its parameters. The positive parameter α con- ply ahead. Most programs for these games employ some trols the learning rate and would typically be “annealed” form of minimax search. In minimax search, one builds towards zero during the course of a long series of games. a tree from position x by examining all possible moves Theparameterλ ∈ [0,1]controlstheextenttowhichtem- for the computer in that position, then all possible moves poraldifferencespropagatebackwardsintime.Toseethis, fortheopponent,andthenallpossiblemovesforthecom- compareequation(3)forλ=0: puter and so on to some predetermineddepth d. The leaf nodesofthetreearethenevaluatedusingaheuristiceval- w:=w+αN−1∇J˜(x ,w)d uationfunction(such as J˜(·,w)), and the resulting scores X t t arepropagatedbackupthe treebychoosingateachstage t=1 themovewhichleadstothebestpositionfortheplayeron N−1 =w+α X ∇J˜(xt,w)hJ˜(xt+1,w)−J˜(xt,w)i tmheinmimoavxe.evSaelueatfiiognu.reW1itfhorreafenreenxcaemtpolethgeafimgeurtere,enoatnedthiatst t=1 (4) the evaluation assigned to the root node is the evaluation oftheleafnodeoftheprincipalvariation;thesequenceof andλ=1: movestakenfromtheroottotheleafifeachsidechooses thebestavailablemove. N−1 w:=w+α X ∇J˜(xt,w)hr(xN)−J˜(xt,w)i. (5) Inpracticemanyengineeringtricksareusedtoimprovethe t=1 performanceoftheminimaxalgorithm,α−βsearchbeing themostfamous. Consider each term contributing to the sums in equations (4) and (5). For λ = 0 the parameter vector is being ad- LetJ˜(x,w) denotetheevaluationobtainedforstatexby d justed in such a way as to move J˜(xt,w)—the predicted applying J˜(·,w) to the leaf nodes of a depth d minimax rewardattimet—closer toJ˜(xt+1,w)—the predictedre- search from x. Our aim is to find a parameter vector w wardattimet+1.Incontrast,TD(1)adjuststheparameter suchthatJ˜(·,w)isagoodapproximationtotheexpected d vectorinsuchawayastomovethepredictedrewardattime rewardJ∗. OnewaytoachievethisistoapplytheTD(λ) step t closer to the finalrewardattime step N. Valuesof algorithmtoJ˜(x,w). Thatis, foreachsequenceofposi- d λbetweenzeroandoneinterpolatebetweenthesetwobe- haviors. Notethat(5)is equivalentto gradientdescenton 2Ifsuccessorstatesareonlydeterminedstochasticallybythe 2 choiceofa,wewouldchoosetheactionminimizingtheexpected theerrorfunctionE(w):=PNt=−11hr(xN)−J˜(xt,w)i . rewardoverthechoiceofsuccessorstates. 4 4 A A -5 4 4 4 B C B C 3 -5 4 5 4 10 4 5 D E F G D E F G H I J K L M N O H I J K L M N O 3 -9 -5 -6 4* 2 -9 5 4* -9 10 8 4* 2 -9 5 Figure 1: Full breadth, 3-ply search tree illustrating the Figure2: Asearchtreewithanon-uniqueprincipalvaria- minimax rule for propagating values. Each of the leaf tion(PV).InthiscasethederivativeoftherootnodeAwith nodes (H–O) is given a score by the evaluation function, respecttotheparametersoftheleaf-nodeevaluationfunc- J˜(·,w). Thesescoresarethenpropagatedbackupthetree tion is multi-valued, either ∇J˜(H,w) or ∇J˜(L,w). Ex- byassigningtoeachopponent’sinternalnodetheminimum ceptfortranspositions(inwhichcaseHandLareidentical ofitschildren’svalues,andtoeachofourinternalnodesthe and the derivative is single-valued anyway), such “colli- maximumofits children’svalues. The principlevariation sions”arelikelytobeextremelyrare,soinTDLeaf(λ)we is then the sequenceof best movesfor either side starting ignore them by choosing a leaf node arbitrarily from the fromtherootnode,andthisisillustratedbya dashedline availablecandidates. inthefigure. NotethatthescoreattherootnodeAis the evaluationoftheleafnode(L)oftheprincipalvariation.As therearenotiesbetweenanysiblings,thederivativeofA’s of w. Note that J˜(x,w) is also a continuousfunctionof scorewithrespecttotheparameterswisjust∇J˜(L,w). w whenever J˜(x,dw) is a continuousfunction of w. This impliesthatevenforthe“bad”pairs(x,w), ∇J˜(x,w) is d only undefined because it is multi-valued. Thus we can tionsx ,...,x ina gamewe definethe temporaldiffer- 1 N still arbitrarily choose a particular value for ∇J˜(x,w) if ences d whappenstolandononeofthebadpoints. dt :=J˜d(xt+1,w)−J˜d(xt,w) (7) Based on these observations we modified the TD(λ) al- gorithm to take account of minimax search in an almost as per equation (2), and then the TD(λ) algorithm (3) for trivial way: instead of working with the root positions updatingtheparametervectorwbecomes x ,...,x ,theTD(λ)algorithmisappliedtotheleafpo- 1 N sitions found by minimax search from the root positions. N−1 N−1 w :=w+α ∇J˜(x ,w) λj−td . (8) WecallthisalgorithmTDLeaf(λ). Fulldetailsaregivenin X d t X t figure3. t=1 j=t  Oneproblemwithequation(8)isthatford > 1,J˜(x,w) d isnotnecessarilyadifferentiablefunctionofwforallval- 4 TDLeaf(λ) andChess uesofw,evenifJ˜(·,w)iseverywheredifferentiable.This isbecauseforsomevaluesofw therewillbe“ties”inthe In this section we describe the outcome of several ex- minimaxsearch,i.e.therewillbemorethanonebestmove periments in which the TDLeaf(λ) algorithm was used availableinsomeofthepositionsalongtheprincipalvari- to train the weights of a linear evaluation function in ation,whichmeansthattheprincipalvariationwillnotbe our chess program “KnightCap”. KnightCap is a reason- unique(seefigure2). Thus,theevaluationassignedtothe rootnode,J˜(x,w), willbetheevaluationofanyoneofa ably sophisticated computer chess program for Unix sys- d tems. It has all the standard algorithmic features that numberofleafnodes. modern chess programs tend to have as well as a num- Fortunately,undersomemildtechnicalassumptionsonthe ber of features that are much less common. For more behaviorofJ˜(x,w), itcanbeshownthatforeachstatex, details on KnightCap, including the source code, see thesetofw ∈ Rk forwhichJ˜d(x,w)isnotdifferentiable wwwsyseng.anu.edu.au/lsg. has Lebesgue measure zero. Thus for all states x and for “almostall”w ∈ Rk, J˜(x,w) isadifferentiablefunction d Let J˜(·,w) be a class of evaluation functionsparameterized by w ∈ Rk. Let x1,...,xN be N positions that occurred duringthecourseofagame,withr(x )theoutcomeofthegame.FornotationalconveniencesetJ˜(x ,w):=r(x ). N N N 1. Foreachstatex ,computeJ˜(x ,w)byperformingminimaxsearchtodepthdfromx andusingJ˜(·,w)toscorethe i d i i leafnodes.Notethatdmayvaryfrompositiontoposition. 2. Letxl denotetheleafnodeoftheprinciplevariationstartingatx .Ifthereismorethanoneprincipalvariation,choose i i aleafnodefromtheavailablecandidatesatrandom.Notethat J˜(x ,w)=J˜(xl,w). (9) d i i 3. Fort=1,...,N −1,computethetemporaldifferences: d :=J˜(xl ,w)−J˜(xl,w). (10) t t+1 t 4. UpdatewaccordingtotheTDLeaf(λ)formula: N−1 N−1 w :=w+α ∇J˜(xl,w) λj−td . (11) X t X t t=1 j=t  Figure3:TheTDLeaf(λ)algorithm 4.1 ExperimentswithKnightCap • Theraw(linear)leaf nodeevaluationsJ˜(xl,w) were i convertedtoascorebetween−1and1bycomputing In our main experiment we took KnightCap’s evaluation function and set all but the material parameters to zero. vil :=tanhhβJ˜(xli,w)i. The material parameters were initialized to the standard “computer” values: 1 for a pawn, 4 for a knight, 4 for a This ensured small fluctuationsin the relative values bishop, 6 for a rook and 12 for a queen. With these pa- of leaf nodes did not produce large temporal differ- rametersettingsKnightCap(underthepseudonym“Wimp- ences (the values vil were used in place of J˜(xli,w) Knight”) was started on the Free Internet Chess server in the TDLeaf(λ) calculations). The outcome of the (FICS, fics.onenet.net) against both human and game r(xN) was set to 1 for a win, −1 for a loss computeropponents. We playedKnightCapfor25games and 0 for a draw. β was set to ensure that a value withoutmodifyingitsevaluationfunctionsoastogetarea- oftanh βJ˜(xl,w) = 0.25wasequivalenttoama- h i i sonableideaofitsrating.After25gamesithadablitz(fast terialsuperiorityof1pawn(initially). time control) rating of 1650±503, which put it at about B-gradehumanperformance(ona scale fromE (1000)to • Thetemporaldifferences,dt =vtl+1−vtl,weremod- A(1800)),althoughofcoursethekindofgameKnightCap ified in the following way. Negative values of dt playswith justmaterialparametersset is verydifferentto were left unchanged as any decrease in the evalua- humanplayofthesamelevel(KnightCapmakesnoshort- tion from one position to the next can be viewed as termtacticalerrorsbutispositionallycompletelyignorant). mistake. However, positive values of dt can occur WethenturnedontheTDLeaf(λ)learningalgorithm,with simplybecausetheopponenthasmadeablunder. To λ=0.7andthelearningrateα=1.0. Thevalueofλwas avoid KnightCap trying to learn to predict its oppo- chosen heuristically, based on the typical delay in moves nent’s blunders, we set all positive temporal differ- before an error takes effect, while α was set high enough ences to zero unless KnightCap predicted the oppo- toensurerapidmodificationoftheparameters.Acoupleof nent’smove4. minormodificationstothealgorithmweremade: 4In a later experiment we only set positive temporal differ- encestozeroifKnightCapdidnot predict theopponent’s move andtheopponent wasratedlessthanKnightCap. Afterall,pre- 3thestandarddeviationforallratingsreportedinthissection dictingastrongeropponent’sblundersisausefulskill,although isabout50 whetherthismadeanydifferenceisnotclear. • Thevalueofapawnwaskeptfixedatitsinitialvalue 2150 so as to allow easy interpretation of weight values 2100 as multiples of the pawn value (we actually experi- 2050 mented with not fixing the pawn value and found it 2000 made little difference: after 1764 games with an ad- justable pawn its value hadfallen by less than 7 per- 1950 cent). Rating 11895000 1800 Within300gamesKnightCap’sratinghadrisento2150,an 1750 increaseof500pointsinthreedays,andtoalevelcompa- 1700 rablewithhumanmasters. AtthispointKnightCap’sper- 1650 formance began to plateau, primarily because it does not 1600 haveanopeningbookandsowillrepeatedlyplayintoweak lines. Wehavesinceimplementedanopeningbooklearn- -100-50 0 5010015020025030035040045050055060065070075080085090095010001050 Games ingalgorithmandwiththisKnightCapnowplaysatarating of2400–2500(peak2575)ontheothermajorinternetchess Figure4:KnightCap’sratingasafunctionofgamesplayed server: ICC, chessclub.com5 It often beats Interna- (secondexperiment).Learningwasturnedonatgame0. tionalMastersatblitz. Also,becauseKnightCapautomati- callylearnsitsparameterswehavebeenabletoaddalarge numberofnewfeaturestoitsevaluationfunction: Knight- store the whole position in the hash table so small mem- Capcurrentlyoperateswith5872features(1468featuresin oryreallyhurtstheperformance).Anotherreasonmayalso four stages: opening, middle, ending and mating6). With havebeenthatforaportionoftherunwewereperforming this extra evaluation power KnightCap easily beats ver- paramaterupdatesaftereveryfourgamesratherthanevery sions of Crafty restricted to search only as deep as itself. game. However, a big caveat to all this optimistic assessment is Plotsofvariousparametersasafunctionofthenumberof that KnightCap routinely gets crushed by faster programs gamesplayedare shownin Figure 5 (these plotsare from searchingmoredeeply. Itisquiteunlikelythiscanbeeas- thesameexperimentinfigure4). Eachplotcontainsthree ilyfixedsimplybymodifyingtheevaluationfunction,since graphs corresponding to the three different stages of the for this to work one has to be able to predict tactics stat- evaluationfunction:opening,middleandending7. ically, something that seems very difficult to do. If one couldfindaneffectivealgorithmfor“learningtosearchse- Finally,we comparedtheperformanceofKnightCapwith lectively”therewouldbepotentialforfargreaterimprove- itslearntweighttoKnightCap’sperformancewithasetof ment. hand-codedweights,againbyplayingthetwoversionson ICC. The hand-coded weights were close in performance Note that we have twice repeatedthe learningexperiment tothelearntweights(perhaps50-100ratingpointsworse). and found a similar rate of improvementand final perfor- We also tested the result of allowing KnightCap to learn mance level. The rating as a function of the number of a starting from the hand-coded weights, and in this case it games from one of these repeat runs is shown in figure 4 seems that KnightCap performs better than when start- (wedidnotrecordthisinformationinthefirstexperiment). ingfromjustmaterialvalues(peakperformancewas2632 NotethatinthiscaseKnightCaptookmearlytwiceaslong comparedto2575,butthesefiguresareverynoisy).Weare toreachthe2150mark,butthiswaspartlybecauseitwas conductingmore tests to verify these results. However, it operating with limited memory (8Mb) until game 500 at shouldnotbetoosurprisingthatlearningfromagoodqual- which pointthe memorywas increasedto 40Mb(Knight- itysetofhand-craftedparametersisbetterthanjustlearn- Cap’ssearchalgorithm—MTD(f)[3]—isamemoryinten- ing from material parameters. In particular, some of the sive variant of α–β and when learning KnightCap must handcraftedparametershaveveryhighvalues(thevalueof 5Thereappearstobeasystematicdifferenceofaround 200– an“unstoppablepawn”,forexample)whichcantakeavery 250pointsbetweenthetwoservers,soapeakratingof2575on longtimetolearnundernormalplayingconditions,partic- ICCroughly corresponds toapeakof 2350 onFICS.Wetrans- ularlyiftheyarerarelyactiveintheprincipalleaves. Itis ferred KnightCap toICC because there aremore strong players playingthere. 6Inrealitytherearenot1468independent“concepts”perstage 7KnightCap actually has a fourth and final stage “mating” inKnightCap’sevaluationfunctionasmanyofthefeaturescome whichkicksinwhenallthepiecesareoff,butthisstageonlyuses ingroupsof64,oneforeachsquareontheboard(likethevalue afewofthecoefficients(opponent’skingmobiliityandproximity ofplacingarookonaparticularsquare,forexample) ofourkingtotheopponent’sking). DOUBLED_PAWN closeinparameterspacetomanyfarsuperiorparam- 200 Opening etersettings. Middle Ending 150 3. MostplayersonFICSprefertoplayopponentsofsim- ilarstrength,andsoKnightCap’sopponentsimproved Score (1 pawn = 10000) 15000 4. KaKassnntiiirggtohhdnttigCCdsaa.epptTwaohlfaoiswsnlgeemiagaarhnyptisnah.tgahvoeinn-hlwiandeei,tghnheottesbpfyfaecsceetlfot-hfpalgtauyle.iddTinhtoge advantageof on-lineplay is that there is a greatdeal 0 ofinformationprovidedbytheopponent’smoves. In particular,againstastrongeropponentKnightCapwas beingshownpositionsthat1)couldbeforced(against -50 0 50 100 150 200 250Games300 350 400 450 500 550 KnightCap’sweakplay)and2)weremis-evaluatedby CASTLE_BONUS 1000 itsevaluationfunction.Ofcourse,inself-playKnight- Opening 900 EMnidddinleg Cap can also discover positions which are misevalu- ated, but it will not find the kinds of positions that 800 arerelevanttostrongplayagainstotheropponents.In 700 thissetting,onecanviewtheinformationprovidedby wn = 10000) 560000 trhateioonp”ppoanretnot’fstmheoevxepsloasraptaiortni/aelxlyplsooiltvaitniogntthread“eeoxfpfl.o- Score (1 pa 340000 To further investigate the importance of some of these 200 reasons,weconductedseveralmoreexperiments. 100 0 Goodinitialconditions. -1000 50 100 150 200 250 300 350 400 450 500 550 A second experiment was run in which KnightCap’s co- Games efficients were all initialised to the value of a pawn. The Figure5: Evolutionoftwoparamaters(bonusforcastling value of a pawn needs to be positive in KnightCap be- andpenaltyforadoubledpawn)asafunctionofthenum- cause it is used in many other places in the code: for ber of games played. Note that each parameter appears example we deem the MTD search to have converged if threetimes: onceforeachofthethreestagesintheevalua- α < β+0.07∗PAWN.Thus,tosetallparametersequalto tionfunction. thesamevalue,thatvaluehadtobeapawn. Playing with the initial weight settings KnightCap had a blitz rating of around1250. After more than 1000games not yet clear whether given a sufficient number of games on FICS KnightCap’s rating has improvedto about1550, this dependence on the initial conditions can be made to a 300 point gain. This is a much slower improvement vanish. than the original experiment. We do not know whether the coefficients would have eventually convergedto good 4.2 Discussion values, but it is clear from this experiment that starting near to a good set of weights is important for fast con- Thereappeartobeanumberofreasonsfortheremarkable vergence. An interesting avenue for further exploration rateatwhichKnightCapimproved. here is the effect of λ on the learning rate. Because the initial evaluation function is completely wrong, there 1. As all the non-material weights were initially zero, would be some justification in setting λ = 1 early on so evensmallchangesintheseweightscouldcausevery that KnightCap only tries to predict the outcome of the large changes in the relative ordering of materially game and not the evaluations of later moves (which are equalpositions.HenceevenafterafewgamesKnight- extremelyunreliable). Capwasplayingasubstantiallybettergameofchess. 2. It seems to be important that KnightCap started out Self-Play life with intelligent material parameters. This put it Learning by self-play was extremely effective for TD- Gammon,butasignificantreasonforthisistherandomness On the theoretical side, it has recently been shown that of backgammon which ensures that with high probabil- TD(λ) converges for linear evaluation functions[11] (al- ity different games have substantially different sequences though only in the sense of prediction, not control). An of moves, and also the speed of play of TD-Gammon interestingavenueforfurtherinvestigationwouldbetode- which ensured that learning could take place over several terminewhetherTDLeaf(λ)hassimilarconvergenceprop- hundred-thousandgames. Unfortunately, chess programs erties. areslow,andchessisadeterministicgame,soself-playby adeterministicalgorithmtendstoresultinalargenumber Acknowledgements ofsubstantiallysimilargames. Thisisnotaproblemifthe games seen in self-play are “representative”of the games Thankstoseveraloftheanonymousrefereesfortheirhelp- ful remarks. Jonathan Baxter was supported by an Aus- played in practice, however KnightCap’s self-play games tralian Postdoctoral Fellowship. Lex Weaver was sup- with only non-zero material weights are very different to thekindofgameshumansofthesamelevelwouldplay. portedbyanAustralianPostgraduateResearchAward. Todemonstratethatlearningbyself-playforKnightCapis References notaseffectiveaslearningagainstrealopponents,we ran another experimentin which all but the material parame- [1] D. F. Beal and M. C. Smith. Learning Piece values terswereinitialisedtozeroagain,butthistimeKnightCap UsingTemporalDifferences. JournalofTheInterna- learntbyplayingagainstitself. After600games(twiceas tionalComputerChessAssociation,September1997. manyasintheoriginalFICSexperiment),weplayedthere- sultingversionagainstthegoodversionthatlearntonFICS [2] D. P. Bertsekas andJ. N. Tsitsiklis. Neuro-Dynamic forafurther100gameswiththeweightvaluesfixed. The Programming. AthenaScientific,1996. self-play version scored only 11% against the good FICS version. [3] A.Plaat,J.Schaeffer,W.Pijls,andA.deBruin.Best- FirstFixed-DepthMinmaxAlgorithms. ArtificialIn- Simultaneously with the work presented here, Beal telligence,87:255–293,1996. and Smith [1] reported positive results using essentially TDLeaf(λ)andself-play(withsomerandommovechoice) [4] J. Pollack, A. Blair, and M. Land. Coevolution of whenlearningtheparametersofanevaluationfunctionthat a Backgammon Player. In Proceedings of the Fifth only computedmaterial balance. However, they were not ArtificialLifeConference,Nara,Japan,1996. comparing performance against on-line players, but were [5] A. L. Samuel. Some Studies in Machine LEarning primarily investigating whether the weights would con- Using the Game of Checkers. IBM Journal of Re- vergeto“sensible”valuesatleastasgoodasthenaive(1,3, searchandDevelopment,3:210–229,1959. 3,5,9)valuesfor(pawn,knight,bishop,rook,queen)(they did, within 2000 games, and using a value of λ = 0.95 [6] N.Schraudolph,P.Dayan,andT.Sejnowski. Tempo- whichsupportsthediscussionin “goodinitialconditions” ralDifferenceLearningofPositionEvaluationinthe above). GameofGo. InJ.Cowan,G.Tesauro,andJ.Alspec- tor,editors,AdvancesinNeuralInformationProcess- 5 Conclusion ing Systems 6, San Fransisco, 1994. Morgan Kauf- mann. WehaveintroducedTDLeaf(λ),avariantofTD(λ)suitable [7] R.Sutton.LearningtoPredictbytheMethodofTem- fortraininganevaluationfunctionusedinminimaxsearch. poralDifferences. MachineLearning,3:9–44,1988. Theonlyextrarequirementofthealgorithmisthattheleaf- nodes of the principalvariations be stored throughoutthe [8] G. Tesauro. Practical Issues in TemporalDifference game. Learning. MachineLearning,8:257–278,1992. We presented some experimentsin which a chess evalua- [9] G. Tesauro. TD-Gammon,a self-teachingbackgam- tionfunctionwastrainedfromB-gradetomasterlevelus- mon program, achieves master-level play. Neural ingTDLeaf(λ)byon-lineplayagainstamixtureofhuman Computation,6:215–219,1994. and computeropponents. The experimentsshow both the importanceof“on-line”sampling(asopposedtoself-play) [10] S. Thrun. Learning to Play the Game of Chess. In for a deterministic game such as chess, and the need to G. Tesauro, D. Touretzky, and T. Leen, editors, Ad- start near a good solution for fast convergence, although vances in Neural Information Processing Systems 7, justhownearisstillnotclear. SanFransisco,1995.MorganKaufmann. [11] J.N.TsitsikilisandB.V.Roy. AnAnalysisofTem- poral Difference Learning with Function Approxi- mation. IEEE Transactions on Automatic Control, 42(5):674–690,1997. [12] S.Walker,R.Lister,andT.Downs. OnSelf-Learning Patterns in the Othello Board Game by the Method of Temporal Differences. In C. Rowles, H. liu, and N. Foo, editors, Proceedings of the 6th Aus- tralian Joint Conference on Artificial Intelligence, pages328–333,Melbourne,1993.WorldScientific.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.