ebook img

Modelling Competitive Sports: Bradley-Terry-\'{E}l\H{o} Models for Supervised and On-Line Learning of Paired Competition Outcomes PDF

0.55 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Modelling Competitive Sports: Bradley-Terry-\'{E}l\H{o} Models for Supervised and On-Line Learning of Paired Competition Outcomes

Modelling Competitive Sports: Bradley-Terry-Élo˝ Models for Supervised and On-Line Learning of Paired Competition Outcomes 7 1 Franz J. Király ∗ 1 and Zhaozhi Qian † 12 0 2 1 Department of Statistical Science, University College London, n Gower Street, London WC1E 6BT, United Kingdom a J 2King Digital Entertainment plc, Ampersand Building, 7 178 Wardour Street, London W1F 8FY, United Kingdom 2 ] January 30, 2017 L M . t a Abstract t s Prediction and modelling of competitive sports outcomes has received much recent attention, es- [ pecially from the Bayesian statistics and machine learning communities. In the real world setting of 1 outcomeprediction,theseminalÉlo˝updatestillremains,aftermorethan50years,avaluablebaseline v whichisdifficulttoimproveupon,thoughinitsoriginalformitisaheuristicandnotaproperstatistical 5 “model”.Mathematically,theÉlo˝ratingsystemisverycloselyrelatedtotheBradley-Terrymodels,which 5 areusuallyusedinanexplanatoryfashionratherthaninapredictivesupervisedoron-linelearningset- 0 ting. 8 Exploiting this close link between these two model classes and some newly observed similarities, 0 . weproposeanewsupervisedlearningframeworkwithclosesimilaritiestologisticregression,low-rank 1 matrix completion and neural networks. Building on it, we formulate a class of structured log-odds 0 models, unifying the desirable properties found in the above: supervised probabilistic prediction of 7 scoresandwins/draws/losses,batch/epochandon-linelearning,aswellasthepossibilitytoincorporate 1 featuresintheprediction,withouthavingtosacrificesimplicity,parsimonyoftheBradley-Terrymodels, : v orcomputationalefficiencyofÉlo˝’soriginalapproach. i Wevalidatethestructuredlog-oddsmodellingapproachinsyntheticexperimentsandEnglishPremier X Leagueoutcomes,wheretheaddedexpressivityyieldsthebestpredictionsreportedinthestate-of-art, r a closetothequalityofcontemporarybettingodds. ∗[email protected][email protected] 1 Contents 1 Introduction 4 1.1 Modellingandpredictingcompetitivesports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Historyofcompetitivesportsmodelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Aimofcompetitivesportsmodelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Mainquestionsandchallengesincompetitivesportsoutcomesprediction . . . . . . . . . . . 5 1.5 Maincontributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.6 Manuscriptstructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 TheMathematical-StatisticalSetting 7 2.1 Supervisedpredictionofcompetitiveoutcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 TheGenerativeModel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 TheObservationModel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.3 TheLearningTask.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Lossesforprobablisticclassification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Learningwithstructuredandsequentialdata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Conditioningonthepairing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.2 Conditioningontime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Approachestocompetitivesportsprediction 12 3.1 TheBradley-Terry-Élo˝models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.1 TheoriginalformulationoftheÉlo˝model . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1.2 Bradley-Terry-Élo˝models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1.3 Glickman’sBradley-Terry-Élo˝model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.4 LimitationsoftheBradley-Terry-Élo˝modelandexistingremedies . . . . . . . . . . . . 16 3.2 Domain-specificparametricmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.1 BivariatePoissonregressionandextensions . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2.2 Bayesianlatentvariablemodels. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Feature-basedmachinelearningpredictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.4 Evaluationmethodsusedinpreviousstudies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4 ExtendingtheBradley-Terry-Élo˝ model 21 4.1 Thestructuredlog-oddsmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1 Statisticaldefinitionofstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . . . 21 4.1.2 Importantspecialcases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.3 Connectiontoexistingmodelclasses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Predictingnon-binarylabelswithstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . 26 4.2.1 Thestructuredlog-oddsmodelwithfeatures. . . . . . . . . . . . . . . . . . . . . . . . . 26 4.2.2 Predictingternaryoutcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.3 Predictingscoreoutcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Trainingofstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1 Thelikelihoodofstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.2 Batchtrainingofstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.3 On-linetrainingofstructuredlog-oddsmodels . . . . . . . . . . . . . . . . . . . . . . . 31 4.4 Rankregularizedlog-oddsmatrixestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2 5 Experiments 34 5.1 Syntheticexperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.1.1 Two-factorBradley-Terry-Élo˝model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.2 Rank-fourBradley-Terry-Élo˝model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1.3 Regularizedlog-oddsmatrixestimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 PredictionsontheEnglishPremierLeague. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.1 Descriptionofthedataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2.2 Validationsetting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.3 PredictionStrategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.4 Quantitativecomparisonfortheevaluationmetrics . . . . . . . . . . . . . . . . . . . . 40 5.2.5 Performanceofthestructuredlog-oddsmodel . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2.6 Performanceofthebatchlearningmodels . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.3 FairnessoftheEnglishPremierLeagueranking. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6 DiscussionandSummary 48 6.1 Methodologicalfindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.2 FindingsontheEnglishPremierLeague . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.3 Openquestions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3 1. Introduction 1.1. Modellingandpredictingcompetitivesports Competitivesportsreferstoanysportthatinvolvestwoteamsorindividualscompetingagainsteachother to achieve higher scores. Competitive team sports includes some of the most popular and most watched gamessuchasfootball,basketballandrugby. Suchsportsareplayedbothindomesticprofessionalleagues such as the National Basketball Association, and international competitions such as the FIFA World Cup. For football alone, there are over one hundred fully professional leagues in 71 countries globally. It is estimatedthatthePremierLeague,thetopfootballleagueintheUnitedKingdom,attracteda(cumulative) televisionaudienceof4.7billionviewersinthelastseason[47]. The outcome of a match is determined by a large number of factors. Just to name a few, they might involvethecompetitivestrengthofeachindividualplayerinbothteams,thesmoothnessofcollaboration betweenplayers,andtheteam’sstrategyofplaying. Moreover,thecompositionofanyteamchangesover the years, for example because players leave or join the team. The team composition may also change withinthetournamentseasonorevenduringamatchbecauseofinjuriesorpenalties. Understanding these factors is, by the prediction-validation nature of the scientific method, closely linked to predicting the outcome of a pairing. By Occam’s razor, the factors which empirically help in predictionareexactlythosethatonemayhypothesizetoberelevantfortheoutcome. Since keeping track of all relevant factors is unrealistic, of course one cannot expect a certain pre- diction of a competitive sports outcome. Moreover, it is also unreasonable to believe that all factors can bemeasuredorcontrolled,henceitisreasonabletoassumethatunpredictable,ornon-deterministicsta- tistical “noise” is involved in the process of generating the outcome (or subsume the unknowns as such noise). A good prediction will, hence, not exactly predict the outcome, but will anticipate the “correct” odds more precisely. The extent to which the outcomes are predictable may hence be considered as a surrogatequantifierofhowmuchtheoutcomeofamatchisinfluencedby“skill”(assurrogatedbydeter- minism/prediction),orby“chance”1 (assurrogatedbythenoise/unknownfactors). Phenomenawhichcannotbespecifieddeterministicallyareinfactverycommoninnature. Statistics and probability theory provide ways to make inference under randomness. Therefore, modelling and predicting the results of competitive team sports naturally falls into the area of statistics and machine learning. Moreover, any interpretable predictive model yields a possible explanation of what constitutes factorsinfluencingtheoutcome. 1.2. Historyofcompetitivesportsmodelling Research of modeling competitive sports has a long history. In its early days, research was often closely related to sports betting or player/team ranking [22, 26]. The two most influential approaches are due to Bradley and Terry [3] and Élo˝ [15]. The Bradley-Terry and Élo˝ models allow estimation of player rating; theÉlo˝ systemadditionallycontainsalgorithmicheuristicstoeasilyupdateaplayer’srank,which havebeeninuseforofficialchessrankingssincethe1960s. TheÉlo˝systemisalsodesignedtopredictthe oddsofaplayerwinningorlosingtotheopponent. Incontemporarypractice,Bradley-TerryandÉlo˝type models are broadly used in modelling of sports outcomes and ranking of players, and it has been noted thattheyareveryclosemathematically. Inmorerecentdays,relativelydiversemodellingapproachesoriginatingfromtheBayesianstatistical framework [37, 13, 20], and also some inspired by machine learning principles [36, 23, 43] have been applied for modelling competitive sports. These models are more expressive and remove some of the 1Weexpresslyavoiduseoftheword“luck”asinvernacularuseitoftenmeans“chance”,jointlywiththebeliefthatitmaybe influencedbyesoterical,magicalorotherwisemetaphysicalmeans. Whileinthesuggestedsurrogateuse,itmaywellbethatthe “chance”componentofamodelsubsumespossiblepointsofinfluencewhichsimplyarenotmeasuredorobservedinthedata,an extremelystrongcorpusofscientificevidenceimpliesthatthesewillnotbemetaphysical,onlyunknown-twoqualifierswhichare obviouslynotthesame,despitestronghumantendenciestobelievethecontrary. 4 Bradley-Terry and Élo˝ models’ limitations, though usually at the price of interpretability, computational efficiency,orboth. A more extensive literature overview on existing approaches will be given later in Section 3, as lit- erature spans multiple communities and, in our opinion, a prior exposition of the technical setting and simultaneousstraighteningofthoughtsbenefitstheunderstandingandallowsustogivepropercreditand contextforthewidelydifferentideasemployedincompetitivesportsmodelling. 1.3. Aimofcompetitivesportsmodelling Inliterature,thestudyofcompetitiveteamsportsmaybeseentoliebetweentwoprimarygoals. Thefirst goalistodesignmodelsthatmakegoodpredictionsforfuturematchoutcome. Thesecondgoalistoun- derstandthekeyfactorsthatinfluencethematchoutcome,mostlythroughretrospectiveanalysis[45,50]. Asexplainedabove,thesetwoaspectsareintrinsicallyconnected,andinourviewtheyarethetwofacets of a single problem: on one hand, proposed influential factors are only scientifically valid if confirmed by falsifiable experiments such as predictions on future matches. If the predictive performance does not increase when information about such factors enters the model, one should conclude by Occam’s razor that these factors are actually irrelevant2. On the other hand, it is plausible to assume that predictions are improved by making use of relevant factors (also known as “features”) become available, for exam- ple because they are capable of explaining unmodelled random effects (noise). In light of this, the main problem considered in this work is the (validable and falsifiable) prediction problem, which in machine learningterminologyisalsoknownasthesupervisedlearningtask. 1.4. Mainquestionsandchallengesincompetitivesportsoutcomesprediction Giventheabovediscussion,themajorchallengesmaybestatedasfollows: On the methodological side, what are suitable models for competitive sports outcomes? Current modelsarenotatthesametimeinterpretable,easilycomputable,allowtousefeatureinformationonthe teams/players,andallowtopredictscoresorternaryoutcomes. Itisanopenquestionhowtoachievethis inthebestway,andthismanuscriptattemptstohighlightapossiblepath. Themaintechnicaldifficultyliesinthefactthatoff-shelfmethodsdonotapplyduetothestructured natureofthedata: unlikeinindividualsportssuchasrunningandswimmingwheretheoutcomedepends onlyonthegiventeam,andwherethepredictiontaskmaybedealtwithclassicalstatisticsandmachine learningtechnology(see[2]foradiscussionofthisinthecontextofrunning),incompetitiveteamsports theoutcomemaybedeterminedbypotentiallycomplexinteractionsbetweentwoopposingteams. Inpar- ticular, the performance of any team is not measured directly using a simple metric, but only in relation totheopposingteam’sperformance. Onthesideofdomainapplications,whichinthismanuscriptispremierleaguefootball,itisofgreat interest to determine the relevant factors determining the outcome, the best way to predict, and which rankingsystemsarefairandappropriate. Allthesequestionsarerelatedtopredictivemodelling, aswellastheavailabilityofsuitableamounts ofqualitydata. Unfortunately,thescarcityoffeaturesavailableinsystematicpresentationplacesahurdle toacademicresearchincompetitiveteamsports,especiallywhenitcomestoassessingimportantfactors suchasteammembercharacteristics,orstrategicconsiderationsduringthematch. Moreover,closelylinkedisalsothequestiontowhichextenttheoutcomesaredeterminedby“chance” as opposed to “skill”. Since if on one hypothetical extreme, results would prove to be completely unpre- dictable,therewouldbenoempiricalevidencetodistinguishthematchesfromagameofchancesuchas 2...todistinguish/characterizetheobservations,whichinsomecasesmayplausiblypertaintorestrictionsinsetofobservations, ratherthantocausativerelevance.Hypotheticalexample:ageoffootballplayersmaybeidentifiedasunimportantfortheoutcome -whichmayplausiblybeduetothefactthatthedatacontainednoplayersofages5or80,say,asopposedtoplayeragebeing unimportantingeneral.Rephrased,itisonlyunimportantforcasesthatareplausibletobefoundinthedatasetinthefirstplace. 5 flipping a coin. On the other hand, importance of a measurement for predicting would strongly suggest itsimportanceforwinning(orlosing),thoughwithoutanexperimentnotnecessarilyacausativelink. We attempt to address these questions in the case of premier league football within the confines of readilyavailabledata. 1.5. Maincontributions Ourmaincontributionsinthismanuscriptarethefollowing: (i) Wegivewhatwebelievetobethefirstcomprehensiveliteraturereviewofstate-of-artcompetitive sportsmodellingthatcomprisesthemultiplecommunities(Bradley-Terrymodels,Élo˝ typemodels, Bayesianmodels,machinelearning)inwhichresearchsofarhasbeenconductedmostlyseparately. (ii) WepresentaunifiedBradley-Terry-Élo˝modelwhichcombinesthestatisticalrigouroftheBradley- Terrymodels withfittingandupdate strategiessimilarto thatfoundin theÉlo˝ system. Mathemat- ically only a small step, this joint view is essential in a predictive/supervised setting as it allows efficient training and application in an on-line learning situation. Practically, this step solves some problemsoftheÉlo˝system(includingrankinginitializationandchoiceofK-factor),andestablishes closerelationstologisticregression,low-rankmatrixcompletion,andneuralnetworks. (iii) ThisunifiedviewonBradley-Terry-Élo˝ allowsustointroduceclassesofjointextensions,thestruc- tured log-odds models, which unites desirable properties of the extensions found in the disjoint communities: probabilistic prediction of scores and wins/draws/losses, batch/epoch and on-line learning,aswellasthepossibilitytoincorporatefeaturesintheprediction,withouthavingtosacri- ficestructuralparsimonyoftheBradley-Terrymodels,orsimplicityandcomputationalefficiencyof Élo˝’soriginalapproach. (iv) We validate the practical usefulness of the structured log-odds models in synthetic experiments and in answering domain questions on English Premier League data, most prominently on the importanceoffeatures,fairnessoftheranking,aswellasonthe“chance”-“skill”divide. 1.6. Manuscriptstructure Section2givesanoverviewofthemathematicalsettingincompetitivesportsprediction. Buildingonthe technical context, Section 3 presents a more extensive review of the literature related to the prediction problemofcompetitivesports,andintroducesajointviewonBradley-TerryandÉlo˝typemodels. Section4 introducesthestructuredlog-oddsmodels,whicharevalidatedinempiricalexperimentsinSection5. Our resultsandpossiblefuturedirectionsforresearcharediscussedinsection6. Authors’contributions This manuscript is based on ZQ’s MSc thesis, submitted September 2016 at University College London, writtenundersupervisionofFK.FKprovidedtheideasofre-interpretationandpossibleextensionsofthe Élo˝model. LiteratureoverviewisjointlyduetoZQanFQ,andinpartsfollowssomeveryhelpfulpointers by I. Kosmidis (see below). Novel technical ideas in Sections 4.2 to 4.4, and experiments (set-up and implementation)aremostlyduetoZQ. The present manuscript is a substantial re-working of the thesis manuscript, jointly done by FK and ZQ. Acknowledgements WearethankfultoIoannisKosmidisforcommentsonanearlierformofthemanuscript,forpointingout someearlieroccurrencesofideaspresentedinitbutnotgivenpropercredit,aswellasrelevantliterature inthe“Bradley-Terry”branch. 6 2. The Mathematical-Statistical Setting This section formulates the prediction task in competitive sports and fixes notation, considering as an instanceofsupervisedlearningwithseveralnon-standardstructuralaspectsbeingofrelevance. 2.1. Supervisedpredictionofcompetitiveoutcomes Weintroducethemathematicalsettingforoutcomepredictionincompetitiveteamsports. Asoutlinedin theintroductorySection1.1,threecrucialfeaturesneedtobetakenintoaccountinthissetting: (i) Theoutcomeofapairingcannotbeexactlypredictedpriortothegame,evenwithperfectknowledge of all determinates. Hence it is preferable to predict a probabilistic estimate for all possible match outcomes(win/draw/loss)ratherthandeterministicallychoosingoneofthem. (ii) In a pairing, two teams play against each other, one as a home team and the other as the away or guest team. Not all pairs may play against each other, while others may play multiple times. As a mathematicallyprototypical(thoughinaccurate)sub-caseonemayconsiderallpairsplayingexactly once,whichgivestheobservationsanimplicitmatrixstructure(row=hometeam,column=away team). Outcomelabelsandfeaturescruciallydependontheteamsconstitutingthepairing. (iii) Pairings take place over time, and the expected outcomes are plausibly expected to change with (possiblyhidden)characteristicsoftheteams. Hencewewillmodelthetemporaldependenceexplic- itlytobeabletotakeitintoaccountwhenbuildingandcheckingpredictivestrategies. 2.1.1. TheGenerativeModel. Followingtheabovediscussion,wewillfixagenerativemodelasfollows: asinthestandardsupervisedlearningsetting,wewillconsideragenerativejointrandomvariable(X,Y) taking values in X×Y, where X is the set of features (or covariates, independent variables) for each pairing,whileYisthesetoflabels(oroutcomevariables,dependentvariables). In our setting, we will consider only the cases Y = {win,lose} and Y = {win,lose,draw}, in which case an observation from Y is a so-called match outcome, as well as the case Y = (cid:78)2, in which case an observationisaso-calledfinalscore(inwhichcase,byconvention,thefirstcomponentofYisofthehome team),orthecaseofscoredifferenceswhereY=(cid:78)(inwhichcase,byconvention,apositivenumberisin favour of the home team). From the official rule set of a game (such as football), the match outcome is uniquelydeterminedbyascoreorscoredifference. AsalltheabovesetsYarediscrete,predictingYwill amounttosupervisedclassification(thescoredifferenceproblemmaybephrasedasaregressionproblem, butwewillabstainfromdoingsofortechnicalreasonsthatbecomeapparentlater). TherandomvariableX anditsdomainXshallincludeinformationontheteamsplaying,aswellason thetimeofthematch. We will suppose there is a set I of teams, and for i,j ∈ I we will denote by (X ,Y ) the random ij ij variable (X,Y) conditioned on the knowledge that i is the home team, and j is the away team. Note that information in X can include any knowledge on either single team i or j, but also information ij correspondinguniquelytothepairing(i,j). We will assume that there areQ :=#I teams, which means that the X and Y may be arranged in ij ij (Q×Q)matriceseach. Further therewill be a setT of time pointsat which matches areobserved. For t ∈T we willdenote by(X(t),Y(t))or(X (t),Y (t))anadditionalconditioningthattheoutcomeisobservedattimepoint t. ij ij NotethattheindexingX (t)andY (t)formallyamountstoadoubleconditioningandcouldbewrit- ij ij tenasX|I =i,J = j,T =tandY|I =i,J = j,T =t,whereI,J,T arerandomvariablesdenotingthehome team,theawayteam,andthetimeofthepairing. Thoughwedobelievethattheindex/bracketnotation iseasiertocarrythroughandtofollow(includinganexplicitmirroringofthethe“matrixstructure”)than theconditionalor“graphicalmodels”typenotation,whichisourmainreasonforadoptingtheformerand notthelatter. 7 2.1.2. The Observation Model. By construction, the generative random variable (X,Y) contains all information on having any pairing playing at any time, However, observations in practice will concern twoteamsplayingatacertaintime,henceobservationsinpracticewillonlyincludeindependentsamples of(X (t),Y (t))forsomei,j∈I,t∈T,andneverfullobservationsof(X,Y)whichcanbeinterpretedas ij ij alatentvariable. Notethattheobservationscanbe,in-principle,correlated(orunconditionallydependent)ifthepairing (i,j)orthetime t isnotmadeexplicit(byconditioningwhichisimplicitintheindicesi,j,t). An important aspect of our observation model will be that whenever a value of X (t) or Y (t) is ij ij observed, itwillalwayscometogetherwiththeinformationoftheplayingteams(i,j)∈I2 andthetime t ∈T at which it was observed. This fact will be implicitly made use of in description of algorithms and validation methodology. (formally this could be achieved by explicitly exhibiting/adding I×I×T as a CartesianfactorofthesamplingdomainsXorYwhichwewillnotdoforreasonsofclarityandreadability) Twoindependentbatchesofdatawillbeobservedintheexposition. Wewillconsider: atrainingsetD:={(X(1)(t ),Y(1)(t )),...,(X(N)(t ),Y(N)(t ))} i j 1 i j 1 i j N i j N 11 11 N N N N atestsetT:={(X(1∗)(t∗),Y(1∗)(t∗)),...,(X(M∗)(t∗ ),Y(M∗)(t∗ ))} i∗j∗ 1 i∗j∗ 1 i∗ j∗ M i∗ j∗ M 1 1 1 1 M M M M where(X(i),Y(i))and(X(i∗),Y(i∗))arei.i.d.samplesfrom(X,Y). Notethatunfortunately(fromanotationalperspective),onecannotomitthesuperscriptsκasinX(κ) whendefiningthesamples, sincethefigurative“dies”shouldbecastanewforeachpairingtakingplace. Inparticular,ifallgameswouldconsistofasinglepairofteamsplayingwheretheresultsareindependent of time, they would all be the same (and not only identically distributed) without the super-index, i.e., withoutdistinguishingdifferentgamesasdifferentsamplesfrom(X,Y). 2.1.3. The Learning Task. As set out in the beginning, the main task we will be concerned with is predicting future outcomes given past outcomes and features, observed from the process above. In this work, the features will be assumed to change over time slowly. It is not our primary goal to identify the hiddenfeaturesin(X,Y),astheyareneverobservedandhencenotaccessibleasgroundtruthwhichcan validateourmodels. However,thesewillbeofsecondaryinterestandconsideredempiricallyvalidatedby awell-predictingmodel. Moreprecisely,wewilldescribemethodologyforlearningandvalidatingpredictivemodelsofthetype f :X×I×I×T→Distr(Y), where Distr(Y) is the set of (discrete probability) distributions on Y. That is, given a pairing (i,j) and a time point t at which the teams i and j play, and information of type x = X (t), make a probabilistic ij prediction f(x,i,j,t)oftheoutcome. MostalgorithmswediscusswillnotuseaddedinformationinX,hencewillbeoftype f :I×I×T→ Distr(Y). Some will disregard the time in T. Indeed, the latter algorithms are to be considered scientific baselinesabovewhichanyalgorithmusinginformationinXand/orT hastoimprove. The models f above will be learnt on a training set D, and validated on an independent test set T as defined above. In this scenario, f will be a random variable which may implicitly depend on D but willbeindependentofT. Thelearningstrategy-whichis f dependingonD-maytakeanyformandis consideredinafullblack-boxsense. Intheexposition, itwillinfacttaketheformofvariousparametric andnon-parametricpredictionalgorithms. Thegoodnessofsuchan f willbeevaluatedbyaloss L :Distr(Y)×Y→(cid:82)whichcomparesaproba- bilisticpredictiontothetrueobservation. Thebest f willhaveasmallexpectedgeneralizationloss (cid:34)(f|i,j,t):=(cid:69)(X,Y)(cid:148)L(cid:128)f(Xij(t),i,j,t),Yij(t)(cid:138)(cid:151) 8 atanyfuturetimepoint t andforanypairingi,j. Undermildassumptions,wewillarguebelowthatthis quantityisestimablefromT andonlymildlydependenton t,i,j. Though a good form for L is not a-priori clear. Also, it is unclear under which assumptions (cid:34)(f|t) is estimable,duedotheconditioningon(i,j,t)inthetrainingset. Thesespecialaspectsofthecompetitive sportspredictionsettingswillbeaddressedinthesubsequentsections. 2.2. Lossesforprobablisticclassification Inordertoevaluatedifferentmodels,weneedacriteriontomeasurethegoodnessofprobabilisticpredic- tions. Themostcommonerrormetricusedinsupervisedclassificationproblemsisthepredictionaccuracy. However,theaccuracyisofteninsensitivetoprobabilisticpredictions. Forexample,onacertaintestcasemodelApredictsawinprobabilityof60%,whilemodelBpredicts awinprobabilityof95%. Iftheactualoutcomeisnotwin,bothmodelsarewrong. Intermsofprediction accuracy(oranyothernon-probabilisticmetric),theyareequallywrongbecausebothofthemmadeone mistake. However,modelBshouldbeconsideredbetterthanmodelAsinceitpredictedthe“true”outcome withhigheraccuracy. Similarly,ifalargenumberofoutcomesofafaircointosshavebeenobservedastrainingdata,amodel that predicts 50% percent for both outcomes on any test data point should be considered more accurate thanamodelthatpredicts100%percentforeitheroutcome50%ofthetime. Thereexiststwocommonlyusedcriteriathattakeintoaccounttheprobabilisticnatureofpredictions which we adopt. The first one is the Brier score (Equation 1 below) and the second is the log-loss or log-likelihoodloss(Equation2below). Bothlossescompareadistributiontoanobservation,hencemath- ematicallyhavethesignatureofafunctionDistr(Y)×Y→(cid:82). By(veryslight)abuseofnotation,wewill identifydistributionson(discrete)Ywithitsprobabilitymassfunction;foradistribution p,for y ∈Ywe write p formassontheobservation y (=theprobabilitytoobserve y inarandomexperimentfollowing y p). Withthisconvention,log-loss L(cid:96) andBrierloss LBr aredefinedasfollows: L(cid:96): (p,y)(cid:55)→ −logpy (1) (cid:88) L : (p,y)(cid:55)→ (1−p )2+ p2 (2) Br y y y∈Y\{y} Thelog-lossandtheBrierlossfunctionshavethefollowingproperties: (i) the Brier Score is only defined on Y with an addition/subtraction and a norm defined. This is not necessarily thecase in oursetting where itmay be thatY={win,lose,draw}. In literature, this is often identified with Y={1,0,−1}, though this identification is arbitrary, and the Brier score may changedependingonwhichnumbersareused. Ontheotherhand,thelog-lossisdefinedforanyYandremainsunchangedunderanyrenamingor renumberingofadiscreteY. (ii) Forajointrandomvariable(X,Y)takingvaluesinX×Y,itcanbeshownthattheexpectedlosses (cid:69)(cid:2)L(cid:96)(f(X),Y)(cid:3)areminimizedbythe“correct”prediction f :x (cid:55)→(cid:128)py =P(Y = y|X =x)(cid:138)y∈Y. ThetwolossfunctionsusuallyareintroducedasempiricallossesonatestsetT,i.e., 1 (cid:88) (cid:34)(cid:98)T(f)= #T L∗(x,y). (x,y)∈T Theempiricallog-lossisthe(negativelog-)likelihoodofthetestpredictions. The empiricalBrier loss, usually called the “Brierscore”, isa straightforward translationof the mean squarederrorusedinregressionproblemstotheclassificationsetting,astheexpectedmeansquarederror 9 of predicted confidence scores. However, in certain cases, the Brier score is hard to interpret and may behaveinunintuitiveways[27],whichmaypartlybeseenasaphenomenoncausedbyabove-mentioned lackofinvarianceunderclassre-labelling. Giventhisandtheinterpretabilityoftheempiricallog-lossasalikelihood,wewillusethelog-lossas principalevaluationmetricinthecompetitiveoutcomepredictionsetting. 2.3. Learningwithstructuredandsequentialdata Thedependencyoftheobserveddataonpairingandtimemakesthepredictiontaskathandnon-standard. Weoutlinethemajorconsequencesforlearningandmodelvalidation,aswellastheimplicitassumptions which allow us to tackle these. We will do this separately for the pairing and the temporal structure, as thesebehaveslightlydifferently. 2.3.1. Conditioningonthepairing Matchoutcomesareobservedforgivenpairings(i,j),thatis,each feature-label-pair will be of form (X ,Y ), where as above the subscripts denote conditioning on the ij ij pairing. Multiple pairings may be observed in the training set, but not all; some pairings may never be observed. Thishasconsequencesforbothlearningandvalidatingmodels. Formodellearning,itneedstobemadesurethatthepairingstobepredictedcanbepredictedfrom ∗ the pairings observed. With other words, the label Y in the test set that we want to predict is (in a ij practically substantial way) dependent on the training set D={(X(1),Y(1)),...,(X(N),Y(N))}. Note that i j i j i j i j 11 11 N N N N smartmodelswillbeabletopredicttheoutcomeofapairingevenifithasnotbeenobservedbefore,and evenifithas,itwilluseinformationfromotherpairingstoimproveitspredictions Forvariousparametricmodels, “predictability”canberelatedtocompletabilityofadatamatrixwith Y as entries. In section 4, we will relate Élo˝ type models to low-rank matrix completion algorithms; ij completioncanbeunderstoodaslow-rankcompletion,hencepredictabilitycorrespondstocompletability. Though,exactlyworkingcompletabilityoutisnotthemainisnottheprimaryaimofthismanuscript,and for our data of interest, the English Premier League, all pairings are observed in any given year, so com- pletabilityisnotanissue. Hencewereferto[33]forastudyoflow-rankmatrixcompletability. General parametricmodelsmaybetreatedalongsimilarlines. Formodel-agnosticmodelvalidation,itshouldholdthattheexpectedgeneralizationloss (cid:34)(f|i,j):=(cid:69)(X,Y)(cid:148)L(cid:128)f(Xij,i,j),Yij(cid:138)(cid:151) canbewell-estimatedbyempiricalestimationonthetestdata. Forleaguelevelteamsportsdatasets,this can be achieved by having multiple years of data available. Since even if not all pairings are observed, usuallythesetofpairingswhichisobservedis(almost)thesameineachyear,hencethepairingswillbe similarinthetrainingandtestsetifwholeyears(orhalf-seasons)areincluded. Furtherwewillconsider anaverageoverallobservedpairings,i.e.,wewillcomputetheempiricallossonthetrainingsetT as (cid:34)(f):= 1 (cid:88) L(cid:128)f(X ,i,j),Y (cid:138) (cid:98) #T ij ij (Xij,Yij)∈T Bytheaboveargument,thesetofallobservedpairingsinanygivenyearisplausiblymodelledassimilar, henceitisplausibletoconcludethatthisempiricallossestimatessomeexpectedgeneralizationloss (cid:34)(f):=(cid:69) (cid:2)L(cid:0)f(X ,I,J),Y (cid:1)(cid:3) X,Y,I,J IJ IJ where I,J (possiblydependent)arerandomvariablesthatselectteamswhicharepaired. Notethatthistypeofaggregateevaluationdoesnotexcludethepossibilitythatpredictionsforsingle teams(e.g.,newcomersorafterre-structuring)maybeinaccurate,butonlythatthe“average”prediction 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.