Table Of ContentDEGREE PROJECT IN ELECTRICAL ENGINEERING,
SECOND CYCLE, 30 CREDITS
STOCKHOLM, SWEDEN 2018
Deep Reinforcement Learning for
Adaptive Resource Allocation in
Virtualized Network Functions
SIMON IGNAT
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Deep Reinforcement Learning
for Adaptive Resource
Allocation in Virtualized
Network Functions
SIMON IGNAT, SIMON.G.IGNAT@GMAIL.COM
M.Sc in Electrical Engineering
Systems, Control and Robotics
Date: October 7, 2018
Supervisor: Johannes A. Stork
Examiner: Danica Kragic
KTH Royal Institute of Technology
School of Electrical Engineering and Computer Science
ii
Nobody ever figures out what life is all about, and it doesn’t matter.
Explore the world.
Nearly everything is really interesting if you go into it deeply enough.
RICHARD FEYNMAN
iii
Abstract
Network Function Virtualization (NFV) is the transition from propri-
etary hardware functions to virtualized counterparts of them within
the telecommunication industry. These virtualized counterparts are
known as Virtualized Network Functions (VNFs) and are the main
buildingblocksofNFV.Thetransitionstarted2012andisstillongoing,
with research and development moving at a high pace. It is believed
that when using virtualization, both capital and operating expenses
canbeloweredasaresultofeasierdeployments,cheapersystemsand
networks that can operate more autonomous. This thesis examines if
thecurrentstateofNFVcanlowertheoperatingexpenseswhilemain-
taining quality of service (QoS) high by using current state of the art
machinelearningalgorithms. Morespecificallythethesisanalyzesthe
problem of adaptive autoscaling of virtual machines (VMs) allocated
by the VNFs with deep reinforcement learning (DRL). To analyze the
task, the thesis implements a discrete time model for VNFs with the
purposeofcapturingthefundamentalcharacteristicsofthescalingop-
eration. It also examines the learning and robustness/generalization
of six state-of-the-art DRL algorithms. The algorithms are examined
since they have fundamental differences in their properties, ranging
from off-policy methods such as DQN to on-policy methods such as
PPO Advantage Actor Critic. The policies are compared to a baseline
P-controllertoevaluatetheperformancewithrespecttosimplermeth-
ods. The result from the model show that DRL needs around 100,000
samples to converge, which in a real setting would represent around
70daysoflearning. Thethesisalsoshowsthatthefinalpolicyapplied
bytheagentdoesnotshowconsiderableimprovementsoverasimple
controlalgorithmwithrespecttorewardandperformancewhenmul-
tiple experiments with varying loads and configurations are tested.
Due to the lack of data and slow real time systems, with robustness
being an important consideration, the time to convergence required
by a DRL agent is to long for an autoscaling solution to be deployed
in the near future. Therefore, the author can not recommend DRL for
autoscalinginVNFsgiventhecurrentstateofthetechnology. Instead
the author recommend simpler methods, such as supervised machine
learningorclassicalcontroltheory.
iv
Sammanfattning
Network Function Virtualization (NFV) är övergången från propri-
etärahårdvarufunktionertillvirtualiserademotsvarigheteravdeminom
telekommunikationsindustrin. Dessavirtualiserademotsvarigheterär
kända som Virtualized Network Functions (VNF) och kan ses som
beståndsdelarnaavNFV.Tankaromvirtualiseringstartade2012ochär
fortfarandepågående,därforskningochutvecklingfortskriderisnabb
takt. Förhoppningen är att virtualiseringen ska sänka både kapital-
och driftkostnader till följd av enklare installationer, billigare system
och mer autonoma lösningar. Det här examensarbetet undersöker om
NFV:snuvarandetillståndkansänkadriftskostnadernasamtidigtsom
kvaliteten på tjänsten (QoS) hålls hög genom att använda maskinin-
lärning. Mer specifikt undersöks Deep Reinforcment Learning (DRL)
och problemet adaptiv autoskalning av virtuella maskiner som an-
vänds av VNF:erna. För att analysera uppgiften implementerar ex-
amensarbetet en diskret model över VNF:s med syftet att fånga de
fundamentala egenskaperna hos skalningsoperationer. Det granskar
också lärandet och robustheten av sex DRL-algoritmer. Algoritmerna
undersöks eftersom de har grundläggande skillnader i deras egen-
skaper,frånoff-policy-metodersåsomDQNtillon-policy-metoderså
som PPO Advantage Actor Critic. Algoritmerna jämförs sedan med
en P-regulator för att utvärdera prestanda med hänsyn till enklare
metoder. Resultatet från studien visar att DRL behöver cirka 100 000
interaktionermedmodellenförattkonvergera,vilketienverkligmiljö
skulle motsvara cirka 70 dagers lärande. Examensarbetet visar också
att de konvergerade algoritmerna inte visar avsevärda förbättringar
över den enkla P-regulatorn när flera experiment med varierande be-
lastningar och konfigurationer testas. På grund av bristen på data
ochdetlångsammarealtidssystem,därrobusthetärettviktigtövervä-
gande, ses tiden för konvergens som krävs av en DRL-agent som ett
stortproblem. DärförkanförfattareninterekommenderaDRLförau-
toskalningiNFVmedtankepåteknikensnuvarandetillstånd. Istället
rekommenderar författaren enklare metoder, såsom supervised ma-
chinelearningellerklassiskkontrollteori.
v
Acknowledgements
FirstlyIwouldliketothankmysupervisoranddomainexpertatEric-
sson,HerbieFrancis,forhissupportandtrustinme. Ourdailydiscus-
sionsaboutAI,lifeandthefaithofhumanityhashelpedmeinwriting
thisthesisandgrowasaperson.
I would also like to express my gratitude to Johannes Stork as my
supervisorfromKTH,withouthissupportandideasthisthesiswould
nothavebeenwhatitis.
In addition to my supervisors I also want to thank Wenfeng and
Tobias for their support, Ibrahim for our daily fussball matches, Tord
forhisgenuineinterestintheprojectandJörgenforhishelpwithiden-
tifyingandanalysingthedatafromVNFs.
Lastly I would like to thank my boss, Thomas Edwall, for entrust-
ing me in with this thesis and giving me the opportunity to explore
andlearnaboutthesubjectofreinforcementlearning.
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 ResearchQuestion . . . . . . . . . . . . . . . . . . . . . . 3
1.3 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 NetworkFunctionVirtualization 5
2.1 MotivationofVirtualization . . . . . . . . . . . . . . . . . 5
2.2 ManagementOperations . . . . . . . . . . . . . . . . . . . 6
2.2.1 TheScalingAction . . . . . . . . . . . . . . . . . . 6
2.2.2 WaystoMeasurePerformance . . . . . . . . . . . 8
2.2.3 GettingDatafromCustomers . . . . . . . . . . . . 11
3 ReinforcementLearning 13
3.1 IntroductiontoReinforcementLearning . . . . . . . . . . 14
3.2 SolvingaTaskusingRL . . . . . . . . . . . . . . . . . . . 15
3.2.1 StateSelection . . . . . . . . . . . . . . . . . . . . . 15
3.2.2 RewardFunction . . . . . . . . . . . . . . . . . . . 16
3.2.3 EpisodicandContinuousTasks . . . . . . . . . . . 17
3.3 MathematicalFramework . . . . . . . . . . . . . . . . . . 19
3.3.1 MarkovDecisionProcess . . . . . . . . . . . . . . 20
3.3.2 ValueFunctions . . . . . . . . . . . . . . . . . . . . 21
3.4 Dynamicprogramming . . . . . . . . . . . . . . . . . . . 22
3.4.1 BellmanOptimalityEquations . . . . . . . . . . . 22
3.4.2 PolicyIteration . . . . . . . . . . . . . . . . . . . . 24
3.5 LearningfromExperience . . . . . . . . . . . . . . . . . . 25
3.5.1 ExploringandExploitingExperience . . . . . . . 25
3.5.2 MonteCarloLearning . . . . . . . . . . . . . . . . 28
3.5.3 TemporalDifference(TD)Learning . . . . . . . . 29
vi
CONTENTS vii
3.5.4 UnifyingMethodology,n-stepTDlearning . . . . 30
3.5.5 FunctionApproximation . . . . . . . . . . . . . . 34
3.5.6 PolicyGradientMethods . . . . . . . . . . . . . . 35
4 DeepReinforcementLearning 38
4.1 ArtificialNeuralNetworks(ANN) . . . . . . . . . . . . . 39
4.1.1 FeedforwardNeuralNetworks . . . . . . . . . . . 39
4.1.2 NetworkStructures . . . . . . . . . . . . . . . . . . 42
4.1.3 TrainingtheNetwork . . . . . . . . . . . . . . . . 43
4.2 DeepReinforcementLearningAlgorithms . . . . . . . . . 46
4.2.1 NeuralFittedQ(NFQ)Iteration . . . . . . . . . . 48
4.2.2 DeepQ-Network(DQN) . . . . . . . . . . . . . . . 49
4.2.3 TrustRegionPolicyOptimization(TRPO) . . . . . 50
4.2.4 ProximalPolicyOptimization(PPO) . . . . . . . . 52
4.2.5 ActorCriticAlgorithms . . . . . . . . . . . . . . . 54
4.2.6 EvolutionaryStrategies . . . . . . . . . . . . . . . 55
4.3 ImprovementstoDRLAlgorithms . . . . . . . . . . . . . 56
4.3.1 DDQNandRainbowDQN . . . . . . . . . . . . . 56
4.3.2 AdvantageActorCritic(A2C)withGAE(�) . . . . 58
4.3.3 AsynchronousLearning . . . . . . . . . . . . . . . 60
5 Method 64
5.1 ModellingofaVNFDuringScaling . . . . . . . . . . . . 64
5.1.1 StatesintheModel . . . . . . . . . . . . . . . . . . 65
5.1.2 ActionsPossibletoPerformontheModel . . . . . 66
5.1.3 ExternalLoad,ltr . . . . . . . . . . . . . . . . . . . 66
t
5.1.4 State-transitiondynamics . . . . . . . . . . . . . . 67
5.2 ModelingtheTask. . . . . . . . . . . . . . . . . . . . . . . 71
5.2.1 RewardSignal . . . . . . . . . . . . . . . . . . . . . 71
5.2.2 ModelParameters . . . . . . . . . . . . . . . . . . 72
5.3 AutoscalingusingDRLonModel . . . . . . . . . . . . . . 74
5.4 ImplementationDetails. . . . . . . . . . . . . . . . . . . . 75
6 ExperimentsandResults 77
6.1 Experiment1: GatheringStatisticsofTraining . . . . . . . 77
6.1.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.1.2 InterpretationofResults . . . . . . . . . . . . . . . 82
6.2 Experiment2: ChangingtheModelParameters . . . . . . 83
6.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2.2 InterpretationofResults . . . . . . . . . . . . . . . 83
viii CONTENTS
7 Discussion 89
7.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
7.1.1 UncertaintyDuetoDifferentConfigurationsand
LoadPatterns . . . . . . . . . . . . . . . . . . . . . 90
7.1.2 SuboptimalPerformanceduetoExploring . . . . 91
7.1.3 ValidatingRewardFunctionandDesignConsid-
erations . . . . . . . . . . . . . . . . . . . . . . . . 91
7.1.4 ModelandtheAgentsAppliedtoIt . . . . . . . . 92
7.2 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
8 Conclusion 96
8.1 FutureWorkonModel . . . . . . . . . . . . . . . . . . . . 96
Description:known as Virtualized Network Functions (VNFs) and are the main building blocks .. is modelled as two different types of errors, explained more in detail later .. Policy iteration works by alternating between evaluating and improv-.