ebook img

Predicting Short-Term Electricity Demand by Combining the Advantages of ARMA and XGBoost in ... PDF

19 Pages·2017·2.41 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Predicting Short-Term Electricity Demand by Combining the Advantages of ARMA and XGBoost in ...

Hindawi Wireless Communications and Mobile Computing Volume 2018, Article ID 5018053, 18 pages https://doi.org/10.1155/2018/5018053 Research Article Predicting Short-Term Electricity Demand by Combining the Advantages of ARMA and XGBoost in Fog Computing Environment ChuanbinLi ,XiaosenZheng ,ZikunYang ,andLiKuang SchoolofSoftware,CentralSouthUniversity,Changsha410075,China CorrespondenceshouldbeaddressedtoLiKuang;[email protected] Received 26 January 2018; Accepted 25 March 2018; Published 6 May 2018 AcademicEditor:XuyunZhang Copyright©2018ChuanbinLietal.ThisisanopenaccessarticledistributedundertheCreativeCommonsAttributionLicense, whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited. With the rapid development of IoT, the disadvantages of Cloud framework have been exposed, such as high latency, network congestion,andlowreliability.Therefore,theFogComputingframeworkhasemerged,withanextendedFogLayerbetweenthe Cloudandterminals.Inordertoaddressthereal-timepredictiononelectricitydemand,weproposeanapproachbasedonXGBoost andARMAinFogComputingenvironment.BytakingtheadvantagesofFogComputingframework,wefirstproposeaprototype- basedclusteringalgorithmtodivideenterpriseusersintoseveralcategoriesbasedontheirtotalelectricityconsumption;wethen proposeamodelselectionapproachbyanalyzingusers’historicalrecordsofelectricityconsumptionandidentifyingthemost importantfeatures.Generallyspeaking,ifthehistoricalrecordspassthetestofstationarityandwhitenoise,ARMAisusedto modeltheuser’selectricityconsumptionintimesequence;otherwise,ifthehistoricalrecordsdonotpassthetest,andsomediscrete featuresarethemostimportant,suchasweatherandwhetheritisweekend,XGBoostwillbeused.Theexperimentresultsshow thatourproposedapproachbycombiningtheadvantageofARMAandXGBoostismoreaccuratethantheclassicalmodels. 1.Introduction aims at providing enterprises with safe, reliable, and high- quality electric power, has become an indispensable part in Inrecentyears,withtheriseofCloudComputing[1,2],more theconstructionofnationaleconomyandpeople’slife,soit andmorecomputingandstorageprocessingaretakingplace is affected at first. Under the current technical conditions, inCloud,andthevastemploymentofCloudinevitablyleads it is still not possible to achieve large-scale storage of elec- tohighlatency,networkcongestion,andlowreliability.Atthe tric energy; therefore, it is required to generate electricity sametime,withthewideadoptionofIoTservices,avariety accordingtothesystemloadatanytime,orelsethequality ofhouseholdappliancesandsensorswillbeconnectedtothe ofelectricitysupplyandusagemaybeaffected,andeventhe Internetandproducealargeamountofdata[3–5].Ithasbeen safetyandstabilityofthesystemmaybeendangered.Ithas estimatedthatthenumberofdevicesconnectedbyIoTwill become an urgent and important research issue to improve reach 50 billion to 100 billion by 2020, which means there the accuracy of electricity demand prediction in Fog Com- will be more and more data without the control of existing putingframework. techniquesondataprocessingandanalysis,privacyleaksmay Inthefieldofelectricitydemandprediction,scholarshave becaused,andthequalityofservicewillbedecreased[6,7]. carried out extensive research. In the early stage, scholars In this regard, the rapid development of IoT has deepened basically followed the technology in the field of economic the dilemma of Cloud Computing. The emergence of Fog prediction,focusingontheruleoftheloadsequenceinthe Computingmakesupfortheseshortcomingsbutalsobrings formoftimeseriesitself.Thepredictionmodelisestablished newopportunitiesandchallengestothetransformationand byanalyzingthequalitativerelationshipbetweenthehistori- upgradingoftraditionalindustries.Electricitysystem,which calloadandrelatedfactors,andtheparametersareestimated 2 WirelessCommunicationsandMobileComputing accordingtohistoricaldata.However,timeseriesbasedap- [14–16], Neural Network [17–20], Bayes [21], Fuzzy Theory proachrequireshistoricaldatawithhighaccuracy,insensitive [20,22],andWaveletEchoStateNetwork[23].Eachkindof tothefactorssuchasweatherandholidays.Actually,itisvery method has its own applicable scenario, and no model can difficulttoexpressthenonlinearrelationbetweentheinput achievedesiredsatisfyingresultalone. andoutputbyusingaclearmathematicalequationsincethe Inordertoimprovetheaccuracyofprediction,thecur- electricitydataarenonlinear,time-varying,anduncertain.In rent research works mainly focus on three directions. The ordertofurtherimprovetheaccuracyofelectricitydemand firstoneistoexploretheoptimizationofsinglemodel.Zhu prediction,artificialintelligencemethodshavebeenapplied et al. [24] propose to predict daily load roughly by ARMA sincethe1990s,suchasneuralnetwork,expertsystem,and first,thenobtainthedifferencesequencesthatarenoncyclical waveletanalysis.However,theexistingmethodsusuallycan andstronglyinfluencedbytheweather,andfinallyproposean just be applied to a limited scenario and only effective for improvedARIMApredictionmodelwithstrongadaptability simpleelectricitysystemswithasmallquantityoffactors. toweather.Factorswhichinfluencetheelectricityconsump- The main deficiencies of existing work include the fol- tion are recognized and the mapping relation between key lowing:(1)electricitydatasampleshavebeenproventohave factors and electricity consumption are mined [25, 26]. anomalies but few of the existing solutions detect and deal Ghelardonietal.[27]usetheempiricalmodedecomposition withtheabnormaldata.(2)Whenclassifyingusersandtheir methodtodividethetimeseriesintotwoparts,describingthe data,thenumberofclustersneedstobespecifiedinadvance, trendandthelocaloscillationofenergyconsumptionvalues, and the density distribution on electricity consumption is respectively, and then use them to train the support vector usuallyignoredinclusteringusers.(3)Intheprocessofmod- regressionmodel.Cheetal.[28]usethehumanknowledge eling, the temporal features of data are not fully excavated, to construct fuzzy membership functions for each similar andtheinteractionamongthefeaturesisnotfullyconsidered. subgroup and then build an adaptive fuzzy comprehensive (4) Most of the solutions just adopt a sole model, which model based on self-organizing mapping, support vector cannotgivefullplaytotheadvantageofeachmodel. machine, and fuzzy reasoning for prediction. An electricity Inordertoovercometheshortcomingsofexistingsolu- loadmodelisestablishedbasedonimprovedparticleswarm tions, in this paper, we propose a short-term electrical- optimizationalgorithmandgeneticalgorithm[29,30]. demandpredictionapproachbycombiningtheadvantagesof Theseconddirectionistoimprovetheaccuracyofpre- XGBoostandARMAinFogComputingframework.Firstly, dictionbyintegratingdifferentmodels.Haqueetal.propose sensorscollectthereal-timedataofelectricityconsumption, a hybrid intelligent algorithm based on wavelet transform andthentheFogNodeswouldclassifyenterpriseusersinto and fuzzy adaptive resonance theory [31]. In [32–34], the differentgroupsaccordingtotheamountofelectricitycon- wavelet decompositionis used to project theload sequence sumption and perform anomaly and outlier data detection decomposition onto different scales, different models are and procession for each group. Secondly, a model selection usedtopredictthedifferentcomponents,andfinallythefinal process would be performed, that is, based on a series of result is obtained by reconstructing the components. Pin- tests including stationarity, white noise and Pearson corre- doriyaetal.[35]proposeanadaptivewaveletneuralnetwork lation coefficient of data, as well as the observed electricity (AWNN) for short-term price prediction in the electricity consumptionrule;wewilldecidewhethertousetimeseries market.PanyandGhoshal[36]proposealocallinearwavelet model or decision tree based model for the modeling of neural network (LLWNN) model instead of wavelet neural eachenterprisegroup.Finally,thepredictionvaluesofeach network for the electricity price prediction. Che and Wang enterprise user are combined to obtain the final result. The [37] propose a hybrid model that combines the unique accuracy of the proposed approach is verified to be 20% advantagesofSVRandARIMAmodelsinbothnonlinearand higherthanclassicalmodelsbyexperiments. linearmodeling. The rest of the paper is organized as follows: Section2 The third direction is to explore composite models for reviewstherelatedworkonpredictingshort-termelectricity prediction.Theweightedaverageofalltheresultsbyvarious demand.Section3describestheframeworkoftheproposed algorithmisusuallyused,andtherearetwokindsofwaysto solution,aswellasthedetailsoftheinvolvedkeytechniques. determinetheweights.Thefirstkindistoimprovethefitting Section4introducestheexperimentandanalyzestheresults. accuracy of historical electricity consumption by minimiz- Section5 summarizes our work and provides the future ing the fitting error. The main methods include monotone researchplan. iterativealgorithm[38],evolutionaryprogramming[39],and quadraticprogramming[40].Wangetal.[41]proposeusing 2.RelatedWork adaptivepracticalswarmoptimizationalgorithmtooptimize the weight of the integrated model. The second kind is to Becauseofthenonlinear,time-varying,anduncertaintychar- determine the weights by evaluating the algorithm’s score. acteristicsofelectricitydata,itisdifficulttoaccuratelygrasp Elliott and Timmermann [42] introduce the concept of the the related factors and the rules of electricity consumption lossfunction,quantifythenegativeimpactcausedbydifferent change. How to effectively improve the accuracy of elec- prediction errors, then take the minimum loss expectation tricity demand prediction has become a major challenge to as the goal, and perform optimization to get the weights. researchers[8].Atpresent,themethodsusedforshort-term Yaoetal.[43,44]employanalytichierarchyprocess(AHP) electricitydemandpredictionmainlyincludetimeseries[9– inmultiobjectivedecisionanalysistogettherelativemerits 11], Regression Analysis [12, 13], Support Vector Regression of each algorithm in fitting accuracy, model adaptability, WirelessCommunicationsandMobileComputing 3 Table1:Examplesoftheenterpriseelectricityconsumptiondataset. AsanextensionofCloudComputing,theframeworkpre- processesenterprisedataandmakesreal-timedecisionand record date user id power consumption providestemporarystoragetoenhancetheusers’experience. 2015/1/1 1 1135 Inthecurrentelectricitysystems,thenumberofhopsfrom 2015/1/2 1 570 enterpriseterminaltoCloudisgenerally3to4orevenmore, 2015/1/3 1 3418 sothesystemwillhavetofacethenetworkdelaywhenmak- 2015/1/4 1 3968 ing real-time decisions. Figure1 shows the framework of a smartelectricitysysteminFogComputing.Electricitymeters collectdataassensors,and,forsomeenterpriseswithmajor and result reliability, the judgment matrices are obtained, changesofdataandhighreal-timerequirement,wecanmake andthentheweightsarecombinedbycalculatingthemain real-time decisions in Fog Nodes to meet the need of real- eigenvectors of each matrix. Petridis et al. study the use of time electrical-demand prediction; otherwise, the data can probability models to determine the weight of each model bebufferedatFogNodes,compressedtosavenetworkband- andcombinethevaluesofeachalgorithmtoobtainthefinal width,andthentransmittedtotheCloud. result [45]. Without enough quantitative theoretical basis, the weights of such models only reflect the advantages and 3.2.2. Process of Electricity Demand Prediction Approach. disadvantagesofthealgorithms. Figure2showstheprocessofelectricitydemandprediction Insummary,therehavebeenmanyresearchworksonthe approach based on multimodel fusion algorithm, which prediction of short-term electricity demand, and exploring includesfivemainsteps. compositemodelsforpredictionisthemaintrend.However, existingmethodsstillhavelimitations.Inthecontextofthe (1) Data Preprocessing. After data are collected, the missing rapiddevelopmentofsmartelectricitygrid,inthispaper,we valueswillbefilledandtheirformwillbeunified. proposeashort-termelectricitydemandpredictionmethod basedonXGBoostandARMA. (2)EnterpriseUsersClustering.Thesizeofelectricitydemand for each enterprise user is measured by counting the total 3.PredictingShort-TermElectricityDemand amountofitshistoricalelectricityconsumption.Thenenter- prisesuserswillbedividedintodifferentgroupsbyclustering 3.1.ProblemDefinition. Givenadataset𝑅 = {𝑟1,𝑟2,𝑟3⋅⋅⋅𝑟𝑛}, themaccordingtotheirsizesofelectricitydemand. 𝑟𝑖 is the 𝑖th electricity consumption record for a certain enterprise user, and 𝑟𝑖 can expressed as a 3-tuple, that is, (3)ModelSelection.Wethenaimtodetermineanappropriate 𝑟𝑖 = ⟨record date, user id, power consumption⟩, where trainingmodelforeachgroupofenterpriseusers.First,the record date represents the date time, user id is the ID of rulesofelectricityconsumptionfordifferentgroupsofenter- the enterprise user, and power consumption represents the priseusersareanalyzedforprejudgementofmodelselection. electricityconsumptionamountoftheenterpriseuseronthat If the electricity consumption changes with time, showing day. a periodical change or an obvious rising/falling trend, we We use the dataset from Tianchi, which contains the consider to model the users’ data by XGBoost. If the elec- historical records of 1454 enterprises in Yangzhong High- tricityconsumptionshowsanirregularchangeovertimeor TechIndustrialDevelopmentZoneofJiangshuProvincefrom fluctuatesaroundacertainconstantandthefluctuationrange 2015-01-01 to 2016-11-30, for the following illustration and islimited,weconsidertomodeltheusers’databytimeseries experiments.ExamplesofthedatasetareshowninTable1. model. Giventhedataset𝑅andamonthtimeinfuture,weaim Second,aseriesoftestswouldbeperformedtoverifythe topredictthetotalamountoftheelectricitydemandonthe prejudgement and finally determine the selected model for desiredmonthinthisregiononthebasisofhistoricalrecords. eachgroupofusers.Featurecorrelationanalysisandfeature importance scores would be performed before selecting 3.2.FrameworkoftheProposedApproach XGBoost modeling. Stationary and white noise test would be performed before selecting ARMA modeling. If neither 3.2.1.SmartElectricitySysteminFogComputing. FogCom- XGBoostnorARMAisappropriate,meanmodelwillbeused. putingframeworkhastheadvantagesoflowlatency,saving corebandwidth,andhighreliability[46,47].FogNodesare (4) Model Building. After model is selected for a given located lower in the network topology and thus they have group of users, data cleaning is performed first, including lessnetworklatencyandmorereactivity.Asanintermediate thedetectionandprocessingofanomaliesandoutliers.For between Cloud and terminals, Fog Layer can filter and XGBoost modeling, anomalies and outliers will be deleted aggregate enterprise messages and only send the necessary first, the influence of time factor and temperature on the messages to Cloud, thus reducing the pressure on the core predictionwillbeconsideredemphatically,Pearsoncorrela- network. In order to serve enterprises in different regions, tion coefficient will be used to identify redundant features, thesameserviceswillbedeployedontheFogNodesineach andappropriatefeatureswillbeselectedtobuildthemodel region.Oncetheservicesinacertainareaareabnormal,the bycombiningthefeatureimportanceoutputtedbypretrain- requests can be quickly forwarded to other same services ing. For AMRA modeling, anomalies and outliers will be nearby,whichmakestheframeworkhighlyreliable. modified by average value, and parameters 𝑝 and 𝑞 will be 4 WirelessCommunicationsandMobileComputing Cloud layer Core network Fog layer User terminal Figure1:SmartelectricitysysteminFogComputing. Data Collecting Processing Unifying Data Preprocessing Data Missing values Form Enterprise Users Counting Size of Clustering Users based on Clustering Electricity Demand Prototype-based K-means Model Model Pre-judgement Model Determination based Selection based on Observation on a Series of Tests Model Building XGBoost ARMA Mean Predicting Predicting Values for Summing up to Get Total Electricity Different User Groups Electricity Demand Demand Figure2:Processofelectricitydemandpredictionapproach. optimized based on the minimum amount of information collect external weather data from http://lishi.tianqi.com. principleofBIC. SamplesareshowninTable2. (5)PredictingElectricityDemand.Aftermodelingfordifferent (2) Processing Missing Value. We find that there are some enterprisegroups,thepredictionvaluesofeachgroupwillbe missingvaluesinTianchidataset,sowethenfillthemissing summed up to get the final daily electricity demand in the valueswiththemeanvalueofthethreedaysbeforeandafter desiredmonthofthisregion. thedatewithmissingvalue.Thedetailedcalculationisshown asfollows: 3.3.KeyTechniques. Inthissection,wewillelaborateonthe (𝑐 +𝑐 +𝑐 +𝑐 +1,𝑐 ,𝑐 ) detailed key techniques in the five steps of the electricity 𝑐 = 𝑖,𝑗−3 𝑖,𝑗−2 𝑖,𝑗−1 𝑖,𝑗 𝑖,𝑗+2 𝑖,𝑗+3 , (1) demandpredictionapproach. 𝑖𝑗 6 inwhich𝑐𝑖𝑗 representsuser𝑖’smissingrecordonthe𝑗thday 3.3.1.DataPreprocessing oftheyear. (1)CollectingExternalWeatherData.Takingintoaccountthe (3)UnifyingDataForm.Theelectricityconsumptionrecords impact of temperature on electricity consumption, we first are reorganized to facilitate the follow processing. Each WirelessCommunicationsandMobileComputing 5 Table2:Samplesofweatherinformation. Record date Highesttemperature Lowesttemperature Conditions Winddirection Windforce 2016/10/12 21 16 Overcast West breezy 2016/10/13 20 16 PartlyCloudy Calm lightbreeze 2016/10/14 19 17 MostlyCloudy West lightbreeze 2016/10/15 20 16 Showers East breezy 2016/10/16 23 17 Overcast NorthWest lightbreeze column represents records of an enterprise user, and there classifiedtothesecondcategory.TheuserswithID90,129, are totally 1454 users. Each row represents the records on 1262,1307,and1310,whoserelativedistancesarebetween100 a certain day, and the dates are sorted in ascending order. to20,shouldbethethirdcategory.Andtheremaining1447 Eachgridrepresentstheelectricityconsumptionofaspecific enterpriseusers,whoserelativedistancesarebelow20,should user on a certain day, which can be expressed as 𝑟𝑖𝑗 = bethefourthcategory.Sowesetk=4whenusingK-means {Record date𝑗,Power𝑖,𝑐𝑖𝑗}.Theunifieddatasheetisshownas algorithm.Andthefinalresultofusersclusteringisshownas Table3. Table4. 3.3.2. Clustering Users by Prototype-Based K-Means Algo- 3.3.3. Model Selection. Model selection is the basis of next rithm. Clustering analysis can find locally strongly related stepanditisespeciallyimportant.Figure4showstheprocess object groups. Outlier detection can detect objects that are and rules of model selection, which mainly includes four significantly different from most objects by detecting the parts. discreteness of the data. Based on the characteristics of the two techniques, in this paper, we use a prototype-based (1) Data Preparing. The data now mainly include the enter- clusteringmethodtodetectthedegreeofdatadiscreteness, priseuserID,itsgroupID,thehistoryelectricityconsump- soastolearnthedistributionofthedataandthendetermine tion,andtheweatherdatafrom2015/1/1to2016/11/30. the range of 𝑘 first, and next use K-means algorithm to clusterenterpriseusersinordertoachieveenterprisegroups (2)ModelPrejudgementBasedon division. Periodicity/TrendandNonlinear/WeakStationaryAnalyzing Theprincipleoftheprototype-basedclusteringmethodis toclusteralltheobjectsfirstandthenevaluatethedegreethat (2a) Periodicity/Trend Analyzing. Periodicity analyzing is to theobjectsbelongtotheclustersaccordingtothedistance.In findoutwhetherelectricityconsumptionofauserwillchange traditionalmethod,thedistancebetweentheobjectandthe periodically with time. If a user’s electricity consumption cluster center is used to measure the degree that the object showsregularfluctuations,theincreasingspeedisthesimilar belongstothecluster.Inthispaper,weconsiderthedensity at both sides of the wave peak, and the appearance of peak of data distribution and adopts the relative distance. Based isstronglyrelatedtotimeperiod,thatis,Power(𝑛)=Power onthisidea,wedesigntheclusteringalgorithmforenterprise (𝑛 + 𝑇), where 𝑇 is the time span; then we can make a groupsdivisionasshowninAlgorithm1. prejudgement that XGBoost model would be appropriate WewilltaketheTianchidatasetasanexampletoillustrate withthischaracter. the process of prototype-based K-means clustering algo- Trend analyzing is to find out whether change pattern rithm: of electricity consumption will show some trend. If a user’s First,wecalculatethehistoricalelectricityconsumption electricityconsumptionshowsatendencytoriseorfall,orthe ofeachenterprisefrom01/01/2015to10/31/2016,andcluster appearanceofpeakchangeswithtimeandshowsarelationof allthesamplesbyusingprototype-basedalgorithm.Wetake positiveandnegativecorrelation,thenwecanalsoconsider therelativedistancetomeasurethediscretedegreeofsamples usingXGBoostmodelforthischaracter. andchoosethethreshold𝜆as20,sothesamplewithrelative distance greater than 20 is deemed as an outlier. Figure3 (2b)Nonlinear/WeakStationaryAnalyzing.Iftheuser’selec- showsthediscretepointswithrelativedistances. tricityconsumptionpresentsirregularchangeormutationor EachofthepointsinFigure3ismarkedwithapair,which noobviouschange,itshowsthatthesequenceofelectricity indicates the enterprise ID and its relative distance to the consumptionisnonlinear.Iftheuser’selectricityconsump- clustercenter.Thereare7pointswithhigherdispersionthat tion shows a fluctuation around a certain constant and the are 1416, 175, 174, 90, 129, 1262, 1307, and 1310 from high to fluctuationisapproximatelyinthesamerange,itcanbeseen low.AccordingtoFigure3,thedataaregenerallydistributing thatthemeanandvarianceofthedataareallconstant,which infourdistancesegments.User1416islocatedatthetopof indicates that the data has no obvious trend. If the mean thefigure,whoserelativedistanceisgreaterthan400,much value has nothing to do with the change of time, and the higherthantheotherpoints,anditshouldbeclassifiedtothe influence among the sequence variables is almost the same firstcategory.Atthemiddleofthefigure,175and174,whose afterdelaying𝑘periods,itreflectstheweakstationaryofdata. relativedistancesarebetweenabout150and200,shouldbe Sowecanmakeaprejudgementthattimeseriesmodelinglike 6 WirelessCommunicationsandMobileComputing Table3:Unifieddatasheet. Record date Power1 Power2 Power3 Power4 Power5 Poweri Power1454 2015/1/1 1135 24 385 206 156 ⋅⋅⋅ 976 2015/1/2 570 22 475 5134 368 ⋅⋅⋅ 520 2015/1/3 3418 18 526 6784 359 ⋅⋅⋅ 536 2015/1/4 3968 119 1 6475 467 ⋅⋅⋅ 488 2015/1/5 3986 108 535 6592 433 ⋅⋅⋅ 406 2015/1/6 4082 109 663 6742 452 ⋅⋅⋅ 554 2015/1/7 4172 86 606 6963 467 ⋅⋅⋅ 803 2015/1/8 4022 106 735 6301 495 ⋅⋅⋅ 975 2015/1/9 4025 103 698 6693 542 ⋅⋅⋅ 796 Input:HistoricalelectricityconsumptionrecordsR Output:thegroupdivisionofenterpriseusersClusterSet(Power) ∗ i ∗ (1)/ Calculatethehistoricaltotalelectricityconsumptionofeachenterprise / INPUT:𝑅={Record date𝑗,user𝑖,𝑐𝑖𝑗} FOR𝑖=1Tolength(user) i DO SumPower𝑖=0 FOR𝑗=1Tolength(Record date) j DO SumPoweri=SumPoweri+𝑐𝑖𝑗; END END OUTPUT:{SumPower𝑖} ∗ (2)/ Selecttheclusteringalgorithm(K-means)toclusteralltheobjectsandthenchoosek= ∗ 1toclusterallthesamplesintoaclusterandfindthecentroidofthecluster / INPUT:{SumPower𝑖} DO Select𝐾=1 ClusterCenter=Kmeans(SumPower𝑖) END OUTPUT:ClusterCenter ∗ ∗ (3)/ Calculatetherelativedistance / INPUT:ClusterCenter,{SumPower𝑖} FOR𝑖=1Tolength(user𝑖) DO AbsoluteDistance𝑖=SumPower𝑖−ClusterCenter MedianDistance𝑖=median(AbsoluteDistance𝑖) RelativeDistance𝑖=AbsoluteDistance𝑖/MedianDistance𝑖 END OUTPUT:{RelativeDistance𝑖} (4)Makediscretepointrelativedistanceerrorfigureandroughlydeterminetherangeofk accordtotherelativedistance (5)/∗Use𝐾-meansalgorithmtodoclusteringforenterprises,andgetthegroupsofenterprises ∗ accordingtotheresultofclustering / INPUT:{⟨user𝑖,SumPower𝑖⟩},k DO Select𝐾=𝑘 ClusterSet(user𝑖)=Kmeans(user𝑖) END OUTPUT:ClusterSet(user𝑖) Algorithm1:Prototype-basedK-meansclusteringalgorithm WirelessCommunicationsandMobileComputing 7 Table4:Clusteringresult. ClusterID EnterpriseuserID Semanticmeaning 400<relativedistance, 1 1416 largeenterprise 150<relativedistance<200, 2 174,175 Secondlargeenterprise 20<relativedistance<100, 3 90,129,1262,1307,1310 mediumenterprise 0<relativedistance<20, 4 Otherenterprisesexcept1416,174,175,90,129,1262,1307,1310 smallenterprise (1416, 424.86) eachfeatureandthetargetvalueandstudyingtheinfluence 400 of each feature on the change of the target. By outputting the importance of all features through the pretraining of XGBoost model, if the factors which cannot be character- e 300 izedintimeseries,suchasweather,vacation,andworkday/ c an nonworkday, have high scores, it can be determined to use dist XGBoostmodel. ve 200 (175, 194.30) Relati (174, 157.40) dataIfabreotnhottessutsithabavleefloorwXsGcoBroeos,stitminoddiecla,teasndthwatetchaenutsuerrn’s 100 totestwhethertheyaresuitableforthetimeseries(ARMA) model. (90, 60.59) (1262, 35.77) (129, 35.33) ((11331007,, 2244..5533)) 0 (4)ModelVerificationforARMA.IfARMAisprejudgedasan 0 200 400 600 800 1000 1200 1400 appropriatechoiceforauser,orifXGBoostisnotsuitable,we thenperformaseriesoftestsincludingPearsoncorrelation ID coefficient, stationarity testing, and white noise testing to Figure3:Discretepointswithrelativedistances. verifywhetherARMAissuitable. (4a)CorrelationAnalyzing.Sameasin(3b),wetestthecorre- lation between features and the target value by using Pear- ARMAwouldbeappropriateforsuchkindofdatawithweak son’s correlation coefficient. If the correlation between the stationary. targetanddiscretefeaturesislow,wecanconsidertousethe timeseries(ARMA)modelpreliminarily. (3)ModelVerificationforXGBoost.IfXGBoostisprejudged as an appropriate choice for a user, a series of tests includ- (4b) Stationarity Testing. Stationarity is the prerequisite for ing Pearson correlation coefficient and feature importance timeseriesmodeling.Ifthedataisstationary,thefittingcurve analyzingwouldbeperformedbeforefinallydecidingtouse obtained through the sample time series can still inertially XGBoost. continue for a period of time in the future. If the data is not stationary, it indicates that the shape of the sample (3a)FeatureGenerating.Featureistheinformationextracted fittingcurvedoesnothavethecharacteristicof“inertia”con- fromthedatathatisusefulinprediction.Featureextraction tinuation;thatis,thecurvefittingbasedonthesampletime is mainly based on the existing background knowledge, so seriestobeobtainedinthefuturewillbedifferentfromthe thatthefeaturecanplayabetterroleinthemachinelearning currentsamplefittingcurve.Totestthestationarityofdata, algorithm. Based on thorough analysis of the features, we weusetheunitrootforinspection.Iftheelectricityconsump- mainly extract the time and weather features as the input tionsequencedatahasunitroot,thedataisstationary. features. (4c)WhiteNoiseTesting.Whitenoisesequenceisastationary (3b)CorrelationAnalyzing.WethenusePearsoncorrelation randomsequencewithoutanyinformation.Ifthesequenceis coefficienttesttoanalyzethecorrelationbetweenthefeatures whitenoise,itindicatesthatthereisnorelationshipbetween and the prediction value. We use the results of feature thevaluesofsequence,anditisapurelyrandomsequence. generationandelectricityconsumptiondatafortesting.Ifthe Theautocorrelationcoefficientisequaltozero,thatis,𝜆(𝑘)= correlationbetweenthepredictiontargetandsomediscrete 0, 𝑘 ≠ 0. If the white noise test is passed, it shows that the features,suchasweather,holidays,workday/nonworkday,is sequenceisanon-whitenoisesequence. strong,itisbettertouseXGBoostformodeling. Ifbothstationarityandwhitenoisetestingarepassed,we (3c)FeatureImportanceAnalyzing.Featureimportanceanal- candeterminethatARMAmodelissuitableformodelingthe ysisreferstoanalyzingtheimportancerelationshipbetween user’sdata. 8 WirelessCommunicationsandMobileComputing (1) Data preparing user (2a) Periodicity/Trend analyzing (2b) Nonlinear/Weak stationary category analyzing ∘C wedaatthaer −−00000.....1100005005 20040060080010001200 876543210000000006008001000120014001600 Periodicity Tendency Weak stationary Nonlinear electricity demand Periodicity/Tendency Nonlinear/weak stationary (3a) Feature generating Day Mon YearDoy 1 1 1 4 2 1 1 5 (4a) Correlation analyzing (4b) Stationary testing (3b) Correlation analyzing (3c) Feaatnuarely izminpgortance Lowfe faetautruer ree ilmevpaonrctea nsccoer/elow SSS123SSS123 0003...10003974...126819742681/000/3...00078713...582787115821#000#3...00078813...852788118521 TTn1::{{!!$$·&&·,,·......,,00LLII<<}} SSS132SSS123 0003...10003974...126819742681/000/3...00078713...582787115821#000#3...00078813...852788118521 Feature importrance score111420864200000000 f2 f0 f7Featurfe9 Impof1r0tancesf1 f6 f3 Ffaaiilleedd wcohrirtee lnatoioisne tteesstt/ ARMA model (4c) White noise testing 110 110086 T1:{!==IL,...,0LI<} XGBoost model (5) Mean value model 110042 ··· Tree 1 Tree 2 10908 Tn:{32n1,...,0LI<} YIs mYale?AgNe < 15N YUse cdoamilyputerN 96969800020406081012 +2 +0.1 −1 +0.9 −0.9 f()=2+0.9=2.9f()=−1+0.9=−0.1 Mean value model XGBoost model ARMA model Figure4:Frameworkofmodelselection. (5) Mean Value Model. If neither XGBoost nor ARMA is andnonlinear.SoitcanbeprejudgedtoemployARMAfor suitableformodeling,wethenusethemeanmodelasafinal modeling. choice.Themeanmodeltakesthemeanofthehistoricalelec- Figure6showsthedataofasmallenterprise. tricityconsumptiondataasthepredictionvalue. From Figure6, it can be seen that the curve presents a We then show the process of model selection by two cyclicalfluctuationwiththechangeoftime.Throughanalysis, examples: one is a small enterprise and another one is the wecanfindthatcurvepresentsa“W”shape,itissymmetric enterprise with ID 1307. Figure5 shows the daily electricity on the 1/1/2016, and every small peak fluctuates in the unit consumptionamountofthe1307thenterprisefrom1/1/2015 of week. The data wave within each week is like a convex to10/31/2016,inwhichthe𝑥-axisisrepresentedasthe𝑥thday line,showingahighmiddleandlowsides.Itcanbejudged intheperiod. thattheelectricityconsumptionofsmallenterprisesisgreatly From Figure5, we can find that the curve of the 1307th influencedbytheyearandweek.Soitcanbeprejudgedtouse enterprise fluctuates on the mean value line from 2015 to XGBoostformodeling. 2016 and the fluctuation amplitudes are almost the same, From the analysis above, we can find out that most which is consistent with characteristics of weak stationarity enterprisesarehighlycorrelatedwithtimefeatures,sowecan WirelessCommunicationsandMobileComputing 9 300000 Line charts of 1307th enterprise FromTable6,aftersmoothprocessing,wecanfindthat the𝑝valueoftheunitrootteststatistic(0.0089)oftheseries is significantly less than 0.01, so the original hypothesis is 250000 strictly rejected, which judges that the series is a stationary sequence.The𝑝valueofwhitenoisetestissignificantlyless 200000 Mean value than0.01,sowestrictlyrejecttheoriginalhypothesis,which wer 150000 judges that the logarithmic processed series is a stationary o non-whitenoisesequence.Combinedwiththeprejudgement P andtheresultofthethreetests,wecandeterminetochoose 100000 ARMAformodelingthedataofthe1307thenterprise. FromFigure8,wecanfindthatthescoresoftimefeatures 50000 for the small enterprise are relatively high, so it can be in- ferredthattheelectricityconsumptioncharacteristicsofsmall 0 enterprises are highly correlated with the time features. In 0 100 200 300 400 500 600 particular,thefeature“holidays”hasthestrongestcorrelation, Daydistance whiletimeseriesmodelcannotmakefulluseoftemperature Figure5:Dailyelectricityconsumptionamountofthe1307thenter- andholidayfeatures.Sotimeseriesmodelingisnotsuitable prise. forsmallenterprises. Wethencalculatefeatureimportancescoresforthesmall enterprisebypretraininginXGBoost,andtheresultisshown inFigure9. 2500000 Aftertraining,theXGBoostsortstheimportanceoffea- turesfromhightolowandtheresultisdoy,daydis,dow,maxt, 2000000 mint,dom,holiday,andwoy(f2,f7,f0,f9,f10,f1,f6,andf3). Wecanfindoutthatsmallenterprisesareinfluencedbydoy, 1500000 dow,daydis,maxt,andmintgreatly,whichisconsistentwith wer previousanalysis. o P1000000 Combining the prejudgement with the test of Pearson correlationcoefficientandfeatureimportancescores,wecan 500000 determinethatXGBoostmodelingismoresuitableforsmall enterprises. 0 0 100 200 300 400 500 600 3.3.4. Abnormal Data Processing. Data quality is crucial for theperformanceofmodels.Alargenumberofabnormaldata Daydistance intheoriginaldatamayleadtothedeviationoftheresult,so Figure6:Dailyelectricityconsumptionamountofasmallenter- itisnecessarytocleanthedata.Missingvalueprocessinghas prise. beendoneinthedatapreprocessing,andthen,inthispart, wemainlyperformthedetectionandprocessingofabnormal data. extract temporal features [48] as attributes for feature con- Inordertodetectabnormaldata,outlierdetectionisusu- struction.Sahayetal.[49]introducedtheinfluenceoftem- allyusedtofindoutthevalueswhichobviouslydeviatefrom peraturetoelectricityconsumption,sowealsoconsiderthe mostofthesamples.Forsomeenterprises,weusepretraining effectoftemperature.ThenweusePearsoncorrelationcoeffi- modelingtomarkthesuddenpointswhichdeviatetoomuch cienttotestthecorrelationbetweenelectricityconsumption from the fitted curve as the outlier. Prototype-based clus- andfeaturesforenterprisesfromeachgroup. tering outlier detection method is used to detect the out- Themainfactorswhichmaycausechangesinelectricity liers which are deviating from the centroid obviously. This consumptionarespecificallyshowninTable5. outlierdetectionalgorithmalsoadoptstherelativedistance Figure7 is the Pearson correlation coefficient test result whichissimilartoAlgorithm1.Andthereisalittledifference; of1307thenterprise. that is, the outlier detection algorithm based on prototype Thecorrelationcoefficientmatrixisasymmetricmatrix. clusteringfilterstheoutliersbyusingtheappropriatethresh- The correlation between the feature and the target can be oldvalue𝑎inthe4thstepandoutputsthedetectedoutliers. regarded as the importance of the feature, which is more Figure10showstheoriginaldailyelectricityconsumption important if it is closer to 1 or−1. From Figure7, we can amountofthe175thenterprise. find that the scores of time features for the 1307th enter- ItcanbeseenfromFigure10thattheelectricityconsump- prisearerelativelylow,soitcanbeinferredthatitselectricity tionofthe175thenterprisehasalargedifferencebetweenthe consumption characteristics are irrelevant to the time fea- first year and the second year, so segment detection will be tures.Nextweperformstationaritytestandwhitenoisetest used.Wetakeyearasaunittocarryoutsegmentdetection. onthedataofthe1307thenterprise. Figure11 shows the results of outlier detection based on Table6showsthetestresultsofthe1307thenterprise. prototype-basedclustering. 10 WirelessCommunicationsandMobileComputing target 1 0.0096 0.12 0.017 0.012 0.0092 0.0093 −0.24 −0.045 −0.05 0.2 0.19 −0.0011 0.8 f0 0.0096 1 0.004 −0.007 −0.011 −0.0074 0.79 0.098 −0.0013−0.0015 0.0066 0.0031 0.0013 f1 0.12 0.004 1 0.099 0.072 0.01 0.0069 −0.24 0.052 0.0071 0.012 0.032 0.012 f2 0.017 −0.007 0.099 1 0.97 1 −0.0041 −0.07 0.37 0.37 0.39 0.46 0.96 0.4 f3 0.012 −0.011 0.072 0.97 1 0.97 −0.006 −0.019 0.34 0.34 0.37 0.44 0.94 f4 0.0092 −0.0074 0.01 1 0.97 1 −0.0047 −0.05 0.36 0.36 0.4 0.46 0.97 f5 0.0093 0.79 0.0069 −0.0041 −0.006 −0.0047 1 0.098 0.0016 0.0013 −0.0048−0.0024 0.002 0.0 f6 −0.24 0.098 −0.24 −0.07 −0.019 −0.05 0.098 1 −0.014 −0.0035 −0.03 −0.021 −0.032 f7 −0.045 −0.0013 0.052 0.37 0.34 0.36 0.0016 −0.014 1 1 0.29 0.33 0.36 −0.4 f8 −0.05 −0.0015 0.0071 0.37 0.34 0.36 0.0013 −0.0035 1 1 0.3 0.34 0.36 f9 0.2 0.0066 0.012 0.39 0.37 0.4 −0.0048 −0.03 0.29 0.3 1 0.95 0.39 f10 0.19 0.0031 0.032 0.46 0.44 0.46 −0.0024 −0.021 0.33 0.34 0.95 1 0.45 −0.8 f11 −0.0011 0.0013 0.012 0.96 0.94 0.97 0.002 −0.032 0.36 0.36 0.39 0.45 1 target f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 Figure7:Pearsoncorrelationcoefficienttestresultofthe1307thenterprise. target 1 −0.26 0.13 0.2 0.18 0.19 −0.32 −0.56 0.22 0.22 0.17 0.18 0.18 0.8 f0 −0.26 1 0.004 −0.007 −0.011 −0.0074 0.79 0.098 −0.0013−0.0015 0.0066 0.0031 0.0013 f1 0.13 0.004 1 0.099 0.072 0.01 0.0069 −0.24 0.052 0.0071 0.012 0.032 0.012 f2 0.2 −0.007 0.099 1 0.97 1 −0.0041 −0.07 0.37 0.37 0.39 0.46 0.96 0.4 f3 0.18 −0.011 0.072 0.97 1 0.97 −0.006 −0.019 0.34 0.34 0.37 0.44 0.94 f4 0.19 −0.0074 0.01 1 0.97 1 −0.0047 −0.05 0.36 0.36 0.4 0.46 0.97 f5 −0.32 0.79 0.0069 −0.0041 −0.006 −0.0047 1 0.098 0.0016 0.0013 −0.0048−0.0024 0.002 0.0 f6 −0.56 0.098 −0.24 −0.07 −0.019 −0.05 0.098 1 −0.014 −0.0035 −0.03 −0.021 −0.032 f7 0.22 −0.0013 0.052 0.37 0.34 0.36 0.0016 −0.014 1 1 0.29 0.33 0.36 −0.4 f8 0.22 −0.0015 0.0071 0.37 0.34 0.36 0.0013 −0.0035 1 1 0.3 0.34 0.36 f9 0.17 0.0066 0.012 0.39 0.37 0.4 −0.0048 −0.03 0.29 0.3 1 0.95 0.39 f10 0.18 0.0031 0.032 0.46 0.44 0.46 −0.0024 −0.021 0.33 0.34 0.95 1 0.45 −0.8 f11 0.18 0.0013 0.012 0.96 0.94 0.97 0.002 −0.032 0.36 0.36 0.39 0.45 1 target f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 Figure8:Pearsoncorrelationcoefficienttestresultofasmallenterprise.

Description:
6963. 467. ⋅⋅⋅. 803. 2015/1/8. 4022. 106. 735. 6301. 495. ⋅⋅⋅. 975. 2015/1/ .. mint, dom, holiday, and woy (f2, f7, f0, f9, f10, f1, f6, and f3). We can
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.