Table Of Content

T R R ECHNICAL ESEARCH EPORT Automated Network Fault Management by J.S. Baras, M. Ball, S. Gupta, P. Viswanathan, P. Shah CSHCN T.R. 97-24 (ISR T.R. 97-64) The Center for Satellite and Hybrid Communication Networks is a NASA-sponsored Commercial Space Center also supported by the Department of Defense (DOD), industry, the State of Maryland, the University of Maryland and the Institute for Systems Research. This document is a technical report in the CSHCN series originating at the University of Maryland. Web site http://www.isr.umd.edu/CSHCN/ Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 1997 2. REPORT TYPE - 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Automated Network Fault Management 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION Army Research Laboratory,2800 Powder Mill Road,Adelphi,MD,20783 REPORT NUMBER 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT see report 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 8 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 (cid:3) AUTOMATED NETWORK FAULT MANAGEMENT J.S. Baras, M. Ball, S. Gupta, P. Viswanathan, and P. Shah Center for Satellite and Hybrid Communication Networks Institute for Systems Research University of Maryland College Park, Maryland 20742 ABSTRACT volume makes the task of (cid:12)nding the problem a very time-consuming process. However, unlike some of the Future military communication networks will have a other network management functions listed in the ISO mixture of backbone terrestrial, satellite and wireless model, in fault management, speed is very crucial and terrestrial networks. The speeds of these networks vary recovery from a problem has to occur quickly. and they are very heterogeneous. As networks become Several e(cid:11)orts have taken place to tackle the fault faster, it is not enough to do reactive fault manage- management problem, some of which are described in ment. Our approach combines proactive and reactive [4, 5, 6, 7, 8, 9, 10]. Although several interesting issues fault management . Proactive fault management is im- have been addressed in these papers, such as trouble plemented by dynamic and adaptive routing. Reactive ticketing and alarm correlation, most of the work has fault management is implemented by a combination of been done through the use of expert systems alone [12, a neural network and an expert system. The system has 13] without the use of neural networks. Furthermore, been developed for the X.25 protocol. Several fault sce- in these sources, we have not seen the use of SNMP narios were modeled and included in the study: reduced statistics for the fault management problem. switch capacity, increased packet generation rate of a Theexpertsystemapproachtodiagnosisisintuitively certain application, disabled switch in the X.25 cloud, attractive, as symptoms can be linked to causes explic- disabled links. Wealso modeled occurrence ofalarms in- itly, in a rule-based knowledge representation scheme. cludingseverity of theproblem, location of the eventand The limitations of rule-based expert systems are re- a threshold. To detect and identify faults we use both vealed when they are confronted with novel fault sit- numerical data associated with the performance objects uations for which no speci(cid:12)c rules exist. Novel faults, (attributes) intheMIBaswellasSNMPtraps (alarms). forwhichtheneuralnetworkhasnotbeentrained,orfor Simulation experiments have been performed in order to which no output neuron has been assigned, are general- understand the convergence of the algorithms, the train- ized and matched to the closest fault scenario for which ingoftheneuralnetworksinvolvedandtheG2/NeurOn- the network has been trained. Each approach contains Line software environment and MIB design. its own strengths and weaknesses. In order to take ad- INTRODUCTION vantage of the strengths of each technique, as well as avoiding the weaknesses of either we used an integrated Fault management [14, 15, 16, 17] includes detect- neuralnetwork/expertsystemdiagnosticstrategy[2,3]. ing, isolating, and repairing problems in the network, Dynamic fault management is a critical element of tracing faults, given many alarms in the system, using network management. It is even more di(cid:14)cult in mil- error logs andtracing errors throughthe log reports [1]. itary networks because in addition to hard faults and One of the problemsfaced by network control centers is soft faults(caused by performancedegradation) we also that of handling extremely large volumes of data deal- have faultscausedbythevaryingsituationandscenario ing with the performance of the networks. The data of the battle. Future military communication networks (cid:3) This work was supported in part by the U.S. Department of willhave amixtureofbackboneterrestrial,satellite and the Army, Army Research Laboratory under Cooperative Agree- wireless terrestrial networks. The speeds of these net- ment DAAL01-96-2-0002, Federated Laboratory ATIRP Consor- works varies and they are very heterogeneous. As net- tium, in part by the Center for Satellite and Hybrid Communi- cation Networks under NASA cooperative agreement NCC3-528, worksbecomefaster,itisnotenoughtodoreactivefault andin partbya grant fromSpace SystemsLoral. management. Ourapproach combinesproactive andre- 1 Alarms and SNMP information (Traps and X.25 Statistics) active fault management . Proactive fault management Fault location isimplementedbydynamicandadaptiverouting. Reac- and possible cause Expert tivefaultmanagementisimplementedbyacombination System of a neural network and an expert system. X.25 network Relational Output from the In the work reported here we concentrate on fault Database neural network Network management at the application level. Each applica- statistics (fault type) Neural tion generates packets using a Markov Modulated Poi- Network son Process (MMPP). We assume two packet priorities, Performance Data and non-preemptive queue management. The system has beendevelopedfor the X.25 protocol. Thedynamic Figure 1: The overall system routingisbasedondynamicallyadjustablelinkcosts on the basis of utilization to induce correction via rerout- ing based on minimumcost. In our model the following NETWORK TOPOLOGY AND SIMULATION performance data are collected by the network: blocking of packets, queue sizes, packets throughput from all applications, utilization on links connecting subnet- The network that we have simulated in OPNET is works, end-to-end delays experienced by packets. Sev- based on the X.25 protocol. Each user corresponds to a eral fault scenarios were modeled and included in the Data Terminal Equipment (DTE), connected to a Data study: reduced switch capacity, increased packet gener- Communications Equipment (DCE). Thus, having 10 ationrateofacertainapplication,disabledswitchinthe users implies having 10 DTE/DCE pairs, where each X.25 cloud, disabled links. These scenario are used to DTE can have several logical channels. Each DTE can train the neural network, so as to predictively recognize handle2 applications,thusmaking itpossibleto runup the genesis of faults. to 20 applicationsat a time. There are both permanent We also modeled occurrence of alarms including virtual circuits (PVCs) and virtual calls. Four PVCs severity of the problem, location of the event and a have beenprede(cid:12)ned. ThePVCs are DTEto DTEcon- threshold. Decisionsof whetheror not to sendanalarm nection. In addition to the DTEs and DCEs pertaining are determined by examining data over a user-speci(cid:12)ed to the X.25 model, there is also a SNMP manager. time window. We implemented components of SNMP Inoursimulation,theX.25 cloudconsistsof15 nodes monitoring based on RFC 1382, including agents and used to transmit the packets in a store-and-forward traps [18, 19, 20]. We have completed a small proto- manner. These 15 nodes are grouped into 3 subnet- type demonstration system which consists of: an OP- works, where each subnetwork consists of 5 nodes. The NET simulation of a network and its faults, a MIB, division of the X.25 cloud into subnetworks should be and a tightly coupled Neural Network and Expert Sys- done so that a single neural network can be appropri- tem. We used neural networks based on radial basis ately assigned to monitor each subnetwork. A typical functions. To detect and identify faults we use both subnetwork is shown in Figure 2. numericaldataassociated withtheperformanceobjects We have incorporated the following assumptions in (attributes)intheMIBaswellasSNMPtraps(alarms). the simulation model. Each application generates Performance data from the X.25 network is supplied as packets using a Markov Modulated Poisson Process inputtotheneuralnetworkanddataconcerningSNMP (MMPP). The source sends packets whose sizes are statistics, SNMP traps, and alarms are supplied as in- (cid:12)xed. The MMPP source is used in order to simulate a put to the expert system. We are using both neural burstytra(cid:14)cmodelfordata. Theamountofdatatrans- networks and expert systems since all faults cannot be fer is established by a random number generator. Each explainedthroughtheuseofjustalarmsorSNMPtraps. packet has a priority of 0 (low) or 1 (high), depending An overview of the system is shown in Figure 1. on the user generating the packet. The input and output queues have (cid:12)nite capacity and (cid:12)xed service rate that are user-speci(cid:12)ed. Therate ofthe inputqueue cor- respondsto the switch rate whilethe rate of the output queue corresponds to the link rate. There is a collection of source/destination DTE pairs. The association between DTEs is many-to-many, as in electronic mail, 2 40.375 node1 link using the following cost function: NV 39.1875 UT ci =(1(cid:0)(cid:11))ci(cid:0)1 +(cid:11) (cid:0)1 1 (cid:26)i node2 38 th where ci is the cost of the link at the i time instant node5 36.8125 CA (when the data is sampled), ci(cid:0)1 is the cost of the link at the previous time instant, (cid:11) is a weighting factor be- 35.625 tween 0 and 1 (inclusive), and (cid:26)i is the utilization on AZ the link at the ith time instant. The weighting factor, 34.4375 node4 (cid:11), is used in order to take into account the dynamics Los Angeles node3 of the network. The choice of (cid:11) is left to the network 33.25-121.125 -118.75 -117.5625 -115.1875 -112.P8h1o2e5nix designer. Inoursimulations,we chose(cid:11)tobeanumber greater than 0.5, thus assigningmore weight to the current value of the utilization. When the utilization of a linkincreases, thecost ofthelinkincreasesalso(though Figure 2: The overall system. not in a linear fashion). As a result, the tra(cid:14)c will be re-routedthroughlinksthatarerelativelyunderutilized. for example. We used queuing with priorities but non- In addition to depending on the source and destina- preemptive. All the packets (arriving from the various tion addresses, this routing algorithm also depends on nodes via the input links) are inserted into one queue. the maximum number of hops allowed. This parameter We performed simulations with widely di(cid:11)erent traf- is speci(cid:12)ed by the user. (cid:12)c patterns. When running a simulation for T seconds, we are varying the tra(cid:14)c in such a way that there are FAULT SCENARIOS AND DATA periodswhentra(cid:14)cishighandotherperiodswhentraf- The modeling of faults is done as follows. We de(cid:12)ne (cid:12)c is light. The performance data being collected con- a normal state in the network, where normal refers sistof statistics about the followingparameters: Packet to levels of tra(cid:14)c (cid:13)ow that are not unusually low or drop rates, queue sizes, packet throughput from all the high, e.g link utilization between 0.20 and 0.70. Then, applications, linkutilizations,end-to-end delays experi- asetoffaultscenariosaremodeledandusedtotrainthe enced by packets. In addition, inthe simulationwe also neural network system. By training the neural network have SNMP variables monitored and have implemented tounderstandanormalstateofoperation,itwouldthen traps. be able to recognize abnormal states also. The fault scenarios that we have simulated are the following: A MINIMUM COST ROUTING ALGORITHM 1. Reducing switch capacity, i.e. dropping the service rate. Thiswoulda(cid:11)ectdroppingofpacketsandresponse When a packet is created by an application running times for applications. on a DTE, it is divided into a number of packets of 2. Increase the (normal) packet generation rate of a (cid:12)xed length (the length can be chosen by the network certain application (e.g. 3 times the original amount of designer). In our simulation, the length of each packet tra(cid:14)c). was 128 bytes, the default value as speci(cid:12)edinthe X.25 3. Disabling of certain switches in the X.25 cloud. This Recommendations. A source to destination pair is then means that the switch is not functional and cannot be assigned to the message. Based on this pair, a route is used as a hop for a call. Such a fault would cause re- selected from the routing table and is assigned to the routing of calls via other (working) switches. message based on a minimum cost routing algorithm. 4. Disabling certain links. We usedminimumcost routingbasedondynamically Alarms changing link costs, in order to implement some proac- A method for simulating the occurrence of alarms is tive fault management. At the start of the simulation, alsoincorporatedinthesimulation. Thealarmcontains all the links have zero cost assigned to them. As the information regarding the severity of the problem, the simulation progresses, this cost is updated periodically location of the event (i.e. which node in which subnet- by relating the cost of the link to the utilization on the work), and a threshold. The severity levels and alarm 3 codesare: critical(5),major(4),minor(3),warning(2), backpropagation networks, RBFNs use Gaussian trans- informational (1), cleared (0). The decision of whether fer functions, one per hidden node. The hidden nodes or not to sendan alarm isdeterminedby examining the have spherical (or elliptical) regions of signi(cid:12)cant acti- sampled data over a user-speci(cid:12)ed time window. vation. The(cid:12)niteboundingoftheactivationregionsen- SNMP Monitoring ables RBFNs to detect novel cases. Another advantage In much of the literature that was reviewed [4, 5, 6, of RBFNs isthat they requireless(typicallyan orderof 7, 8], there had been little mention regarding the use of magnitude) time for training compared to backpropa- SNMP variables to perform fault management. In our gation networks. However, they have a slower run-time approach, we are logging statistics pertaining to SNMP execution [11]. variables based on the RFC 1382, (SNMP MIB Exten- ThetrainingofRBFNsisdoneinthreestages. Inthe sion for the X.25 Packet Layer). A list of variables was (cid:12)rststage, thecenterofeach oftheradialbasisfunction extracted from RFC 1382 and were logged during the units is determined using the k-means clustering algo- simulation. The subset of variables were chosen from rithm. This is an unsupervised technique that places the RFC because they are helpful in identifying faults unit centers centrally among clusters of points. In the that could occur in the X.25 simulation. The variables second stage, the unit widths are determined using the are logged on a per DTE basis and not on a per logical nearest neighbor technique, which ensures the smooth- channel basis. This is implemented by assigning IDs to ness and continuity of the (cid:12)tted function. In the (cid:12)nal each DTE. stage, the weights of the second layer of connections are SNMP Traps found using linear regression. In addition, we also have the facility for agents to Network Monitoring send traps to a manager when something goes wrong. One of the most crucial elements in performing fault Here, an agent refers to a node in the X.25 cloud. This management of networks is speed for fault detection, manager isdesigned to manage the switches inthe X.25 fault location, and identi(cid:12)cation of the type of fault. cloud. ItdoesnotreceivetrapsfromtheDTEsorDCEs For managing the X.25 network, we used a hybrid inthenetwork. AccordingtoRFC1215(\AConvention architecture of neural networks and expert systems to forDe(cid:12)ningTrapsforusewiththeSNMP),therearesix perform the fault management functions. Speci(cid:12)cally, basictypesoftraps,togetherwithaseventh(enterprise- we used RBFNs to analyze the performance data be- speci(cid:12)c) trap. These are : coldStart(0), warmStart(1), ing generated by OPNET. There is one RBFN for each linkDown(2), linkUp(3), authenticationFailure(4), egp- subnetwork. The size and structure of each subnetwork NeighborLoss(5), enterpriseSpeci(cid:12)c(6). In our simula- need not be the same and it is an arbitrary design issue tion, we have implemented traps 2, 3, and 6 above. that is left to the network designer. The possible out- EXPERT SYSTEMS AND NEURAL puts of the neural networks are the di(cid:11)erent classes of NETWORKS faults that could occur in the X.25 subnetworks. When a fault occurs within a certain subnetwork, the RBFN OPNET/NEURONLINE Interface assigned to monitor that subnetwork will alert the net- ThedatafromtheX.25simulationinOPNETisgath- work operator that a fault of a speci(cid:12)c class (e.g. dis- ered in a (cid:13)at (cid:12)le and stored in an ORACLE database. abled node) has occured. However, this will not inform The data is then read by G2 and Neuronline, where the the operator of the location of the fault. Thus, in the former is the expert system and the latter is the neural exampleabove, the operator wouldknowthat anodein network component. After careful review of the alter- a speci(cid:12)c subnetwork was disabled, but he/she would natives wechoseradialbasisfunctionnetworks(RBFN) not know which node was disabled. Then, based on the as the neural network architecture for conducting clas- outcome of the neural network, appropriate action is si(cid:12)cation. In implementing our system, we used a com- taken by the expert system. The expert system uses in- bination of both neural networks and expert systems. formationaboutalarmsandSNMPtraps,togetherwith Radial Basis Function Networks the SNMP variables which we chose from RFC 1382, to Recently, researchers have been using radial basis makeitsconclusionsregardingthepossiblelocationand function networks for handling classi(cid:12)cation problems causeofthefault. Weimplementedspecialrulestohan- [3, 11]. RBFNs are three-layered networks, with an in- dle disabled nodes, others to handle failed links, and so put layer, a hidden layer, and an output layer. Unlike on. 4 FIRST LEVEL OF FAULT DETECTION AND Neural NetworkTraining: 5 Classes DIAGNOSIS: NEURAL NETWORKS Total %Error %Error %Error %Error %Error Hidden Normal Disabled Excess Degraded Disabled We used one radial basis function network for each Nodes State Node Thrput Bu(cid:11)er Link subnetwork in the X.25 cloud. In the training phase 175 0.13 0.16 0.14 0.19 0.21 for a speci(cid:12)c neural network we used the performance 200 0.10 0.13 0.13 0.17 0.18 210 0.08 0.11 0.12 0.16 0.15 data obtained directly from the network. This data is 230 0.07 0.10 0.09 0.14 0.11 then scaled using a data rescaler, which was con(cid:12)gured to use zero mean, unit variance scaling on the input Table 1: Neural network training chart for the third and no scaling on the output. The scaled data is then test. used by the trainer to train the RBFN. A (cid:12)t tester is also available. The criterion chosen for the (cid:12)t tester is fraction misclassi(cid:12)ed. Thus, the output of the (cid:12)t tester distinguish between the re-routing that occurs in these is a number between 0 and 1, re(cid:13)ecting how accurately two cases in order to have a small percentage of mis- data samples are classi(cid:12)ed. classi(cid:12)cation; and this was veri(cid:12)ed by our experiments. Theneuralnetwork hassphericalnodesforitshidden Since there is no (cid:12)xed method to train neural net- layer. Thenumberofhiddennodesperclasswaschosen works, we arbitrarily selected a few di(cid:11)erent test cases through trial and error, after several training sessions, todevelopabetterunderstandingofhowtheneuralnet- until the desired performance is achieved. During our works were trained. The data obtained from the simu- experiments, it was found that as the numberof hidden lations in OPNET was divided so that two-thirds was nodes increased, the (cid:12)t tester error decreased (though used for training and one-third for testing. In the (cid:12)rst not linearly), thus implying that there was a better (cid:12)t test,weconsidered3classes(normal,disablednode,and of the data by the neural network. However, a higher excessusertra(cid:14)c). We used180 samplesofdataforthe number of hidden nodes also meant a longer training normal class and 90 samples for each of the other two period. The training of the neural networks is a(cid:11)ected classes. In the second test, we repeated the (cid:12)rst test by the following factors: The quality of the input data but, changed the number of samples of training data and how well it re(cid:13)ects the conditions of the X.25 net- to 180 samples per class and retrained the RBFN network; The number of hidden nodes in the hidden layer works. Comparison of the results indicates that with of the RBFN; The number of input variables that are more data points per class, the total number of hidden supplied to the neural network (we supplied the uti- nodes decreases for a certain range of error values for lization levels on all the links, the queue sizes, and the the (cid:12)t tester. measured packet throughput at each node); The dis- criminating characteristics of data for faults occuring Inthethirdtest,weconsideredall(cid:12)veclassesoffaults simultaneously. and trained the RBFNs with di(cid:11)erent sample sizes. We Since a neural network observes patterns and makes (cid:12)rsttrainedthe RBFN with150 samples forthe normal inferences based on those patterns, similar patterns for class and 80 samples for each of the other fault classes, di(cid:11)erent fault classes would lead to misclassi(cid:12)cations. giving a total of 470 points in the training set. The In several experiments, when one node in a subnetwork results for this case are shown in Table 1. By looking was blocked, the average queue size of other nodes in at the last two columns in each row, it is observed that the subnetwork increased drastically, beginning at the thepercentageerrorishigherforthosetwofaultclasses. time of the node blockage. This occurs as a result of This provided the motivation for the next test. re-routing of the X.25 calls. In such cases, there are In the fourth and (cid:12)nal test, we again considered all certain distinct patterns that help the neural network (cid:12)ve classes of faults and trained the RBFNs with dif- to identify the di(cid:11)erent cases. However, there are other ferent sample sizes. We trained the RBFN with 180 instances when it is more di(cid:14)cult. For example, a link samples each for the normal, disabled node, and ex- failureandanodefailurebothleadtore-routingoftraf- cess user tra(cid:14)c classes. For the remaining two faults (cid:12)c. If the samples of training data are small, it is very classes, we used 320 samples for each class giving a to- di(cid:14)cultfor the neuralnetwork to distinguishbetween a talof1180 pointsinthetrainingset. Thereasonforthis node failure and a link failure, simply by analyzing the isbecausethesetwocases donot manifestthemselvesin re-routing that occurs. Thus, more data is needed to an obvious manner through the performance data from 5 Single Faults Neural NetworkTraining: 5 Classes To detect a node failureat node i, the algorithm (cid:12)rst Total %Error %Error %Error %Error %Error searches for a SNMP trap. Reception of a trap would Hidden Normal Disabled Excess Degraded Disabled solve the problem. If, due to some problem in the net- Nodes State Node Thrput Bu(cid:11)er Link work, the trap was not received by the SNMP manager 60 0.08 0.07 0.06 0.07 0.08 80 0.06 0.05 0.04 0.04 0.07 (a feature that exists in our simulation), then we ex- 100 0.05 0.03 0.03 0.04 0.05 ecute a query from the expert system looking for the 125 0.03 0.04 0.03 0.05 0.04 following condition: Table 2: Neural network training chart for the fourth X (cid:26)ij <(cid:15) test. 8js:t:9link(i;j) In our implementation, we set (cid:15)=0:01. To con(cid:12)rm the hypothesis, we examine: the network. When the trainingfor these two cases was 1. x25StatCallTimeouts counter at the DTEs that are performedwith180 samplesperclass, thepercentage of the\source" partofthesource/destinationpairsforthe misclassi(cid:12)cation was very high (approximately 0.40 for DTEs connected to node i. each fault class). On the other hand, when we tried us- 2. x25StatOutCallFailures, x25StatOutCallAttempts ing500 samplesfor each of thesetwo classes, theRBFN counters at the source DTE. was overtrained and all data points (from the testing To detect a user connected to node i that is submitting data) were classi(cid:12)ed either as degraded bu(cid:11)er or link excess tra(cid:14)c to the network, we look for the following failed. Thus, we had to use an intermediate number of condition: points between these two extreme cases of training and the results for this case are shown in Table 2. X (cid:26)ij >(cid:28) 8js:t:9link(i;j) After analyzing the behavior of the network under fault conditions in several experiments, it appears that To con(cid:12)rm the hypothesis, we check: the network topology in(cid:13)uences the neural networks 1. x25StatOutDataPackets at the DTEs connected to ability to discriminate faults. Since all occurrences of node i. a particular fault class are not identical, several di(cid:11)er- 2. Measured packet throughput at node i. ent cases need to be presented to the RBFN for the 3. x25StatInDataPackets at the destination DTE, i.e. same fault class. Obviously, this corresponds to longer node i, obtained by checking the source/destination trainingsessionsfor the neuralnetworks. Similarobser- pairs in the case of PVCs. vations were also recorded for the other fault classes. 4. x25StatInCalls at the destination DTE, obtained The output of the neural network is used by a classi- by checking the source/destination pairs in the case of (cid:12)er to inform the network operator of the current sta- PVCs. tus of the network; the neural network outputs a fault Todetectadegradedswitch,thealgorithm(cid:12)rstsearches code. If a certain fault code is observed several times for a SNMP trap. Reception of a trap would solve the (e.g. K times out of M samples), then the expert sys- problem. To con(cid:12)rm the hypothesis, we check the fol- temisactivated to determinefurtherinformationabout lowing: the location and cause of the fault, as described in the 1. Alarms corresponding to high queue sizes and/or next section. blocking of packets. 2. High end{to{end delays experienced by packets. SECOND LEVEL OF FAULT DETECTION To detect a link failure on link (i;j), the algorithm (cid:12)rst AND DIAGNOSIS: EXPERT SYSTEMS searches for a SNMP trap. Reception of a trap would solve the problem. If the SNMP manager does not re- The neuralnetwork for each subnetwork analyzes the ceive a trap, then we execute a query from the expert incoming data and if a state other than a normal one system looking for the following condition: (cid:12)nd i and j appears to be present, then the expert system makes such that queries to an ORACLE database to determine further (cid:26)ij =0 and (cid:26)ji =0: information about the observed fault in the network. Di(cid:11)erentfaultconditionsinducedi(cid:11)erentqueries,asde- In addition, we check the x25StatRestartTimeouts and scribed below. x25StatResetTimeouts counters. 6 Multiple Faults [7] Mark T. Sutter and Paul E. Zeldin. Design- Inthecaseofmultiplefaults,wesimplyneedtoexam- ing expert systems for real-time diagnosis of inetheoutputsoftheRBFNneuralnetworksanddeter- self-correcting networks. IEEE Network Magazine, minewhichonesdonotcorrespondtonormaltra(cid:14)c. By pages 43-51, September 1988. doing so, we eliminate a large number of nodes in the [8] Robert N. Cronk, Paul H. Callahan, and Lawrence X.25 network and we can focus on those subnetworks Bernstein. Rule-based expert systems for network that are experiencing problems. In the work to date, management and operations: An introduction. we didnotconsidermultiplefaultsoccurringsimultane- IEEE Network Magazine, pages 7-21, September ously within the same subnetwork since the probability 1988. of occurrence of such an event is much smaller than the probability of occurrence of multiple faults within dif- [9] Gabriel Jakobson, Robert Weihmayer, and Mark ferent subnetworks. In the event of faults occurring in Weissman. A dedicated expert system shell one subnetwork, with the resulting e(cid:11)ects propagating for telecommunication network alarm correlation. toanothersubnetwork,theRBFNsofbothsubnetworks IEEE Network Magazine, pages 277-288, Septem- wouldindicateproblemsituations andthe resultsof the ber 1993. queries from both subnetworks would have to be exam- [10] A. Finkel, K.C. Houck, S.B. Calo, and A.T. ined. Bouloutas.Analarmcorrelationsystemforhetero- These strategies were validated with the simulation geneous networks. IEEE Network Magazine, pages results from OPNET. One should note that the rules 289-309, September 1993. whichwereconstructedfortheexpertsystemaredriven bytheX.25networkarchitecture. Itispossiblethatdif- [11] JamesA.LeonardandMarkA.Kramer.Radialba- ferent architectures would probably use the above rules sis function networks for classifying process faults. with some modi(cid:12)cations. IEEE Control Systems, pages 31-38, April 1991. REFERENCES [12] Larry L. Ball. Network Management with Smart Systems. McGraw-Hill, 1994. [1] Kornel Terplan. Communication Networks Man- agement. Prentice Hall, 2nd. edition, 1992. [13] James Malcolm and Trish Wooding. IKBS in network management. Computer Communications, [2] W.R. Becraft and P.L. Lee. An integrated neu- 13(9):542-546, November 1990. ral network/expert system approach for fault diag- [14] MarshallT.Rose.How tomanage your network us- nosis. Computers Chem. Engng, 17(10):1001-1014, ingSNMP:the networking management practicum. 1993. PTR Prentice Hall, 1995. [3] James Hendler and Leonard Dickens. Radial ba- [15] William Stallings. Network management. IEEE sis function networks for classying process faults. Computer Society Press, 1993. AISB Conference, April 1991. [16] William Stallings. SNMP, SNMPv2, and CMIP: [4] Joseph Pasquale. Using expert systems to manage the practical guide to network-management stan- distributedcomputersystems.IEEENetworkMag- dards, Addison-Wesley Pub. Co., 1993. azine, pages 22-28, September 1988. [17] John Mueller. The hands-on guide to network [5] Sameh Rabie, Andrew Rau-Chaplin, and Taro management. Windcrest/McGraw-Hill, 1st. edi- Shibahara. DAD: A real-time expert system for tion, 1993. monitoringofdatapacketnetworks.IEEENetwork Magazine, pages 29-34, September 1988. [18] Rfc 1382: SNMP MIB extension for the X.25 packet layer. [6] Wei-Dong Zhan, Suchai Thanawastien, and Lois [19] Rfc 1215: A convention for de(cid:12)ning traps for use M.K. Delcambre. SimNetMan: An expert sys- with the SNMP. tem for designing rule-based network management systems. IEEE Network Magazine, pages 35-42, [20] X.25 recommendations. September 1988. 7