5 overlooked principles in the race for autonomous networks
Future network management systems must be able to auto-provision, auto-scale and auto-heal. In this article, TM Forum Outstanding Contributor Yuval Stein identifies five overlooked principles required for network automation.
10 Oct 2019
5 overlooked principles in the race for autonomous networks
As the telecom industry strives to lower costs and improve performance, tomorrow’s networks will depend heavily upon automating network and management processes. In this article, TEOCO’s Yuval Stein, a TM Forum Outstanding Contributor, identifies five overlooked principles required for network automation which he has discovered during deployments with customers, his work with the Forum’s Open Digital Architecture project and as an active participant in several Catalyst proofs of concept.
Future network management systems will undoubtedly need to be able to auto-provision, auto-scale and auto-heal. This will happen through a ‘closed-loop’ process that collects data, identifies problems, recommends or makes decisions, and then takes action.
Future network management systems will undoubtedly need to be able to auto-provision, auto-scale and auto-heal. This will happen through a ‘closed-loop’ process that collects data, identifies problems, recommends or makes decisions, and then takes action.
When translating these needs to a network architecture, there is often a tendency to highlight the role of service assurance in the collection part of the process, while the identification and recommendation stages are either set aside, treated as lower priority or even ignored. It’s important to shine a spotlight on these stages, focusing on functionality that is critical to successful autonomous networks and how it relates to service assurance.
Service assurance systems rely upon operational data that comes from several sources, such as:
Data from these sources goes through a collection phase, which means that all the data messages that originate from multiple management sources, including various vendors and technologies, must be unified by their structure (per source type) and by their attributes – for example, unifying the semantics of probable cause values, formulas of measurements, units of measurement and so on.
Once collected, identifying network or service problems from all these data sources still requires a significant amount of logic. This is because most of the data just describes symptoms of problems, not the problems themselves.
Looking at each data source, we realize the following:
Identifying problems rather than symptoms requires addressing five critical service assurance principles that are often overlooked.
In today’s network management systems, notifications coming from the network layer are resource alarm events, some of which trigger alarms. These may indicate high CPU, low memory, high traffic or too many dropped packets. There is common confusion between alarm events and alarms, but they are different, mainly because alarm events are spontaneous.
Most modern fault management systems are designed to manage large numbers of alarm events. A common ratio between alarm events to alarms is 10:1, meaning that the number of actual alarms is one tenth of the number of alarm events. This early processing stage of the alarm lifecycle is common. As a filtering process, it provides another layer of cleaning, bringing alarm events from a symptom level closer to identifying the actual problem, and is applicable for both manual resolution and automated responses. (See graphic below.)
Even when looking at actual alarms, there are still too many ‘symptoms’ taking the focus away from actionable, identifiable problems. How do we know this? Because some of today’s root-cause analysis software tools can reduce alarms by a factor of 50% to 60% or more, which means that even the reduced alarms are not always actionable, as the majority still represent symptoms and not real problems.
Many of the industry’s closed loop proofs of concept in recent years illustrate that use cases often are built upon reacting to alarms and alarm events. It’s true that in many cases there may be a common underlying reason for symptoms like high CPU or low memory, which are often due to a lack of resources. But if the communications service provider (CSP) creates an automated, closed loop reaction to always allocate more resources, this can be destructive. Especially if the underlying problem is related to an issue with the virtual network function (VNF) or a major external issue, such as failure in the provisioning process or an extreme weather event.
Therefore, using analytics and root-cause analysis for self-healing cannot be deferred. CSPs will implement only auto-healing use cases that respond to actionable, well-identified root-cause problems. This requires early deployments of artificial intelligence (AI) and machine learning.
The diagram below represents the hierarchy of processing events, from symptoms to identified problems. It shows the nature of each level, and the processing required to take the information to the next level until it eventually filters through the process to become an identified problem that is actionable.
Act upon identified problems, not symptoms
Service assurance systems rely upon operational data that comes from several sources, such as:
- Alarm events and logs
- Measurements, operational status values (including telemetry) and usage records
- Active or passive probe data
Data from these sources goes through a collection phase, which means that all the data messages that originate from multiple management sources, including various vendors and technologies, must be unified by their structure (per source type) and by their attributes – for example, unifying the semantics of probable cause values, formulas of measurements, units of measurement and so on.
Once collected, identifying network or service problems from all these data sources still requires a significant amount of logic. This is because most of the data just describes symptoms of problems, not the problems themselves.
Looking at each data source, we realize the following:
- Alarm events and logs – this data typically requires more processing to identify real network problems. Service layer problems must be deduced, as they are often not directly reported.
- Measurements, operational status values and usage records – additional processing is required to identify problems, detect anomalies and realize root causes.
- Active or passive probes – accuracy depends on the exact domain and the elements that are being probed. Their proximity to a problem depends on the granularity of the test. When a test is specific, probes may detect a software or hardware component that is not functioning. However, for an end-to-end test, such as a protocol connectivity test, the failure will typically point to a symptom, because it cannot identify the underlying cause of the actual problem
Elevating problems to the level of actionable events
Identifying problems rather than symptoms requires addressing five critical service assurance principles that are often overlooked.
1. Alarms are not the same as alarm events
In today’s network management systems, notifications coming from the network layer are resource alarm events, some of which trigger alarms. These may indicate high CPU, low memory, high traffic or too many dropped packets. There is common confusion between alarm events and alarms, but they are different, mainly because alarm events are spontaneous.
Most modern fault management systems are designed to manage large numbers of alarm events. A common ratio between alarm events to alarms is 10:1, meaning that the number of actual alarms is one tenth of the number of alarm events. This early processing stage of the alarm lifecycle is common. As a filtering process, it provides another layer of cleaning, bringing alarm events from a symptom level closer to identifying the actual problem, and is applicable for both manual resolution and automated responses. (See graphic below.)
2. Analytics are mandatory
Even when looking at actual alarms, there are still too many ‘symptoms’ taking the focus away from actionable, identifiable problems. How do we know this? Because some of today’s root-cause analysis software tools can reduce alarms by a factor of 50% to 60% or more, which means that even the reduced alarms are not always actionable, as the majority still represent symptoms and not real problems.
Many of the industry’s closed loop proofs of concept in recent years illustrate that use cases often are built upon reacting to alarms and alarm events. It’s true that in many cases there may be a common underlying reason for symptoms like high CPU or low memory, which are often due to a lack of resources. But if the communications service provider (CSP) creates an automated, closed loop reaction to always allocate more resources, this can be destructive. Especially if the underlying problem is related to an issue with the virtual network function (VNF) or a major external issue, such as failure in the provisioning process or an extreme weather event.
Therefore, using analytics and root-cause analysis for self-healing cannot be deferred. CSPs will implement only auto-healing use cases that respond to actionable, well-identified root-cause problems. This requires early deployments of artificial intelligence (AI) and machine learning.
The diagram below represents the hierarchy of processing events, from symptoms to identified problems. It shows the nature of each level, and the processing required to take the information to the next level until it eventually filters through the process to become an identified problem that is actionable.
3. Near real-time calculated KPIs are mandatory
Using raw measurements data to create calculated, useful measurements is something practically all network management systems provide. In fact, engineering departments are known for developing their own key performance indicators (KPIs) for planning and optimization, and for creating management summaries. Some KPIs are also necessary when developing operational requirements. For example:
- Time Measurement KPIs – taking cumulative samples of time-window specific measurements can help CSPs calculate the number of dropped sessions between two data points so that unusual trends can be identified. Another example is measuring traffic across short spurts of time (for example, every 2 seconds) and then averaging it across 1-minute and 5-minute time windows, creating average, minimum and maximum values. These time window measurements then become the actionable KPIs for detecting abnormal traffic variances.
- Correlation KPIs – calculating KPIs when there is a dependency among other KPIs can also be helpful. For example, correlating the number of attempts against the number of failures can be used to calculate a ratio or percentage. Often, this percentage will be a much better KPI to act upon than looking only at the number of failures.
- Topology KPIs – there is also value in creating KPIs for when additional topology information is required. For example, calculating an interface throughput which depends on the allocated bandwidth, or the survivability of a VNF or network element that depends on its location in the topology.
In cases like those described above, required actions are being derived by the near-real time calculated KPIs, and not the raw data from collected measurements.
4. Near real-time abnormality detection (analytics!) is mandatory
When network or service problems are identified by data collected from measurement-based data sources coming from the network or probes, the measurements need to pass through an abnormalities-detection process. This is to identify exactly what is behaving differently, and more importantly, why it is behaving this way?
Exceptional behavior depends on the ‘regular’ behavior at the network object instance level. What’s exceptional for a certain link may be normal for another. This means that when a CSP needs to manage hundreds or thousands of links or connectivity services, then either a rule should be defined for each link/connectivity service instance, which is an impossible burden to maintain, or threshold Analytics can be applied to identify exceptional behavior. This technology is available.
To automatically identify abnormalities in the network, simple instance-based policies will not work well enough.
5. The NOC/SOC needs to understand current network and service problems
OSS systems are quickly advancing towards greater automation and being cloud based. However, even with this fast progress, CSPs will continue to operate network operations centers (NOC) and service operations centers (SOC) for many years because:
- The move to cloud native is evolutionary; not all networks and IT systems can be replaced and migrated at the same time.
- Some problems still reside at the physical network layer and provide no ability to automate. For example, cut cables, cellular towers damaged by bad weather, or broken customer access points will always require field technicians and repair services.
- Some root causes may originate from higher management levels, especially at the business layer, where decisions depend on people and not on machines. In cases like erroneous planning or complex network configurations, human intervention is required.
- Some problems are related to network or service upgrades and maintenance operations. Such cases of external context usually need human surveillance.
So long as controllers sit in the NOC/SOC, they will need to understand what’s happening in the network – even in areas where actions are occurring automatically. This means that, at least for the foreseeable future, the NOC will manage network problems with a mixture of manual and automatic resolutions. The needs of the NOC/SOC should continue to be considered – and respected.
NOC/SOC systems need to:
- Show relevant alarms at all layers. As noted, modern management systems are able to filter a large percentage of symptomatic alarms, but existing problems need to be listed.
- Root-cause analysis is still required, even if the problem is resolved by a lower-layer management system. The NOC needs to have a clear, holistic view to understand what was resolved automatically – and what needs manual intervention.
- Enrich alarms with organizational data that assists controllers. Information like geography, administrative ownership, and network relevant change requests must be associated to alarms.
- Track the source of abnormalities, whether they are alarm-based or measurement-based. NOC controllers should understand the nature of the abnormalities as much as possible, which may require additional alarms, measurements or metadata.
If you’d like to get involved in the important work Yuval and other TM Forum members are doing on network automation, please contact TM Forum’s George Glass.