The ‘Multi-level multi-agent network fault healing’ Catalyst delivers a hierarchical multi-agent system that uses LLM-driven diagnosis, digital-twin simulation and cross-domain coordination to automate end-to-end network fault detection, analysis and repair.

How multi-agent AI is transforming network fault repair
Commercial context
Large 5G rollouts and the growth of computing-network services have created dense layers of operational systems that generate vast volumes of alarms and events. These environments now demand response times that conventional tools cannot provide. CSPs face surges in work orders triggered by alarm storms, limited coordination between domains, and manual processes that slow fault handling. Diagnosis often relies on isolated tools and human knowledge, with little ability to form a complete cross-domain view. As a result, network operations centers experience long mean time to repair, inconsistent practices, and rising operational costs.
This pressure is particularly visible in transport networks, where faults often involve multiple layers and vendors. A single issue can trigger related alarms across fiber, optical, IP and service layers, yet existing systems treat these fragments separately. CSPs cannot easily connect symptoms to a root cause or confirm the impact on services. Manual effort still dominates routine faults, and repetitive tasks consume valuable specialist time. These challenges impose commercial strain. Slow diagnosis prolongs service interruptions and weakens SLA performance. High OPEX limits the ability to scale new services. As networks evolve toward Level 4 autonomy, these constraints become incompatible with the operational maturity required.
The market context also shows clear demand for more intelligent and automated models. CSPs need systems that integrate network insights, interpret intent, act across domains, and verify change before execution. They require architectures that support collaboration rather than isolated automation. Crucially, they need solutions that can be deployed at scale without heavy customization, and with standardized interfaces that avoid vendor lock-in.
The solution
This is the environment in which the Multi-level multi-agent network fault healing Catalyst was created. The project introduces a hierarchical multi-agent architecture that builds an automated closed loop for network fault healing. It combines large language models, knowledge-graph techniques, multi-agent coordination, and digital-twin simulation to deliver accurate diagnosis and efficient repair. The system is structured around service-layer agents and network-layer agents, each with defined roles that collaborate across every stage of fault handling.
At the network layer, the system begins by aggregating massive alarm information using a small-model AI algorithm. This reduces alarm noise and identifies fault patterns with over 95% aggregation accuracy. The process constructs a resource and alarm knowledge graph, enabling spatiotemporal correlation that maps symptoms to likely root events. CSPs can then see fault names, root alarms, derived alarms and corresponding work tickets in a unified view. This alone reduces diagnostic effort by 15% and ensures no key issue is missed.
The next stage uses a diagnosis agent built on a fine-tuned large language model. Trained on more than 100,000 fault corpora and 237 detailed fault scenarios, the model generates a chain-of-thought reasoning path matched to the fault type. It schedules atomic capabilities from the system to locate the root cause. Experts can also inject reasoning steps through natural language, strengthening accuracy and extending the model’s reach into emerging or rare scenarios.
Once the model identifies the likely cause, the system creates a repair solution and verifies it through a digital twin. The twin offers a high-fidelity simulation of resources, equipment and services, allowing the system to test changes before they reach the live network. This prevents the risk of cascading issues and enables automatic repair for soft faults. CSPs can view simulation results and solution details through a visual interface, ensuring full transparency of AI decision-making.
The multi-agent layer coordinates the entire process. Agents collaborate to report faults, exchange diagnosis results, split work when needed, generate repair scripts, and confirm outcomes. The scheduling agent orchestrates cross-domain activity. Sub-agents manage service logic, equipment data, or specific repair tasks.
Application
The project shows clear impact. Zhejiang Mobile saves around 6.3 million RMB in annual maintenance costs and 2,250 person-days of work. With nationwide deployment across China Mobile’s provincial networks, annual OPEX savings could reach 180 million RMB. Fiber break location has dropped from two hours to two minutes. Service restoration for a batch of 100 services has reduced from two hours to twenty minutes, contributing to an 83% reduction in service interruption duration. The 5G service SLA compliance rate has increased to 99.5%.
The architecture also scales well beyond the transport network. The agent model can extend to wireless backhaul, dedicated enterprise lines and core-network scenarios without retraining the underlying LLM. Prompt-engineering and feedback loops allow the system to adapt to new network types with minimal effort. The hierarchical framework supports cross-domain collaboration, enabling operators to evolve towards unified autonomous operations. The model reduces mean time to repair by 40% and delivers 90% automation coverage across the workflow. In China Mobile’s Zhejiang Branch, a fault copilot component further shortens handling times to around forty minutes by assisting field teams and enabling remote collaboration.
The project also realized a cross-wireless and transport network fault self-healing scenario. Specifically, the OSS (operations support system) service receives a wireless cell out-of-service-alarm and a transmission equipment board power-off alarm. Through association analysis by an AI agent, it can then be discovered that the root cause of the cell out-of-service is the transmission board power-off fault. The fault self-healing agent then sends a board power-on command to clear the board alarm, thereby clearing the cell out of service alarm.
Wider value
Long-term value includes reduced rollout costs, stronger ecosystem independence, and the potential for new service models. Standardized interfaces help CSPs avoid dependence on single-vendor ecosystems. Digital-twin capability creates a safe environment for change validation. The approach also lays the foundation for 'intelligent O&M as a service,' where CSPs provide autonomous maintenance capabilities to enterprise customers. As networks move toward greater autonomy, this multi-level, multi-agent architecture provides a formula for others to follow.
By integrating structured knowledge, reasoning models, simulation and collaborative agents, the Catalyst demonstrates a credible means to achieve high-level autonomous network operations at scale. In bringing measurable gains in efficiency, resilience and service quality, the industry has a strong benchmark for how CSPs can modernize O&M at pace.