Autonomous AI Self-Healing Distributed Systems Using Deep Reinforcement Learning (DRL)
Main Article Content
Abstract
The cloud-native and distributed systems of modernity create complex failures that can hardly be detected and recovered manually or through the rule of thumb. The current paper is a proposal of an Autonomous AI Self-Healing Distributed System based on Deep Reinforcement Learning (DRL). The structure integrates real time observability, artificial intelligence fault detection and a DRL based action engine to make autonomous choices and take autonomous action by selecting and executing the recovery actions. Controlled failure injection was used as a quantitative experimentation. The findings indicate that there are a great deal of improvement in Mean Time to Repair (MTTR), increased availability of the system, and there is also a low rate of false positive remediation when using the traditional ones. The results prove that DRL facilitates efficient, persistent, and self-reliant system resilience.