NOC Multi-LLM Agent

NOC Multi-LLM Agent

Project Date: October 2025
Technologies: Python (FastAPI, LangGraph), PostgreSQL, Redis, Weaviate, Gemini APIs, Docker, Grafana MCP
Repository: github.com/sonali-rajput/noc-ai-agent

Summary

An automated network operations centre agent that orchestrates multiple LLMs to triage incidents, correlate telemetry, and notify the right response team in seconds instead of hours.

Key Capabilities

  • Multi-agent orchestration: Coordinates specialised LangGraph agents for intake, enrichment, triage, and remediation planning across every alert.
  • Vector-powered correlation: Uses Weaviate embeddings to match incoming incidents with historical fixes and eliminate duplicate pagers.
  • Real-time observability: Streams metrics and incident timelines into Grafana MCP dashboards for end-to-end visibility.
  • Automated routing: Applies Gemini policy agents to pick the correct on-call team based on runbooks, severity, and blast radius.

Architecture

  • Event ingestion: FastAPI webhook gateway normalises alerts from Prometheus, CloudWatch, and custom sensors.
  • State and context: Redis backs short-lived agent memory while PostgreSQL persists incident timelines, actions, and audit trails.
  • AI execution layer: LangGraph orchestrates task-specific LLMs (intake, triage, diagnostics) with Gemini APIs providing long-context reasoning.
  • Insights & visualisation: Grafana MCP renders live incident heatmaps, recovery timers, and post-incident analytics.

Impact

  • Reduced manual alert handling time by over 70% through automated triage and escalation.
  • Improved mean time to acknowledge (MTTA) and resolve (MTTR) by surfacing runbook matches within seconds.
  • Delivered a reusable template for LLM-first NOC automation that integrates cleanly with existing SRE tooling.

Back to Projects