AI Agents for Ops – Automated Incident Response (AIOPS)

AI - Artificial Intelligence, AI for Professions

In this practical hands-on course (30% theory, 70% practice) you will learn to apply multi-agent AI systems for automated incident investigation and automated resolution in IT operations. Sessions run in a prepared Linux lab, with emphasis on security and cost control.

This course explains agent architecture and the observe/decide/act loop, plus the subagent pattern to reduce blast radius and costs. You will run agent teams, design MCP safety boundaries, execute layered simulated incidents, plus production patterns for logging and escalation.

Location, current course term

Contact us

Custom Customized Training (date, location, content, duration)

The course:

Hide detail
  • Module 1 — Agent architecture for Ops
    1. Intro: what LLMs do, what chat can and cannot do
    2. What is an agent: observe / decide / act loop; difference from chat (tools + autonomy)
    3. When NOT to use an agent: scriptable tasks, deterministic processes
    4. Subagent pattern: why not one big agent — blast radius, costs, focused context
    5. Three roles: coordinator (strong model), syscheck + logcheck (cheaper models)
    6. Hands-on: build a coordinator and two specialist agents; run on a simulated alert
  • Module 2 — MCP servers as a safety boundary
    1. Problem from Module 1: agents with full shell access are unacceptable in production
    2. What is an MCP: networked tools, not shell access; like sudo rules — allowed commands only
    3. Two tool types: read-only (investigation) vs write (action)
    4. MCP anatomy: entry point, tool definitions, input/output schema
    5. Hands-on — Method 1: plug an existing syscheck-MCP into a syscheck profile
    6. Hands-on — Method 2: build a logcheck-MCP using a dev-squad (developer + tester + security subagents)
  • Module 3 — Automated investigation
    1. Investigation loop: trigger → investigate → report → (optionally) act
    2. When to stop, when to act, when to escalate
    3. Triggers: polling as a simple start (why start simple, not webhooks)
    4. Live demo: full chain on a prepared incident
    5. Hands-on: connect a trigger, run the full team on a layered real scenario (Disk full → MySQL lock → HTTP 500)
    6. Scenario design: layered checks require coordinator to correlate syscheck + logcheck outputs
  • Module 4 — Production patterns and cost control
    1. Model tiering: coordinator vs workers; real cost numbers for pipeline runs
    2. Failures and mitigations: infinite loops, token burn, hallucinated actions, cascading delegation, stale context
    3. Production checklist: read-only default, human-in-the-loop for write, cost budget, logging, alerting-on-alerting, graceful degradation
    4. When NOT to use agents: deterministic tasks, compliance-critical actions, tasks without an audit trail
Assumed knowledge:
Basic Linux server skills (SSH, shell, reading logs) and prior LLM chat experience.
Recommended previous course:
Linux – Basic Administration (LNX1)
Schedule:
1 day (9:00 AM - 5:00 PM )
Course price:
316.00 € ( 382.36 € incl. 21% VAT)
Language: