AI Reliability Engineer - DevOps

3 - 5 years

0 Lacs

Posted:2 weeks ago| Platform: Foundit logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

We're looking for an Agentic AI Reliability Engineer who thrives at the intersection of DevOps, support engineering, and AI infrastructure. This role is critical to maintaining the stability, uptime, and seamless performance of our no-code Agentic AI platform, which powers autonomous digital workers across industries.You'll be responsible for building core support and monitoring infrastructure, handling incident response, collaborating with engineering for escalations, and applying site reliability best practices to intelligent, multi-agent systems.

Responsibilities;

  • Build and manage support tooling and observability pipelines for AI agent operations.
  • Troubleshoot, investigate, and resolve incidents across multi-agent AI workflows.
  • Collaborate with engineering on complex technical escalations and fixes.
  • Monitor system health using tools like Prometheus, Cloudwatch, and custom LLM telemetry.
  • Ensure CI/CD reliability for agent deployment cycles.
  • Maintain clear, proactive communication with customers and internal teams.
  • Create and maintain high-quality documentation and knowledge base articles.
  • Continuously improve incident response playbooks and automation.

Requirements

  • BE/BTech/BS or MS in Computer Science or related field.
  • 3+ years of experience in SRE, DevOps, or technical support engineering.
  • Deep understanding of SDLC, release management, and system reliability.
  • Familiarity with support systems and ticketing workflows.
  • Proficiency in AWS/Azure, CI/CD pipelines (Jenkins), Ansible, and infrastructure-as-code.
  • Experience with observability tools like Datadog, Prometheus, SIP, Homer, and Cloudwatch.
  • Strong written and verbal communication skills.
  • Experience supporting GenAI or agentic AI applications in production.
  • Familiarity with LLM orchestration, prompt reliability, or RAG systems.
  • Passion for automation and building resilient AI-powered platform infrastructure.
  • Exposure to managing infrastructure for applications like LangChain, AutoGen, or similar agent orchestration frameworks.
This job was posted by Jayanth Babu from Avaamo.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Avaamo logo
Avaamo

Software Development

Los Altos California

RecommendedJobs for You