Agentic Infrastructure Observability Engineer

4 - 8 years

0 Lacs

Posted:1 week ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

You will be working as an Agentic Infrastructure Observability Engineer at XenonStack, a leading Data and AI Foundry for Agentic Systems. Your role will involve designing and implementing end-to-end observability frameworks for AI-native and multi-agent systems. Here are the key responsibilities associated with this role: - **Observability Frameworks**: - Design and implement observability pipelines covering metrics, logs, traces, and cost telemetry for agentic systems. - Build dashboards and alerting systems to monitor reliability, performance, and drift in real-time. - **Agentic AI Monitoring**: - Track LLM usage, context windows, token allocation, and multi-agent interactions. - Build monitoring hooks into LangChain, LangGraph, MCP, and RAG pipelines. - **Reliability & Performance**: - Define and monitor SLOs, SLIs, and SLAs for agentic workflows and inference infrastructure. - Conduct root cause analysis of agent failures, latency issues, and cost spikes. - **Automation & Tooling**: - Integrate observability into CI/CD and AgentOps pipelines. - Develop custom plugins/scripts to extend observability for LLMs, agents, and data pipelines. - **Collaboration & Reporting**: - Work with AgentOps, DevOps, and Data Engineering teams to ensure system-wide observability. - Provide executive-level reporting on reliability, efficiency, and adoption metrics. - **Continuous Improvement**: - Implement feedback loops to improve agent performance and reduce downtime. - Stay updated with state-of-the-art observability and AI monitoring frameworks. **Skills & Qualifications**: **Must-Have**: - 3-6 years of experience in SRE, DevOps, or Observability Engineering. - Strong knowledge of observability tools such as Prometheus, Grafana, ELK, OpenTelemetry, Jaeger. - Experience with cloud-native infrastructure (AWS, GCP, Azure) and Kubernetes monitoring. - Proficiency in scripting and automation using Python, Go, or Bash. - Understanding of AI/LLM pipelines, RAG systems, and vector databases. - Hands-on experience with CI/CD pipelines and monitoring-as-code. **Good-to-Have**: - Experience with AgentOps tools like LangSmith, PromptLayer, Arize AI, Weights & Biases. - Exposure to AI-specific observability including token usage, model latency, and hallucination tracking. - Knowledge of Responsible AI monitoring frameworks. - Background in BFSI, GRC, SOC, or other regulated industries. If you are passionate about metrics, monitoring, and making complex systems transparent and reliable, this role offers you the opportunity to define observability for the next generation of enterprise AI at XenonStack.,

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You