Agentic Infrastructure Observability Engineer

6 years

0 Lacs

Posted:11 hours ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

About Xenonstack

XenonStack is the fastest-growing

Data and AI Foundry for Agentic Systems

, enabling enterprises to gain

real-time and intelligent business insights

.

We Deliver Innovation Through

  • Agentic Systems for AI Agents → akira.ai
  • Vision AI Platform → xenonstack.ai
  • Inference AI Infrastructure for Agentic Systems → nexastack.ai
Our mission is to accelerate the world’s transition to

AI + Human Intelligence

by building platforms that are

scalable, reliable, and observable by design

.

THE OPPORTUNITY

We are seeking an

Agentic Infrastructure Observability Engineer

to design and implement

end-to-end observability frameworks

for AI-native and multi-agent systems.This role sits at the heart of

AgentOps and Reliability Engineering

— ensuring that

agents, pipelines, and infrastructure

are monitored, measurable, and continuously optimized.If you thrive on

metrics, monitoring, and making complex systems transparent and reliable

, this role offers a chance to define observability for the next generation of enterprise AI.

Key Responsibilities

  • Observability Frameworks
    • Design and implement observability pipelines covering metrics, logs, traces, and cost telemetry for agentic systems.
    • Build dashboards and alerting systems to monitor reliability, performance, and drift in real-time.
  • Agentic AI Monitoring
    • Track LLM usage, context windows, token allocation, and multi-agent interactions.
    • Build monitoring hooks into LangChain, LangGraph, MCP, and RAG pipelines.
  • Reliability & Performance
    • Define and monitor SLOs, SLIs, and SLAs for agentic workflows and inference infrastructure.
    • Conduct root cause analysis of agent failures, latency issues, and cost spikes.
  • Automation & Tooling
    • Integrate observability into CI/CD and AgentOps pipelines.
    • Develop custom plugins/scripts to extend observability for LLMs, agents, and data pipelines.
  • Collaboration & Reporting
    • Work with AgentOps, DevOps, and Data Engineering teams to ensure system-wide observability.
    • Provide executive-level reporting on reliability, efficiency, and adoption metrics.
  • Continuous Improvement
    • Implement feedback loops to improve agent performance and reduce downtime.
    • Stay updated with state-of-the-art observability and AI monitoring frameworks.

Skills & Qualifications

Must-Have

  • 3–6 years of experience in SRE, DevOps, or Observability Engineering.
  • Strong knowledge of observability tools (Prometheus, Grafana, ELK, OpenTelemetry, Jaeger).
  • Experience with cloud-native infrastructure (AWS, GCP, Azure) and Kubernetes monitoring.
  • Proficiency in Python, Go, or Bash for scripting and automation.
  • Understanding of AI/LLM pipelines, RAG systems, and vector databases.
  • Hands-on with CI/CD pipelines and monitoring-as-code.

Good-to-Have

  • Experience with AgentOps tools (LangSmith, PromptLayer, Arize AI, Weights & Biases).
  • Exposure to AI-specific observability (token usage, model latency, hallucination tracking).
  • Knowledge of Responsible AI monitoring frameworks.
  • Background in BFSI, GRC, SOC, or other regulated industries.

WHY SHOULD YOU JOIN US?

  • Agentic AI Product Company
Build observability frameworks for

next-gen enterprise AI systems

.
  • A Fast-Growing Category Leader
Be part of one of the fastest-growing

AI Foundries

, powering mission-critical agent deployments.
  • Career Mobility & Growth
Advance into roles like

Reliability Architect, AgentOps Lead, or Head of Observability

.
  • Global Exposure
Work on observability challenges across

Fortune 500 enterprises and global innovators

.
  • Create Real Impact
Ensure

transparency, trust, and resilience

in production-grade AI systems.
  • Culture of Excellence
Our values —

Agency, Taste, Ownership, Mastery, Impatience, and Customer Obsession

— give you autonomy to innovate and accountability to deliver.
  • Responsible AI First
Help enterprises adopt AI that is

not just powerful, but explainable and auditable

.

XENONSTACK CULTURE – JOIN US & MAKE AN IMPACT!

At XenonStack, we believe in

shaping the future of intelligent systems

. We foster a

culture of cultivation

built on bold, human-centric leadership principles, where

deep work, simplicity, and adoption

define everything we do.

Our Cultural Values

  • Agency – Be self-directed and proactive.
  • Taste – Sweat the details and build with precision.
  • Ownership – Take responsibility for outcomes.
  • Mastery – Commit to continuous learning and growth.
  • Impatience – Move fast and embrace progress.
  • Customer Obsession – Always put the customer first.

Our Product Philosophy

  • Obsessed with Adoption – Making observability and trust an integral part of enterprise AI.
  • Obsessed with Simplicity – Turning complex monitoring into seamless, actionable insights.
Be part of our mission to

accelerate the world’s transition to AI + Human Intelligence

— by making agentic AI systems

transparent, observable, and reliable at scale

.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You