Agentic Infrastructure Observability Engineer

3 years

2 - 7 Lacs

Posted:1 day ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Part Time

Job Description

Job Information

    Date Opened

    08/08/2025

    Job Type

    Full time

    Industry

    Technology

    Work Experience

    3+ Years

    City

    Mohali

    State/Province

    Punjab

    Country

    India

    Zip/Postal Code

    160071


ABOUT XENONSTACK


XenonStack is the fastest-growing data and AI foundry for agentic systems, enabling people and organizations to gain real-time and intelligent business insights.


  • Agentic Systems for AI Agents
    : akira.ai


  • Vision AI Platform
    : xenonstack.ai


  • Inference AI Infrastructure for Agentic Systems
    : nexastack.ai


THE OPPORTUNITY


We are seeking an Agentic Infrastructure Observability Engineer to design, implement, and maintain visibility, monitoring, and assurance systems for large-scale AI agent deployments.


This role focuses on observability, telemetry, and evaluation pipelines across multi-agent and multi-context workflows, ensuring AI systems are measurable, trustworthy, and compliant in enterprise and regulated environments.


If you’re passionate about SRE principles for AI, LLM evaluation, and agentic system transparency, this role offers the chance to shape observability for the next generation of intelligent automation.


RESPONSIBILITIES


  • Design and Implement Telemetry Pipelines

    Build observability infrastructure to capture logs, metrics, traces, and behavioral data from AI agents, orchestration layers, and integrated tools.

  • Develop Evaluation Dashboards & KPIs

    Track accuracy, latency, reliability, cost, token usage, and success rates for agentic workflows.

  • Enable Full-Stack Tracing

    Build execution flow tracing for multi-agent, multi-tool pipelines, with attribution for each decision, prompt, and retrieval step.

  • Monitor Behavioral Reliability

    Detect and flag hallucinations, decision drift, prompt degradation, or tool misuse in real time.

  • Integrate with Evaluation Frameworks

    Work with LLM eval tools like TruLens, Ragas, Arize AI, and custom scoring systems for continuous quality monitoring.

  • Ensure Compliance & Auditability

    Implement observability features for regulatory audits (e.g., PCI-DSS, GDPR), including secure logging of prompts, retrieved context, and decisions.

  • Cost & Resource Observability

    Track model/API usage, compute cost, and token consumption to enable optimization decisions.

  • Collaborate Across Teams

    Partner with AgentOps Engineers, AI Interaction Engineers, and Model Reliability teams to turn observability insights into operational improvements.


SKILLS & QUALIFICATIONS


Must-Have:


  • 3–5 years in SRE, DevOps, AI infrastructure, or ML systems engineering.


  • Proficiency in Python and observability stacks (Prometheus, OpenTelemetry, Grafana, ELK, etc.).


  • Familiarity with LLM architectures, multi-agent orchestration frameworks (LangGraph, LangChain, AgentBridge), and context pipelines.


  • Experience with logging, tracing, and performance profiling for distributed systems.


  • Understanding of LLM evaluation metrics (factuality, coherence, toxicity, cost efficiency).


  • Knowledge of privacy and compliance standards for AI systems.


Good-to-Have:


  • Hands-on experience with LLM eval tools (TruLens, Ragas, Arize AI, Weights & Biases).


  • Familiarity with RAG, vector databases, and knowledge graph-based retrieval.


  • Experience in regulated industries (BFSI, healthcare, GRC).


  • Background in anomaly detection or behavioral monitoring for ML systems.


CAREER GROWTH & BENEFITS


Continuous Learning & Growth


  • Training and certifications in AI observability, LLM evaluation, and Responsible AI.


  • Hands-on exposure to enterprise-scale agentic infrastructure.


Recognition & Rewards


  • Incentives for innovations in AI observability and monitoring.


  • Fast-track opportunities into AI Reliability Architecture or Model Ops Leadership roles.


Work Benefits & Well-Being


  • Comprehensive medical insurance and project-based allowances.


  • Cab facilities for women employees and special project perks.


XENONSTACK CULTURE – JOIN US & MAKE AN IMPACT!


We foster a culture of cultivation with bold, human-centric leadership principles. We value deep work, experimentation, and ownership in every initiative, and we are on a mission to reshape how enterprises adopt AI + Human Intelligence systems.


Product Values:


  • Obsessed with Adoption
    – Making AI accessible and enterprise-ready.


  • Obsessed with Simplicity
    – Turning complexity into seamless, intuitive AI experiences.


Be a part of our vision to accelerate the world’s transition to AI + Human Intelligence.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You