About Xenonstack
XenonStack is the fastest-growing
Data and AI Foundry for Agentic Systems
, enabling enterprises to gain
real-time and intelligent business insights
.
We Deliver Innovation Through
- Agentic Systems for AI Agents → akira.ai
- Vision AI Platform → xenonstack.ai
- Inference AI Infrastructure for Agentic Systems → nexastack.ai
Our mission is to accelerate the world’s transition to
AI + Human Intelligence
by making AI agents
reliable, explainable, and enterprise-ready
.
THE OPPORTUNITY
We are seeking an
LLM Reliability & Evaluation Engineer
to ensure that large language models (LLMs) and agentic AI systems meet
enterprise-grade standards of accuracy, safety, and trustworthiness
.This role focuses on
evaluating, benchmarking, and stress-testing
LLMs in real-world workflows, building frameworks for
reliability, robustness, and continuous improvement
. If you thrive at the intersection of
AI research, applied testing, and responsible deployment
, this is the role for you.
Key Responsibilities
- Evaluation Frameworks
- Design and implement LLM evaluation pipelines covering accuracy, robustness, safety, and bias.
- Develop automated systems for benchmarking models on enterprise-relevant tasks.
- Reliability Engineering
- Conduct stress tests, adversarial testing, and edge-case evaluations.
- Build tools to measure latency, consistency, and error recovery in multi-turn interactions.
- Metrics & Monitoring
- Define KPIs such as factual accuracy, hallucination rate, toxicity, and compliance alignment.
- Establish real-time monitoring for drift, anomalies, and performance regressions.
- Collaboration & Alignment
- Partner with ML engineers, product managers, and domain experts to align evaluation with business objectives.
- Work with Responsible AI teams to implement ethical, explainable, and compliant evaluation practices.
- Continuous Improvement
- Feed insights from evaluation into fine-tuning, RLHF/RLAIF pipelines, and model selection.
- Maintain a central repository of test cases, benchmarks, and evaluation results.
- Research & Innovation
- Stay current with state-of-the-art LLM evaluation techniques, from academic benchmarks to applied enterprise metrics.
- Explore automated evaluation using agentic test harnesses and synthetic data generation.
Skills & Qualifications
Must-Have
- 3–6 years in AI/ML, NLP, or applied model evaluation.
- Strong understanding of LLM architectures, prompt engineering, and failure modes.
- Hands-on with evaluation frameworks (Eval harnesses, Ragas, OpenAI Evals, DeepEval).
- Proficiency in Python and libraries like LangChain, LangGraph, LlamaIndex, Hugging Face.
- Experience with vector databases, RAG pipelines, and knowledge graph integration.
- Familiarity with bias/fairness testing and Responsible AI frameworks.
Good-to-Have
- Experience with reinforcement learning (RLHF, RLAIF) and reward modeling.
- Exposure to agentic evaluation frameworks (multi-agent stress testing, synthetic user simulators).
- Knowledge of compliance and safety requirements for BFSI, GRC, or SOC use cases.
- Contributions to open-source evaluation libraries or research papers.
WHY SHOULD YOU JOIN US?
- Agentic AI Product Company
Ensure reliability in cutting-edge AI platforms that are redefining enterprise adoption.
- A Fast-Growing Category Leader
Be part of one of the fastest-growing
AI Foundries
, powering Fortune 500 enterprises with trustworthy AI.
Grow into roles such as
AI Systems Architect, Responsible AI Engineer, or Reliability Engineering Lead
.
Work on
enterprise-scale evaluation challenges
across BFSI, Healthcare, Telecom, and GRC.
Your evaluations will directly shape
production-grade AI agents used in mission-critical systems
.
Our values —
Agency, Taste, Ownership, Mastery, Impatience, and Customer Obsession
— empower you to innovate fearlessly.
Join a company that prioritizes
trustworthy, explainable, and compliant AI
.
XENONSTACK CULTURE – JOIN US & MAKE AN IMPACT!
At XenonStack, we believe in
shaping the future of intelligent systems
. We foster a
culture of cultivation
built on bold, human-centric leadership principles, where
deep work, simplicity, and adoption
define everything we do.
Our Cultural Values
- Agency – Be self-directed and proactive.
- Taste – Sweat the details and build with precision.
- Ownership – Take responsibility for outcomes.
- Mastery – Commit to continuous learning and growth.
- Impatience – Move fast and embrace progress.
- Customer Obsession – Always put the customer first.
Our Product Philosophy
- Obsessed with Adoption – Making AI accessible, reliable, and enterprise-ready.
- Obsessed with Simplicity – Turning complex evaluation challenges into seamless, automated frameworks.
Be part of our mission to
accelerate the world’s transition to AI + Human Intelligence
— by making AI agents not just powerful, but
trustworthy and reliable
.