Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in sahibzada ajit singh nagar
>
XenonStack Moments
>
LLM Reliability & Evaluation Engineer

LLM Reliability & Evaluation Engineer

XenonStack Moments

6 years

0 Lacs

sahibzada ajit singh nagar punjab india

Posted:3 weeks ago| Platform:

Apply

Skills Required

reliability evaluation data ai vision inference benchmarking research testing design latency consistency metrics compliance monitoring drift collaboration ml tuning model test engineering openai python reinforcement learning cutting mobility healthcare leadership

Work Mode

On-site

Job Type

Full Time

Job Description

About Xenonstack

XenonStack is the fastest-growing

Data and AI Foundry for Agentic Systems

, enabling enterprises to gain

real-time and intelligent business insights

We Deliver Innovation Through

Agentic Systems for AI Agents → akira.ai
Vision AI Platform → xenonstack.ai
Inference AI Infrastructure for Agentic Systems → nexastack.ai

Our mission is to accelerate the world’s transition to

AI + Human Intelligence

by making AI agents

reliable, explainable, and enterprise-ready

THE OPPORTUNITY

We are seeking an

LLM Reliability & Evaluation Engineer

to ensure that large language models (LLMs) and agentic AI systems meet

enterprise-grade standards of accuracy, safety, and trustworthiness

.This role focuses on

evaluating, benchmarking, and stress-testing

LLMs in real-world workflows, building frameworks for

reliability, robustness, and continuous improvement

. If you thrive at the intersection of

AI research, applied testing, and responsible deployment

, this is the role for you.

Key Responsibilities

Evaluation Frameworks

Design and implement LLM evaluation pipelines covering accuracy, robustness, safety, and bias.
Develop automated systems for benchmarking models on enterprise-relevant tasks.

Reliability Engineering

Conduct stress tests, adversarial testing, and edge-case evaluations.
Build tools to measure latency, consistency, and error recovery in multi-turn interactions.

Metrics & Monitoring

Define KPIs such as factual accuracy, hallucination rate, toxicity, and compliance alignment.
Establish real-time monitoring for drift, anomalies, and performance regressions.

Collaboration & Alignment

Partner with ML engineers, product managers, and domain experts to align evaluation with business objectives.
Work with Responsible AI teams to implement ethical, explainable, and compliant evaluation practices.

Continuous Improvement

Feed insights from evaluation into fine-tuning, RLHF/RLAIF pipelines, and model selection.
Maintain a central repository of test cases, benchmarks, and evaluation results.

Research & Innovation

Stay current with state-of-the-art LLM evaluation techniques, from academic benchmarks to applied enterprise metrics.
Explore automated evaluation using agentic test harnesses and synthetic data generation.

Skills & Qualifications

Must-Have

3–6 years in AI/ML, NLP, or applied model evaluation.
Strong understanding of LLM architectures, prompt engineering, and failure modes.
Hands-on with evaluation frameworks (Eval harnesses, Ragas, OpenAI Evals, DeepEval).
Proficiency in Python and libraries like LangChain, LangGraph, LlamaIndex, Hugging Face.
Experience with vector databases, RAG pipelines, and knowledge graph integration.
Familiarity with bias/fairness testing and Responsible AI frameworks.

Good-to-Have

Experience with reinforcement learning (RLHF, RLAIF) and reward modeling.
Exposure to agentic evaluation frameworks (multi-agent stress testing, synthetic user simulators).
Knowledge of compliance and safety requirements for BFSI, GRC, or SOC use cases.
Contributions to open-source evaluation libraries or research papers.

WHY SHOULD YOU JOIN US?

Agentic AI Product Company

Ensure reliability in cutting-edge AI platforms that are redefining enterprise adoption.

A Fast-Growing Category Leader

Be part of one of the fastest-growing

AI Foundries

, powering Fortune 500 enterprises with trustworthy AI.

Career Mobility & Growth

Grow into roles such as

AI Systems Architect, Responsible AI Engineer, or Reliability Engineering Lead

Global Exposure

Work on

enterprise-scale evaluation challenges

across BFSI, Healthcare, Telecom, and GRC.

Create Real Impact

Your evaluations will directly shape

production-grade AI agents used in mission-critical systems

Culture of Excellence

Our values —

Agency, Taste, Ownership, Mastery, Impatience, and Customer Obsession

— empower you to innovate fearlessly.

Responsible AI First

Join a company that prioritizes

trustworthy, explainable, and compliant AI

XENONSTACK CULTURE – JOIN US & MAKE AN IMPACT!

At XenonStack, we believe in

shaping the future of intelligent systems

. We foster a

culture of cultivation

built on bold, human-centric leadership principles, where

deep work, simplicity, and adoption

define everything we do.

Our Cultural Values

Agency – Be self-directed and proactive.
Taste – Sweat the details and build with precision.
Ownership – Take responsibility for outcomes.
Mastery – Commit to continuous learning and growth.
Impatience – Move fast and embrace progress.
Customer Obsession – Always put the customer first.

Our Product Philosophy

Obsessed with Adoption – Making AI accessible, reliable, and enterprise-ready.
Obsessed with Simplicity – Turning complex evaluation challenges into seamless, automated frameworks.

Be part of our mission to

accelerate the world’s transition to AI + Human Intelligence

— by making AI agents not just powerful, but