Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in bengaluru
>
JPMorgan Chase Bank
>
Site Reliability Engineer Associate

Site Reliability Engineer Associate

JPMorgan Chase Bank

7 - 12 years

20 - 25 Lacs

bengaluru

Posted:3 months ago| Platform:

Apply

Skills Required

system architecture service level networking circuit breakers scheduling troubleshooting open source load balancing monitoring

Work Mode

Work from Office

Job Type

Full Time

Job Description

a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division, you will join our dynamic team of innovators and technologists. Your mission will be to enhance the reliability and resilience of AI systems that revolutionize how the Bank services and advises clients. You will focus on ensuring the robustness and availability of AI models, deepening client engagements, and promoting process transformation. We seek team members passionate about leveraging advanced reliability engineering practices, AI observability, and incident response strategies to solve complex business challenges through high-quality, cloud-centric software delivery.

Job Responsibilities

Develop and refine Service Level Objectives( including metrics like accuracy, fairness, latency, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token)) for large language model serving and training systems, balancing availability/latency with development velocity
Design, implement and continuously improve monitoring systems including availability, latency and other salient metrics
Collaborate in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of high-traffic internal workloads
Champion site reliability culture and practices, providing technical leadership and influence across teams to foster a culture of reliability and resilience
Champion site reliability culture and practices and exerts technical influence throughout your team
Develop and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
Develop AI Incident Response playbooks for AI-specific failures like sudden drift or bias spikes, including automated rollbacks and AI circuit breakers.
Lead incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident Build and maintain cost optimization systems for large-scale AI infrastructure, ensuring efficient resource utilization without compromising performance.
Engineer for Scale and Security, leveraging techniques like load balancing, caching, optimized GPU scheduling, and AI Gateways for managing traffic and security.
Collaborate with ML engineers to ensure seamless integration and operation of AI infrastructure, bridging the gap between development and operations.
Implement Continuous Evaluation, including pre-deployment, pre-release, and continuous post-deployment monitoring for drift and degradation.

Required qualifications, capabilities, and skills

Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
Proficient with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform
Proficient with container and container orchestration (ECS, Kubernetes, Docker)
Experience with troubleshooting common networking technologies and issues
Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
Comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
Can effectively bridge the gap between ML engineers and infrastructure teams Have excellent communication skills

Preferred qualifications, capabilities, and skills

Experience with AI-specific observability tools and platforms, such as OpenTelemetry and OpenInference.
Familiarity with AI incident response strategies, including automated rollbacks and AI circuit breakers.
Knowledge of AI-centric SLOs/SLAs, including metrics like accuracy, fairness, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token).
Expertise in engineering for scale and security, including load balancing, caching, optimized GPU scheduling, and AI Gateways.
Experience with continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.
Understand ML model deployment strategies and their reliability implications
Have contributed to open-source infrastructure or ML tooling
Have experience with chaos engineering and systematic resilience testing

More Jobs at JPMorgan Chase Bank

Valuation Controller - Automation, Developer - Associate

Mumbai

5 - 10 yrs

INR 20 - 25 Lacs

Transaction Processing Specialist

Chennai, Pune, Delhi, Mumbai, Bengaluru, Hyderabad, Kolkata

2 - 7 yrs

INR 4 - 9 Lacs

Transaction Processing Manager

Chennai, Pune, Delhi, Mumbai, Bengaluru, Hyderabad, Kolkata

4 - 9 yrs

INR 15 - 20 Lacs

Lead Software Engineer - Java, SpringBoot

Mumbai

5 - 10 yrs

INR 15 - 20 Lacs

Software Engineer III - Mongo DB Database Engineer

Mumbai

3 - 8 yrs

INR 30 - 35 Lacs

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

JPMorgan Chase Bank

Financial Services

New York

Login to

Please Verify Your Phone or Email

Confirm Action

Site Reliability Engineer Associate