Get alerts for new jobs matching your selected skills, preferred locations, and experience range. Manage Job Alerts
5.0 - 9.0 years
0 Lacs
karnataka
On-site
Role Overview: You have the unique opportunity to work as a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division. Your mission will involve enhancing the reliability and resilience of AI systems to transform how the Bank services and advises clients. You will play a crucial role in ensuring the robustness and availability of AI models, deepening client engagements, and driving process transformation through advanced reliability engineering practices and cloud-centric software delivery. Key Responsibilities: - Develop and refine Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity. - Design, implement, and continuously improve monitoring systems, including availability, latency, and other key metrics. - Collaborate in designing high-availability language model serving infrastructure for high-traffic workloads. - Champion site reliability culture and practices, providing technical leadership to foster a culture of reliability and resilience. - Develop automated failover and recovery systems for model serving deployments across multiple regions and cloud providers. - Create AI Incident Response playbooks for AI-specific failures and lead incident response for critical AI services. - Build and maintain cost optimization systems for large-scale AI infrastructure to ensure efficient resource utilization. - Engineer for Scale and Security, leveraging techniques like load balancing, caching, and optimized GPU scheduling. - Collaborate with ML engineers to ensure seamless integration and operation of AI infrastructure. - Implement Continuous Evaluation processes for pre-deployment, pre-release, and post-deployment monitoring for drift and degradation. Qualifications Required: - Proficiency in reliability, scalability, performance, security, enterprise system architecture, and other site reliability best practices. - Knowledge and experience in observability tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc. - Experience with continuous integration and delivery tools like Jenkins, GitLab, or Terraform. - Familiarity with container and container orchestration tools like ECS, Kubernetes, Docker. - Understanding of common networking technologies and troubleshooting. - Experience in operating AI infrastructure, including model serving, batch inference, and training pipelines. - Ability to implement and maintain SLO/SLA frameworks for business-critical services. - Proficiency in working with traditional and AI-specific metrics. - Excellent communication skills. (Note: The job description did not include any additional details about the company.),
Posted 1 day ago
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Accenture
73564 Jobs | Dublin
Wipro
27625 Jobs | Bengaluru
Accenture in India
22690 Jobs | Dublin 2
EY
20638 Jobs | London
Uplers
15021 Jobs | Ahmedabad
Bajaj Finserv
14304 Jobs |
IBM
14148 Jobs | Armonk
Accenture services Pvt Ltd
13138 Jobs |
Capgemini
12942 Jobs | Paris,France
Amazon.com
12683 Jobs |