Jobs
Interviews

1 Continuous Evaluation Jobs

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

5.0 - 9.0 years

0 Lacs

karnataka

On-site

Role Overview: You have the unique opportunity to work as a Senior AI Reliability Engineer at JPMorgan Chase within the Technology and Operations division. Your mission will involve enhancing the reliability and resilience of AI systems to transform how the Bank services and advises clients. You will play a crucial role in ensuring the robustness and availability of AI models, deepening client engagements, and driving process transformation through advanced reliability engineering practices and cloud-centric software delivery. Key Responsibilities: - Develop and refine Service Level Objectives for large language model serving and training systems, balancing availability/latency with development velocity. - Design, implement, and continuously improve monitoring systems, including availability, latency, and other key metrics. - Collaborate in designing high-availability language model serving infrastructure for high-traffic workloads. - Champion site reliability culture and practices, providing technical leadership to foster a culture of reliability and resilience. - Develop automated failover and recovery systems for model serving deployments across multiple regions and cloud providers. - Create AI Incident Response playbooks for AI-specific failures and lead incident response for critical AI services. - Build and maintain cost optimization systems for large-scale AI infrastructure to ensure efficient resource utilization. - Engineer for Scale and Security, leveraging techniques like load balancing, caching, and optimized GPU scheduling. - Collaborate with ML engineers to ensure seamless integration and operation of AI infrastructure. - Implement Continuous Evaluation processes for pre-deployment, pre-release, and post-deployment monitoring for drift and degradation. Qualifications Required: - Proficiency in reliability, scalability, performance, security, enterprise system architecture, and other site reliability best practices. - Knowledge and experience in observability tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc. - Experience with continuous integration and delivery tools like Jenkins, GitLab, or Terraform. - Familiarity with container and container orchestration tools like ECS, Kubernetes, Docker. - Understanding of common networking technologies and troubleshooting. - Experience in operating AI infrastructure, including model serving, batch inference, and training pipelines. - Ability to implement and maintain SLO/SLA frameworks for business-critical services. - Proficiency in working with traditional and AI-specific metrics. - Excellent communication skills. (Note: The job description did not include any additional details about the company.),

Posted 1 day ago

Apply
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies