Home
Jobs

Senior Site Reliability Engineer - AI/ML

4 - 11 years

25 - 30 Lacs

Posted:4 hours ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Position Summary:

The Reliability Engineering Automation team prides itself in keeping Visa systems up and secure, catering to the 24*7 needs of the business. The GenAI Senior site reliability Engineer, a highly motivated senior individual contributor based in India - Bengaluru location, responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning. The role is a senior technologist who has the passion to solve problems developing systems and software that help increase site reliability and performance. Site reliability engineering (

SRE

) fuses the software engineering and operations disciplines in GenAI ecosystem.

Responsibilities:

- System Reliability: Ensure the uptime, reliability, and scalability of GenAI platforms and services.
- Monitoring & Alerting: Design, implement, and improve monitoring, logging, and alerting for AI workloads and infrastructure.
- Incident Response: Respond to, investigate, and resolve production incidents, ensuring minimal disruption to GenAI services.
- Automation: Develop and maintain automation scripts for deployment, scaling, and recovery of GenAI systems.
- Performance Optimization: Analyze system bottlenecks and optimize resource utilization for AI model training and inference.
- Collaboration: Work closely with ML engineers, data scientists, DevOps, and platform teams to support end-to-end GenAI pipelines.
- Security & Compliance: Implement robust security practices and ensure compliance with relevant data and AI regulations.
- Documentation: Maintain clear documentation for processes, runbooks, and system architecture.

Required Skills:

- Kubernetes & Containers: Proficiency in Kubernetes, Docker, and related tools for orchestration of AI workloads.
- Infrastructure as Code: Skills in Terraform, Ansible, or similar.
- Monitoring & Logging: Familiarity with Prometheus, Grafana, ELK stack, or similar tools.
- Scripting & Programming: Ability to write scripts (Python, Bash, Go, etc.) for automation and tooling.
- CI/CD Pipelines: Knowledge of CI/CD workflows, especially for ML/AI projects.
- AI/ML Workloads: Understanding of ML model lifecycle, distributed training, and inference serving (e.g., using Ray, Kubeflow, MLFlow).
- Troubleshooting: Strong analytical and troubleshooting skills, especially in complex, distributed environments.
This is a hybrid position. Expectation of days in office will be confirmed by your Hiring Manager.

Bachelor s or master s in computer science, Engineering, or a related field.
- Professional Experience: 4+ years as an SRE, DevOps Engineer, or similar, preferably supporting AI/ML or large-scale data platforms.
- AI/ML Infrastructure: Han

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Visa
Visa

IT Services and IT Consulting

Foster City California

RecommendedJobs for You

Kolkata, Mumbai, New Delhi, Hyderabad, Pune, Chennai, Bengaluru

Mumbai, Nagpur, Thane, Nashik, Pune, Aurangabad