Senior Site Reliability Engineer, ML Platforms

6 - 10 years

0 Lacs

Posted:19 hours ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

As a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team at NVIDIA, you will play a crucial role in designing, building, and maintaining services that support advanced data science and machine learning applications. Your responsibilities will involve implementing software and systems engineering practices to ensure high efficiency and availability of the platform, applying SRE principles to improve production systems, and collaborating with customers to implement changes while monitoring capacity, latency, and performance. Key Responsibilities: - Develop software solutions to ensure reliability and operability of large-scale systems supporting machine-critical use cases. - Gain a deep understanding of system operations, scalability, interactions, and failures to identify improvement opportunities and risks. - Create tools and automation to reduce operational overhead and eliminate manual tasks. - Establish frameworks, processes, and standard methodologies to enhance operational maturity, team efficiency, and accelerate innovation. - Define meaningful and actionable reliability metrics to track and improve system and service reliability. - Oversee capacity and performance management to facilitate infrastructure scaling across public and private clouds globally. - Build tools to improve service observability for faster issue resolution. - Practice sustainable incident response and blameless postmortems. Qualifications Required: - Minimum of 6+ years of experience in SRE, Cloud platforms, or DevOps with large-scale microservices in production environments. - Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent experience. - Strong understanding of SRE principles, including error budgets, SLOs, and SLAs. - Proficiency in incident, change, and problem management processes. - Skilled in problem-solving, root cause analysis, and optimization. - Experience with streaming data infrastructure services, such as Kafka and Spark. - Expertise in building and operating large-scale observability platforms for monitoring and logging (e.g., ELK, Prometheus). - Proficiency in programming languages such as Python, Go, Perl, or Ruby. - Hands-on experience with scaling distributed systems in public, private, or hybrid cloud environments. - Experience in deploying, supporting, and supervising services, platforms, and application stacks. Additional Company Details: NVIDIA leads the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. The GPU, their invention, serves as the visual cortex of modern computers and is at the heart of their products and services. NVIDIA's work opens up new universes to explore, enables amazing creativity and discovery, and powers what were once science fiction inventions, from artificial intelligence to autonomous cars. NVIDIA is looking for exceptional individuals to help accelerate the next wave of artificial intelligence. As a Senior Site Reliability Engineer (SRE) for the Data Science & ML Platform(s) team at NVIDIA, you will play a crucial role in designing, building, and maintaining services that support advanced data science and machine learning applications. Your responsibilities will involve implementing software and systems engineering practices to ensure high efficiency and availability of the platform, applying SRE principles to improve production systems, and collaborating with customers to implement changes while monitoring capacity, latency, and performance. Key Responsibilities: - Develop software solutions to ensure reliability and operability of large-scale systems supporting machine-critical use cases. - Gain a deep understanding of system operations, scalability, interactions, and failures to identify improvement opportunities and risks. - Create tools and automation to reduce operational overhead and eliminate manual tasks. - Establish frameworks, processes, and standard methodologies to enhance operational maturity, team efficiency, and accelerate innovation. - Define meaningful and actionable reliability metrics to track and improve system and service reliability. - Oversee capacity and performance management to facilitate infrastructure scaling across public and private clouds globally. - Build tools to improve service observability for faster issue resolution. - Practice sustainable incident response and blameless postmortems. Qualifications Required: - Minimum of 6+ years of experience in SRE, Cloud platforms, or DevOps with large-scale microservices in production environments. - Master's or Bachelor's degree in Computer Science, Electrical Engineering, or equivalent experience. - Strong understanding of SRE principles, including error budgets, SLOs, and SLAs. - Proficiency in incident, change, and problem management processes. - Skilled in problem-solving, root cause analysis, and optimization. - Experience with streaming data infrastructure services,

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You