Senior Site Reliability Engineer, AI Infrastructure

12 - 16 years

0 Lacs

Posted:1 day ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

As an aspiring member of the NVIDIA team, you will be joining a group of innovative and hardworking individuals who are at the forefront of technology. NVIDIA has a rich history of transforming computer graphics, PC gaming, and accelerated computing over the past 30 years, and is now leveraging the power of AI to shape the future of computing. If you are someone who is creative, autonomous, and driven to make a difference, we invite you to be a part of our AI Infrastructure Production engineering team and contribute to groundbreaking advancements in the field. Role Overview: - Develop and maintain large-scale systems that support critical AI Infrastructure use cases, ensuring reliability, operability, and scalability in global public and private clouds. - Implement Site Reliability Engineering (SRE) fundamentals, such as incident management, monitoring, and performance optimization, while creating automation tools to streamline manual processes. - Create tools and frameworks to enhance observability, define reliability metrics, and facilitate quick problem resolution to continuously improve system performance. - Establish operational frameworks, lead incident response procedures, and conduct postmortems to enhance team efficiency and system resilience. - Collaborate with engineering teams to deliver innovative solutions, mentor peers, maintain high coding and infrastructure standards, and participate in team building activities. Key Responsibilities: - Bachelor's degree in Computer Science or related field, or equivalent experience with a minimum of 12 years in Software Development, SRE, or Production Engineering. - Proficiency in Python and at least one other language (e.g., C/C++, Go, Perl, Ruby). - Expertise in systems engineering on Linux or Windows environments and various cloud platforms (AWS, OCI, Azure, GCP). - Strong grasp of SRE principles, including error budgets, SLOs, SLAs, and Infrastructure as Code tools (e.g., Terraform CDK). - Hands-on experience with observability platforms (ELK, Prometheus, Loki) and CI/CD systems (GitLab). - Effective communication skills to convey technical concepts to diverse audiences. - Dedication to promoting diversity, curiosity, and continuous improvement within the team. Qualifications Required: - Experience in AI training, inferencing, and data infrastructure services preferred. - Proficiency in deep learning frameworks like PyTorch, TensorFlow, JAX, and Ray would be advantageous. - Solid background in hardware health monitoring and system reliability is a plus. - Hands-on expertise in operating and scaling distributed systems with stringent SLAs for high availability and performance. - Proven track record in incident, change, and problem management processes within sophisticated environments. Should you possess the required qualifications and are eager to contribute to the cutting-edge work at NVIDIA, we look forward to welcoming you to our team.,

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You