Senior Site Reliability Engineer - GPU Cloud

5 - 8 years

7 - 10 Lacs

Posted:4 hours ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

 
The NVIDIA GPU cloud is a hosted platform for internal R&D teams and external AI/ML stack customers. This SRE team is accountable for the setup, management, reliability and availability of this infrastructure spanning 1000s of GPU nodes. As a senior SRE, you are responsible for:
  • Providing scalable and robust service oriented infrastructure automation, monitoring and analytics solutions for NVIDIAs on-prem and cloud based GPU infrastructure.
  • You will own the whole life cycle of new tools and services - from requirements gathering, to design documentation, validation and deployment.
  • Provide customer support on a rotation basis.
What we need to see:
  • Minimum of 8 years of experience ce in automating and handling large-scale distributed system software deployments in on-prem/cloud environments.
  • Proficiency in any language - Go/Python/Perl/C++/Java/C.
  • Strong command on terraform, Kubernetes and cloud infra administration.
  • Excellent debugging and troubleshooting skills.
  • Ability to design simple and reliable systems that can work without much support.
  • Outstanding teammate who can collaborate and influence in a multifaceted environment.
  • Excellent interpersonal, and written communication skills.
  • M. Sc or B. E in Computer Science or a related technical field involving coding (e. g. , physics or mathematics)
Ways to stand out from the crowd:
  • Ability to decompose complex requirements into simple tasks and reuse available solutions to implement most of those.
  • Proven record of maintaining platform SLAs through accurate resolutions.
  • Unit testing and benchmarking are an integral part of your code.
  • Ability to reason and choose the best possible algorithm to meet scaling and availability challenges.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Nvidia logo
Nvidia

Computer Hardware Manufacturing

Santa Clara CA

RecommendedJobs for You