Staff SRE, Agentic AI

8 - 13 years

25 - 30 Lacs

Posted:Just now| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

About Netskope

Today, theres more data and users outside the enterprise than inside, causing the network perimeter as we know it to dissolve. We realized a new perimeter was needed, one that is built in the cloud and follows and protects data wherever it goes, so we started Netskope to redefine Cloud, Network and Data Security.

About the role:

Please note, this team is hiring across all levels and candidates are individually assessed and appropriately leveled based upon their skills and experience.

As a SRE MLOps, you will be critical to deploying and managing cutting-edge infrastructure crucial for AI/ML operations, and you will collaborate with AI/ML engineers and researchers to develop a robust CI/CD pipeline that supports safe and reproducible experiments. Your expertise will also extend to setting up and maintaining monitoring, logging, and alerting systems to oversee extensive training runs and client-facing APIs. You will ensure that training environments are optimally available and efficiently managed across multiple clusters, enhancing our containerization and orchestration systems with advanced tools like Docker and Kubernetes.

What s in it for you

distributed systems challenges

What you will be doing


  • Work closely with AI/ML engineers and researchers to participate in the designing and architecture of AI ML Applications for scale and reliability. Design and deploy a CI/CD pipeline that ensures safe and reproducible experiments.

  • Involve in production troubleshooting of AI ML Application code as well as infrastructure configurations.

  • Set up and manage monitoring, logging, and alerting systems for extensive training runs and client-facing APIs.

  • Ensure training environments are consistently available and prepared across multiple clusters.

  • Develop and manage containerization and orchestration systems utilizing tools such as Docker and Kubernetes.

  • Operate and oversee large Kubernetes clusters with GPU workloads.

  • Improve reliability, quality, and time-to-market of our suite of software solutions

  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating for continual improvement

  • Provide primary operational support and engineering for multiple large-scale distributed software applications





Required skills and experience


  • 8+ years of professional experience building core infrastructure systems.

  • Hands-on experience with core model training principles and major frameworks like PyTorch and Hugging Face Transformers

  • Familiarity with LLM development, deployment, and optimization techniques (e.g., TensorRT).

  • Familiarity with high-performance, large-scale ML systems and their unique infrastructure needs.

  • Experience with major cloud providers (Google Cloud, AWS, or Azure).

  • Proficiency with Infrastructure as Code (IaC) tools like Terraform.

  • Strong scripting skills using languages like Python or Bash, and experience with Git and GitHub workflows.

  • Expert experience operating orchestration systems such as Kubernetes at scale.

  • Strong scripting skills using languages like Python or Bash, and experience with Git and GitHub workflows.

  • Experience setting up and using monitoring tools such as Prometheus, Grafana, or similar for comprehensive tracing and monitoring.

  • Proven track record of building and operating scalable, reliable, and secure systems.

  • A natural knack for troubleshooting complex systems and solving challenging technical problems.

  • Proactive attitude in identifying problems, performance bottlenecks, and areas for improvement.

  • Comfortable working with ambiguity and rapid change in a dynamic environment.

Education


  • BSCS or equivalent required, MSCS or equivalent strongly preferred


#LI-DS1

Netskope is committed to implementing equal employment opportunities for all employees and applicants for employment. Netskope does not discriminate in employment opportunities or practices based on religion, race, color, sex, marital or veteran statues, age, national origin, ancestry, physical or mental disability, medical condition, sexual orientation, gender identity/expression, genetic information, pregnancy (including childbirth, lactation and related medical conditions), or any other characteristic protected by the laws or regulations of any jurisdiction in which we operate.

Netskope respects your privacy and is committed to protecting the personal information you share with us, please refer to Netskopes Privacy Policy for more details.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
NetSkope Software logo
NetSkope Software

Cloud Security

San Francisco

RecommendedJobs for You

hyderabad, chennai, bengaluru

hyderabad, pune, chennai