ML Platform Engineer

5 years

30 - 45 Lacs

Posted:4 hours ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

We are looking for a specialist who can provide operational support for AI/ML compute environments. This role centers on overseeing machine-learning workloads executed on Kubernetes and Ray-based ecosystems. The position acts as the initial escalation point for issues involving training pipelines, inference services, deployment workflows, and GPU consumption. A key expectation is maintaining the stability and efficiency of the infrastructure that drives large-scale AI initiatives. The job includes participating in an on-call roster and working closely with engineering teams, researchers, and platform owners to keep the platform dependable and performant.

Key Responsibilities

  • Act as the first responder for support inquiries related to the AI/ML execution environment, covering model training jobs, inference services, deployment tasks, and GPU usage.
  • Provide operational coverage—including on-call shifts—for Ray and Kubernetes clusters running distributed ML jobs across both cloud and on-prem setups.
  • Track system health, investigate incidents, and resolve problems linked to failed jobs, scaling issues, cluster performance degradation, or GPU contention.
  • Administer GPU allocation policies and enforce usage limits across teams to maintain fair distribution and utilization efficiency.
  • Support users leveraging Ray Train/Tune for distributed training and Ray Serve for scalable inference, ensuring smooth and predictable performance.
  • Diagnose Kubernetes-related problems involving pods, networking, images, scheduling, or resource shortages in multi-team environments.
  • Work with SREs, platform engineers, and ML practitioners to resolve infrastructure, orchestration, and dependency issues affecting ML workloads.
  • Strengthen monitoring and alerting for Ray/Kubernetes systems using observability tools such as Prometheus, Grafana, and OpenTelemetry.
  • Update and expand runbooks, automation utilities, and documentation to streamline incident handling and reduce repeated issues.
  • Contribute to RCAs and post-incident reviews and assist in driving automation and reliability enhancements across the platform.

Required Background

  • Degree in Computer Science, Engineering, or a related technical field, or equivalent professional experience.
  • 5+ years in ML operations, DevOps, platform support, or similar fields handling distributed AI systems.
  • Experience offering L1/L2 support and on-call coverage for Ray-based and Kubernetes-based environments running ML workloads.
  • Deep understanding of Ray operations, including job orchestration, scaling behavior, and scheduling across CPU/GPU infrastructure.
  • Strong practical knowledge of Kubernetes internals—control plane, data plane, RBAC, namespaces, ingress, and resource segmentation.
  • Expertise in GPU scheduling and optimization, NVIDIA tooling, and related compute frameworks.
  • Proficiency in Python or Go for building automation tools and debugging distributed systems.
  • Familiarity with cloud-native environments on AWS, Azure, or GCP, as well as CI/CD practices.
  • Hands-on experience with metrics, tracing, and alerting solutions such as Prometheus, Grafana, and OpenTelemetry, plus incident-management platforms like PagerDuty or ServiceNow.
  • Understanding of ML frameworks (e.g., PyTorch, TensorFlow) and how they behave in Ray/Kubernetes-based distributed setups.
  • Strong problem-solving and communication skills, with the ability to work effectively across engineering and research groups.
  • Disciplined operational mindset focused on reliability, performance, and user experience.

Job Types: Full-time, Permanent

Pay: ₹3,000,000.00 - ₹4,500,000.00 per year

Benefits:

  • Cell phone reimbursement
  • Food provided
  • Health insurance
  • Provident Fund

Work Location: In person

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You

bengaluru, karnataka, india

bengaluru, karnataka, india

chennai, tamil nadu, india