Jobs

Interviews
Job Alerts
Tools

Upskill and Grow with AI

Mock Interview Practice interviews in realistic simulations

Coding Practice Improve your coding skills with challenges

Certification Earn certifications to validate your skills

AI Learning Get trained with AI expert sessions

Career Path AI insights for smarter career decisions

AI Job Match Score AI-Powered Job Match Against Your Resume and Optimize Your Resume

Career Tools and Resources

Resume Builder Build Professional Resume with Ease

ATS Friendliness Check Check Resume Friendliness for Applicant Tracking Systems

Auto Apply Apply to hundreds of jobs on any platform effortlessly

Co-Pilot (Chrome Extension) Your AI Assistant for Seamless Browsing Efficiency

Interview Questions Streamline interviews with ready-to-use questions

Salaries Discover market-driven salary insights across skillsets and geographies

Companies Explore leading companies actively hiring talent
For Employers

Home
>
Jobs in bengaluru
>
North Hires
>
ML Platform Engineer

ML Platform Engineer

North Hires

5 years

30 - 45 Lacs

bengaluru

Posted:4 hours ago| Platform: GlassDoor logo

Apply

Skills Required

ml support ai learning kubernetes escalation training inference deployment stability efficiency engineering model resolve scaling networking scheduling orchestration monitoring automation documentation reliability devops data optimization tooling python debugging aws azure gcp metrics opentelemetry management pytorch tensorflow communication research

Work Mode

On-site

Job Type

Full Time

Job Description

We are looking for a specialist who can provide operational support for AI/ML compute environments. This role centers on overseeing machine-learning workloads executed on Kubernetes and Ray-based ecosystems. The position acts as the initial escalation point for issues involving training pipelines, inference services, deployment workflows, and GPU consumption. A key expectation is maintaining the stability and efficiency of the infrastructure that drives large-scale AI initiatives. The job includes participating in an on-call roster and working closely with engineering teams, researchers, and platform owners to keep the platform dependable and performant.

Key Responsibilities

Act as the first responder for support inquiries related to the AI/ML execution environment, covering model training jobs, inference services, deployment tasks, and GPU usage.
Provide operational coverage—including on-call shifts—for Ray and Kubernetes clusters running distributed ML jobs across both cloud and on-prem setups.
Track system health, investigate incidents, and resolve problems linked to failed jobs, scaling issues, cluster performance degradation, or GPU contention.
Administer GPU allocation policies and enforce usage limits across teams to maintain fair distribution and utilization efficiency.
Support users leveraging Ray Train/Tune for distributed training and Ray Serve for scalable inference, ensuring smooth and predictable performance.
Diagnose Kubernetes-related problems involving pods, networking, images, scheduling, or resource shortages in multi-team environments.
Work with SREs, platform engineers, and ML practitioners to resolve infrastructure, orchestration, and dependency issues affecting ML workloads.
Strengthen monitoring and alerting for Ray/Kubernetes systems using observability tools such as Prometheus, Grafana, and OpenTelemetry.
Update and expand runbooks, automation utilities, and documentation to streamline incident handling and reduce repeated issues.
Contribute to RCAs and post-incident reviews and assist in driving automation and reliability enhancements across the platform.

Required Background

Degree in Computer Science, Engineering, or a related technical field, or equivalent professional experience.
5+ years in ML operations, DevOps, platform support, or similar fields handling distributed AI systems.
Experience offering L1/L2 support and on-call coverage for Ray-based and Kubernetes-based environments running ML workloads.
Deep understanding of Ray operations, including job orchestration, scaling behavior, and scheduling across CPU/GPU infrastructure.
Strong practical knowledge of Kubernetes internals—control plane, data plane, RBAC, namespaces, ingress, and resource segmentation.
Expertise in GPU scheduling and optimization, NVIDIA tooling, and related compute frameworks.
Proficiency in Python or Go for building automation tools and debugging distributed systems.
Familiarity with cloud-native environments on AWS, Azure, or GCP, as well as CI/CD practices.
Hands-on experience with metrics, tracing, and alerting solutions such as Prometheus, Grafana, and OpenTelemetry, plus incident-management platforms like PagerDuty or ServiceNow.
Understanding of ML frameworks (e.g., PyTorch, TensorFlow) and how they behave in Ray/Kubernetes-based distributed setups.
Strong problem-solving and communication skills, with the ability to work effectively across engineering and research groups.
Disciplined operational mindset focused on reliability, performance, and user experience.

Job Types: Full-time, Permanent

Pay: ₹3,000,000.00 - ₹4,500,000.00 per year

Benefits:

Cell phone reimbursement
Food provided
Health insurance
Provident Fund

Work Location: In person

More Jobs at North Hires

Senior Data Engineer

Hyderābād

5.0 - 5.0 yrs

INR 15 - 22 Lacs

Senior Product Security Engineer

Hyderābād

7.0 - 7.0 yrs

INR 40 - 50 Lacs

Data Engineer

Hyderabad, Telangana, India

5.0 - 5.0 yrs

Salary: Not disclosed

Azure SRE Engineer

Hyderabad, Telangana, India

Experience: Not specified

Salary: Not disclosed

Data Quality Engineer

Pune, Maharashtra, India

5.0 - 5.0 yrs

Salary: Not disclosed

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

North Hires

Login to

Please Verify Your Phone or Email

Confirm Action

ML Platform Engineer