Principal Platform Engineer

5 - 9 years

0 Lacs

Posted:6 days ago| Platform: Shine logo

Apply

Skills Required

Work Mode

On-site

Job Type

Full Time

Job Description

As a Senior Platform Engineer specializing in ML Infrastructure within the AI/ML infrastructure and deep-tech industry, you will play a crucial role in the design and scaling of foundational systems that drive AI products. If you are passionate about creating robust, efficient, and innovative ML infrastructure, we invite you to join our core infrastructure team. Your responsibilities will include designing, building, and operating scalable ML & data infrastructure across on-premises and cloud environments such as AWS, Azure, and GCP. You will be tasked with setting up and automating multi-node Kubernetes + GPU clusters, ensuring their health and cost-effectiveness. Additionally, you will develop golden-path CI/CD & MLOps pipelines for training, serving, RAG, and agentic workflows using tools like Kubeflow, Flyte, and Ray. Collaboration with ML engineers to troubleshoot challenging CUDA/Kubernetes issues before they impact production systems will be a key part of your role. Emphasizing Infrastructure as Code (IaC) standards with tools like Terraform and Pulumi, as well as configuration-as-code with Ansible, will be essential. You will also have the opportunity to mentor developers on platform best practices and promote a platform-first mindset within the organization. To excel in this position, you should possess a minimum of 5 years of experience in DevOps, Site Reliability Engineering (SRE), or Platform Engineering, with at least 2 years focused on ML infrastructure at scale. Hands-on experience with Docker, Kubernetes, Helm, and other Kubernetes-native tools is required. Proficiency in managing distributed GPU scheduling, CUDA drivers, and networking is essential. Strong skills in Terraform, Pulumi, Ansible, and scripting languages like Bash and Python are expected. Operating data lakes, high-availability databases, and object stores should be within your expertise. Familiarity with ML orchestration tools such as Kubeflow, Flyte, Prefect, and model registries is advantageous. Knowledge of RAG, LLM fine-tuning, or agentic frameworks is a significant plus. While not mandatory, experience with Ray, Spark, or Dask is beneficial. Proficiency in security and RBAC design, along with contributions to open-source projects in the cloud-native/MLOps space, are considered desirable skills. Skills required for this role include Data Management, Machine Learning, and Computing.,

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You