DevOps Engineer

5 - 9 years

0 Lacs

Posted:2 days ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

You are a deep-tech innovator at the intersection of Artificial Intelligence, machine-learning infrastructure, and edge-to-cloud platforms. Your award-winning solutions enable Fortune-500 enterprises to build, train, and deploy large-scale AI models seamlessly, securely, and at lightning speed. As the global demand for generative AI, RAG pipelines, and autonomous agents accelerates, you are scaling your MLOps team to keep your customers two steps ahead of the curve. **Role Overview:** As part of the MLOps team, your role involves owning the full MLOps stack by designing, building, and hardening GPU-accelerated Kubernetes clusters across on-prem DCs and cloud platforms for model training, fine-tuning, and low-latency inference. You will automate processes by crafting IaC modules and CI/CD pipelines, ensuring zero-downtime releases and reproducible experiment tracking. Additionally, you will be responsible for shipping production-grade LLM workloads, optimizing RAG/agent pipelines, managing model registries, and implementing self-healing workflow orchestration. It is crucial to eliminate bottlenecks, profile CUDA, resolve driver mismatches, and tune distributed frameworks for multi-node scale-out. Reliability is key as you architect HA data lakes, databases, ingress/egress, DNS, and ensure end-to-end observability targeting 99.99% uptime. Furthermore, you will have the opportunity to mentor and influence team members, instill a platform-first mindset, codify best practices, and report progress/roadblocks directly to senior leadership. **Key Responsibilities:** - Own the full MLOps stack including designing, building, and hardening GPU-accelerated Kubernetes clusters - Automate processes through crafting IaC modules and CI/CD pipelines - Ship production-grade LLM workloads, optimize RAG/agent pipelines, and implement self-healing workflow orchestration - Eliminate bottlenecks by profiling CUDA, resolving driver mismatches, and tuning distributed frameworks - Champion reliability by architecting HA data lakes, databases, and ensuring end-to-end observability - Mentor team members, influence a platform-first mindset, and report progress to senior leadership **Qualifications Required:** **Must-Have:** - 5+ years of DevOps/Platform experience with Docker & Kubernetes; expert in bash/Python/Go scripting - Hands-on experience in building ML infrastructure for distributed GPU training and scalable model serving - Deep fluency in cloud services, networking, load-balancing, RBAC, and Git-based CI/CD - Proven mastery of IaC & config-management using Terraform, Pulumi, Ansible **Preferred:** - Production experience with LLM fine-tuning, RAG architectures, or agentic workflows at scale - Exposure to Kubeflow, Flyte, Prefect, or Ray; track record of setting up observability and data-lake pipelines This job requires expertise in cloud services, containerization, automation tools, version control, and DevOps.,

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You

gurugram, haryana, india

ahmedabad, gujarat, india