Senior DevOps Engineer (Kubernetes & AI Infra)

8 - 12 years

12 - 16 Lacs

Posted:2 weeks ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Job Summary:

We re looking for an experienced Senior DevOps Engineer who loves working with Kubernetes and AI-driven applications. In this role, you ll be responsible for designing, implementing, and maintaining scalable cloud infrastructure while supporting MLOps pipelines for AI workloads.

What You ll Be Doing:


  1. Building Scalable Infrastructure: You ll design, implement, and maintain cloud infrastructure using Kubernetes to handle AI and non-AI workloads efficiently.

  2. Developing CI/CD & MLOps Pipelines: Help us automate AI/ML workflows using tools like Kubeflow, MLflow, or Argo Workflows, ensuring seamless deployment and monitoring of AI models.

  3. Optimizing AI Model Deployments: Work with ML engineers to fine-tune LLM models, AI-driven applications, and containerized environments for smooth operation.

  4. Monitoring & Performance Tuning: Keep an eye on Kubernetes clusters and AI workloads, using tools like Prometheus, Grafana, and Loki to ensure high availability and performance.

  5. Automating Everything: Whether it s infrastructure provisioning (Terraform, Helm) or Kubernetes security best practices, you ll help enforce efficiency and compliance.

Staying Ahead of the Curve:

What We re Looking For:


  1. 8+ years of experience in DevOps, SRE, or Platform Engineering role, with expertise in Kubernetes and cloud-native DevOps

  2. Strong knowledge of Kubernetes fundamentals (deployments, services, ingress, storage, GPU scheduling, multi-cluster management).

  3. Proficiency in scripting & automation with Python, Bash, or Go, particularly for AI-related workflows.

  4. Hands-on experience with AWS, Azure, or GCP, especially in Kubernetes-based AI/ML infrastructure (e.g., Amazon SageMaker, GKE with AI, Azure ML).

  5. Hands-on experience with model deployment frameworks (NVIDIA Triton, vLLM. TGI etc.)

  6. Experience with Distributed computing, multi-GPU training on kubernetes and on-prem GPU clusters.

  7. Experience managing resource allocation and autoscaling for large training/inference workloads. (KEDA, HPA etc.)

  8. Experience with CI/CD & MLOps tools such as Jenkins, Argo CD, Kubeflow, MLflow, or Tekton.

  9. Familiarity with GenAI model deployment, including fine-tuning, inference optimization, and A/B testing.

  10. Hands-on experience with managed ML services (AWS bedrock, Vertext AI models etc.)

  11. Strong problem-solving skills and a mindset of automating repetitive tasks.

  12. Excellent communication skills to collaborate with ML engineers, data scientists, and software teams.

Bonus Points If You Have:


  1. Experience with LLMOps (Large Language Model Operations) and deploying LLM-based applications at scale.

  2. Knowledge of Vector Databases (FAISS, Weaviate, Qdrant) for AI-driven applications.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Navikenz India logo
Navikenz India

Information Technology and Services

Faridabad

RecommendedJobs for You