This role is for one of our clientsIndustry: Technology, Information and MediaSeniority level: Mid-Senior levelMin Experience: 6 yearsLocation: Remote (India)JobType: full-timeWe are seeking a highly skilled
Lead AI Infrastructure Engineer
to drive the development and management of our AI and ML infrastructure. This role blends technical leadership with hands-on execution, overseeing the end-to-end ML lifecycle — from model training and deployment to monitoring, optimization, and scaling. You will lead a small team of engineers while ensuring seamless collaboration between research, engineering, and operations teams.Key Responsibilities
ML Infrastructure & Lifecycle Management
Design, maintain, and optimize scalable infrastructure for ML training, inference, and experimentation.Ensure model deployment pipelines are reliable, efficient, and cost-effective.Implement robust monitoring, alerting, and automated rollback mechanisms to maintain system reliability.Collaboration with Research & Product Teams
Partner with research teams to streamline workflows for training, evaluation, and fine-tuning of models.Support AI-driven initiatives across product teams by providing reliable infrastructure and operational expertise.Team Leadership & Mentorship
Lead a small team of ML engineers, providing guidance, mentoring, and technical support.Balance hands-on engineering work with strategic oversight of infrastructure projects.Performance & Optimization
Enhance model inference latency, throughput, and cost-efficiency.Apply model optimization techniques such as quantization, distillation, and TensorRT integration.Automation & Best Practices
Develop and enforce CI/CD practices for ML models, including versioning, testing, and deployment.Establish MLOps standards and operational excellence across teams.Cloud & Platform Management
Leverage cloud-based ML platforms (AWS SageMaker, GCP Vertex AI, Azure ML) to optimize workflows and costs.Maintain secure, compliant, and scalable AI environments for both training and inference workloads.Architecture & Strategy
Contribute to ML architecture design, documentation, and roadmap planning.Continuously evaluate emerging AI infrastructure technologies to improve efficiency and performance.Qualifications & Skills
5+ years of hands-on experience in MLOps, ML Engineering, or AI Infrastructure roles.Strong understanding of ML/DL concepts with applied experience in model training and deployment.Proficiency with cloud-native ML platforms: AWS SageMaker, GCP Vertex AI, or Azure ML.Experience with Kubernetes, Docker, MLflow, Kubeflow, or similar orchestration tools.Familiarity with model optimization techniques: quantization, distillation, TensorRT, FasterTransformer.Proven ability to lead technical projects and mentor engineers in a fast-paced environment.Excellent communication and cross-functional collaboration skills.Ownership-driven mindset and ability to bring clarity to ambiguous technical challenges.Core Skills
MLOps | ML Infrastructure | Model Deployment | Model Monitoring | CI/CD for ML | Cloud ML Platforms | Kubernetes | Docker | Vertex AI | AWS SageMaker | Kubeflow | MLflow | Model Optimization