Role Summary
The Director of AI Platform & MLOps is a critical, hands-on technology leader responsible for the architecture, execution, and day-to-day management of the infrastructure powering our AI Training Data Services. You will lead a world-class team of engineers to build, scale, and operate a robust, secure, and highly automated platform. This role requires a deep technical background in cloud infrastructure, MLOps, and large-scale systems, combined with proven experience in leading high-performing engineering teams.
Responsibilities
Technical Execution & Architecture:
- Translate the long-term vision into an executable technical roadmap, making key architectural decisions for the platform.
- Lead the hands-on design and implementation of our MLOps/LLMOPs framework, including CI/CD for models, data/model versioning, automated workflows, and monitoring.
- Engineer and manage a scalable, multi-cloud infrastructure (AWS/GCP/Azure) using Infrastructure as Code (IaC) principles (Terraform, CloudFormation).
- Oversee the technical integration, scaling, and reliability of data annotation platforms and the GIG worker technology layer.
- Drive SRE best practices to ensure high availability, performance, and security of the entire platform.
Team & Operational Leadership
- Recruit, lead, mentor, and directly manage a team of Cloud, DevOps, MLOps, and Site Reliability Engineers.
- Foster a culture of technical excellence, automation, and accountability within the engineering team.
- Manage the day-to-day project timelines, resource allocation, and operational activities to ensure successful platform delivery.
- Implement and optimize the cloud budget (FinOps) for your domain, ensuring cost-effective scaling.
Stakeholder & Client Engagement
- Act as the primary technical expert in project-level discussions with clients, providing detailed solutioning and estimation for new deals.
- Collaborate closely with product management and service delivery teams to ensure the platform meets their requirements.
Skills & Qualification
- Experience: 12+ years in platform engineering or DevOps, with at least 4 years in a leadership role directly managing engineering teams.
- Cloud Architecture Mastery: Deep, hands-on experience designing and managing complex infrastructure on AWS, GCP, or Azure. Expertise with Kubernetes (EKS, GKE, AKS), serverless, and core cloud services.
- Expertise in MLOps/LLMOPs: Demonstrable, in-depth implementation experience with the full MLOps lifecycle (e.g., Kubeflow, MLflow, Seldon, DVC, Airflow) and infrastructure for LLMs (Vector DBs, fine-tuning environments).
- Infrastructure as Code (IaC): Strong, hands-on proficiency with Terraform or CloudFormation is a must.
- Technical Leadership: Proven ability to lead, mentor, and scale a technical team, driving projects from conception to production.
- Problem Solving: Exceptional ability to debug and solve complex technical issues in a distributed systems environment.