Job Summary
We are seeking a Senior MLOps / AIOps Platform Engineer with deep DevSecOps expertise and hands-on experience managing enterprise-grade AI/ML platforms. This critical role focuses on building, configuring, and operationalizing secure, scalable, and reusable infrastructure and pipelines that support AI and ML initiatives across the enterprise. The ideal candidate will have a strong background in Infrastructure as Code (IaC), pipeline automation, and platform engineering, with specific experience configuring and maintaining IBM watsonx and Google Cloud Vertex AI environments.
Key Responsibilities
Platform Engineering & Operations
- Lead the provisioning, configuration, and ongoing support of IBM watsonx and Google Cloud Vertex AI platforms.
- Ensure platforms are production-ready, secure, cost-efficient, and performant across training, inference, and orchestration workflows.
- Manage lifecycle tasks such as patching, upgrades, integrations, and service reliability.
- Partner with security, compliance, and product teams to align platforms with enterprise and regulatory standards.
Enterprise MLOps / AIOps Enablement
- Define and implement standardized MLOps/AIOps practices across business units for consistency and scalability.
- Build and maintain reusable workflows for model development, deployment, retraining, and monitoring.
- Provide onboarding, enablement, and support to AI/ML teams adopting enterprise platforms and tools.
- Support development/deployment of GenAI applications and maintain them at an Enterprise scale.
DevSecOps Integration
- Embed security and compliance guardrails across the ML lifecycle, including CI/CD pipelines and IaC templates.
- Implement policy-as-code, access controls, vulnerability scanning, and automated compliance checks .
- Ensure all deployments meet enterprise and regulatory requirements (HIPAA, SOX, FedRAMP, etc.).
Infrastructure as Code & Automation
- Design and maintain IaC templates (Terraform, Pulumi, Ansible, CloudFormation) for reproducible ML infrastructure.
- Build and optimize CI/CD pipelines for AI/ML assets including data pipelines, training workflows, deployment artifacts, and monitoring systems.
- Enforce best practices around automation, reusability, and observability of infrastructure and workflows.
Monitoring, Logging & Observability
- Implement comprehensive observability for AI/ML workloads using Prometheus, Grafana, Stackdriver, or Datadog.
- Monitor both infrastructure health (CPU, memory, cost) and ML-specific metrics (model drift, data integrity, anomaly detection).
- Define KPIs and usage metrics to measure platform performance, adoption, and operational health .
Qualifications
Education
- Bachelors or Masters degree in Computer Science, Engineering, or a related technical field.
Experience
- 5+ years in MLOps, DevOps, Platform Engineering, or Infrastructure Engineering .
- 2+ years applying DevSecOps practices (secure CI/CD, vulnerability management, policy enforcement).
- Hands-on experience configuring and managing enterprise AI/ML platforms (IBM watsonx, Google Vertex AI) .
- Demonstrated success in building and scaling ML infrastructure, automation pipelines, and platform support models .
Technical Skills
- Proficiency with IaC tools (Terraform, Pulumi, Ansible, CloudFormation).
- Strong scripting skills in Python and Bash .
- Deep understanding of containerization and orchestration (Docker, Kubernetes).
- Experience with model lifecycle tools (MLflow, TFX, Vertex Pipelines, or equivalents).
- Familiarity with secrets management, policy-as-code, access control , and monitoring tools.
- Working knowledge of data engineering concepts and their integration into ML pipelines.
Preferred
- Cloud certifications (e.g., GCP Professional ML Engineer, AWS DevOps Engineer, IBM Cloud AI Engineer).
- Experience supporting platforms in regulated industries (HIPAA, FedRAMP, SOX, PCI-DSS).
- Contributions to open-source projects in MLOps, automation, or DevSecOps.
- Familiarity with responsible AI practices including governance, fairness, interpretability, and explainability.
- Hands-on experience with enterprise feature stores, model monitoring frameworks, and fairness toolkits .