Home
Jobs

Director of Infrastructure, Cloud, and ML Operations

10 years

0 Lacs

Posted:2 days ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Position Overview

The Director of Infrastructure, Cloud, and ML Operations is a critical role within Invent Health, responsible for architecting, scaling, securing, and managing the company’s technology platforms. This position blends technical expertise, strategic vision, and people leadership to ensure that all infrastructure and machine learning operations are robust, scalable, and aligned with the company’s growth ambitions.


Key ResponsibilitiesInfrastructure Strategy & Architecture

·       Develop and execute a forward-looking infrastructure strategy across cloud and network environments, ensuring seamless alignment with business objectives and scalability requirements.

·       Design and maintain modern, scalable, and secure architectures supporting multi-tenancy, high availability, and disaster recovery.

·       Architect, implement, and operate cloud platforms (IaaS, PaaS, SaaS) for high performance, scalability, and security.

Operational Excellence & Reliability

·       Ensure the operational reliability, uptime, and service availability of the SaaS platform, including rapid incident response and robust observability practices.

·       Implement monitoring and incident management best practices, utilizing KPIs to drive continuous improvement.

DevOps & Automation incl. Cloud Ops

·       Advance DevOps practices by driving automation throughout CI/CD pipelines and adopting Infrastructure as Code (IaC) methodologies to accelerate deployments and enhance engineering efficiency.

·       Implement standardized processes for service provisioning, deployment, and maintenance, blending DevOps and SRE principles.

·       Automate cloud provisioning and management using IaC and DevOps methodologies to streamline deployments and reduce manual intervention.


Customer Support & Incident Management

·       Diagnose and resolve customer issues and lead incident management processes.

·       Conduct root-cause analyses and oversee the remediation of critical incidents impacting service availability.

Security & Compliance

·       Enforce security best practices and ensure compliance with relevant industry standards like SOC-2, HIPAA, HITRUST etc. 

·       Implement and oversee security controls and compliance policies within cloud environments.

·       Lead disaster recovery planning, business continuity, and develop governance frameworks.

Cross-functional Collaboration

·       Work closely with product, engineering, and executive teams to inform architectural decisions and support key company initiatives.

·       Manage vendor relationships, infrastructure budgets, and effectively communicate platform health and project outcomes to leadership.

Team Leadership & Development

·       Build, mentor, and manage a high-performing engineering team, fostering a culture of innovation, accountability, and operational excellence.

Specialized Operations AreaMachine Learning Operations (ML Ops)

·       Design and maintain ML infrastructure supporting the entire machine learning lifecycle: data ingestion, model training, validation, deployment, and monitoring.

·       Automate ML workflows with CI/CD pipelines tailored to machine learning models for reproducibility and reliability.

·       Monitor deployed models for performance, drift, and data quality, implementing alerting and retraining as required.

·       Ensure security, compliance, and governance of ML data, models, and endpoints in production.

·       Collaborate with data science and engineering teams to ensure smooth handoffs and operational support.

·       Optimize resource allocation for ML workloads, balancing cost, performance, and scalability in cloud-based environments.

Qualifications

·       10+ years’ experience in infrastructure, cloud, and (preferred 4 + years’ experience in machine learning operations).

·       Proficiency in cloud platforms (AWS preferred) DevOps tools, and Infrastructure as Code.

·       Strong knowledge of modern security and compliance standards for SaaS and ML operations.

·       Demonstrated success in team leadership and cross-functional collaboration.

·       Clear communication skills and the ability to translate technical concepts into business outcomes.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You