Operations & Service Support Lead - Cloud Infrastructure

3 - 8 years

0 Lacs

Posted:1 day ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

As the Operations & Service Support Manager, your primary responsibility is to ensure operational excellence and customer satisfaction for cloud infrastructure offerings, including GPU accelerated compute solutions. You will oversee day-to-day operations, manage support teams (Tier 2-3), and collaborate closely with product and engineering teams to maintain high availability, performance, and robust service for enterprise customers running AI, HPC, or other mission-critical workloads. **Key Responsibilities:** - Lead and coordinate daily operations in multi-cloud or hybrid environments (e.g., AWS, Azure, GCP, on-prem HPC). - Maintain operational dashboards (uptime, ticket volumes, SLAs) and proactively address performance or capacity bottlenecks. - Ensure adherence to ITIL or other standard frameworks for incident, change, and problem management. - Manage Support Tiers (L2, L3) and operations staff (NOC, monitoring specialists) to handle escalations, incident triage, and root cause analysis. - Set clear KPIs and SOPs for the team, focusing on quick resolution times, high first-contact resolution rates, and continuous improvements. - Oversee major incidents and ensure timely resolution for critical outages or severe performance degradations, especially in GPU-based clusters. - Proactively monitor metrics and logs (e.g., GPU utilization, HPC job performance, cost anomalies) to spot potential issues before they escalate. - Act as a liaison between support/ops teams and key customers, ensuring visibility into operational performance and planned maintenance windows. - Manage relationships with external vendors and partners (e.g., GPU hardware providers, colocation/DC hosts, cloud service providers). - Implement and enforce security policies for HPC/GPU clusters and cloud environments. **Qualifications & Skills:** - **Education & Experience:** - Bachelors or Masters in Computer Science, Engineering, or a related field. - 8+ years in operations/support management roles, with 3+ years in cloud infrastructure or HPC/AI environments. - **Technical & Domain Expertise:** - Strong understanding of cloud computing concepts and GPU accelerated computing. - Familiarity with infrastructure automation and observability tools. - Knowledge of distributed systems, HPC clusters, performance tuning, and relevant DevOps/SRE practices. - **Operations Management:** - Proven track record implementing ITIL/ITSM frameworks for incident, change, and problem management at scale. - Experience running 24/7 support teams, establishing SLAs, and delivering on operational KPIs. - **Leadership & Communication:** - Excellent people management and coaching skills. - Strong communication skills to engage stakeholders at all levels. - Adept at crisis management and systematic escalation protocols. - **Analytical & Continuous Improvement:** - Experience with data analysis of operational metrics to identify trends and drive reliability enhancements. - Ability to propose and execute process improvements or automation for operational optimization. **Preferred Certifications:** - ITIL Foundation/Intermediate, PMP/PRINCE2 for project oversight. - Exposure to cloud certifications or HPC vendor certifications is beneficial.,

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You