Operations & Service Support Lead - Cloud Infrastructure

3 - 8 years

0 Lacs

Posted:2 days ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

As the Operations & Service Support Manager, your primary responsibility is to ensure 24/7 operational excellence and customer satisfaction for cloud infrastructure offerings, including GPU accelerated compute solutions. You will oversee day-to-day operations, manage support teams (Tier 2/3), and collaborate closely with product and engineering teams to maintain high availability, performance, and robust service for enterprise customers running AI, HPC, or other mission-critical workloads. **Key Responsibilities:** - Lead and coordinate daily operations in multi-cloud or hybrid environments (e.g., AWS, Azure, GCP, on-prem HPC). - Maintain operational dashboards (uptime, ticket volumes, SLAs) and proactively address performance or capacity bottlenecks. - Ensure adherence to ITIL or other standard frameworks for incident, change, and problem management. - Manage Support Tiers (L2, L3) and operations staff (NOC, monitoring specialists) to handle escalations, incident triage, and root cause analysis. - Set clear KPIs and SOPs for the team, focusing on quick resolution times, high first-contact resolution rates, and continuous improvements. - Oversee major incidents and ensure timely resolution for critical outages or severe performance degradations, especially in GPU-based clusters. - Proactively monitor metrics and logs to spot potential issues before they escalate. - Act as a liaison between support/ops teams and key customers, ensuring visibility into operational performance and planned maintenance windows. - Manage relationships with external vendors and partners to ensure optimal resource allocation and cost targets. - Implement and enforce security policies for HPC/GPU clusters and cloud environments. **Qualifications & Skills:** - Bachelors or Masters in Computer Science, Engineering, or related field. - 8+ years in operations/support management roles, with 3+ years in cloud infrastructure or HPC/AI environments. - Strong understanding of cloud computing concepts and GPU accelerated computing. - Familiarity with infrastructure automation and observability tools. - Proven track record implementing ITIL/ITSM frameworks and running 24/7 support teams. - Excellent people management, coaching, and communication skills. - Experience with data analysis of operational metrics and proposing process improvements or automation. - Preferred certifications: ITIL Foundation/Intermediate, PMP/PRINCE2, cloud certifications (AWS, Azure, GCP), or HPC vendor certifications. (Note: The job description does not provide any additional details about the company.),

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You