About the Role:
We are looking for a seasoned DevOps SRE Manager with GCP as the primary cloud platform and AWS as secondary. The ideal candidate will be responsible for leading a DevOps team managing GCP-based infrastructure and a CloudOps/SRE team ensuring 24x7 uptime for critical services.
This role requires a strong technical background in DevOps & SRE, leadership and team management skills, and the ability to own customer relationships while ensuring seamless cloud operations.
The candidate should have hands-on expertise with Terraform, Kubernetes (GKE), Prometheus, and Grafana while possessing working knowledge of AWS. They will play a crucial role in managing customer expectations, ensuring timely project deliveries, and driving operational excellence.
Key Responsibilities:
1. DevOps Management (GCP-Focused Infrastructure)
- Own and oversee DevOps operations in a GCP environment using Terraform, Kubernetes (GKE), Prometheus, and Grafana.
- Ensure timely execution of DevOps tasks while optimizing infrastructure automation.
- Drive CI/CD pipeline enhancements and cloud security best practices.
- Enhance monitoring, logging, and alerting capabilities to improve system reliability.
- Optimize cloud costs, scalability, and security for long-term efficiency.
2. CloudOps / SRE Management (24x7 Support)
- Manage and guide a 24x7 CloudOps/SRE team responsible for uptime and incident response.
- Create and maintain rosters to ensure continuous 24x7 support coverage.
- Oversee incident management, RCA (Root Cause Analysis), and SLAs.
- Implement observability best practices using Grafana, Prometheus, and Opsgenie.
- Reduce manual intervention by promoting automation and self-healing infrastructure.
3. Leadership & Team Management
- Build and maintain strong customer relationships, ensuring clear and transparent communication.
- Lead and mentor a cross-functional team of DevOps and CloudOps/SRE engineers.
- Ensure team productivity, performance reviews, and professional growth.
- Drive continuous improvement through feedback, training, and best practices.
4. AWS (Good to Have)
- Maintain basic to intermediate AWS knowledge (IAM, EC2, EKS, S3, Lambda, CloudFormation).
- Assist in AWS networking, security, and infrastructure optimization when required.
- Provide support for AWS-based workloads where integration with GCP exists.
Technical Stack Expertise Required:
Primary (GCP-Focused DevOps & CloudOps):
- Cloud Platform: Google Cloud Platform (GCP) - Major, AWS-Minor
- Infrastructure as Code (IaC): Terraform
- Containerization & Orchestration: Kubernetes (GKE)
- CI/CD & Automation: Jenkins, GitOps, Ansible
- Monitoring & Observability: Prometheus, Grafana
- Incident & Alerting Tools: Opsgenie
- Big Data & Streaming Technologies: Kafka, Airflow, Druid
- AWS Services: IAM, EC2, S3, Lambda, CloudFormation, CloudWatch
Required Skills & Qualifications:
- B.Tech/B.E. graduate with 10-15 years of experience in DevOps, CloudOps, or SRE roles
- Prior experience in handling 24x7 operations and multi-cloud environments.
- Proven experience in managing DevOps & CloudOps/SRE teams, ensuring smooth operations.
- Hands-on expertise with GCP infrastructure, Terraform, Kubernetes (GKE), and CI/CD pipelines.
- Experience in incident management, RCA, monitoring, and alerting tools (Prometheus, Grafana, Opsgenie).
- Strong understanding of reliability engineering, automation, and cloud security best practices.
Nice to have
- Experience with Kafka, Airflow, and Druid in large-scale environments.
- Certifications: GCP Professional DevOps Engineer, AWS Solutions Architect, or Kubernetes certifications.
- Working knowledge of AWS cloud services, assisting in hybrid-cloud scenarios.