AWS and GCP Team Lead

5 - 9 years

0 Lacs

Posted:1 month ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Role Overview: As the AWS & GCP SRE Team Lead, your primary responsibility is to lead and mentor a team of Site Reliability Engineers in ensuring the reliability and performance of the cloud infrastructure across AWS and GCP. You will collaborate with cross-functional teams to establish resilient and automated systems while fostering a culture of continuous improvement. Your expertise in both AWS and GCP, combined with leadership experience, will be crucial in building highly available systems. Key Responsibilities: - Leadership & Team Development: - Lead, mentor, and manage a team of SREs specializing in AWS and GCP infrastructure, emphasizing collaboration, innovation, and operational excellence. - Establish clear goals and priorities for the team to align with business objectives. - Provide guidance in resolving complex technical issues, promoting best practices, and facilitating knowledge sharing within the team. - Reliability & Scalability: - Take ownership of the availability, reliability, and scalability of core services on AWS and GCP. - Implement and monitor service-level objectives (SLOs), service-level indicators (SLIs), and error budgets to enhance system reliability. - Collaborate with engineering teams to develop and deploy scalable, high-performance cloud solutions meeting operational and business requirements. - Automation & Infrastructure Management: - Drive automation of operational workflows, encompassing provisioning, deployment, monitoring, and incident management for AWS and GCP. - Promote the adoption of Infrastructure as Code (IaC) tools and practices like Terraform, AWS CloudFormation, and GCP Deployment Manager. - Supervise the implementation of CI/CD pipelines and deployment strategies for faster and more reliable releases. - Incident Management & Resolution: - Lead incident management initiatives to ensure swift responses to critical incidents and minimize downtime. - Conduct post-incident reviews (PIRs) to identify root causes, enhance processes, and share insights across teams. - Continuously refine the incident response process to reduce MTTR (Mean Time to Recovery) and enhance system stability. - Collaboration & Communication: - Collaborate with engineering, product, and DevOps teams to incorporate SRE best practices into the software development lifecycle. - Communicate effectively with both technical and non-technical stakeholders, providing regular updates on service reliability, incident status, and team accomplishments. - Monitoring & Observability: - Ensure comprehensive monitoring and observability practices across AWS and GCP environments utilizing tools such as AWS CloudWatch, GCP Stackdriver, Prometheus, Grafana, or ELK stack. - Proactively identify performance bottlenecks and system failures, leveraging metrics and logs to drive enhancements in system reliability. - Continuous Improvement: - Advocate for ongoing enhancement in operational practices, tooling, and processes to drive efficiency and reliability.,

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You