Job
Description
Job Title: Site Reliability Engineering (SRE) Lead Location: Hyderabad / Bengaluru Job Type: Full-time Experience Level: 10+ years Job Overview: We are seeking a seasoned Site Reliability Engineering (SRE) Lead with a strong background in cloud operations, production systems, and automation. This is a senior-level hands-on role that combines leadership with deep technical expertise in AWS, DevOps, and infrastructure reliability. You will lead a team focused on ensuring availability, scalability, and operational excellence for our cloud-native product environments. Key Responsibilities: Leadership & Operations Management - Lead and mentor a team of SREs and Cloud Operations Engineers. - Define and enforce reliability standards, SLOs/SLIs, and incident response practices. - Drive reliability, observability, and automation improvements across cloud-based platforms. - Act as the bridge between product engineering, DevOps, and support teams for operational readiness. Cloud & Infrastructure Reliability - Manage production-grade environments hosted on AWS with a focus on high availability and performance. - Lead incident management processes, perform root cause analysis, and implement corrective actions. - Own and evolve monitoring, alerting, and observability using tools like CloudWatch, Prometheus, Grafana, ELK. - Ensure compliance with security and regulatory standards (e.g., HIPAA, SOC2, GDPR). DevOps & Automation - Design and improve CI/CD pipelines using tools like Jenkins, GitHub Actions, or Azure DevOps. - Implement Infrastructure as Code (IaC) using CloudFormation. - Experience with Packer and Ansible - Automate manual operational tasks and production workflows. - Support containerized workloads using Docker, ECS, or Kubernetes (EKS). Stakeholder Communication - Present technical issues, incident reports, and performance metrics to business and technical stakeholders. - Collaborate with Engineering, Product, and Security teams to embed reliability across the software lifecycle. - Provide guidance on cloud cost optimization, performance tuning, and capacity planning. Required Qualifications: - 10+ years of overall IT experience, including: - At least 5 years in AWS cloud operations or SRE. - Minimum 3 years in production-grade environments and incident response. - Strong leadership experience managing high-performing technical teams. - Deep understanding of SRE principles, DevOps practices, and cloud-native architecture. - Proven experience in: - AWS core services (VPC, EC2, RDS, ECS, EKS, IAM, S3) - Container orchestration and microservices - Infrastructure as Code (Terraform / CloudFormation) - Monitoring & observability tools (ELK, Prometheus, CloudWatch) Preferred Qualifications: - AWS Certified Solutions Architect or DevOps Engineer. - Experience working on SaaS or multi-tenant platforms. - Familiarity with multi-cloud and hybrid cloud strategies.,