Posted:1 week ago| Platform:
Work from Office
Full Time
Experience: 3 to 5 years in cloud infrastructure operations, L1 incident management, automation support, and observability, with team coordination or mentoring experience. Location: Pune Shift: 24x7 Support (Rotational Shifts) Education: BE/B.Tech (Relevant certifications preferred AWS Cloud Practitioner/Associate, Azure Fundamentals, CKA, Terraform Associate) Job Summary: We are seeking a L1 Lead – Site Reliability Engineer (SRE) to guide and manage the frontline SRE team in ensuring the stability, availability, and efficiency of enterprise-scale cloud infrastructure operations. This role involves supervising incident response, ensuring adherence to runbooks and SOPs, providing technical guidance to L1 engineers, and being the key escalation point for L1 issues. You will be responsible for monitoring cloud services, triaging alerts, validating remediation efforts, mentoring junior engineers, and collaborating with L2/L3 teams for escalations and root cause analysis. Responsibilities: Lead and mentor the L1 SRE team during shifts, ensuring timely response and proper handling of incidents, service requests, and alerts. Oversee infrastructure and application monitoring using tools such as Prometheus, Grafana, AWS CloudWatch, and Azure Monitor. Validate and guide remediation actions like pod restarts, disk space cleanup, scaling, and alert verification. Ensure SOPs, runbooks , and shift handover notes are followed and updated regularly. Execute and validate predefined Ansible playbooks, Terraform scripts, and CI/CD pipelines with junior team members. Act as the first point of escalation for unresolved L1 issues and coordinate with L2/L3 teams for resolution and RCA. Govern and track shift performance, including SLA compliance, FCR (First Call Resolution), and ticket hygiene. Coordinate patching, backup checks, standard changes, and validations in AWS/Azure environments. Facilitate onboarding of new L1 engineers, and deliver knowledge-sharing and refresher training sessions. Support automation initiatives by identifying repetitive tasks and creating/reviewing simple scripts. Conduct weekly/monthly shift reports and participate in SRE governance and review calls with operations leadership. Monitor the health of Kubernetes clusters and guide the team in basic pod/node/service troubleshooting. Skills/Expertise: 3+ years of experience in cloud infrastructure operations with at least 1 year in a lead or mentoring role. Strong troubleshooting, coordination, documentation, and escalation management skills. Proven ability to lead shifts in a 24x7 support model. Familiarity with ITSM practices and SLA management ( ServiceNow or similar). Proactive and structured communicator, capable of shift planning, reporting, and stakeholder updates. Technical Skills: Experience monitoring and operating cloud-based environments with basic troubleshooting for system and application-level issues. Familiarity with cloud services and concepts across AWS, such as EC2, S3, IAM, VPC, etc and Azure DevOps services. Basic knowledge of container platforms such as Docker and Kubernetes (understanding pod/service basics, logs, etc.). Exposure to scripting using Shell, Bash, or Python for automation of routine tasks. Basic understanding of version control systems like Git, GitHub, or GitLab. Awareness of infrastructure-as-code and automation tools such as Ansible, Terraform, or CloudFormation (execution under guidance). Familiar with CI/CD concepts and tools like Jenkin or GitLab CI (executing builds, monitoring pipelines). Understanding of alerting and monitoring tools like Grafana, ELK, site 24*7, CloudWatch and Prometheus Hands-on with ITSM tools such as ServiceNow for incident and ticket tracking. Role & responsibilities Preferred candidate profile
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Bengaluru
8.0 - 12.0 Lacs P.A.
1.0 - 5.0 Lacs P.A.
7.0 - 10.0 Lacs P.A.
18.0 - 30.0 Lacs P.A.
7.0 - 11.0 Lacs P.A.
5.0 - 9.0 Lacs P.A.