Job
Description
Note: This job role is part of MetLifes Hack4Job India (a hiring hackathon)
Only shortlisted candidates will be invited, Department: Global Technology Role Overview MetLife is seeking an experienced Site Reliability Engineer (SRE) to ensure the availability, scalability, and performance of critical systems and services The role involves monitoring, automation, incident management, and collaboration with engineering teams to optimize system reliability and efficiency, Key Responsibilities System Reliability & Performance: Ensure system uptime, troubleshoot issues, and optimize performance Service Design & Automation: Develop automation scripts and tools to streamline operations Monitoring & Alerting: Implement observability solutions using ELK, Grafana, Splunk, and Azure Monitor Incident Response & Management: Lead root cause analysis, post-mortems, and corrective actions Collaboration: Work with engineering teams to align system performance with business goals Documentation & Knowledge Sharing: Maintain accurate system documentation and promote best practices Qualifications & Skills Experience: 3+ years as an SRE, supporting hybrid cloud platforms (On-Prem and Azure) Programming: Java, Python, Bash, PowerShell Cloud & Containers: Azure services, Docker, Kubernetes, Terraform Monitoring & Logging: ELK stack, Grafana, Splunk, Azure Application Insights Database: Strong hands-on experience with SQL Tools: Azure DevOps, Pipelines, Repos, ServiceNow Soft Skills: Strong analytical, problem-solving, and communication skills Language: Business proficiency in English; Japanese language is a plus This is a great opportunity to be part of MetLifes technology transformation journey,