Posted:2 months ago| Platform:
Remote
Full Time
We are seeking a talented SRE Engineer to drive reliability, scalability, and performance of critical systems in Azure Core, specifically focusing on Office 365 buildouts. The ideal candidate will have hands-on experience in cloud infrastructure, automation, monitoring, and incident management. You will be responsible for deploying resources in public and sovereign clouds, troubleshooting complex system issues, and working with large datasets to generate operational insights. Role & responsibilities Participate in on-call rotations, responding to production incidents during non-business hours, weekends, and holidays as needed. Manage and resolve system incidents by leading incident bridges, troubleshooting, and driving resolution. Continuously monitor system performance using telemetry tools to identify and resolve potential issues before they impact service reliability. Ensure all performance metrics remain within acceptable limits and drive towards KPIs. Maintain automation tools, reducing manual efforts and increasing reliability. Lead and execute buildouts, ensuring timely deliveries, and troubleshooting deployment issues. - Analyze operational data, create dashboards, and report on system chokepoints, throughput, and performance. Identify areas for cycle time reduction and incident toil minimization. - Conduct postmortem reviews and lead blameless post-incident reviews to determine root cause and improve service resiliency. Implement preventive measures to avoid repeat issues. Create and maintain comprehensive documentation, including technical procedures, playbooks, and TSGs, to help streamline incident response and improve operational knowledge sharing. Preferred candidate profile Participate in on-call rotations, responding to production incidents during non-business hours, weekends, and holidays as needed. Manage and resolve system incidents by leading incident bridges, troubleshooting, and driving resolution. Continuously monitor system performance using telemetry tools to identify and resolve potential issues before they impact service reliability. Ensure all performance metrics remain within acceptable limits and drive towards KPIs. Maintain automation tools, reducing manual efforts and increasing reliability. Lead and execute buildouts, ensuring timely deliveries, and troubleshooting deployment issues. - Analyze operational data, create dashboards, and report on system chokepoints, throughput, and performance. Identify areas for cycle time reduction and incident toil minimization. - Conduct postmortem reviews and lead blameless post-incident reviews to determine root cause and improve service resiliency. Implement preventive measures to avoid repeat issues. Create and maintain comprehensive documentation, including technical procedures, playbooks, and TSGs, to help streamline incident response and improve operational knowledge sharing. Perks and benefits PF, Medical Insurance, Paid time off
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Faridabad, Haryana, India
Salary: Not disclosed
Bengaluru
5.0 - 15.0 Lacs P.A.
Bengaluru / Bangalore, Karnataka, India
Salary: Not disclosed
Chennai
35.0 - 50.0 Lacs P.A.
Chennai, Tamil Nadu, India
Salary: Not disclosed
5.0 - 6.0 Lacs P.A.
Chennai, Pune, Coimbatore
8.5 - 18.5 Lacs P.A.
Hyderabad
5.0 - 12.0 Lacs P.A.
Tamil Nadu
4.0 - 8.0 Lacs P.A.
Bengaluru
8.0 - 13.0 Lacs P.A.