Posted:1 month ago| Platform:
Hybrid
Full Time
Hiring, Lead Site Reliability Engineer with following skills and expertise. What will this person do? Provide leadership in designing and implementing reliable, scalable, and secure infrastructure solutions. Develop and maintain observability solutions, ensuring visibility into system performance using native Azure Cloud solutions. Define and track SLIs, ensuring compliance with SLOs and SLAs. Lead incident response efforts, conduct root cause analysis, and implement preventive measures to minimize downtime. Automate infrastructure provisioning, configuration and management using Terraform & Ansible. Build and maintain robust Observability pipelines to support automated deployments and continuous monitoring practices. Continuously analyze system health and optimize performance by identifying and resolving bottlenecks. Work with our BCDR team to minimize business impact during failures and measure the quality of services. Work with Cloud Governance team to monitor cloud infrastructure spending and implement cost-saving strategies. Implement centralized logging, metric collection, and distributed tracing for troubleshooting and debugging. Deploy, Manage and Monitor containerized workloads. Maintain configuration consistency and compliance across cloud environments using tools like Ansible. Partner with software development teams to integrate reliability best practices into the application development lifecycle. Conduct detailed post-mortems, document learnings, and drive improvements to reduce future incidents. Develop automation scripts in Python, Bash, or other languages to reduce manual efforts and improve efficiency. Provide mentorship to junior engineers, fostering a culture of learning and continuous technical growth. Research and evaluate new technologies, tools, and methodologies to improve system reliability and efficiency. Maintain detailed documentation on infrastructure, monitoring setups, incident responses, and best practices. Qualifications Bachelors degree in Computer Science, Engineering, or a related field. 10+ years in Observability, DevOps, and Site Reliability Engineering (SRE). At least 2 years of experience in defining Observability KPIs for both on-premises and cloud environments. Strong experience with cloud platforms (AWS, Azure, GCP) and cloud-native technologies. Passion for automation, reducing toil and implementing reliability-focused best practices. Deep knowledge of services/tools like Grafana, PowerBI, Prometheus, Azure Monitor, Application Insights & Azure Metrics. Expertise in Terraform, Ansible, Chef, and CI/CD pipeline tools like GitHub Actions, Jenkins, and GitOps methodologies. Working understanding of load balancing, authentication (AAA), encryption, and network parameters monitoring. Strong troubleshooting skills and experience handling on-call incidents and post-mortem analysis. Ability to work cross-functionally, drive technical discussions, and mentor junior engineers. Ability to work in a dynamic team environment and possess time management skills to meet deadlines. Sense of ownership and pride in your performance and its impact on the companys success. Critical thinker with problem-solving skills. Good interpersonal and communication skills.
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
16.0 - 20.0 Lacs P.A.
16.0 - 20.0 Lacs P.A.
Noida, Uttar Pradesh, India
5.0 - 9.0 Lacs P.A.
Pune, Maharashtra, India
3.0 - 8.0 Lacs P.A.
7.0 - 12.0 Lacs P.A.
8.0 - 9.49244 Lacs P.A.
Hyderabad, Telangana
Salary: Not disclosed
1.92 - 5.2 Lacs P.A.
Hyderabad
13.0 - 18.0 Lacs P.A.
32.5 - 40.0 Lacs P.A.