SRE Lead (Support & Operations)

7 - 12 years

20 - 35 Lacs

Posted:10 hours ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Job Summary:

We are seeking an experienced and proactive Site Reliability Engineering (SRE) Lead with a strong background in support operations, service management, and debugging complex systems built on Java and microservices architecture. This role is crucial in ensuring the reliability, stability, and efficiency of our critical systems while driving process improvements, incident management, and cross-functional collaboration. As an SRE Lead, you will oversee system health, manage escalations, track and ensure ticket closures, follow up on issues, and enhance support processes to deliver a seamless operational experience.

Experience: 7-12 years

Key Responsibilities:

Service Reliability & Operational Excellence:

  • Ensure high availability and performance of critical services through proactive monitoring and issue resolution.
  • Define and uphold Service Level Indicators (SLIs) and Service Level Objectives (SLOs) aligned with business needs.
  • Identify recurring operational challenges and implement process improvements to enhance service reliability.

Incident & Problem Management:

  • Lead incident response efforts, ensuring quick resolution and minimal business impact.
  • Establish robust on-call processes and ensure smooth incident handling across teams.
  • Conduct post-incident reviews, documenting learnings and driving continuous improvement initiatives.
  • Collaborate with engineering teams to ensure long-term fixes for recurring incidents.
  • Possess strong debugging skills and the ability to analyze and resolve complex issues.

Support & Escalation Management:

  • Act as the primary point of contact for major incidents, working with cross-functional teams to resolve issues.
  • Manage support escalations efficiently, ensuring timely communication and resolution.
  • Track and ensure timely closure of support tickets and incidents.
  • Follow up on pending issues to drive resolution and prevent recurring problems.
  • Develop and enhance support playbooks and standard operating procedures (SOPs).
  • Foster a culture of accountability and knowledge sharing within the team.

Collaboration & Stakeholder Management:

  • Work closely with development, infrastructure, and business teams to align operational goals.
  • Ensure seamless communication between engineering teams, customer support, and leadership.
  • Provide regular updates on system health, incidents, and improvements to stakeholders.
  • Advocate for operational needs in engineering and product discussions.

Process Improvement & Automation:

  • Streamline support workflows and implement best practices for efficient issue resolution.
  • Drive automation initiatives to reduce manual operational tasks and improve response times.
  • Ensure documentation and knowledge management practices are maintained effectively.

Leadership & Team Development:

  • Mentor and support a team of SREs, fostering a culture of reliability and operational excellence.
  • Promote a customer-first mindset within the team.
  • Encourage collaboration, learning, and professional growth among team members.

Skills & Qualifications:

  • Strong experience in IT operations, support, or service reliability roles.
  • Proven track record in incident management, troubleshooting, and root cause analysis.
  • Strong Java knowledge with an understanding of microservices architecture.
  • Experience with monitoring and alerting tools (e.g., Grafana, Prometheus, New Relic, or similar).
  • Familiarity with Kubernetes and cloud-based environments (AWS, Azure, GCP).
  • Familiarity with ITIL practices and service management methodologies.
  • Strong communication and stakeholder management skills.
  • Ability to manage escalations effectively and ensure timely issue resolution.
  • Strong skills in tracking support issues, ensuring ticket closures, and following up on action items.

Preferred Qualifications:

  • Prior experience in an SRE, IT operations, or support leadership role.
  • Knowledge of ticketing and ITSM tools (e.g., ServiceNow, Jira Service Management, or similar).
  • Understanding of compliance, security, and best practices in support operations.
  • Exposure to automation and process improvement initiatives.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You