Site Reliability Engineer Lead - Support & Operations

7 - 12 years

30 - 40 Lacs

Posted:-1 days ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Job Title: SRE Lead (Support & Operations)

Job Summary:

We are seeking an experienced and proactive Site Reliability Engineering (SRE) Lead with a

strong background in support operations, service management, and debugging complex

systems built on Java and microservices architecture. This role is crucial in ensuring the

reliability, stability, and efficiency of our critical systems while driving process

improvements, incident management, and cross-functional collaboration. As an SRE Lead,

you will oversee system health, manage escalations, track and ensure ticket closures, follow

up on issues, and enhance support processes to deliver a seamless operational experience.

Experience: 7+ years

Key Responsibilities:

Service Reliability & Operational Excellence:

• Ensure high availability and performance of critical services through proactive

monitoring and issue resolution.

• Define and uphold Service Level Indicators (SLIs) and Service Level Objectives

(SLOs) aligned with business needs.

• Identify recurring operational challenges and implement process improvements to

enhance service reliability.

Incident & Problem Management:

• Lead incident response efforts, ensuring quick resolution and minimal business

impact.

• Establish robust on-call processes and ensure smooth incident handling across teams.

• Conduct post-incident reviews, documenting learnings and driving continuous

improvement initiatives.

• Collaborate with engineering teams to ensure long-term fixes for recurring incidents.

• Possess strong debugging skills and the ability to analyze and resolve complex issues.

Support & Escalation Management:

• Act as the primary point of contact for major incidents, working with cross-functional

teams to resolve issues.

• Manage support escalations efficiently, ensuring timely communication and

resolution.

• Track and ensure timely closure of support tickets and incidents.

• Follow up on pending issues to drive resolution and prevent recurring problems.

• Develop and enhance support playbooks and standard operating procedures (SOPs).

• Foster a culture of accountability and knowledge sharing within the team.

Collaboration & Stakeholder Management:

• Work closely with development, infrastructure, and business teams to align

operational goals.

• Ensure seamless communication between engineering teams, customer support, and

leadership.

• Provide regular updates on system health, incidents, and improvements to

stakeholders.

• Advocate for operational needs in engineering and product discussions.

Process Improvement & Automation:

• Streamline support workflows and implement best practices for efficient issue

resolution.

• Drive automation initiatives to reduce manual operational tasks and improve response

times.

• Ensure documentation and knowledge management practices are maintained

effectively.

Leadership & Team Development:

• Mentor and support a team of SREs, fostering a culture of reliability and operational

excellence.

• Promote a customer-first mindset within the team.

• Encourage collaboration, learning, and professional growth among team members.

Skills & Qualifications:

Required Skills:

• Strong experience in IT operations, support, or service reliability roles.

• Proven track record in incident management, troubleshooting, and root cause analysis.

• Strong Java knowledge with an understanding of microservices architecture.

• Experience with monitoring and alerting tools (e.g., Grafana, Prometheus, New Relic,

or similar).

• Familiarity with Kubernetes and cloud-based environments (AWS, Azure, GCP).

• Familiarity with ITIL practices and service management methodologies.

• Strong communication and stakeholder management skills.

• Ability to manage escalations effectively and ensure timely issue resolution.

• Strong skills in tracking support issues, ensuring ticket closures, and following up on

action items.

Preferred Qualifications:

• Prior experience in an SRE, IT operations, or support leadership role.

• Knowledge of ticketing and ITSM tools (e.g., ServiceNow, Jira Service Management,

or similar).

• Understanding of compliance, security, and best practices in support operations.

• Exposure to automation and process improvement initiatives.

Mock Interview

Practice Video Interview with JobPe AI

Start Java Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Java Skills

Practice Java coding challenges to boost your skills

Start Practicing Java Now
WOW Softech logo
WOW Softech

Software Development

San Francisco

RecommendedJobs for You