Home
Jobs

Site Reliability Engineer

8 years

0 Lacs

Posted:7 hours ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Job Title: Site Reliability Engineer (SRE) - LEAD

Department:

Job Summary:

Site Reliability Engineer (SRE)

Key Responsibilities:

Strategic Leadership & Governance

  • Define and evolve the SRE CoE vision, strategy, and roadmap.
  • Establish enterprise-wide SRE standards, frameworks, and maturity models.
  • Drive adoption of SRE principles across product and platform teams.

Enablement

  • Act as a subject matter expert and advisor to engineering teams on reliability, scalability, and performance.
  • Conduct workshops, training sessions, and knowledge-sharing forums.
  • Promote a culture of observability, automation, and continuous improvement.

Collaboration & Mentorship

  • Partner with engineering, product, and operations leaders to align reliability goals with business outcomes.
  • Mentor SREs and engineers across teams, fostering a community of practice.
  • Lead cross-functional reliability reviews and architecture assessments.
  • Collaborate with development, operations, and network teams.
  • Align infrastructure reliability with application SLOs/SLIs.
  • Advocate for best practices in system architecture and operations.

Infrastructure & Reliability

  • Design, implement, and maintain scalable, reliable infrastructure.
  • Ensure high availability and disaster recovery strategies.
  • Improve reliability for legacy and hybrid (cloud/on-prem) systems.

Monitoring & Incident Management

  • Develop and maintain monitoring, alerting, and incident response systems.
  • Conduct root cause analysis and post-mortems.
  • Participate in on-call rotations and respond to production issues.

Automation & Efficiency

  • Automate repetitive tasks using scripting and tooling.
  • Lead Infrastructure-as-Code (IaC) and automation for provisioning and scaling.
  • Create sustainable systems through automation and continuous improvement.
  • Evaluate and recommend tools for monitoring, alerting, incident management, and chaos engineering.
  • Build reusable automation frameworks and templates for onboarding teams to SRE practices.
  • Collaborate with DevOps and platform teams to integrate reliability tooling into CI/CD pipeline
  • Support rigorous testing and release procedures.

Performance & Capacity

  • Lead capacity planning, system upgrades, and OS patching.
  • Gather and analyze system/application metrics for performance tuning.

Containerization & Cloud

  • Support Kubernetes and container platforms in hybrid environments.
  • Work with OpenShift, GCP, Azure and AWS for cloud-integrated services.

Required Qualifications:

  • Bachelor’s degree in computer science, Engineering, or a related field (or equivalent experience).
  • 8+ years of experience in SRE.
  • Proficiency in at least one programming/scripting language (e.g., Python).
  • Experience with cloud platforms (AWS, GCP, Azure).

Preferred Qualifications:

  • Experience in setting up or leading a CoE or similar strategic function.
  • Certifications in cloud, DevOps, or SRE-related domains.
  • Experience with chaos engineering and resilience testing.
  • Experience with observability tools (Prometheus, Grafana, ELK, Datadog, etc.).
  • Experience with incident management and SLO/SLI/SLA frameworks.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
HCLTech
HCLTech

Information Technology Services

New Delhi

RecommendedJobs for You

Noida, Uttar Pradesh, India