Lead Site Reliability Engineer

7 years

0 Lacs

Posted:2 days ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Job Title:

Location:

Experience :

Shift Time

Working

Notice Period:


About the Role

SRE Lead


As the SRE Leader, you will collaborate closely with development, operations, and security teams to ensure our services are highly available, secure, and performant, while fostering a culture of automation, monitoring, and continuous improvement.


Key Responsibilities

  • Lead and mentor a team of SRE engineers to design, build, and maintain reliable, scalable, and secure cloud infrastructure across AWS and Azure.
  • Architect and implement Infrastructure as Code (IaC) solutions primarily using Terraform to manage multi-cloud environments efficiently.
  • Develop, maintain, and optimize CI/CD pipelines leveraging GitHub Actions to enable fast and reliable software delivery.
  • Establish and drive best practices in site reliability, monitoring, alerting, and incident response using Datadog and other observability tools.
  • Collaborate with software engineering teams to improve system reliability through automation, load testing, and performance tuning.
  • Define and track SLOs, SLIs, and error budgets; lead incident retrospectives and continuous improvement initiatives.
  • Manage cloud resource costs and optimize usage across multiple cloud providers.
  • Promote a DevOps culture emphasizing automation, continuous deployment, and proactive incident management.
  • Stay current with the latest industry trends and technologies in cloud, automation, and SRE practices.

Required Skills

  • 7+ years of experience in Site Reliability Engineering, DevOps, or cloud infrastructure roles.
  • Implement dashboards to monitor and track SLOs, SLIs, and error budgets; lead incident retrospectives and continuous improvement initiatives.
  • Proven experience leading and mentoring engineering teams.
  • Strong hands-on experience with AWS and Azure cloud platforms.
  • Expert in Infrastructure as Code using Terraform with multi-cloud deployments.
  • Proficient in building and managing CI/CD pipelines using GitHub Actions.
  • Deep knowledge of monitoring and observability tools, especially Datadog.
  • Solid understanding of networking, security, container orchestration (Kubernetes is a plus), and cloud-native architectures.
  • Strong scripting and automation skills (Python, Bash, or similar).
  • Experience with incident management, root cause analysis, and capacity planning.
  • Excellent communication, leadership, and collaboration skills.

Technical Skills

  • IAC:

    Terraform
  • CICD :

    Git Action, Git workflow and ArgoCD
  • Observability:

    Datadog, Prometheus and Fluent bit

  • POD Orchestration: EKS and EKS Faregate

  • Cloud :

    AWS and Azzure

Preferred

  • Certifications such as AWS Certified DevOps Engineer, Azure DevOps Engineer, or HashiCorp Terraform Associate.
  • Experience with Kubernetes and service mesh technologies.
  • Familiarity with chaos engineering and resilience testing.
  • Knowledge of security best practices in cloud environments.
  • Mock Interview

    Practice Video Interview with JobPe AI

    Start DevOps Interview
    cta

    Start Your Job Search Today

    Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

    Job Application AI Bot

    Job Application AI Bot

    Apply to 20+ Portals in one click

    Download Now

    Download the Mobile App

    Instantly access job listings, apply easily, and track applications.

    coding practice

    Enhance Your Python Skills

    Practice Python coding challenges to boost your skills

    Start Practicing Python Now

    RecommendedJobs for You