Principal Site Reliability Engineer

8 - 13 years

20 - 25 Lacs

Posted:2 days ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

As a Lead SRE Engineer, you will drive reliability, scalability, and operational excellence across a portfolio of applications deployed on a modern internal platform. You will lead and mentor a team of SRE engineers, establish best practices, and collaborate closely with product and development teams to ensure robust, automated, and self-healing systems. Your leadership will be critical in shaping the SRE function and enabling the team to deliver high-impact solutions that support Lilly s mission.
 
What You ll Be Doing
  • L ead the SRE team responsible for the reliability and performance of applications deployed on a cloud-native internal platform.
  • Design, implement, and maintain automation frameworks , self-service tooling, and auto-healing systems to eliminate manual toil.
  • Build and enhance end-to-end observability , monitoring, logging, and alerting systems for proactive issue detection and resolution.
  • Ensure Uptime: Take ultimate ownership of our production environments stability. Lead end-to-end incident management, from escalation to Root Cause Analysis (RCA). Manage patching, upgrades, and disaster recovery processes.
  • Champion Infrastructure as Code ( IaC ) and CI/CD best practices to ensure consistent, repeatable, and secure deployments.
  • Collaborate with development and product teams to embed reliability and scalability into application design and architecture.
  • Continuously evaluate and introduce emerging tools and technologies to keep the SRE stack modern and efficient.
  • Mentor and guide SRE engineers , fostering a culture of ownership, innovation, and continuous improvement.
  • Implement AIOps frameworks to improve operational tasks and enhance system self-healing capabilities.
  • Participate in and optimise the on-call rotation , striving to minimise human intervention through automation.
  • Drive capacity planning, disaster recovery, and business continuity initiatives.
  • Support onboarding, documentation, and knowledge sharing for platform services and operational best practices.
How You Will Succeed
  • Demonstrate technical leadership and strategic thinking in SRE practices.
  • Proactively identify and resolve reliability risks and bottlenecks.
  • Foster strong cross-functional relationships with engineering, product, and operations teams.
  • Lead by example in incident management, troubleshooting, and performance optimisation.
  • Promote a culture of blameless postmortems and continuous learning.
  • Effectively communicate complex technical concepts to both technical and non-technical stakeholders.
What You Should Bring
  • Proven experience leading SRE or DevOps teams in a complex, cloud-native environment.
  • Deep expertise in at least one major cloud platform (AWS, Azure, or GCP).
  • Advanced knowledge of Linux/Unix systems, networking, and distributed systems.
  • Proficiency in programming/scripting (Python, Go, or similar).
  • Hands-on experience with containers and orchestration (Docker, Kubernetes at scale).
  • Strong background in CI/CD pipelines and Infrastructure as Code (Terraform, Ansible, Helm, etc.).
  • Expertise with observability platforms (Prometheus, Grafana, ELK, Datadog, Splunk).
  • Experience with SRE practices (SLIs, SLOs, error budgets, blameless postmortems).
  • Excellent problem-solving, debugging, and performance optimisation skills.
  • Experience with security engineering, IAM, secrets management, and vulnerability scanning is a plus.
  • Exposure to cloud cost optimisation strategies is desirable.
  • Experience mentoring and developing engineers.
Basic Qualifications and Experience Requirement
  • Bachelor s degree in Computer Science , Engineering, or related field.
  • 8+ years of hands-on experience in SRE, DevOps, or related roles , with at least 2 years in a technical leadership capacity.
  • Demonstrated success in managing reliability for large-scale, distributed systems.
  • Relevant certifications (e.g., AWS Certified DevOps Engineer, CKA, etc.) are a plus.
Additional Skills/Preferences
  • Experience with AI/ML in operations (AIOps) for anomaly detection, predictive scaling, or automated incident triage.
  • Contribution to open-source projects or thought leadership in SRE/DevOps communities.
  • Knowledge of Agile principles and frameworks (e.g., Scrum, SAFe ), including related tools (such as Jira).
  • Excellent analytical, problem-solving, and investigative skills.
  • Strong communication and collaboration skills.
Additional Information
Availability to work flexible work hours is/may be required . This team will support continuous operations across two shifts and therefore, this role will require non-standard work hours, and some work on weekends and holidays . Appropriate adjustments in benefits will be provided for employees working non-standard hours where applicable

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Eli Lilly And Company logo
Eli Lilly And Company

Pharmaceutical Manufacturing

Indianapolis Indiana

RecommendedJobs for You