Principal Site Reliability Engineer

8 years

0 Lacs

Posted:4 days ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

At Lilly, we unite caring with discovery to make life better for people around the world. We are a global healthcare leader headquartered in Indianapolis, Indiana. Our employees around the world work to discover and bring life-changing medicines to those who need them, improve the understanding and management of disease, and give back to our communities through philanthropy and volunteerism. We give our best effort to our work, and we put people first. We’re looking for people who are determined to make life better for people around the world.

About The Technology Organization

Technology at Lilly builds and maintains capabilities using pioneering technologies like most prominent tech companies. What differentiates Technology at Lilly is that we create new possibilities through tech to advance our purpose – creating medicines that make life better for people around the world, like data driven drug discovery and connected clinical trials. We hire the best technology professionals from a variety of backgrounds, so they can bring an assortment of knowledge, skills, and diverse thinking to deliver solutions in every area of our business.

About The Business Function

The Software Product Engineering (SPE) team is a specialised engineering group that delivers strategic solutions and differentiated capabilities. We take a forward-thinking approach, focusing on an enterprise platform and product mindset, ensuring that the solutions we build can be leveraged across Technology teams for broader impact and efficiency.

Job Title:

Principal Site Reliability Engineer

Role Summary

As a Lead SRE Engineer, you will drive reliability, scalability, and operational excellence across a portfolio of applications deployed on a modern internal platform. You will lead and mentor a team of SRE engineers, establish best practices, and collaborate closely with product and development teams to ensure robust, automated, and self-healing systems. Your leadership will be critical in shaping the SRE function and enabling the team to deliver high-impact solutions that support Lilly’s mission.

What You’ll Be Doing

  • Lead the SRE team responsible for the reliability and performance of applications deployed on a cloud-native internal platform.
  • Design, implement, and maintain automation frameworks, self-service tooling, and auto-healing systems to eliminate manual toil.
  • Build and enhance end-to-end observability, monitoring, logging, and alerting systems for proactive issue detection and resolution.
  • Ensure Uptime: Take ultimate ownership of our production environment's stability. Lead end-to-end incident management, from escalation to Root Cause Analysis (RCA). Manage patching, upgrades, and disaster recovery processes.
  • Champion Infrastructure as Code (IaC) and CI/CD best practices to ensure consistent, repeatable, and secure deployments.
  • Collaborate with development and product teams to embed reliability and scalability into application design and architecture.
  • Continuously evaluate and introduce emerging tools and technologies to keep the SRE stack modern and efficient.
  • Mentor and guide SRE engineers, fostering a culture of ownership, innovation, and continuous improvement.
  • Implement AIOps frameworks to improve operational tasks and enhance system self-healing capabilities.
  • Participate in and optimise the on-call rotation, striving to minimise human intervention through automation.
  • Drive capacity planning, disaster recovery, and business continuity initiatives.
  • Support onboarding, documentation, and knowledge sharing for platform services and operational best practices.

How You Will Succeed

  • Demonstrate technical leadership and strategic thinking in SRE practices.
  • Proactively identify and resolve reliability risks and bottlenecks.
  • Foster strong cross-functional relationships with engineering, product, and operations teams.
  • Lead by example in incident management, troubleshooting, and performance optimisation.
  • Promote a culture of blameless postmortems and continuous learning.
  • Effectively communicate complex technical concepts to both technical and non-technical stakeholders.

What You Should Bring

  • Proven experience leading SRE or DevOps teams in a complex, cloud-native environment.
  • Deep expertise in at least one major cloud platform (AWS, Azure, or GCP).
  • Advanced knowledge of Linux/Unix systems, networking, and distributed systems.
  • Proficiency in programming/scripting (Python, Go, or similar).
  • Hands-on experience with containers and orchestration (Docker, Kubernetes at scale).
  • Strong background in CI/CD pipelines and Infrastructure as Code (Terraform, Ansible, Helm, etc.).
  • Expertise with observability platforms (Prometheus, Grafana, ELK, Datadog, Splunk).
  • Experience with SRE practices (SLIs, SLOs, error budgets, blameless postmortems).
  • Excellent problem-solving, debugging, and performance optimisation skills.
  • Experience with security engineering, IAM, secrets management, and vulnerability scanning is a plus.
  • Exposure to cloud cost optimisation strategies is desirable.
  • Experience mentoring and developing engineers.

Basic Qualifications And Experience Requirement

  • Bachelor’s degree in Computer Science, Engineering, or related field.
  • 8+ years of hands-on experience in SRE, DevOps, or related roles, with at least 2 years in a technical leadership capacity.
  • Demonstrated success in managing reliability for large-scale, distributed systems.
  • Relevant certifications (e.g., AWS Certified DevOps Engineer, CKA, etc.) are a plus.

Additional Skills/Preferences

  • Experience with AI/ML in operations (AIOps) for anomaly detection, predictive scaling, or automated incident triage.
  • Contribution to open-source projects or thought leadership in SRE/DevOps communities.
  • Knowledge of Agile principles and frameworks (e.g., Scrum, SAFe), including related tools (such as Jira).
  • Excellent analytical, problem-solving, and investigative skills.
  • Strong communication and collaboration skills.

Additional Information

Availability to work flexible work hours is/may be required. This team will support continuous operations across two shifts and therefore, this role will require non-standard work hours, and some work on weekends and holidays. Appropriate adjustments in benefits will be provided for employees working non-standard hours where applicableLilly is dedicated to helping individuals with disabilities to actively engage in the workforce, ensuring equal opportunities when vying for positions. If you require accommodation to submit a resume for a position at Lilly, please complete the accommodation request form (https://careers.lilly.com/us/en/workplace-accommodation) for further assistance. Please note this is for individuals to request an accommodation as part of the application process and any other correspondence will not receive a response.Lilly does not discriminate on the basis of age, race, color, religion, gender, sexual orientation, gender identity, gender expression, national origin, protected veteran status, disability or any other legally protected status.#WeAreLilly

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You