We are seeking a self-driven, inquisitive, and curious Site Reliability Engineer (SRE) to drive reliability, availability, performance, and security across our global digital product ecosystem. This role is central to ensuring a seamless and resilient experience for our users by blending deep engineering expertise with operational excellence and automation.
You will be part of a global SRE practice supporting a portfolio of 260+ modern cloud-native applications across consumer, commercial, supply chain, and enablement functions. Your mission: prevent incidents before they occur, ensure rapid recovery when they do, and build scalable systems that evolve with our growing business.
Responsibilities
Champion reliability, observability, and operational excellence across mission-critical applications.
- Develop and maintain service-level indicators (SLIs), objectives (SLOs), and error budgets to measure and improve system performance.
- Implement automated monitoring, alerting, and recovery mechanisms to reduce manual intervention and improve response times.
- Collaborate closely with software engineering, platform, and operations teams to embed SRE practices across the development lifecycle.
- Lead and participate in incident response, root cause analysis, and postmortem reviews to drive long-term improvements.
- Identify and eliminate sources of toil through automation, tooling, and process refinement.
- Continuously improve resiliency design, capacity planning, and release management in production systems.
- Influence engineering teams with best practices on cloud-native architecture, observability, and deployment strategies.
Qualifications
Required Skills:
- 5+ years of experience in production engineering, DevOps, or SRE roles.
- Strong foundation in Linux systems, networking, and cloud platforms (Azure, AWS, or GCP).
- Hands-on experience with observability tools (e.g., AppDynamics, Prometheus, Grafana, ELK, FullStory).
- Proficiency in scripting or programming (e.g., Python, Bash, Go) and automation frameworks (e.g., Ansible, Terraform).
- Deep understanding of CI/CD pipelines, release strategies, and deployment automation.
- Experience in managing high-scale, distributed systems in cloud-native environments.
- Strong analytical skills and a passion for continuous improvement.
Preferred Skills:
- Familiarity with microservices, Kubernetes, containers, and service mesh architecture.
- Exposure to incident and problem management frameworks (e.g., ITIL, RCA practices).
- Experience working in global teams supporting mission-critical applications.
Required Skills:
- 5+ years of experience in production engineering, DevOps, or SRE roles.
- Strong foundation in Linux systems, networking, and cloud platforms (Azure, AWS, or GCP).
- Hands-on experience with observability tools (e.g., AppDynamics, Prometheus, Grafana, ELK, FullStory).
- Proficiency in scripting or programming (e.g., Python, Bash, Go) and automation frameworks (e.g., Ansible, Terraform).
- Deep understanding of CI/CD pipelines, release strategies, and deployment automation.
- Experience in managing high-scale, distributed systems in cloud-native environments.
- Strong analytical skills and a passion for continuous improvement.
Preferred Skills:
- Familiarity with microservices, Kubernetes, containers, and service mesh architecture.
- Exposure to incident and problem management frameworks (e.g., ITIL, RCA practices).
- Experience working in global teams supporting mission-critical applications.
Champion reliability, observability, and operational excellence across mission-critical applications.
- Develop and maintain service-level indicators (SLIs), objectives (SLOs), and error budgets to measure and improve system performance.
- Implement automated monitoring, alerting, and recovery mechanisms to reduce manual intervention and improve response times.
- Collaborate closely with software engineering, platform, and operations teams to embed SRE practices across the development lifecycle.
- Lead and participate in incident response, root cause analysis, and postmortem reviews to drive long-term improvements.
- Identify and eliminate sources of toil through automation, tooling, and process refinement.
- Continuously improve resiliency design, capacity planning, and release management in production systems.
- Influence engineering teams with best practices on cloud-native architecture, observability, and deployment strategies.