What you will do In this vital role you will play a key role in building, scaling, and securing the platforms that underpin Amgens global digital initiatives. This role focuses on ensuring the reliability, performance, and efficiency of cloud-native platforms while enabling development velocity and operational excellence.
You will be responsible for designing and operating infrastructure and shared platforms used across the enterprise, including CI/CD, observability, incident management, and collaboration systems.
You will work extensively with containerized environments, handle multi-tenant Kubernetes platforms, and automate processes to improve resilience and reduce operational burden. This role requires deep technical depth, leadership skills, and the ability to drive initiatives across cross-functional teams and global stakeholders.
Roles & Responsibilities: Platform Reliability Engineering
- Design, operate, and scale secure, highly available cloud-based infrastructure using Infrastructure as Code (IaC).
- Handle multi-tenant container orchestration environments with advanced access controls, workload isolation, and governance policies.
- Ensure enterprise CI/CD platforms are performant, secure, and optimized for high-throughput engineering teams.
Monitoring, Observability & Incident Management
- Build and handle observability platforms for full-stack visibility, leveraging metrics, logs, and traces.
- Define, implement, and continuously refine SLIs, SLOs, and error budgets for platform health and service performance.
- Automate incident response workflows, integrate with incident management platforms, and lead post-incident reviews and root cause analysis.
- Enterprise Platform Administration
- Operate and improve core engineering platforms (e.g., CI/CD, collaboration, knowledge sharing) to ensure availability, security, and ease of use.
- Automate platform provisioning, upgrades, access controls, and integration pipelines to reduce manual effort and improve consistency.
- Implement compliance, audit logging, and policy enforcement through code-driven governance models.
AI Adoption & Enablement
- Drive the adoption of AI/ML-based tools to enhance observability, incident prediction, remediation, and intelligent alerting.
- Evaluate and integrate AI-assisted automation platforms to reduce toil and improve operational efficiency.
- Partner with platform, security, and development teams to embed predictive analytics into dashboards, workflows, and root cause tooling.
- Champion a data-driven SRE practice by enabling thoughtful insights and anomaly detection across systems and platforms.
Leadership & Collaboration
- Serve as a technical thought leader and mentor within the SRE organization.
- Promote SRE principles and reliability culture across engineering teams.
- Collaborate with cross-functional stakeholders to influence architecture, roadmaps, and platform investment.
- Lead operational reviews and service health retrospectives, with a focus on continuous improvement.
- Participate in Agile and SAFe delivery processesincluding sprint planning, stand-ups, retrospectives, and PI planningto ensure security and platform reliability are embedded across development cycles.
Basic Qualifications:
- Doctorate degree / Master's degree / Bachelor's degree and 8 to 13 years in Computer Science, Information Technology, or a related technical field
- Demonstrated success operating cloud-native infrastructure in production environments
- Practical experience handling Kubernetes clusters and CI/CD environments at enterprise scale
- Exposure to global on-call or incident support rotations
- Excellent collaboration and communication skills across technical and non-technical teams
Preferred Qualifications:
Must-Have Skills:
- Deep experience with cloud platforms (AWS, Azure, or GCP), including services such as compute, networking, IAM, and VPC design
- Proven proficiency in Infrastructure as Code (IaC) using tools such as Terraform or CloudFormation
- Advanced skills in managing container orchestration platforms (e.g., Kubernetes), including workload isolation, resource quotas, and role-based access control
- Strong understanding of Linux system administration , process management, and system performance tuning
- Hands-on experience with CI/CD platforms and pipelines (build automation, artifact storage, environment provisioning, rollback strategies)
- Strong background in observability tooling , including Prometheus , Grafana , Dynatrace , and distributed tracing frameworks like OpenTelemetry or Jaeger
- Strong practical experience with incident management platforms and practices (e.g., alert routing, runbooks, escalation paths)
- Automation and scripting proficiency in languages such as Python , Go , or Bash
- Experience with configuration management tools like Ansible , Chef , or SaltStack
- Strong grasp of networking fundamentals , such as routing, DNS, OSI layers, load balancing, firewalls, TLS, and security groups
- Version control and collaboration workflows using Git and GitOps principles
- Experience with enterprise collaboration platforms , including provisioning, integration, and permission control
Good-to-Have Skills:
- Exposure to service mesh technologies (e.g., Istio, Linkerd) and zero-trust network concepts
- Familiarity with secrets management platforms (e.g., HashiCorp Vault, AWS Secrets Manager)
- Experience using incident response and chaos engineering tools (e.g., Gremlin, Chaos Mesh)
- Background in cost optimization , budgeting, and resource tracking (FinOps)
- Awareness of policy-as-code frameworks (e.g., OPA, Kyverno)
- Familiarity with feature flagging and progressive delivery tools (e.g., LaunchDarkly, Argo Rollouts)
- Integration experience with ticketing and change management platforms (e.g., ServiceNow, Jira)
- Understanding of compliance standards (e.g., HIPAA, GDPR, SOC 2) and how they apply to infrastructure operations
- Understanding of security and encryption technologies and authentication protocols such as OpenID, OIDC, OAuth, SAML, and LDAP
Professional Certifications (Preferred)
- Cloud DevOps Certification (AWS/Azure/GCP)
- Certified Kubernetes Administrator (CKA) or Security Specialist (CKS)
- CI/CD Platform Certification
- ITIL Foundation or equivalent service management certification
Soft Skills:
- High level of ownership and accountability for platform reliability
- Strong diagnostic and analytical capabilities with a bias for action
- Clear and confident communicator with an ability to influence without authority
- Passion for automation, operational excellence, and team mentorship