What's cool about this job
As an Associate Site Reliability Engineer, you'll be at the forefront of our platform evolution, architecting solutions that ensure reliability, performance, and efficiency for our customers and their production workloads. You'll lead initiatives to enhance system reliability, implement innovative fault-tolerance strategies and drive automation that significantly reduces toil. If you're passionate about solving complex challenges, mentoring the next generation of SREs, and implementing best practices that make a real difference in system reliability and cost-effectiveness, this is the role for you. You'll have the opportunity to work with a diverse set of technologies, influence our technical direction, and make a tangible impact on our platform's performance and reliability.
The day to day
- Work in a team of SREs in designing, implementing, and maintaining highly scalable and resilient systems
- Help execute initiatives to significantly reduce toil through automation and process improvements
- Aid in executing performance optimization initiatives to enhance system efficiency and user experience
- Architect and implement robust, secure, and scalable software solutions to support WP Engine's platform
- Continuously improve WP Engines secure, performant platform that supports 10s of millions of end users.
- Develop and implement strategies to optimize infrastructure costs without compromising reliability or performance
- Drive continuous improvement in observability, including metrics, logging, and tracing to enhance system visibility and troubleshooting capabilities
- Assist in implementing CI/CD pipelines to enhance deployment velocity while maintaining system stability and reliability
- Design and implement sophisticated SLOs and SLIs to better align with business objectives
- Constantly look for opportunities to automate and optimize.
- Contribute to alert management and incident response processes, reducing alert fatigue and minimizing MTTR
- Establish monitoring systems to ensure the health, performance, and reliability of WPEngine platforms.
- Collaborate with development teams to build reliability and operability into services from the ground up
- Participate in on call rotation and determine/implement solutions to reduce production interrupts
Your expertise and passion
- 2+ years experience in SRE, Production Engineering, or DevOps roles
- Familiarity with modern observability practices and tools (e.g., Grafana, Prometheus, TICK stack, ELK stack, distributed tracing)
- Experience with at least one major cloud platform and ability to design and troubleshoot multi-cloud architectures
- Proven track record of significantly reducing toil and improving system reliability in large-scale environments
- Demonstrated experience in performance tuning and cost optimization for large-scale systems
- Proactive with natural problem-solving abilities, an inquisitive personality, a continuous learning approach, and an eagerness to tackle big problems even with uncertain requirements
- Experience designing and implementing effective alerting strategies that minimize noise and maximize signal
- Excellent communication skills with the ability to explain complex technical concepts to both technical and non-technical stakeholders
- Proven ability to drive adoption of SRE best practices across an organization
- Experience with a Kubernetes environment at large scale
- On-call experience for critical services with good troubleshooting skills
- Bachelor’s degree in Computer Science (or a related field) OR equivalent experience
Desired experience
- Programming skills in languages commonly used for SRE tasks (e.g., Python, Go, Bash)
- Understanding of Linux/Unix systems and networking principles
- Proven ability to design and implement robust CI/CD pipelines
- Experience with containerization and orchestration technologies, particularly Kubernetes
- Experience in implementing and managing large-scale distributed systems
- Track record of driving adoption of SRE best practices across an organization
- Experience participating in major incident responses
- Experience defining and implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs)
This role involves on call work
- On-call is a weekly rotation among the team members
- Level two escalation point on a follow the sun model