Job
Description
As an SRE Engineer , you will champion all things pertaining to reliability at Okta on our Auth0 product. Working closely with the product engineers, quality engineers, platform engineers, and architecture teams, your primary focus will be on ensuring production systems remain operational at all times, while continually setting and achieving long-term performance, reliability, and scalability goals in a platform with a growth plan for the coming years.
You will play a key role in Auth0s dedication to ensuring customers uninterrupted access to business-critical enterprise and consumer applications. This is a hands-on role where you will directly operate, troubleshoot, and scale our production systems by responding to monitoring alerts and managing incidents as part of a team's 24/7 on-call rotation. Your work is critical to meeting the demands of ever-increasing traffic and user growth for our customers who rely on us to provide a reliable product experience.At Auth0, we celebrate a variety of perspectives and experiences. We are not looking for someone who checks every single box - were looking for lifelong learners and people who can make us better with their unique experiences.You will:
Drive the technical direction of the team , working with SRE leadership to translate the organizational vision into an actionable technical roadmap.Participate in a global on-call rotation featuring a follow-the-sun model on weekdays and a lower-frequency, shared rotation for weekends to remediate incidents on critical systems.Lead and drive complex, cross-functional initiatives that require partnership with external platform & product teamsUse existing monitoring tools to identify problems and resolve and/or escalate to service teamsImplement changes to enable or improve infrastructure resilience, monitoring, and alertingDevelop and continuously refine SRE tools and processes to improve software delivery, observability, reliability, and operational efficiency.Optimize existing systems and eliminate toil through simplification and automation.Define, document, and advocate reliability best practices and policiesRepresent SRE as a senior technical expert in architectural reviews and strategic planning, ensuring reliability is a primary consideration in all major engineering efforts.Mentor other SREs through pair programming, design discussions, and code reviews to level up the team's technical capabilities.You might be a good fit if you:Have 8+ years of industry experience, with a proven track record of leading complex, cross-functional technical projects .Believe in the SRE mindsetyou are data-driven, embrace a blameless culture, and approach operational problems with a software engineering approach.Have demonstrable experience participating in a 24/7 on-call rotation.Possess deep expertise in a major cloud provider (Azure, AWS).Have demonstrable experience managing infrastructure as code with Terraform at scale.Have a strong understanding of cloud-native architecture, including containers (Docker, Kubernetes), microservices, modern networking concepts, and various database technologies (SQL, NoSQL, etc.).Demonstrate strong proficiency in Go, with proven experience building and maintaining production-grade software, tools, and automation.Have a systematic problem-solving approach, coupled with a strong sense of ownership and the drive to see complex issues through to resolution.Possess exceptional proficiency in verbal and written English, allowing you to drive clarity during high-pressure incidents and articulate complex concepts.Have strong interpersonal and collaboration skills, with a proven ability to build relationships and work effectively in a globally distributed, remote-first team.Are passionate about acting as a force multiplier, mentoring senior engineers and elevating the technical capabilities of the entire team.Have a strong interest in shaping the team's technical vision and actively contributing to its strategic direction and leadership decisions.