This is a hands-on leadership role requiring deep technical expertise, proven ability to scale engineering organizations, and a track record of building reliable systems at scale. The ideal candidate will balance reliability with tactical execution, driving both immediate operational excellence and long-term architectural improvements where necessary.
Key Responsibilities Strategic Leadership & Vision
- Define and execute the long-term SRE strategy aligned with business objectives and technical roadmap
- Establish reliability standards, SLI/SLO frameworks, and error budget policies across services
- Drive architectural decisions that improve system reliability, scalability, and operational efficiency
- Partner with engineering leadership to influence platform and application design for reliability
- Represent SRE perspective in executive technical discussions and strategic planning
Team Leadership & Development
- Build, lead, and scale a high-performing SRE organization
- Recruit, hire, and onboard top-tier SRE talent across multiple experience levels
- Develop career progression frameworks and growth paths for SRE professionals
- Foster a culture of continuous learning, blameless post-mortems, and operational excellence
- Provide technical mentorship and leadership development for senior SRE staff
Operational Excellence & Incident Management
- Manage and oversee enterprise-wide incident response processes and on-call practices
- Drive root cause analysis programs and ensure systematic elimination of failure modes
- Implement sustainable on-call practices that maintain work-life balance while ensuring coverage
- Oversee capacity planning and resource optimization strategies across all services
- Establish metrics and reporting frameworks for reliability, performance, and operational health
Cross-Functional Partnership
- Collaborate with VP/Director level peers in Engineering, Product, and Infrastructure
- Work with Security leadership to integrate reliability and security practices
- Partner with Finance on cost optimization initiatives and capacity planning budgets
- Engage with Customer Success and Support teams on reliability-impacting issues
Platform & Tooling Strategy
- Drive the simplification and reduction of observability, monitoring, and alerting platforms
- Establish automation standards and drive toil reduction initiatives
- Help improve CI/CD pipeline architecture and deployment practices
- Influence infrastructure-as-code and configuration management strategies
Organizational & Process Innovation
- Implement SRE best practices including error budgets, toil tracking, and reliability reviews
- Establish metrics-driven decision making and continuous improvement processes
- Drive adoption of chaos engineering and proactive reliability testing
- Create and maintain SRE documentation, runbooks, and knowledge sharing systems
- Develop and execute disaster recovery and business continuity plans
Required Skills Leadership & Management Experience
- Bachelors or Masters degree in Computer Science, Engineering, or equivalent experience
- 8+ years in engineering leadership roles, with 4+ years managing managers
- Proven track record of building and scaling engineering teams
- Experience with performance management, career development, and succession planning
- Strong executive presence and ability to influence without authority
- Experience driving organizational change and cultural transformation
Technical Expertise
- Experience with multiple cloud platforms (AWS, GCP, Azure) and hybrid environments
- Deep understanding of distributed systems, microservices architecture, and cloud platforms
- Hands-on experience with modern observability tools (Prometheus, Grafana, Datadog, etc.)
- Strong background in infrastructure automation, CI/CD, and infrastructure-as-code
- Expertise in capacity planning, performance optimization, and cost management
SRE & Operations Mastery
- Deep understanding of SRE principles, practices, and implementation at scale
- Experience establishing SLI/SLO frameworks and error budget management
- Proven track record of improving system reliability and reducing operational toil
- Experience with incident management, post-mortem processes, and reliability engineering
- Background in 24/7 operations and on-call management best practices
Business & Strategic Acumen
- Understanding of budget management, resource allocation, and ROI analysis
- Ability to communicate technical concepts to non-technical stakeholders and executives
- Experience with vendor management and technology partnership decisions
- Knowledge of compliance frameworks and regulatory requirements
Desired Skills Advanced Technical Background
- Background in container orchestration (Kubernetes) and service mesh technologies
- Knowledge of database administration and data platform reliability
- Experience with security engineering and DevSecOps practices
Success Metrics Reliability & Performance
- Achieve and maintain service availability targets (typically 99.9%+ uptime)
- Reduce mean time to detection (MTTD) and mean time to recovery (MTTR)
- Improve capacity planning accuracy and reduce over-provisioning costs
- Increase deployment frequency while maintaining reliability standards
Team & Organizational Development
- Build and retain a high-performing SRE organization with low attrition
- Establish clear career progression and achieve high employee satisfaction scores
- Develop internal talent and promote from within the SRE organization
- Create sustainable on-call practices with reasonable operational load
Operational Excellence
- Drive measurable reduction in operational toil and manual interventions
- Establish comprehensive observability and proactive alerting across all services
- Implement effective incident response with blameless post-mortem culture
- Achieve cost optimization targets while maintaining reliability standards