Director of Site Reliability Engineering

8 - 13 years

20 - 25 Lacs

Posted:1 hour ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

This is a hands-on leadership role requiring deep technical expertise, proven ability to scale engineering organizations, and a track record of building reliable systems at scale. The ideal candidate will balance reliability with tactical execution, driving both immediate operational excellence and long-term architectural improvements where necessary.
Key Responsibilities Strategic Leadership & Vision
  • Define and execute the long-term SRE strategy aligned with business objectives and technical roadmap
  • Establish reliability standards, SLI/SLO frameworks, and error budget policies across services
  • Drive architectural decisions that improve system reliability, scalability, and operational efficiency
  • Partner with engineering leadership to influence platform and application design for reliability
  • Represent SRE perspective in executive technical discussions and strategic planning
Team Leadership & Development
  • Build, lead, and scale a high-performing SRE organization
  • Recruit, hire, and onboard top-tier SRE talent across multiple experience levels
  • Develop career progression frameworks and growth paths for SRE professionals
  • Foster a culture of continuous learning, blameless post-mortems, and operational excellence
  • Provide technical mentorship and leadership development for senior SRE staff
Operational Excellence & Incident Management
  • Manage and oversee enterprise-wide incident response processes and on-call practices
  • Drive root cause analysis programs and ensure systematic elimination of failure modes
  • Implement sustainable on-call practices that maintain work-life balance while ensuring coverage
  • Oversee capacity planning and resource optimization strategies across all services
  • Establish metrics and reporting frameworks for reliability, performance, and operational health
Cross-Functional Partnership
  • Collaborate with VP/Director level peers in Engineering, Product, and Infrastructure
  • Work with Security leadership to integrate reliability and security practices
  • Partner with Finance on cost optimization initiatives and capacity planning budgets
  • Engage with Customer Success and Support teams on reliability-impacting issues
Platform & Tooling Strategy
  • Drive the simplification and reduction of observability, monitoring, and alerting platforms
  • Establish automation standards and drive toil reduction initiatives
  • Help improve CI/CD pipeline architecture and deployment practices
  • Influence infrastructure-as-code and configuration management strategies
Organizational & Process Innovation
  • Implement SRE best practices including error budgets, toil tracking, and reliability reviews
  • Establish metrics-driven decision making and continuous improvement processes
  • Drive adoption of chaos engineering and proactive reliability testing
  • Create and maintain SRE documentation, runbooks, and knowledge sharing systems
  • Develop and execute disaster recovery and business continuity plans
Required Skills Leadership & Management Experience
  • Bachelors or Masters degree in Computer Science, Engineering, or equivalent experience
  • 8+ years in engineering leadership roles, with 4+ years managing managers
  • Proven track record of building and scaling engineering teams
  • Experience with performance management, career development, and succession planning
  • Strong executive presence and ability to influence without authority
  • Experience driving organizational change and cultural transformation
Technical Expertise
  • Experience with multiple cloud platforms (AWS, GCP, Azure) and hybrid environments
  • Deep understanding of distributed systems, microservices architecture, and cloud platforms
  • Hands-on experience with modern observability tools (Prometheus, Grafana, Datadog, etc.)
  • Strong background in infrastructure automation, CI/CD, and infrastructure-as-code
  • Expertise in capacity planning, performance optimization, and cost management
SRE & Operations Mastery
  • Deep understanding of SRE principles, practices, and implementation at scale
  • Experience establishing SLI/SLO frameworks and error budget management
  • Proven track record of improving system reliability and reducing operational toil
  • Experience with incident management, post-mortem processes, and reliability engineering
  • Background in 24/7 operations and on-call management best practices
Business & Strategic Acumen
  • Understanding of budget management, resource allocation, and ROI analysis
  • Ability to communicate technical concepts to non-technical stakeholders and executives
  • Experience with vendor management and technology partnership decisions
  • Knowledge of compliance frameworks and regulatory requirements
Desired Skills Advanced Technical Background
  • Background in container orchestration (Kubernetes) and service mesh technologies
  • Knowledge of database administration and data platform reliability
  • Experience with security engineering and DevSecOps practices
Success Metrics Reliability & Performance
  • Achieve and maintain service availability targets (typically 99.9%+ uptime)
  • Reduce mean time to detection (MTTD) and mean time to recovery (MTTR)
  • Improve capacity planning accuracy and reduce over-provisioning costs
  • Increase deployment frequency while maintaining reliability standards
Team & Organizational Development
  • Build and retain a high-performing SRE organization with low attrition
  • Establish clear career progression and achieve high employee satisfaction scores
  • Develop internal talent and promote from within the SRE organization
  • Create sustainable on-call practices with reasonable operational load
Operational Excellence
  • Drive measurable reduction in operational toil and manual interventions
  • Establish comprehensive observability and proactive alerting across all services
  • Implement effective incident response with blameless post-mortem culture
  • Achieve cost optimization targets while maintaining reliability standards
 

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Five9 logo
Five9

Cloud Computing / Software as a Service (SaaS)

San Ramon

RecommendedJobs for You

kolkata, mumbai, new delhi, hyderabad, pune, chennai, bengaluru

kolkata, mumbai, new delhi, hyderabad, pune, chennai, bengaluru

kolkata, mumbai, new delhi, hyderabad, pune, chennai, bengaluru