Site Reliability Engineer

4 - 7 years

12 - 14 Lacs

Posted:1 day ago| Platform: Naukri logo

Apply

Work Mode

Hybrid

Job Type

Full Time

Job Description

What's cool about this job

As an Associate Site Reliability Engineer, you'll be at the forefront of our platform evolution, architecting solutions that ensure reliability, performance, and efficiency for our customers and their production workloads. You'll lead initiatives to enhance system reliability, implement innovative fault-tolerance strategies and drive automation that significantly reduces toil. If you're passionate about solving complex challenges, mentoring the next generation of SREs, and implementing best practices that make a real difference in system reliability and cost-effectiveness, this is the role for you. You'll have the opportunity to work with a diverse set of technologies, influence our technical direction, and make a tangible impact on our platform's performance and reliability.

The day to day

  • Work in a team of SREs in designing, implementing, and maintaining highly scalable and resilient systems
  • Help execute initiatives to significantly reduce toil through automation and process improvements
  • Aid in executing performance optimization initiatives to enhance system efficiency and user experience
  • Architect and implement robust, secure, and scalable software solutions to support WP Engine's platform
  • Continuously improve WP Engines secure, performant platform that supports 10s of millions of end users.
  • Develop and implement strategies to optimize infrastructure costs without compromising reliability or performance
  • Drive continuous improvement in observability, including metrics, logging, and tracing to enhance system visibility and troubleshooting capabilities
  • Assist in implementing CI/CD pipelines to enhance deployment velocity while maintaining system stability and reliability
  • Design and implement sophisticated SLOs and SLIs to better align with business objectives
  • Constantly look for opportunities to automate and optimize.
  • Contribute to alert management and incident response processes, reducing alert fatigue and minimizing MTTR
  • Establish monitoring systems to ensure the health, performance, and reliability of WPEngine platforms.
  • Collaborate with development teams to build reliability and operability into services from the ground up
  • Participate in on call rotation and determine/implement solutions to reduce production interrupts

Your expertise and passion

  • 2+ years experience in SRE, Production Engineering, or DevOps roles
  • Familiarity with modern observability practices and tools (e.g., Grafana, Prometheus, TICK stack, ELK stack, distributed tracing)
  • Experience with at least one major cloud platform and ability to design and troubleshoot multi-cloud architectures
  • Proven track record of significantly reducing toil and improving system reliability in large-scale environments
  • Demonstrated experience in performance tuning and cost optimization for large-scale systems
  • Proactive with natural problem-solving abilities, an inquisitive personality, a continuous learning approach, and an eagerness to tackle big problems even with uncertain requirements
  • Experience designing and implementing effective alerting strategies that minimize noise and maximize signal
  • Excellent communication skills with the ability to explain complex technical concepts to both technical and non-technical stakeholders
  • Proven ability to drive adoption of SRE best practices across an organization
  • Experience with a Kubernetes environment at large scale
  • On-call experience for critical services with good troubleshooting skills
  • Bachelor’s degree in Computer Science (or a related field) OR equivalent experience

Desired experience

  • Programming skills in languages commonly used for SRE tasks (e.g., Python, Go, Bash)
  • Understanding of Linux/Unix systems and networking principles
  • Proven ability to design and implement robust CI/CD pipelines
  • Experience with containerization and orchestration technologies, particularly Kubernetes
  • Experience in implementing and managing large-scale distributed systems
  • Track record of driving adoption of SRE best practices across an organization
  • Experience participating in major incident responses
  • Experience defining and implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs)

This role involves on call work

  • On-call is a weekly rotation among the team members
  • Level two escalation point on a follow the sun model

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Cybage logo
Cybage

Information Technology & Services

Pune

RecommendedJobs for You