Senior Site Reliability Engineer - DevOps

5 - 9 years

0 Lacs

Posted:2 days ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

As a Senior Site Reliability Engineer (SRE), you will play a crucial role in ensuring the reliability, scalability, and observability of the DevOps ecosystem at the company. Your responsibilities will involve managing DevOps components and tools across over 100 production environments, administering and optimizing Kubernetes clusters, implementing and maintaining observability stacks, ensuring high availability of CI/CD pipelines, and automating infrastructure provisioning using Terraform and Ansible. You will also be responsible for building alerting, monitoring, and dashboarding systems, leading root cause analysis for incidents, collaborating with engineering teams to design reliable systems, and participating in on-call rotations. Key Responsibilities: - Own and manage DevOps components and tooling across 100+ production environments. - Administer, scale, and optimize Kubernetes clusters for application and infrastructure workloads. - Implement and maintain observability stacks including Prometheus, OpenTelemetry, Elasticsearch, and ClickHouse. - Ensure high availability of CI/CD pipelines and automate infrastructure provisioning using Terraform and Ansible. - Build alerting, monitoring, and dashboarding systems to proactively detect and resolve issues. - Lead root cause analysis for incidents and drive long-term stability improvements. - Collaborate with engineering teams to design reliable, secure, and observable systems. - Participate in on-call rotations and lead incident response efforts when required. - Provide guidance to the cloud platform team to enhance system reliability and scalability. In addition to the above responsibilities, you will also participate in the development process by supporting new features, services, releases, and maintaining an ownership mindset for cloud platform technologies. You are expected to have expertise in one of the programming languages such as Java, Python, or Go, proficiency in writing bash scripts, a good understanding of SQL and NoSQL systems, and knowledge of systems programming. Hands-on experience with Ansible for automation of day-to-day activities is required. Required Skills & Experience: - 5+ years of experience in SRE, DevOps, or Infrastructure Engineering roles. - Expertise in Kubernetes, including deployment, scaling, troubleshooting, and operations in production. - Strong Linux systems background and scripting skills in Python, Bash, or Go. - Hands-on experience with CI/CD tools like Jenkins, GitLab CI, or similar. - Infrastructure-as-Code skills using tools such as Terraform, Ansible, or equivalent. - Solid knowledge of observability tools such as Prometheus, OpenTelemetry, Elasticsearch, ClickHouse, and Appdynamics. - Experience with containerization (Docker) and orchestration at scale. - Familiarity with cloud platforms like AWS, GCP, or Azure, and hybrid-cloud architecture. - Ability to debug and optimize system performance under production load.,

Mock Interview

Practice Video Interview with JobPe AI

Start Java Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Java Skills

Practice Java coding challenges to boost your skills

Start Practicing Java Now
Qualys logo
Qualys

Computer and Network Security

Foster City CA

RecommendedJobs for You