Home
Jobs

Senior Site Reliability Engineer

5 - 7 years

20 - 24 Lacs

Posted:3 months ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Design, build, and maintain scalable, reliable, and high-performance infrastructure and services. Implement and manage monitoring, ing, and automated remediation tools to ensure maximum uptime. Participate in on-call rotations, resolving production incidents, and improving the reliability of systems. Work closely with development teams to ensure smooth application deployment and continuous integration/continuous deployment (CI/CD) pipelines. Develop and maintain system observability frameworks, including logs, metrics, and tracing. Drive the implementation of SLOs (Service Level Objectives) and SLIs (Service Level Indicators), ensuring systems meet reliability targets. Write Python scripts to automate system operations and improve the deployment process. Build and manage services using GCP resources such as Compute Engine, Kubernetes Engine, Cloud Functions, BigQuery, and Cloud Storage. Ensure proper integration and optimization between GCP services and the overall architecture. Leverage GCP tools for cost management, security, and performance monitoring. Collaborate with engineering teams to integrate reliability practices into software development lifecycles. Provide guidance on best practices for system design, disaster recovery, and fault tolerance. Lead post-incident analysis and conduct retrospectives to prevent future issues. Mentor junior engineers and help build a culture of reliability and high availability. 6+ years of experience in Site Reliability Engineering or related roles. Strong experience in Python programming, including automation, scripting, and developing tools for system management. Expertise in Google Cloud Platform (GCP) services, including Compute Engine, Kubernetes Engine, Cloud Functions, and more. In-depth understanding of cloud infrastructure, containerization (Docker, Kubernetes), and orchestration. Experience with monitoring, logging, and ing tools (Prometheus, Grafana, Stackdriver, etc.). Proven experience with CI/CD pipelines, version control (Git), and automation tools (Ansible, Terraform, etc.). Strong understanding of networking, load balancing, and high-availability architectures. Experience in incident management and working in an on-call rotation.

Mock Interview

Practice Video Interview with JobPe AI

Start Service Level Interview Now

My Connections UST

Download Chrome Extension (See your connection in the UST )

chrome image
Download Now
UST
UST

IT Services and IT Consulting

Aliso Viejo CA

10001 Employees

1845 Jobs

    Key People

  • Kris Canekeratne

    Co-Founder & CEO
  • Sandeep Reddy

    President

RecommendedJobs for You