Site Reliability Engineer

7 - 12 years

17 - 30 Lacs

Posted:1 day ago| Platform: Naukri logo

Apply

Work Mode

Hybrid

Job Type

Full Time

Job Description

Job Summary

Site Reliability Engineer (SRE)

This position requires a strong SRE mindset, production ownership, and close collaboration with development, QA, DevOps, and platform teams.

Roles and Responsibilities

  • Own the

    reliability, availability, and performance

    of mission-critical production systems across IBM Cloud and GCP.
  • Operate, monitor, and scale

    Kubernetes platforms (IKS / GKE)

    , including deployments, upgrades, node pool management, and capacity planning.
  • Design, implement, and maintain

    monitoring, alerting, logging, and dashboards

    using cloud-native and open-source observability tools.
  • Define, measure, and continuously improve

    SLIs, SLOs, error budgets, and service KPIs

    .
  • Participate in

    on-call rotations

    , lead

    incident response

    , perform

    root cause analysis (RCA)

    , and drive

    post-incident reviews

    with clear corrective actions.
  • Proactively analyse system performance, traffic patterns, and failure trends to

    prevent outages and reduce MTTR

    .
  • Manage and support

    PostgreSQL databases

    in production, including

    backups, restores, replication, failover, upgrades, and performance tuning

    .
  • Support

    event-driven architectures and MQTT-based messaging systems

    , ensuring message reliability, scalability, and low latency.
  • Implement and enforce

    cloud and Kubernetes security best practices

    , including

    IAM, RBAC, secrets management, certificate lifecycle, and network security

    .
  • Automate operational, reliability, and maintenance tasks using

    Python and Shell scripting

    .
  • Support

    CI/CD pipelines

    , enabling

    safe release strategies

    such as

    blue-green and canary deployments

    .
  • Troubleshoot build, deployment, application, and infrastructure failures and drive

    long-term reliability improvements

    .
  • Monitor infrastructure utilization and cloud costs, and recommend

    performance and cost-optimization

    measures.
  • Collaborate with

    development, QA, DevOps, and platform teams

    to improve delivery velocity and operational excellence.
  • Maintain clear

    runbooks, SOPs, and operational documentation

    for incident handling and platform operations.

Required Skills & Qualifications

  • Strong hands-on experience operating

    Kubernetes in production environments

  • Proven expertise in

    monitoring, alerting, observability, and SRE KPIs

  • Hands-on experience supporting

    PostgreSQL databases

    in production
  • Knowledge of

    event-driven architectures

    and

    MQTT

  • Solid understanding of

    cloud security principles and best practices

  • Strong automation and scripting skills (

    Python, Shell

    )
  • Experience working with

    IBM Cloud and/or Google Cloud Platform (GCP)

  • Ability to handle

    production incidents

    , perform RCA, and operate in a reliability-focused environment

Preferred Skills

  • Experience with

    Prometheus, Grafana, OpenTelemetry

    , or similar observability platforms
  • Infrastructure as Code or GitOps experience (

    Terraform, Kustomize, Argo CD

    )
  • Kubernetes, Cloud, or SRE-related

    certifications

  • Experience with

    cloud cost optimization (FinOps)

    practices
  • Exposure to

    multi-cloud or hybrid cloud

    environments

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Qentelli logo
Qentelli

Software Development

Dallas Texas

RecommendedJobs for You