Platform Site Reliability Engineer

6 - 9 years

12 - 19 Lacs

chennai bengaluru mumbai (all areas)

Posted:-1 days ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Location:

Experience:

CTC:

Notice Period:

Role Overview

Platform Site Reliability Engineer (SRE)

SLA/SLO targets

Key Roles & Responsibilities

Reliability Engineering & Availability

  • Own and drive

    SLIs, SLOs, SLAs

    , and error budgets for platform services.
  • Balance feature velocity with system reliability using

    error budget frameworks

    .
  • Design and implement

    high-availability architectures

    with redundancy, failover, and disaster recovery.
  • Lead initiatives to achieve and sustain

    99.9%+ uptime

    for critical systems.
  • Perform

    capacity planning

    , forecasting, and scalability assessments.
  • Conduct

    chaos engineering experiments

    to proactively identify system weaknesses.

Infrastructure Automation & Platform Engineering

  • Build and maintain

    Infrastructure-as-Code (IaC)

    using Terraform (preferred), CloudFormation, Pulumi, or Ansible.
  • Develop automation and tooling using

    Python, Go, Bash

    , or similar languages to eliminate manual toil.
  • Implement

    self-healing systems

    and automated remediation workflows.
  • Design and maintain internal

    platform services and developer tools

    .
  • Implement

    GitOps workflows

    and declarative infrastructure management.
  • Automate infrastructure provisioning, configuration, and deployments.

Cloud & Kubernetes Engineering

  • Design and operate infrastructure across

    AWS, Azure, and GCP

    environments.
  • Build and manage

    Kubernetes clusters

    (EKS, AKS, GKE, self-managed).
  • Implement container orchestration best practices including

    HPA, VPA, cluster autoscaler

    .
  • Design and operate

    service mesh

    solutions (Istio, Linkerd, Consul).
  • Optimize container networking, security, storage, and scheduling.
  • Manage container registries (ECR, ACR, GCR, Harbor).

CI/CD & Release Engineering

  • Build and maintain

    CI/CD pipelines

    using Jenkins, GitLab CI, GitHub Actions, Azure DevOps, or CircleCI.
  • Implement deployment strategies such as

    blue-green, canary, and rolling deployments

    .
  • Manage artifact repositories (Nexus, Artifactory).
  • Implement deployment gates, approval workflows, and rollback mechanisms.
  • Integrate security scanning and secrets management into pipelines.

Incident Management & Operational Excellence

  • Participate in

    24/7 on-call rotations

    for production systems.
  • Lead incident response, triage, and resolution during outages.
  • Conduct

    blameless postmortems

    and drive actionable improvements.
  • Track and improve

    MTTD and MTTR

    metrics.
  • Build and maintain

    runbooks, playbooks, and incident automation

    .

Observability & Performance Optimization

  • Deploy and manage

    monitoring, logging, and alerting platforms

    .
  • Design dashboards for system health, performance, and SLO tracking.
  • Configure intelligent alerts to minimize noise and alert fatigue.
  • Perform performance tuning across applications, databases, and infrastructure.
  • Implement caching strategies (Redis, Memcached, CDN) to improve performance.
  • Optimize infrastructure for

    cost efficiency and performance balance

    .

Security & Compliance

  • Implement cloud and infrastructure

    security best practices

    (least privilege, zero trust).
  • Manage secrets using

    Vault, AWS Secrets Manager, Azure Key Vault

    .
  • Implement network security (VPCs, firewalls, security groups, network policies).
  • Ensure compliance with

    SOC 2, ISO 27001, PCI-DSS, HIPAA

    , and internal standards.
  • Integrate security scanning and vulnerability management into CI/CD pipelines.

Collaboration & Leadership

  • Partner with development teams to improve service reliability.
  • Participate in architecture and design reviews.
  • Mentor junior engineers and promote SRE best practices.
  • Create and maintain documentation, diagrams, and knowledge bases.
  • Lead cross-functional reliability and platform improvement initiatives.

Required Skills & Experience

Core Technical Skills

  • Strong proficiency in

    Python, Go, Bash, Ruby, or Java

    .
  • Experience building production-grade automation and tooling.
  • Deep experience with

    AWS

    (EC2, EKS, RDS, VPC, IAM, CloudWatch).
  • Working knowledge of

    Azure

    and

    GCP

    infrastructure services.
  • Strong expertise in

    Kubernetes

    and Docker containerization.
  • Hands-on experience with

    Terraform

    (highly preferred).
  • Experience with configuration management tools (Ansible, Chef, Puppet).
  • Strong CI/CD experience with modern pipeline tools.

Systems & Networking

  • Solid understanding of

    distributed systems and microservices

    .
  • Knowledge of

    TCP/IP, DNS, load balancing, CDNs, and networking concepts

    .
  • Experience with relational and NoSQL databases.
  • Strong observability experience with Prometheus, Grafana, ELK, Datadog, or New Relic.

Professional Experience

  • 69 years in

    SRE, Platform Engineering, Infrastructure, or DevOps roles

    .
  • 3+ years managing

    high-availability production systems

    .
  • Proven track record of

    automation-driven reliability improvements

    .
  • Experience operating infrastructure at scale.

Soft Skills & Mindset

  • Strong problem-solving and analytical skills.
  • Ownership-driven mindset with accountability for system reliability.
  • Ability to remain calm and effective during high-severity incidents.
  • Excellent communication and collaboration skills.
  • Strong commitment to

    blameless culture

    and continuous improvement.

Certifications (Preferred)

  • AWS Certified Solutions Architect / DevOps Engineer
  • Azure Solutions Architect / DevOps Engineer
  • Google Cloud Professional Cloud Architect
  • Certified Kubernetes Administrator (CKA / CKAD)
  • HashiCorp Terraform Associate

Nice-to-Have Skills

  • Chaos engineering (Chaos Monkey, Gremlin, LitmusChaos).
  • FinOps and cloud cost optimization.
  • GitOps tools (ArgoCD, Flux).
  • Progressive delivery tools (Spinnaker, Flagger).
  • Serverless platforms and architectures.
  • Contributions to open-source SRE or infrastructure projects.

Education

  • Bachelor’s degree

    in Computer Science, Engineering, or a related field.

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Net Connect logo
Net Connect

Software Development

Schinnen Amsterdam

RecommendedJobs for You

chennai, bengaluru, mumbai (all areas)

chennai, bengaluru, mumbai (all areas)