Site Reliability Engineer

5 - 10 years

20 - 35 Lacs

Posted:3 hours ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Interested candidates can directly apply through our careers page:

https://aviato.jobs.growhire.com/jobs/cm5xeghu601jtqmznz5zmlr38

own critical infrastructure

What's In It For You?

  • Learn from the Best:

    Report directly to and receive mentorship from our Head of SRE, an experienced

    ex-Google SRE Manager

    . Gain invaluable insights into scaling, reliability, and leadership honed at one of the world's tech giants.
  • High-Impact Projects:

    Take ownership of complex GCP environments for diverse, significant clients across Australia and the EU. Your work directly influences the stability and performance of critical systems.
  • Drive Innovation, Not Just Tickets:

    We empower our Senior SREs to think strategically. You'll architect solutions, implement cutting-edge practices (SLOs, error budgets, advanced automation), and proactively improve systems, not just react to issues.
  • A Culture That Works:

    Founded by ex-Googlers, we foster a transparent, collaborative, and low-bureaucracy environment where doing the right thing matters. We value SRE principles and give you the autonomy to implement them effectively.
  • Cutting-Edge Tech:

    Deepen your expertise with GCP, Kubernetes, Terraform, modern observability tooling (Grafana, Dynatrace, Sentry), and sophisticated CI/CD pipelines.

What You'll Do (Your Impact):

  • Own & Architect Reliability:

    Design, implement, and manage highly available, scalable, and resilient architectures on Google Cloud Platform (GCP) for key customer environments.
  • Lead GCP Expertise:

    Serve as a subject matter expert for GCP within the team and potentially wider organisation, driving best practices for security, cost optimization, and performance.
  • Master Kubernetes at Scale:

    Architect, deploy, secure, and manage production-grade Kubernetes clusters (GKE preferred), ensuring optimal performance and reliability for critical applications (including API platforms like Apigee, though prior Apigee experience isn't mandatory).
  • Drive Automation & IaC:

    Lead the design and implementation of robust automation strategies using Terraform, Ansible, and scripting (Python, Go, Bash) for provisioning, configuration management, and CI/CD pipelines (Jenkins, GitHub Actions, etc.).
  • Elevate Observability:

    Architect and refine comprehensive monitoring, logging, and alerting strategies using tools like Grafana, Dynatrace, and Sentry to ensure proactive issue detection and rapid response.
  • Lead Incident Response & Prevention:

    Spearhead incident management efforts, conduct blameless post-mortems, and drive the implementation of preventative measures to continuously improve system resilience.
  • Champion SRE Principles:

    Actively promote and embed SRE best practices (SLOs, SLIs, error budgets) within delivery teams and operational processes.
  • Mentor & Collaborate:

    Share your expertise, mentor junior team members (potentially), and collaborate effectively across teams to foster a strong reliability culture.

What You'll Bring (Your Expertise):

  • Proven SRE Experience:

    5+ years of hands-on experience in a Site Reliability Engineering, DevOps, or Cloud Engineering role, with a significant focus on production systems.
  • Deep GCP Knowledge:

    Demonstrable, in-depth expertise in designing, deploying, and managing services within Google Cloud Platform (Compute Engine, GKE, Networking, IAM, Cloud SQL/Spanner, Pub/Sub, Monitoring/Logging etc.). GCP certifications are a plus.
  • Strong Kubernetes Skills:

    Proven experience managing Kubernetes clusters in production environments (GKE highly desirable). Understanding of networking, security, and operational best practices within Kubernetes.
  • Infrastructure as Code Mastery:

    Significant experience using Terraform in complex environments. Proficiency with configuration management tools (Ansible, Puppet, Chef) is beneficial.
  • Automation & Scripting Prowess:

    Strong proficiency in scripting languages like Python or Go, with experience in automating operational tasks and building tooling.
  • Observability Expertise:

    Experience implementing and leveraging monitoring, logging, and tracing tools (e.g., Prometheus, Grafana, ELK Stack, Dynatrace, Datadog, Sentry).
  • Problem-Solving Acumen:

    Strong analytical and troubleshooting skills, with experience leading incident response for critical systems.
  • Collaboration & Communication:

    Excellent communication skills and a collaborative mindset, with the ability to explain complex technical concepts clearly. Experience mentoring others is advantageous.
  • (Desirable):

    Experience with API Management platforms (Apigee, Kong, etc.), advanced networking concepts, or security hardening in cloud environments.

Technologies We Use (You'll Master):

  • Cloud:

    Google Cloud Platform (GCP)
  • Containerisation & Orchestration:

    Kubernetes (GKE), Docker
  • Infrastructure & Automation:

    Terraform, Ansible
  • Monitoring & Observability:

    Grafana, Dynatrace, Sentry, Google Cloud Operations Suite
  • CI/CD:

    Jenkins, GitHub Actions, Bamboo (or similar)
  • Scripting:

    Python, Go, Bash
  • Collaboration:

    JIRA, Confluence, Slack

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You

hyderabad, pune, greater noida

mumbai, mumbai suburban, mumbai (all areas)