Lead Site Reliability Engineer (SRE)

6 - 10 years

0 Lacs

Posted:1 day ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

You will be responsible for leading site reliability engineering efforts focusing on infrastructure performance, scaling, and optimization, observability, incident management, zero-downtime deployments, rollback reliability, secret management, IAM risk mitigation, configuration drift, environment parity, and application-level performance and engineering quality. Key Responsibilities: - Take ownership of end-to-end system reliability, including cloud resource planning and code-level instrumentation. - Review and enhance backend code for performance, resiliency, and observability, such as retries, timeouts, connection pools, and logging. - Architect and scale multi-environment Kubernetes deployments, with a preference for GKE, to ensure high availability and low drift. - Define and implement zero-downtime deployment strategies like canary, blue-green, and progressive delivery. - Collaborate with full stack teams to ensure release readiness, CI/CD quality gates, and infra-aware feature rollout. - Strengthen secret management, IAM policies, and privilege boundaries for applications and services. - Lead incident responses, root cause analysis, and long-term reliability enhancements. - Develop and review Terraform modules, Helm charts, or platform tooling using bash, Python, or Go as necessary. - Lead design reviews and make cross-functional decisions that impact both product and platform reliability. Requirements: - 6+ years of experience in full stack development, SRE, or platform engineering. - Proficiency in one or more backend stacks like Python/Django, Node/NestJS, Go, or Java/Spring, with the ability to review or contribute code. - Strong expertise in Kubernetes, preferably GKE, and Helm for optimizing, securing, and debugging real-world workloads. - Proficient in Terraform and Infrastructure as Code (IaC) workflows, ideally with Terraform Cloud and remote state strategy. - Solid understanding of GCP or similar cloud provider services such as IAM, VPCs, CloudSQL, networking, Secret Manager, and monitoring. - Experience implementing progressive delivery practices like ArgoCD, Flux, GitOps, and CI/CD patterns. - Proven ability to enhance system observability using tools like Datadog, Prometheus, and OpenTelemetry. - Capability to deep dive into application repositories, identify architectural flaws or infrastructure misuse, and provide solutions or guidance. - Experience in remaining calm under pressure, incident management, and fostering a postmortem culture. Tools and Expectations: - Datadog: Monitor infrastructure health, capture service-level metrics, and reduce alert fatigue through high signal thresholds. - PagerDuty: Manage the incident management pipeline, route alerts based on severity, and align with business SLAs. - GKE / Kubernetes: Enhance cluster stability, workload isolation, define auto-scaling configurations, and tune for efficiency. - Helm / GitOps (ArgoCD/Flux): Validate release consistency across clusters, monitor sync status, and ensure rollout safety. - Terraform Cloud: Support disaster recovery planning and detect infrastructure changes through state comparisons. - CloudSQL / Cloudflare: Diagnose database and networking issues, monitor latency, enforce access patterns, and validate WAF usage. - Secret Management & IAM: Ensure secure handling of secrets, manage time-privileged credentials, and define alerts for abnormal usage.,

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You