High Availability and Scalability Engineering Lead

8 years

6 - 20 Lacs

Posted:1 week ago| Platform: GlassDoor logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Here’s a comprehensive job description for a High Availability and Scalability Engineering Lead — suitable for enterprise, SaaS, or mission-critical infrastructure teams:

Job Title:

High Availability & Scalability Engineering Lead

Role Overview:

The High Availability & Scalability Engineering Lead is responsible for designing, implementing, and managing highly available, fault-tolerant, and scalable systems to support critical business applications. This role blends deep technical expertise in distributed systems, cloud infrastructure, and performance optimization with leadership and cross-functional collaboration.

You will lead a team of engineers to ensure that all platforms meet stringent SLAs for uptime, resilience, and scalability—especially under peak loads or failure scenarios.

Key Responsibilities:Architecture & Design

  • Design and implement high-availability architectures using clustering, load balancing, replication, and failover strategies.
  • Lead design reviews for scalable distributed systems (microservices, event-driven, or service mesh architectures).
  • Evaluate and adopt cloud-native technologies (e.g., Kubernetes, ECS, autoscaling groups, service meshes, serverless) to enhance elasticity and resilience.
  • Drive the definition of RTO/RPO, failover automation, and multi-region deployment strategies.

Implementation & Operations

  • Develop and enforce SLAs, SLOs, and SLIs for reliability, latency, and performance.
  • Lead efforts in capacity planning, performance tuning, and chaos testing to ensure predictable system behavior under stress.
  • Collaborate with DevOps and SRE teams to automate infrastructure provisioning (e.g., Terraform, Pulumi, CloudFormation).
  • Establish monitoring, alerting, and self-healing mechanisms using tools such as Prometheus, Grafana, Datadog, or New Relic.

Leadership & Strategy

  • Mentor and guide engineers on designing resilient, performant, and secure architectures.
  • Partner with product and platform engineering to forecast future growth and capacity needs.
  • Create frameworks and best practices for high availability, DR, and horizontal scalability across teams.
  • Lead incident reviews, root cause analysis, and reliability retrospectives to drive continuous improvement.

Required Skills & Qualifications:

  • Bachelor’s or Master’s in Computer Science, Engineering, or related field.
  • 8+ years of experience in backend, infrastructure, or systems engineering; 3+ years in a leadership or architect role.
  • Deep expertise with cloud platforms (Azure) and container orchestration (Kubernetes, Docker, ECS).
  • Proficiency in distributed systems design, load balancing, replication, failover, and data partitioning.
  • Strong programming experience in one or more: Go, Python, Java, or C++.
  • Experience with observability and reliability engineering (monitoring, logging, tracing, SLOs).
  • Proven ability to lead cross-functional initiatives, drive architectural decisions, and scale systems supporting millions of users or high transaction volumes.

Preferred Qualifications:

  • Hands-on experience with multi-region, multi-cloud architectures.
  • Certification in Microsoft Azure
  • Background in SRE principles, Chaos Engineering, or Resilience Engineering.
  • Knowledge of event-streaming technologies (Kafka, Pulsar, RabbitMQ) and distributed databases (Cassandra, CockroachDB, DynamoDB).

Success Indicators:

  • Achieved uptime and latency SLAs consistently across services.
  • Reduction in mean time to recovery (MTTR) and incident frequency.
  • Documented and automated failover and scaling strategies.
  • Demonstrated mentorship and technical leadership within engineering teams.

Job Type: Full-time

Pay: ₹670,805.33 - ₹2,059,333.85 per year

Benefits:

  • Cell phone reimbursement
  • Health insurance
  • Internet reimbursement
  • Paid sick time
  • Paid time off

Mock Interview

Practice Video Interview with JobPe AI

Start DevOps Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You