Associate Director, Platform Engineering

10 - 20 years

30 - 45 Lacs

Posted:-1 days ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

Position summary

We are seeking a seasoned Senior Site Reliability Engineer (SRE) to join our team. You will be responsible for the bigpicture architecture, day-to-day operations, and continuous improvement of our production systems, ensuring their availability, performance, and resilience. This role is pivotal in blending cutting-edge observability and automation with proactive engineering practices.

Responsibilities

  • Design, implement, and maintain comprehensive observability solutions to track the health and performance of our systems.
  • Analyze observability data and explore AIOps methodologies to identify potential issues, predict failures, and proactively troubleshoot problems before they impact users.
  • Develop and implement alerts and notifications for critical events to ensure timely intervention.
  • Collaborate with development teams to design and implement solutions that enhance system resilience, partially through designing and executing chaos engineering experiments (e.g., using AWS FIS), to reduce downtime.
  • Analyze performance metrics to identify and resolve latency bottlenecks in our infrastructure.
  • Implement performance optimization techniques and tools to improve the overall responsiveness of our systems.
  • Work with development teams to ensure that new features and code changes do not introduce performance regressions.
  • Develop and maintain metrics dashboards to track key performance indicators (KPIs) for our critical systems.
  • Identify performance trends and anomalies that may indicate potential issues or areas for improvement.
  • Recommend and implement performance optimization strategies to enhance the overall efficiency of our systems.
  • Optimize resource utilization and minimize unnecessary expenditure on IT infrastructure.
  • Identify and implement cost-effective solutions to improve the efficiency of our IT operations, reducing TOIL.
  • Design and implement automated deployment and rollback procedures to mitigate risks associated with software updates.
  • Monitor the performance of new releases and address any issues that arise promptly.
  • Analyze root causes of incidents to identify and implement preventive measures to minimize their recurrence.
  • Document incident responses and communicate lessons learned to enhance our incident handling processes.

Requirements

  • Proficient in application and infrastructure observability; Splunk OpenTelemetry preferred.
  • A deep understanding and practical application of Site Reliability Engineering principles.
  • Ability to build and maintain a system and culture that supports and implements SLOs.
  • Experienced in production environments running in AWS.
  • Comfortable with Infrastructure as Code; Terraform is preferred.
  • Familiar with Docker & Kubernetes, specifically EKS & ECS.
  • Familiar with programming languages, with a strong preference for Python (for scripting, automation, and data analysis/AI).
  • Comfortable with CI/CD pipelines such as GitHub Actions or Azure DevOps.
  • Understanding of the application lifecycle.
  • Familiarity working in an agile environment.
  • Ability to review architecture designs, ensuring observability coverage, high availability, resilience, and disaster recovery principles.
  • Familiarity with Chaos Engineering principles and experience designing or running controlled experiments to test system resilience.
  • Demonstrable interest or experience in AIOps, including the application of AI/ML to operational data and familiarity with platforms like AWS Bedrock.
  • Excellent troubleshooting and problem-solving skills with a knack for identifying and resolving complex technical issues.
  • Ability to work independently and as part of a collaborative team, effectively communicating technical concepts to both technical and non-technical stakeholders.
  • A passion for maintaining high availability, performance, and reliability of critical systems in a fast-paced environment.
  • Maintain relationships with other disciplines and stakeholders.
  • Strong sense of ownership, urgency, and drive.
  • Potential participation in an on-call rotation.

Qualifications

  • Bachelor's degree in Computer Science, Information Technology, or a related field.?
  • 10+ years of experience as a Site Reliability Engineer or equivalent in a similar role.?
  • Proven experience in monitoring, analyzing, and optimizing the performance of large-scale distributed systems in a cloud environment.
  • Proven experience withWindows or Linux production environments, including managing servers, operating systems, and network configurations within the cloud.
  • Proven scripting and automation skills, preferably Powershell, Bash or Python.
  • AWS certification preferred.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
S&P Global Market Intelligence logo
S&P Global Market Intelligence

Financial Services

New York

RecommendedJobs for You