Senior Member of Technical Staff - SMTS

5 - 8 years

0 Lacs

Posted:1 week ago| Platform: Foundit logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.

Join Our Cloud Infrastructure Engineering & Operations Division as a Senior Site Reliability Engineer

We are seeking a highly skilled Senior Site Reliability Engineer to elevate our Cloud Infrastructure Engineering & Operations team. Your primary mission will be to enhance the performance, reliability, and scalability of our platforms by spearheading the development of a world-class observability ecosystem that drives business success.

The Team:

The Logging, Metrics, and Monitoring (LMM) team is at the forefront of building and delivering observability services and tools for our engineering communities within the Cloud Engineering & Operations, and Research & Development zones. Our solutions are critical-used daily by hundreds of developers to develop, monitor, troubleshoot, and optimize our web services. We manage large-scale, distributed, fault-tolerant systems that collect and host vast volumes of log and metric data, enabling data-driven decision-making across the organization.

Our work has a direct, measurable impact on the productivity of our engineering teams across athenaNation, empowering them to innovate faster and operate more reliably.

In this role, you will tackle a diverse set of challenges-from fine-tuning system performance and scaling services to debugging complex issues. You will partner closely with development teams to deliver new monitoring features, improve existing tools, and solve pressing engineering problems-all within an agile environment that leverages both private and public cloud platforms.

Job Responsibilities

  • Automate the deployment, configuration, and management of logging, metrics, and monitoring services leveraging Puppet and Infrastructure as Code best practices to ensure reliable and scalable operations.
  • Proactively troubleshoot and resolve complex production incidents, leveraging deep Linux system administration and engineering expertise to minimize downtime.
  • Lead cross-functional projects from conception through delivery, including designing scalable technical solutions, managing timelines, and ensuring successful implementation.
  • Architect and implement comprehensive monitoring strategies by developing metrics, dashboards, and alerting criteria to enable proactive service performance management and dynamic scaling.
  • Collaborate closely with engineering teams during weekly on-call rotations to swiftly diagnose and resolve high-impact issues, fostering a culture of reliability.
  • Partner with development teams to enhance their logging and telemetry capabilities, improving observability and operational efficiency.
  • Mentor and guide team members on best practices for incident response, system tuning, and service reliability.

Required Qualifications

  • 5-8 years of hands-on experience managing mission-critical production environments with a focus on Linux system administration and DevOps practices.
  • Expertise on Amazon Web Services and Cloud Native Approaches.
  • Experience working on Microservices, production grade infrastructure.
  • Proven expertise in managing and optimizing large-scale logging and data platforms such as Kafka, OpenSearch/Elasticsearch, and log forwarding agents like Vector or Fluentd.
  • Extensive experience with configuration management tools such as Puppet or Ansible, automating deployment and operations at scale.
  • Scripting experience with Python or Bash.
  • Demonstrated success troubleshooting and resolving issues in Linux-based production services, including participating actively in on-call rotations.
  • Proficiency in scripting and programming languages including Bash, Python, and Golang for automation, tooling, and integrations.
  • Strong expertise in Infrastructure as Code using Terraform and AWS CloudFormation to build resilient, repeatable deployment workflows.
  • Ability to rapidly adapt to evolving technology environments and business priorities with a bias toward reliability and automation.

Additional Qualifications

  • Experience managing large-scale production server fleets (thousands of nodes) with high availability and fault tolerance.
  • Deep subject matter expertise in technologies such as Graphite, ClickHouse, Prometheus, Grafana, Docker, Jenkins, and Git.
  • Familiarity with AWS cloud architecture, deployment, and operational best practices, with hands-on experience deploying scalable cloud-native applications.
  • Proficiency with protocol analyzers like tcpdump and Wireshark for network troubleshooting and performance diagnostics.

-

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now

RecommendedJobs for You