Home
Jobs

Senior Site Reliability Engineer

5 - 10 years

9 - 13 Lacs

Posted:1 day ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

System Design and Architecture: Design, implement, and maintain scalable and reliable systems, ensuring they can handle both current and future demands.
Incident Management: Lead incident response efforts, diagnose root causes, and implement long-term solutions to prevent recurrence. Ensure effective communication during outages.
Monitoring and Observability: Develop and maintain comprehensive monitoring and  alerting systems to proactively identify and address issues before they impact users.
Automation and Efficiency: Automate repetitive tasks and processes to improve
operational efficiency and reduce manual intervention.
Performance Tuning: Continuously optimize system performance, including fine-tuning applications, databases, and infrastructure to meet service level objectives (SLOs).
Capacity Planning: Forecast future system requirements based on growth trends and current usage, and plan capacity upgrades to ensure system reliability.
C ollaboration and Mentoring: Work closely with development teams to integrate reliability into the software development lifecycle. Mentor junior SREs and share best  practices.
Documentation and Knowledge Sharing: Create and maintain detailed documentation on  system design, incident response procedures, and operational practices to ensure knowledge is preserved and accessible.

Requirements

:

5+ years of experience as an SRE within AWS environments at medium to large-scale organizations.
5+ years of hands-on experience implementing and managing observability tools, such as Prometheus, New Relic, Grafana, or similar.
Advanced programming skills in Python, Groovy, and Bash.
Strong understanding of database technologies, including both SQL and NoSQL systems.
3+ years of experience developing and managing infrastructure deployment pipelines using Git, Terraform, Helm, Jenkins/Jenkins X/ArgoCD, or similar tools.
Proven expertise in designing, evaluating, and supporting production environments in AWS, including VPCs, EKS, IAM, AMI, EC2, CloudWatch, CloudTrail, Control Tower, GuardDuty, MSK, S3, Glacier, Gateways, Direct Connect, Route 53, RDS, ALBs, Autoscaling, and more.
Hands-on experience with Linux systems and protocols and technologies such as HTTP,
REST, TCP/IP, SSL, DNS, SMTP, SSH, NTP, Load Balancing, SQL/NoSQL, Message  Brokers, Nginx, Vault, etc
Extensive experience with Kubernetes and various container and cloud-native technologies.
Significant experience in managing 24/7 on-call rotations, creating runbooks, establishing support procedures, and proactively monitoring systems across multiple geographic locations.
Ability to thrive under pressure and excel in a technically challenging environment.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Lytx
Lytx

Telematics / Fleet Management

San Diego

RecommendedJobs for You