Get alerts for new jobs matching your selected skills, preferred locations, and experience range. Manage Job Alerts
5.0 - 8.0 years
7 - 12 Lacs
Hyderabad
Work from Office
Job Description The role of the Lead Site Reliability Engineer is to be hands-on and provide mentorship to other team members on core SRE principles and tools. The lead SRE will participate in end to end operational aspects of Production environment. The individual concerned will be able to work on cloud systems, networks, databases and help drive incident lifecycle management. As a member of the SRE team, you will also be working closely with the Architects, DevOps, Product and development teams to ensure we get the most out of the software on AWS platform. This role requires a highly skilled technology professional with excellent communication skills, strategic mindset, strong analytical and troubleshooting skills on AWS Cloud Platform. Other responsibilities include working with internal business partners to gather requirements, prototyping, architecting, implementing/updating solutions, building and executing test plans, performing quality reviews, managing operations, and triaging and fixing operational issues. Site Reliability Engineers must be able to adjust to constant business change; common types of changes include new requirements, evolving goals and strategies, and emerging technologies. About the Role: Be hands-on and provide mentorship to a growing SRE team on core SRE principles and tools. Foster a sense of automation in issue resolution; everything possible should be automated, and only when automation cant resolve an issue should people get involved in the resolution Lead efforts for updating production with new versions/infrastructures as they are available Lead capacity planning efforts in collaboration with Architects and DevOps engineers to determine changes to infrastructure that are needed to support new load and performance characteristics Leads engagement with software developers, DevOps and other infrastructure engineers to integrate software development and delivery from inception to full operation, ensuring robust released software and systems. Ensure highest level of uptime to meet the customer SLA by implementing system wide corrections to prevent reoccurrence of issues. Mentor other SRE team members to further develop their soft and hard skills Triage, troubleshoot and resolve issues using golden signals and go past golden signals Go past golden signals with additional principles such as chaos engineering to detect failure points and lead Game days for testing resiliency of team when it comes to incident response and remediations and synthetic monitoring. Lead SRE team members to create and maintain Recovery Procedures, RCAs in collaboration with other engineering teams. Ensure Incidents assigned to the team are being managed within agreed SLAs Ensure alarms are documented in up to date Knowledge Base Articles. Ensures Production infrastructure is up to date with server/security patches and certificates. Continuous improvement of system and application monitoring and automation Identify and automate manual workarounds and process improvements Proactive monitoring of Monitor the availability, latency, scalability and efficiency of all services Perform periodic on-call duty as part of the SRE team About You: Skilled with cloud operations/administration in Amazon AWS. Tax/Accounting domain experience Bachelors or Masters in Computer Science discipline. 5+ years experience focussed on Site Reliability Engineering or related position in AWS Cloud Platform. At least 2 AWS Certifications are must. (AWS Sysops Admin and Architects certifications preferred). Experience working with SQL, Windows Servers, Load balancers, Linux Deep experience with AWS, Docker and Kubernetes, CloudFormation, CloudWatch, CodeDeploy, DynamoDB, Lambda, SQS, Amazon FSX, Elastic Search and networking concepts are must. Program at a high level in at least one language such as: Java, C#, Javascript, Python or Ruby. Integration experience with PagerDuty, ServiceNow, Datadog, CloudWatch. Good understanding of Site Reliability Engineering (SRE) philosophies, technologies, platforms and tools, SLO management, incident resolution, and automation; Ability to explain technical concepts in clear, non-technical language Working knowledge of infrastructure components (e.g. routers, load balancers, cloud products, container systems, compute, storage, and networks) Knowledge of security and compliance standards such as SOC/PCI is a plus
Posted 4 weeks ago
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Accenture
20312 Jobs | Dublin
Wipro
11977 Jobs | Bengaluru
EY
8165 Jobs | London
Accenture in India
6667 Jobs | Dublin 2
Uplers
6464 Jobs | Ahmedabad
Amazon
6352 Jobs | Seattle,WA
Oracle
5993 Jobs | Redwood City
IBM
5803 Jobs | Armonk
Capgemini
3897 Jobs | Paris,France
Tata Consultancy Services
3776 Jobs | Thane