Get alerts for new jobs matching your selected skills, preferred locations, and experience range. Manage Job Alerts
3.0 - 7.0 years
0 Lacs
punjab
On-site
As a Site Reliability Engineer at HRS, you will play a crucial role in ensuring the reliability, scalability, and performance of the Lodging-as-a-Service (LaaS) platform. Collaborating across engineering, operations, and development teams, you will implement reliability standards, maintain infrastructure architecture, and achieve operational excellence while adhering to service level objectives (SLOs) and reducing toil. Your main responsibility will be incident handling, where you will be at the forefront of identifying, responding to, and resolving production issues to minimize the impact on services. Participating in on-call rotations will require quick thinking and decisive action during critical incidents, emphasizing the importance of remaining calm under pressure and making data-driven decisions to uphold the platform's reliability. Contributing to the reliability roadmap, supporting platform observability, and driving automation initiatives to enhance system resilience are key aspects of your role. Monitoring critical metrics such as error budgets, mean time to recovery (MTTR), and service level indicators (SLIs) daily will ensure optimal platform performance and availability. Your technical expertise in cloud infrastructure, distributed systems, and automation, coupled with problem-solving and incident management skills, will be essential in this position. Operating according to HRS" leadership principles, the SRE department prioritizes system reliability and customer experience. Embracing a culture of blameless post-mortems, continuous improvement, and proactive problem-solving, you will actively participate in incident reviews to prevent future occurrences and enhance overall system reliability. As an SRE at HRS, you will innovate by exploring new technologies and methodologies to improve system reliability and operational efficiency. Working with infrastructure as code, maintaining robust monitoring and alerting systems, and developing automation solutions to reduce manual intervention and enhance incident response times will be part of your responsibilities. Taking full ownership of production systems from capacity planning to disaster recovery ensures resilient and scalable infrastructure. Collaborating with team leads and other SREs to implement best practices, refine incident response procedures, and contribute to the reliability and performance of the LaaS platform is essential. Your expertise in incident handling, system optimization, and proactive problem-solving will play a vital role in maintaining and elevating the high standards of the SRE department at HRS. If you have 3-5 years of experience in site reliability engineering or related areas, a Bachelor's degree in Computer Science, Engineering, or a related field, and proficiency in Java, Python, AWS cloud services, and monitoring tools (New Relic, Kibana, Prometheus, Grafana, ElasticSearch), we invite you to join our team and contribute to shaping the future of business travel at HRS.,
Posted 21 hours ago
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Accenture
57101 Jobs | Dublin
Wipro
24505 Jobs | Bengaluru
Accenture in India
19467 Jobs | Dublin 2
EY
17463 Jobs | London
Uplers
12745 Jobs | Ahmedabad
IBM
12087 Jobs | Armonk
Bajaj Finserv
11514 Jobs |
Amazon
11498 Jobs | Seattle,WA
Accenture services Pvt Ltd
10993 Jobs |
Oracle
10696 Jobs | Redwood City