Site Reliability Engineer

3 - 7 years

0 Lacs

Posted:18 hours ago| Platform: Shine logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

As a Site Reliability Engineer at HRS, you will play a crucial role in ensuring the reliability, scalability, and performance of the Lodging-as-a-Service (LaaS) platform. Collaborating across engineering, operations, and development teams, you will implement reliability standards, maintain infrastructure architecture, and achieve operational excellence while adhering to service level objectives (SLOs) and reducing toil. Your main responsibility will be incident handling, where you will be at the forefront of identifying, responding to, and resolving production issues to minimize the impact on services. Participating in on-call rotations will require quick thinking and decisive action during critical incidents, emphasizing the importance of remaining calm under pressure and making data-driven decisions to uphold the platform's reliability. Contributing to the reliability roadmap, supporting platform observability, and driving automation initiatives to enhance system resilience are key aspects of your role. Monitoring critical metrics such as error budgets, mean time to recovery (MTTR), and service level indicators (SLIs) daily will ensure optimal platform performance and availability. Your technical expertise in cloud infrastructure, distributed systems, and automation, coupled with problem-solving and incident management skills, will be essential in this position. Operating according to HRS" leadership principles, the SRE department prioritizes system reliability and customer experience. Embracing a culture of blameless post-mortems, continuous improvement, and proactive problem-solving, you will actively participate in incident reviews to prevent future occurrences and enhance overall system reliability. As an SRE at HRS, you will innovate by exploring new technologies and methodologies to improve system reliability and operational efficiency. Working with infrastructure as code, maintaining robust monitoring and alerting systems, and developing automation solutions to reduce manual intervention and enhance incident response times will be part of your responsibilities. Taking full ownership of production systems from capacity planning to disaster recovery ensures resilient and scalable infrastructure. Collaborating with team leads and other SREs to implement best practices, refine incident response procedures, and contribute to the reliability and performance of the LaaS platform is essential. Your expertise in incident handling, system optimization, and proactive problem-solving will play a vital role in maintaining and elevating the high standards of the SRE department at HRS. If you have 3-5 years of experience in site reliability engineering or related areas, a Bachelor's degree in Computer Science, Engineering, or a related field, and proficiency in Java, Python, AWS cloud services, and monitoring tools (New Relic, Kibana, Prometheus, Grafana, ElasticSearch), we invite you to join our team and contribute to shaping the future of business travel at HRS.,

Mock Interview

Practice Video Interview with JobPe AI

Start Java Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Java Skills

Practice Java coding challenges to boost your skills

Start Practicing Java Now
HRS Group logo
HRS Group

Travel & Hospitality Technology

Münster

RecommendedJobs for You

hyderabad, chennai, bengaluru