Site Reliability Engineer

HRS Group

3 - 7 years

0 Lacs

punjab

Posted:3 months ago| Platform: Shine logo

Apply

Skills Required

java python kibana elasticsearch kubernetes networking distributed systems aws cloud services monitoring tools new relic prometheus grafana infrastructure as code tools terraform cloudformation containerization orchestration docker scripting automation

Work Mode

On-site

Job Type

Full Time

Job Description

As a Site Reliability Engineer at HRS, you will play a crucial role in ensuring the reliability, scalability, and performance of the Lodging-as-a-Service (LaaS) platform. Collaborating across engineering, operations, and development teams, you will implement reliability standards, maintain infrastructure architecture, and achieve operational excellence while adhering to service level objectives (SLOs) and reducing toil. Your main responsibility will be incident handling, where you will be at the forefront of identifying, responding to, and resolving production issues to minimize the impact on services. Participating in on-call rotations will require quick thinking and decisive action during critical incidents, emphasizing the importance of remaining calm under pressure and making data-driven decisions to uphold the platform's reliability. Contributing to the reliability roadmap, supporting platform observability, and driving automation initiatives to enhance system resilience are key aspects of your role. Monitoring critical metrics such as error budgets, mean time to recovery (MTTR), and service level indicators (SLIs) daily will ensure optimal platform performance and availability. Your technical expertise in cloud infrastructure, distributed systems, and automation, coupled with problem-solving and incident management skills, will be essential in this position. Operating according to HRS" leadership principles, the SRE department prioritizes system reliability and customer experience. Embracing a culture of blameless post-mortems, continuous improvement, and proactive problem-solving, you will actively participate in incident reviews to prevent future occurrences and enhance overall system reliability. As an SRE at HRS, you will innovate by exploring new technologies and methodologies to improve system reliability and operational efficiency. Working with infrastructure as code, maintaining robust monitoring and alerting systems, and developing automation solutions to reduce manual intervention and enhance incident response times will be part of your responsibilities. Taking full ownership of production systems from capacity planning to disaster recovery ensures resilient and scalable infrastructure. Collaborating with team leads and other SREs to implement best practices, refine incident response procedures, and contribute to the reliability and performance of the LaaS platform is essential. Your expertise in incident handling, system optimization, and proactive problem-solving will play a vital role in maintaining and elevating the high standards of the SRE department at HRS. If you have 3-5 years of experience in site reliability engineering or related areas, a Bachelor's degree in Computer Science, Engineering, or a related field, and proficiency in Java, Python, AWS cloud services, and monitoring tools (New Relic, Kibana, Prometheus, Grafana, ElasticSearch), we invite you to join our team and contribute to shaping the future of business travel at HRS.,