Posted:3 weeks ago| Platform:
Work from Office
Full Time
Job Title: Site Reliability Engineer 1 (SRE 1) Overview: As a Site Reliability Engineer 1 (SRE 1) within the DevOps team, you will be instrumental in maintaining and enhancing the reliability, performance, and scalability of our infrastructure. Your role will involve proactive monitoring, incident response, and continuous improvement of our systems. You will collaborate with clients for deployment support and ensure adherence to best practices in site reliability engineering. Responsibilities: 1. Infrastructure Monitoring and Support: Continuously monitor infrastructure using tools such as Prometheus, Alertmanager, and Grafana to ensure optimal performance and reliability. Implement and maintain monitoring solutions to detect and address issues proactively. 2. Client Interaction and Deployment Support: Collaborate with clients to provide deployment support, ensuring seamless implementation and operational functionality. Assist in on-prem and cloud-based deployments as needed. 3. Troubleshooting and Incident Management: Utilize expertise in Linux, Docker, and networking to diagnose and resolve technical issues efficiently. Manage incidents from detection to resolution, ensuring minimal impact on service availability. 4. Problem Management and Root Cause Analysis: Conduct root cause analysis for recurring issues and implement solutions to prevent future occurrences. Identify and mitigate potential problems to enhance system stability. 5. 24/7 Shift Support: Participate in a rotating shift schedule to provide 24/7 support, ensuring continuous monitoring and rapid response to incidents. Maintain high availability and reliability of services during off-hours. 6. Maintenance and Change Management: Execute maintenance tasks and implement change requests as per Standard Operating Procedures (SOPs). Ensure adherence to change management protocols to maintain system integrity. 7. Documentation and Reporting: Maintain comprehensive documentation of incidents, changes, and maintenance activities. Generate and analyze reports to identify areas for performance improvement and optimization. Requirements: Technical Proficiency: Strong understanding of IT infrastructure with expertise in monitoring tools like Prometheus, Alertmanager, and Grafana. Proficiency in Linux, Docker, and networking is essential for troubleshooting and incident management. Client Communication: Excellent interpersonal skills for effective client interaction and support. Incident and Problem Management: Experience in managing incidents, conducting root cause analysis, and implementing problem management strategies. Adaptability and Shift Work: Ability to work in rotating shifts for 24/7 support and adapt to dynamic operational needs. SOP Adherence: Ability to follow defined SOPs for maintenance tasks and change implementations.
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
INR 8.0 - 13.0 Lacs P.A.
Hyderabad, Ahmedabad
INR 15.0 - 19.0 Lacs P.A.
Hyderabad
INR 15.0 - 30.0 Lacs P.A.
INR 18.0 - 33.0 Lacs P.A.
INR 15.0 - 20.0 Lacs P.A.
INR 20.0 - 25.0 Lacs P.A.
Hyderabad
INR 15.0 - 19.0 Lacs P.A.
Hyderabad, Pune, Bengaluru
INR 20.0 - 30.0 Lacs P.A.
Gurugram, Bengaluru
INR 35.0 - 45.0 Lacs P.A.
Hyderabad
INR 10.0 - 14.0 Lacs P.A.