Site Reliability Engineer

1 - 4 years

3.0 - 6.0 Lacs P.A.

Gurugram

Posted:3 weeks ago| Platform: Naukri logo

Apply Now

Skills Required

Root cause analysisChange managementLinuxNetworkingProblem managementIncident managementTroubleshootingContinuous improvementOperationsPerformance improvement

Work Mode

Work from Office

Job Type

Full Time

Job Description

Job Title: Site Reliability Engineer 1 (SRE 1) Overview: As a Site Reliability Engineer 1 (SRE 1) within the DevOps team, you will be instrumental in maintaining and enhancing the reliability, performance, and scalability of our infrastructure. Your role will involve proactive monitoring, incident response, and continuous improvement of our systems. You will collaborate with clients for deployment support and ensure adherence to best practices in site reliability engineering. Responsibilities: 1. Infrastructure Monitoring and Support: Continuously monitor infrastructure using tools such as Prometheus, Alertmanager, and Grafana to ensure optimal performance and reliability. Implement and maintain monitoring solutions to detect and address issues proactively. 2. Client Interaction and Deployment Support: Collaborate with clients to provide deployment support, ensuring seamless implementation and operational functionality. Assist in on-prem and cloud-based deployments as needed. 3. Troubleshooting and Incident Management: Utilize expertise in Linux, Docker, and networking to diagnose and resolve technical issues efficiently. Manage incidents from detection to resolution, ensuring minimal impact on service availability. 4. Problem Management and Root Cause Analysis: Conduct root cause analysis for recurring issues and implement solutions to prevent future occurrences. Identify and mitigate potential problems to enhance system stability. 5. 24/7 Shift Support: Participate in a rotating shift schedule to provide 24/7 support, ensuring continuous monitoring and rapid response to incidents. Maintain high availability and reliability of services during off-hours. 6. Maintenance and Change Management: Execute maintenance tasks and implement change requests as per Standard Operating Procedures (SOPs). Ensure adherence to change management protocols to maintain system integrity. 7. Documentation and Reporting: Maintain comprehensive documentation of incidents, changes, and maintenance activities. Generate and analyze reports to identify areas for performance improvement and optimization. Requirements: Technical Proficiency: Strong understanding of IT infrastructure with expertise in monitoring tools like Prometheus, Alertmanager, and Grafana. Proficiency in Linux, Docker, and networking is essential for troubleshooting and incident management. Client Communication: Excellent interpersonal skills for effective client interaction and support. Incident and Problem Management: Experience in managing incidents, conducting root cause analysis, and implementing problem management strategies. Adaptability and Shift Work: Ability to work in rotating shifts for 24/7 support and adapt to dynamic operational needs. SOP Adherence: Ability to follow defined SOPs for maintenance tasks and change implementations.

Knowmax
Knowmax
Not specified
No locations

RecommendedJobs for You