SRE Engineer

5 - 10 years

8.0 - 13.0 Lacs P.A.

Bengaluru

Posted:2 months ago| Platform: Naukri logo

Apply Now

Skills Required

PythonJavaDockerGCPGosystem designPrometheusnetworkingGrafanaELK StackKubernetes

Work Mode

Work from Office

Job Type

Full Time

Job Description

Responsibilities Key Responsibilities, Command Center Design & Implementation Architect and implement a centralized command center that provides comprehensive visibility into both infrastructure and application layers Establish standardized operational procedures, runbooks, and escalation protocols for incident management Design and implement monitoring solutions that provide real-time insights into system health, performance metrics, and business KPIsOperations Management: Lead the development of automated remediation solutions for common operational issues Implement and maintain SLOs/SLIs across critical services and applications Drive continuous improvement in incident response times and system reliability metrics Collaborate with development teams to ensure applications are designed with operational excellence in mindTool Development & Integration: Develop and maintain monitoring dashboards that provide actionable insights for both technical and non-technical stakeholders Implement and customize monitoring tools for infrastructure and application performance monitoring Create automation scripts and tools to streamline operational processes Integrate various monitoring and alerting systems to provide a unified view of system healthLeadership & Collaboration: Mentor junior engineers in SRE practices and command center operations Collaborate with security, development, and infrastructure teams to ensure comprehensive monitoring coverage Partner with business stakeholders to align monitoring strategies with business objectives Lead post-incident reviews and drive implementation of learned improvementsPreferred Qualifications: Experience in designing and implementing enterprise-scale command centers Knowledge of AIOps and machine learning for IT operations Certification in relevant cloud platforms or technologies is good to have Experience with chaos engineering and resilience testing Background in implementing ITIL practices across any of the IT services Technical and Professional Requirements: Bachelor's degree in Computer Science, Engineering, or related field 5+ years of experience in Site Reliability Engineering or similar roles Strong experience with cloud platforms (AWS/Azure/GCP) and infrastructure-as-code Extensive knowledge of monitoring tools (e.g., Prometheus, Grafana, ELK Stack) Proficiency in at least one programming language (Python, Go, or Java preferred) Experience with containerization and orchestration (Docker, Kubernetes) Strong understanding of networking, system design, and distributed systems Preferred Skills: Foundational->Service Management->ITIL Domain->Telecom->Operations Management Technology->Cloud Security->AWS - Infrastructure Security->AWS Network Security Groups (NSG) Technology->Cloud Security->GCP - GRC Technology->Cloud Platform->Azure Networking Services-> Azure Bastion Additional Responsibilities: Excellent problem-solving and analytical abilities Strong communication skills and ability to work with cross-functional teams Experience in incident management and on-call rotations Proven track record of improving system reliability and performance Ability to handle high-pressure situations and make quick decisions Strong documentation and technical writing skills Educational Requirements Bachelor of Engineering Service Line Cloud & Infrastructure Services * Location of posting is subject to business requirements

IT Services and IT Consulting
Bangalore Karnataka +140

RecommendedJobs for You

Faridabad, Haryana, India

Bengaluru / Bangalore, Karnataka, India

Chennai, Tamil Nadu, India

Chennai, Pune, Coimbatore