Site Reliability Engineer (SRE) - L1 Commander

4 - 6 years

6.0 - 8.0 Lacs P.A.

Bengaluru

Posted:2 weeks ago| Platform: Naukri logo

Apply Now

Skills Required

Site Reliability EngineeringRDSAzureNetworkingSREVPCDevOpsDockerGCPHelmchartsECS/EKSRoute 53AWS EC2AWSLambdaKubernetes

Work Mode

Work from Office

Job Type

Full Time

Job Description

We are looking for Site Reliability Engineer! Youll make a difference by: SRE L1 Commander is responsible for ensuring the stability, availability, and performance of critical systems and services. As the first line of defense in incident management and monitoring, the role requires real-time response, proactive problem solving, and strong coordination skills to address production issues efficiently. Monitoring and Alerting: Proactively monitor system health, performance, and uptime using monitoring tools like Datadog, Prometheus. Serving as the primary responder for incidents to troubleshoot and resolve issues quickly, ensuring minimal impact on end-users. Accurately categorizing incidents, prioritize them based on severity, and escalate to L2/L3 teams when necessary. Ensuring systems meet Service Level Objectives (SLOs) and maintain uptime as per SLAs. Collaborating with DevOps and L2 teams to automate manual processes for incident response and operational tasks. Performing root cause analysis (RCA) of incidents using log aggregators and observability tools to identify patterns and recurring issues. Following predefined runbooks/playbooks to resolve known issues and document fixes for new problems. Youd describe yourself as: Experienced professional with 4 to 6 years of relevant experience in SRE, DevOps, or Production Support with monitoring tools (e.g., Prometheus, Datadog). Working knowledge of Linux/Unix operating systems and basic scripting skills (Python, Gitlab actions) cloud platforms (AWS, Azure, or GCP). Familiarity with container orchestration (Kubernetes, Docker, Helmcharts) and CI/CD pipelines. Exposure with ArgoCD for implementing GitOps workflows and automated deployments for containerized applications. Possessing experience in Monitoring: Datadog, Infrastructure: AWS EC2, Lambda, ECS/EKS, RDS, Networking: VPC, Route 53, ELB and Storage: S3, EFS, Glacier. Strong troubleshooting and analytical skills to resolve production incidents effectively. Basic understanding of networking concepts (DNS, Load Balancers, Firewalls). Good communication and interpersonal skills for incident communication and escalation. Having preferred certifications: AWS Certified SysOps Administrator Associate, AWS Certified Solutions Architect Associate or AWS Certified DevOps Engineer Professional

Automation Machinery Manufacturing
Munich Brande +

RecommendedJobs for You