Lead Site Reliability Engineer

4 - 8 years

11 - 15 Lacs

Posted:3 hours ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

As an SRI Engineer, you will operate at the intersection of reliability, data intelligence, and automation. You will monitor and optimize the health of Azure and AWS environments, use AI-driven insights to prevent incidents, and continuously improve system performance and scalability.

Youll be part of a global 247 reliability team, collaborating closely with Cloud Infrastructure Engineering (CIE), DevOps, Customer Success, and Product teams to ensure SymphonyAIs SaaS platforms consistently meet or exceed customer SLAs and availability targets.

This role is ideal for engineers who combine deep technical skill with curiosity, problem-solving, and a passion for customer excellence.

Key Responsibilities

1. Monitoring, Observability & Proactive Detection

  • Maintain real-time visibility of all customer environments using Datadog, CloudWatch, Azure Monitor, and Prometheus.
  • Develop advanced monitoring dashboards, synthetic checks, and trend analyses to detect early warning signals.
  • Use machine learning and anomaly detection (e.g., Datadog AIOps, Azure AI) to predict and prevent outages.
  • Continuously tune monitoring thresholds to reduce noise while maximizing incident insight.
  • Establish performance baselines and proactively address deviations before SLA breach.

2. Automation, AI & Operational Excellence

  • Build intelligent automation using Power Automate, Datadog Workflows, Azure Logic Apps, and Defender for Cloud to reduce manual interventions.
  • Design auto-remediation workflows that fix common or predictable issues in real time.
  • Contribute to the creation and enhancement of AI-driven playbooks that guide faster triage, root-cause identification, and resolution.
  • Partner with the internal GenAI and Automation teams to develop custom models improving incident response and capacity management.
  • Champion a Zero Toil mindsetautomate repetitive tasks, streamline response, and free engineers for innovation.

3. Reliability Engineering & Continuous Improvement

  • Ensure all SaaS environments achieve or exceed defined SLOs (Service Level Objectives) and SLAs.
  • Use trend and correlation analysis to identify recurring issues, performance degradations, and potential bottlenecks.
  • Participate in game days and chaos testing to validate system resilience and recovery readiness.
  • Drive post-incident reviews (PIRs) that focus on learning, not blameensuring preventive measures are implemented.
  • Collaborate on release readiness reviews and verify reliability criteria before deployments and updates.

4. Cross-Functional Collaboration & Customer Engagement

  • Partner with Cloud Infrastructure Engineering (CIE) and DevOps to improve release pipelines, scaling, and observability.
  • Collaborate with Service Delivery Managers (SDMs) to interpret trends and communicate proactively with customers.
  • Participate in major incident bridges, communicating clearly, confidently, and constructively with both customers and executives.
  • Support customer onboarding, environment validation, and performance benchmarking.
  • Act as a reliability advocate internallyeducating teams on best practices for operability and monitoring.

5. 24*7 Global Operations

  • Work as part of a follow-the-sun model, ensuring continuous global coverage.
  • Take part in on-call rotations, providing leadership during major incidents and escalations.
  • Maintain accurate handovers and documentation between shifts, ensuring transparency and continuity.

Required Skills & Experience

  • Proven experience as a Site Reliability Engineer, Cloud Operations Engineer, or SaaS Support Engineer.
  • Strong proficiency in Azure and AWS monitoring and automation ecosystems.
  • Expert knowledge of observability platforms such as Datadog, CloudWatch, Azure Monitor, and Grafana.
  • Hands-on experience with automation frameworks (Power Automate, Terraform, Ansible, Azure Logic Apps, or similar).
  • Familiarity with AI Ops and intelligent alerting platforms.
  • Strong grasp of ITIL practices, particularly Incident, Problem, Change, and Event Management.
  • Skilled in Kubernetes, AKS, and containerized environments.
  • Understanding of databases (PostgreSQL, Oracle, etc.) and performance tuning.
  • Excellent customer communication and presentation skills; able to explain technical issues to non-technical stakeholders confidently.
  • Experience working in 247 operational models.

Preferred Experience

  • Exposure to SymphonyAI products such as Sensa, NetReveal, InvestigationHub, or DataHub.
  • Experience building or integrating AI-based monitoring or predictive analytics solutions.
  • Experience in financial services or regulated industries (AML, KYC, Fraud, WLM).
  • Scripting in Python, Bash, or PowerShell.
  • Familiarity with DevOps, CI/CD, and Infrastructure-as-Code (IaC) practices.
  • Certifications: Azure Administrator/DevOps, AWS Solutions Architect, or Datadog Certified Professional.

What Success Looks Like

  • Customer environments consistently meet or exceed uptime, SLA, and availability goals.
  • Automation replaces manual effort, reducing incident MTTR and improving recovery.
  • Trends and AI insights drive proactive prevention rather than reactive firefighting.
  • Customers experience confidence and trust through transparent, data-driven communication.
  • The SRI team becomes recognized as a world-class reliability and intelligence function within SymphonyAI.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You