Position Title
Sr. D&T System Reliability Engineer
Function/Group
Digital & Technology
Location
Mumbai/Pune
Shift Timing
Regular
Role Reports to
Manager SRE
Remote/Hybrid/in-Office
Hybrid
JOB OVERVIEW
KEY ACCOUNTABILITIES
Provide technical expertise and guidance across the SRE team, acting as a subject matter expert and leading the best practice techniques in implementing SRE practices.
To help the SRE team in ensuring technical assurance in significant projects, for the delivery of quality technical deliverables, which may involve several teams or technologies.
Help delivering SRE objectives/ priorities like implementing strategic framework of Monitoring and Observability scaling across multiple stakeholders.
Collaborate with tech leads and application architects to build new infrastructure in cloud and ensure the stability and scalability of our internal systems.
Help teams in Observability, eliminating Tech Debts, Tech Resilience and Automation and AI capabilities
Ensure SLOs with SLIs for services so that the organizations can ensure that their operational objectives are closely aligned
Implement new technologies to build the future of Application Hosting capabilities and how applications are built and delivered.
Investigate and resolve complex and multi-faceted issues, spanning the entire technology stack, which require working across teams and technology boundaries.
Proactively improve site reliability and key metrics, such as up-time, application performance, time to issue resolution, time spent resolving incidents and other key operational SLAs
Primary Experience and Expertise
- A total of 6-8 years of experience in
designing, automation, solution
deployment (infrastructure/ platforms/ applications) on hybrid cloud, preferably an AI native approach. - 3+ years of experience in
GCP
and Infrastructure as a code (IaC) tool like Terraform
, Ansible, Google Cloud Deployment Manager, Azure Resource Manager, designing cloud solutions etc. - In-depth understanding of
Monitoring and Observability
Tools like Datadog, GCP/Azure Cloud Monitoring/ Grafana/ Splunk/ Instatus. Experience in Open Telemetry. - Expertise in
CI/CD pipelines, GitOps
. Hands-on experience with GitHub, GitHub Actions, HashiCorp Vault
and other DevOps toolchains. - Experience in operating
AIOps
(Moogsoft, BigPanda, etc) - Experience in
Gen AI
& Agents/Agentic AI
driven observability, Auto-remediations. - Experience in Intelligent DevOps/CI-CD toolchains infused by Cloud-native
AI agents
/framework. - Proficiency in
scripting and automation
using Python, Bash, or similar languages. - Experience in troubleshooting and debugging deployment pipelines and DevOps toolset.
- Strong knowledge of
containerization, Kubernetes, and cloud platforms (GCP, Azure)
and deployment pipelines using GitHub Actions workflow.
- Proven Track Record of improving
system reliability, performance,
and scalability
in complex systems. - Good knowledge of
Linux
and/or Windows
Administration and troubleshooting. - Strong
Problem-Solving Skills
and the ability to make decisions under pressure. - Experience working in Agile teams, defining SLO s and SLI s for products.
- Excellent Communication Skills to effectively collaborate with cross-functional teams.
- Desired knowledge in multiple areas:
- Deployment, building, scanning, and monitoring of the applications in Cloud (GCP)
- Traffic load balancers such as GCP Load balancers, F5, etc.
- Networking protocols and topography, Firewalls, DNS, TCP/IP, HTTP
- Kafka and Elastic.
- Debugging tools
- Server configuration and hardening
- SRE practices and cloud solutioning and ways of working through multi-disciplinary teams, business frameworks and culture.
Preferred Qualification
- Master s Degree in Computer Science, Engineering, Information Technology, or a related field.
- GCP (or equivalent cloud provider) Certifications such as DevOps / Engineer / Professional Cloud Architect certification.