SRE Principal Engineer - Technical Lead

12 years

0 Lacs

Hyderabad, Telangana, India

Posted:2 weeks ago| Platform: Linkedin logo

Apply

Skills Required

management splunk datadog azure python pipeline kubernetes docker ansible devops terraform analysis monitoring reliability engineering automation optimization resolve leadership tooling analyze drive technology deployment configuration chef aws integration gitlab jenkins strategies code troubleshooting service latency instrumentation development software automate compliance

Work Mode

On-site

Job Type

Full Time

Job Description

Change Management, Incident Response, Dynatrace, Grafana, Splunk, Datadog, Grafana, New Relic, Azure, Python, CI/CD/CT Pipeline, Kubernetes, Docker, Ansible, DevOps, Terraform, DevOps, Root Cause Analysis (RCA), SLO/SLAs Monitoring, E2E Implementation Description GSPANN is hiring a Principal Engineer – Technical Lead for Site Reliability Engineering (SRE) to lead reliability engineering initiatives in Pune or Hyderabad. This full-time role focuses on driving enterprise-wide observability, automation, and infrastructure optimization across global production systems. Location: Pune / Hyderabad Role Type: Full Time Published On: 2 June 2025 Experience: 12 - 15 Years Share this job Description GSPANN is hiring a Principal Engineer – Technical Lead for Site Reliability Engineering (SRE) to lead reliability engineering initiatives in Pune or Hyderabad. This full-time role focuses on driving enterprise-wide observability, automation, and infrastructure optimization across global production systems. Role and Responsibilities Demonstrate deep expertise in monitoring and observability tools such as Dynatrace, Splunk, Datadog, Grafana, and New Relic. Apply modern observability practices and tools across enterprise environments. Resolve organizational gaps in SRE implementation by designing scalable, long-term solutions. Lead cross-functional initiatives to adopt emerging technologies and reliability frameworks. Influence senior leadership on strategic decisions related to tooling, observability, and transformation. Analyze complex system issues, uncover performance bottlenecks, and drive root cause resolution. Drive automation and foster a culture of continuous improvement aligned with evolving technology trends. Manage cloud infrastructure efficiently, with a strong preference for Microsoft Azure experience. Write automation scripts proficiently, preferably using Python. Work with cloud deployment tools including Ansible, Terraform, and Azure DevOps. Architect and operate containerized environments using Kubernetes and Docker. Utilize configuration management solutions such as Chef, Ansible, and AWS CodeDeploy. Implement and optimize Continuous Integration/Continuous Deployment (CI/CD) pipelines using tools like GitLab, Jenkins, Bamboo, Travis CI, and CircleCI. Solve technical issues independently and deliver sustainable solutions with minimal supervision. Lead change and incident management processes, while driving strategic SRE transformation at scale. Standardize observability across teams with end-to-end (E2E) implementation and innovative approaches. Champion enterprise-grade monitoring strategies using industry-leading tools. Build scalable infrastructure using Infrastructure as Code (IaC) principles and technologies. Exhibit soft skills such as visionary thinking, proactive leadership, and deep-rooted troubleshooting expertise. Define, implement, and monitor Service Level Objectives (SLOs) and Service Level Agreements (SLAs). Coordinate and lead incident response while conducting thorough Root Cause Analysis (RCA). Skills And Experience Bachelor's degree in Computer Science, Information Science, Engineering, or a related field. 12+ years of experience in Site Reliability Engineering (SRE) or DevOps roles, with a strong focus on managing production systems. Ensure high availability, low latency, optimal performance, and cost-efficient operations for global e-commerce platforms. Spearhead change and incident management across business-critical systems. Mentor and guide product teams in embedding observability and operational excellence throughout the delivery pipeline. Architect and deploy unified, end-to-end observability dashboards tailored for engineering and business stakeholders. Define instrumentation standards and build reusable patterns to scale best practices across teams. Collaborate with cross-functional stakeholders to integrate reliability into every stage of product development. Develop proprietary tools that close gaps in software delivery and incident response. Lead the adoption of SRE best practices to systematically improve resilience and uptime. Automate key operations to ensure rapid and effective incident handling. Monitor and enforce compliance with SLOs and ensure uninterrupted availability of mission-critical services. Continuously optimize infrastructure to lower operational costs and seamlessly manage demand surges. Show more Show less

Mock Interview

Practice Video Interview with JobPe AI

Start Management Interview Now

RecommendedJobs for You