12 years
0 Lacs
Hyderabad, Telangana, India
Posted:2 weeks ago|
Platform:
On-site
Full Time
Change Management, Incident Response, Dynatrace, Grafana, Splunk, Datadog, Grafana, New Relic, Azure, Python, CI/CD/CT Pipeline, Kubernetes, Docker, Ansible, DevOps, Terraform, DevOps, Root Cause Analysis (RCA), SLO/SLAs Monitoring, E2E Implementation Description GSPANN is hiring a Principal Engineer – Technical Lead for Site Reliability Engineering (SRE) to lead reliability engineering initiatives in Pune or Hyderabad. This full-time role focuses on driving enterprise-wide observability, automation, and infrastructure optimization across global production systems. Location: Pune / Hyderabad Role Type: Full Time Published On: 2 June 2025 Experience: 12 - 15 Years Share this job Description GSPANN is hiring a Principal Engineer – Technical Lead for Site Reliability Engineering (SRE) to lead reliability engineering initiatives in Pune or Hyderabad. This full-time role focuses on driving enterprise-wide observability, automation, and infrastructure optimization across global production systems. Role and Responsibilities Demonstrate deep expertise in monitoring and observability tools such as Dynatrace, Splunk, Datadog, Grafana, and New Relic. Apply modern observability practices and tools across enterprise environments. Resolve organizational gaps in SRE implementation by designing scalable, long-term solutions. Lead cross-functional initiatives to adopt emerging technologies and reliability frameworks. Influence senior leadership on strategic decisions related to tooling, observability, and transformation. Analyze complex system issues, uncover performance bottlenecks, and drive root cause resolution. Drive automation and foster a culture of continuous improvement aligned with evolving technology trends. Manage cloud infrastructure efficiently, with a strong preference for Microsoft Azure experience. Write automation scripts proficiently, preferably using Python. Work with cloud deployment tools including Ansible, Terraform, and Azure DevOps. Architect and operate containerized environments using Kubernetes and Docker. Utilize configuration management solutions such as Chef, Ansible, and AWS CodeDeploy. Implement and optimize Continuous Integration/Continuous Deployment (CI/CD) pipelines using tools like GitLab, Jenkins, Bamboo, Travis CI, and CircleCI. Solve technical issues independently and deliver sustainable solutions with minimal supervision. Lead change and incident management processes, while driving strategic SRE transformation at scale. Standardize observability across teams with end-to-end (E2E) implementation and innovative approaches. Champion enterprise-grade monitoring strategies using industry-leading tools. Build scalable infrastructure using Infrastructure as Code (IaC) principles and technologies. Exhibit soft skills such as visionary thinking, proactive leadership, and deep-rooted troubleshooting expertise. Define, implement, and monitor Service Level Objectives (SLOs) and Service Level Agreements (SLAs). Coordinate and lead incident response while conducting thorough Root Cause Analysis (RCA). Skills And Experience Bachelor's degree in Computer Science, Information Science, Engineering, or a related field. 12+ years of experience in Site Reliability Engineering (SRE) or DevOps roles, with a strong focus on managing production systems. Ensure high availability, low latency, optimal performance, and cost-efficient operations for global e-commerce platforms. Spearhead change and incident management across business-critical systems. Mentor and guide product teams in embedding observability and operational excellence throughout the delivery pipeline. Architect and deploy unified, end-to-end observability dashboards tailored for engineering and business stakeholders. Define instrumentation standards and build reusable patterns to scale best practices across teams. Collaborate with cross-functional stakeholders to integrate reliability into every stage of product development. Develop proprietary tools that close gaps in software delivery and incident response. Lead the adoption of SRE best practices to systematically improve resilience and uptime. Automate key operations to ensure rapid and effective incident handling. Monitor and enforce compliance with SLOs and ensure uninterrupted availability of mission-critical services. Continuously optimize infrastructure to lower operational costs and seamlessly manage demand surges. Show more Show less
GSPANN Technologies, Inc
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Hyderabad, Telangana, India
Salary: Not disclosed
Hyderabad, Telangana, India
Salary: Not disclosed