Posted:3 months ago|
Platform:
Work from Office
Full Time
Design, build, and maintain scalable, reliable, and high-performance infrastructure and services. Implement and manage monitoring, ing, and automated remediation tools to ensure maximum uptime. Participate in on-call rotations, resolving production incidents, and improving the reliability of systems. Work closely with development teams to ensure smooth application deployment and continuous integration/continuous deployment (CI/CD) pipelines. Develop and maintain system observability frameworks, including logs, metrics, and tracing. Drive the implementation of SLOs (Service Level Objectives) and SLIs (Service Level Indicators), ensuring systems meet reliability targets. Write Python scripts to automate system operations and improve the deployment process. Build and manage services using GCP resources such as Compute Engine, Kubernetes Engine, Cloud Functions, BigQuery, and Cloud Storage. Ensure proper integration and optimization between GCP services and the overall architecture. Leverage GCP tools for cost management, security, and performance monitoring. Collaborate with engineering teams to integrate reliability practices into software development lifecycles. Provide guidance on best practices for system design, disaster recovery, and fault tolerance. Lead post-incident analysis and conduct retrospectives to prevent future issues. Mentor junior engineers and help build a culture of reliability and high availability. 6+ years of experience in Site Reliability Engineering or related roles. Strong experience in Python programming, including automation, scripting, and developing tools for system management. Expertise in Google Cloud Platform (GCP) services, including Compute Engine, Kubernetes Engine, Cloud Functions, and more. In-depth understanding of cloud infrastructure, containerization (Docker, Kubernetes), and orchestration. Experience with monitoring, logging, and ing tools (Prometheus, Grafana, Stackdriver, etc.). Proven experience with CI/CD pipelines, version control (Git), and automation tools (Ansible, Terraform, etc.). Strong understanding of networking, load balancing, and high-availability architectures. Experience in incident management and working in an on-call rotation.
UST
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
My Connections UST
10.0 - 15.0 Lacs P.A.
Pune/Pimpri-Chinchwad Area
Salary: Not disclosed
Chennai
20.0 - 25.0 Lacs P.A.
Chennai
7.85 - 8.35 Lacs P.A.
7.0 - 10.0 Lacs P.A.
Bengaluru, Karnataka, India
Salary: Not disclosed
Chennai, Tamil Nadu, India
Salary: Not disclosed
Noida, Uttar Pradesh, India
Salary: Not disclosed
8.0 - 9.0 Lacs P.A.
Andhra Pradesh, India
Experience: Not specified
Salary: Not disclosed