Posted:3 months ago|
Platform:
Work from Office
Full Time
Design, implement, and maintain scalable and highly reliable cloud infrastructure using Google Cloud Platform (GCP) services such as Compute Engine, Kubernetes Engine, Cloud Functions, and BigQuery. Write Python scripts to automate operations, deployment processes, and enhance system performance. Collaborate with engineering teams to improve system architecture, application deployment, and continuous integration/continuous deployment (CI/CD) pipelines. Develop and maintain system observability frameworks including logs, metrics, and tracing to ensure visibility into system health and performance. Implement and manage monitoring and ing systems using tools like Prometheus, Grafana, or Stackdriver to ensure system reliability and uptime. Participate in on-call rotations to address production incidents and drive incident management and root cause analysis. Work on improving system performance, cost management, and security using GCP-native tools. Define and track SLOs (Service Level Objectives) and SLIs (Service Level Indicators) to ensure that systems meet reliability targets. Automate and streamline processes for system provisioning, configuration, and deployment. Conduct post-incident reviews to identify areas for improvement and prevent recurrence of issues. 4+ years of experience in Site Reliability Engineering (SRE), DevOps, or similar roles. Strong experience with Python programming, including automation, scripting, and system management tools. Hands-on experience with Google Cloud Platform (GCP) services, such as Compute Engine, Kubernetes Engine, Cloud Functions, and BigQuery. Strong understanding of containerization and orchestration tools, particularly Docker and Kubernetes. Proficiency in monitoring and ing tools, such as Prometheus, Grafana, Stackdriver, or similar. Experience working with CI/CD tools and practices (e.g., GitLab, Jenkins). Solid understanding of system performance optimization, security, and cost management practices on GCP. Strong knowledge of networking concepts, high-availability architectures, and system troubleshooting techniques. Experience with infrastructure automation and configuration management tools (e.g., Terraform, Ansible). Experience in production environment management, incident resolution, and on-call support. Good understanding of software development practices and collaboration with development teams to improve reliability.
UST
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
My Connections UST
Trivandrum
13.0 - 15.0 Lacs P.A.
Chennai, Tamil Nadu, India
6.0 - 10.0 Lacs P.A.
Chennai, Tamil Nadu, India
7.0 - 10.0 Lacs P.A.
Bengaluru / Bangalore, Karnataka, India
3.0 - 7.0 Lacs P.A.
Hyderabad / Secunderabad, Telangana, Telangana, India
3.0 - 7.0 Lacs P.A.
Delhi, Delhi, India
3.0 - 7.0 Lacs P.A.
Noida, Uttar Pradesh, India
3.0 - 9.5 Lacs P.A.
Gurgaon / Gurugram, Haryana, India
7.0 - 14.0 Lacs P.A.
Noida, Uttar Pradesh, India
7.0 - 14.0 Lacs P.A.
Patan - Gujarat, Gujrat, India
4.0 - 11.0 Lacs P.A.