Job
Description
We are looking for a Platform Production Engineer to join our team and take ownership of scaling, automating, and optimizing our multi-cloud platform infrastructure across AWS, GCP, and Azure. In this role, you will design, implement, and operate highly available and efficient systems that power mission-critical applications. Your expertise in Kubernetes, cloud infrastructure, and distributed systems will be key in driving reliability, automation, and operational excellence to deliver a seamless experience for our customers and internal teams. What you will do:
? Ensure platform reliability and performance: Monitor, troubleshoot, and optimize production systems running on Kubernetes (EKS, GKE, AKS).
? Automate operations: Develop and maintain automation for infrastructure provisioning, scaling, and incident response.
? Incident response & on-call support: Participate in on-call rotations to quickly detect, mitigate, and resolve production incidents.
? Kubernetes upgrades & management: Own and drive Kubernetes version upgrades, node pool scaling, and security patches.
? Observability & monitoring: Implement and refine observability tools (Datadog, Prometheus, Splunk, Victoria Metric etc.) for proactive monitoring and alerting.
? Infrastructure as Code (IaC): Manage infrastructure using Terraform, Terragrunt, Helm, and Kubernetes manifests.
? CI/CD & release automation Build, maintain, and improve CI/CD pipelines using GitHub Actions, ArgoCD, and related tooling to streamline application delivery and platform updates.
? Cross-functional collaboration: Work closely with developers, SREs, and other teams to improve platform stability.
? Performance tuning: Analyze and optimize cloud and containerized workloads for cost efficiency and high availability.
? Security & compliance: Ensure platform security best practices, incident response, and compliance adherence.
Required education Bachelor's Degree Preferred education Master's Degree Required technical and professional expertise ?7-9 years of relevant experience?Strong expertise in Kubernetes (EKS, GKE, AKS) and container orchestration.?Experience with AWS, GCP, or Azure, particularly in managing large-scale cloud infrastructure.?Proficiency in Terraform, Helm, and Infrastructure as Code (IaC).?Strong understanding of Linux systems, networking, and security best practices.?Experience with monitoring & logging tools (Datadog, Splunk, Prometheus, Grafana, Victoria Metrics, etc.).?Hands-on experience with automation & scripting (Python, Go, Bash).?Experience in incident management & debugging complex distributed systems.?Familiarity with CI/CD pipelines and release automation.