Job
Description
We are looking for a Platform Production Engineer to join our team and take ownership of scaling, automating, and optimizing our multi-cloud platform infrastructure across AWS, GCP, and Azure. In this role, you will design, implement, and operate highly available and efficient systems that power mission-critical applications. Your expertise in Kubernetes, cloud infrastructure, and distributed systems will be key in driving reliability, automation, and operational excellence to deliver a seamless experience for our customers and internal teams.What you will do:? Ensure platform reliability and performanceMonitor, troubleshoot, and optimize production systems running on Kubernetes (EKS, GKE, AKS).? Automate operationsDevelop and maintain automation for infrastructure provisioning, scaling, and incident response.? Incident response & on-call supportParticipate in on-call rotations to quickly detect, mitigate, and resolve production incidents.? Kubernetes upgrades & managementOwn and drive Kubernetes version upgrades, node pool scaling, and security patches.? Observability & monitoringImplement and refine observability tools (Datadog, Prometheus, Splunk, Victoria Metric etc.) for proactive monitoring and alerting.? Infrastructure as Code (IaC)Manage infrastructure using Terraform, Terragrunt, Helm, and Kubernetes manifests.? CI/CD & release automationBuild, maintain, and improve CI/CD pipelines using GitHub Actions, ArgoCD, and related tooling to streamline application delivery and platform updates.? Cross-functional collaborationWork closely with developers, SREs, and other teams to improve platform stability.? Performance tuningAnalyze and optimize cloud and containerized workloads for cost efficiency and high availability.? Security & complianceEnsure platform security best practices, incident response, and compliance adherence. Required education Bachelor's Degree Preferred education Master's Degree Required technical and professional expertise ? 7-9 years of relevant experience? Strong expertise in Kubernetes (EKS, GKE, AKS) and container orchestration.? Experience with AWS, GCP, or Azure, particularly in managing large-scale cloud infrastructure.? Proficiency in Terraform, Helm, and Infrastructure as Code (IaC).? Strong understanding of Linux systems, networking, and security best practices.? Experience with monitoring & logging tools (Datadog, Splunk, Prometheus, Grafana, Victoria Metrics, etc.).? Hands-on experience with automation & scripting (Python, Go, Bash).? Experience in incident management & debugging complex distributed systems.? Familiarity with CI/CD pipelines and release automation.