Overview
We are seeking a skilled Platform Engineer to join our team and drive the development, deployment, and supportability of our Kubernetes-based microservices platform, deployed on-premises by customers. You will build comprehensive observability, enable log and report extraction for service cases without real-time access, and optimize our overuse of Kafka by integrating Redis and batch processing. This role requires expertise in Kubernetes, Azure DevOps, C++ support, deployment sizing, and designing for reliability, availability, and serviceability (RAS).
Responsibilities
Build Comprehensive Observability
: Implement centralized metrics, logging, and tracing (e.g., Prometheus, Fluentd, OpenTelemetry) for .NET, Python, Java, C++, Kafka, and Redis, ensuring supportability in on-premises environments.Enable Log/Report Extraction
: Design customer-facing tools (e.g., CLI scripts, Helm chart options) to collect and export logs/metrics from on-premises deployments for service cases, without real-time access.Optimize Kafka Usage
: Audit and optimize Kafka configurations (e.g., topics, partitions, compression) to reduce metadata streaming overhead, monitored with Prometheus or Azure Monitor.Implement Alternatives
: Integrate Redis (e.g., Azure Cache for Redis) for metadata caching/pub-sub and batch processing (e.g., Azure Data Factory, Kubernetes Jobs) for high-volume data, reducing Kafka dependency.Troubleshoot Customer Environments
: Debug issues in on-premises customer deployments for services (C++, .NET, Python, Java), Kafka, and Redis, using exported logs and metrics.Enhance Product Supportability
: Build Azure DevOps pipelines and installers (e.g., Helm charts) for consistent, supportable deployments, with documentation for customer support.Contribute to RAS
: Own serviceability by building observability and diagnostic tools; support reliability/availability via Kubernetes optimization, autoscaling, and fault-tolerant designs.Enforce Standards
: Implement and enforce structured logging (e.g., JSON with correlation IDs) and resource sizing standards via Azure DevOps pipelines.Optimize Deployment Sizing
: Set Kubernetes resource requests/limits and autoscaling policies (e.g., HPA, VPA) for services, Kafka, Redis, and batch jobs, based on profiling.Evaluate Service Meshes
: Assess service meshes (e.g., Linkerd) for improving microservice and data platform observability and communication.Support C++ Services
: Assist developers in containerizing, deploying, and debugging C++ services, ensuring integration with observability, Kafka, Redis, or batch workflows.Automate with Azure DevOps
: Build CI/CD pipelines in Azure DevOps for automated builds, tests, and deployments, integrating with AKS, Kafka, and Redis.
Qualifications
Experience
: 3–5 years with Kubernetes, Azure DevOps (AKS, pipelines), and Kafka administration.Technical Skills
:- Expert in Kubernetes (CKA/CKAD preferred) and Azure DevOps (YAML pipelines, AKS integration).
- Proficient in observability tools (e.g., Prometheus, Grafana, Fluentd, OpenTelemetry, Azure Monitor) for metrics, logs, and tracing.
- Experience with on-premises Kubernetes deployments and log/report extraction for service cases.
- Proficient in Kafka optimization (e.g., topic management, consumer groups) and monitoring.
- Knowledge of Redis (e.g., Azure Cache for Redis, pub/sub) and batch processing (e.g., Azure Data Factory, Kubernetes Jobs).
- Familiarity with C++ build systems (e.g., CMake) and debugging (e.g., gdb) in Kubernetes.
- Proficiency in Kubernetes resource management and autoscaling (e.g., HPA, VPA).
- Scripting skills (e.g., Python, Bash) for automation, diagnostics, and log extraction.
Customer Focus
: Proven ability to troubleshoot on-premises customer environments and build supportable deployment and observability tools.Standards Enforcement
: Experience enforcing logging, sizing, and data platform standards via Azure DevOps pipelines.RAS Expertise
: Ability to design for serviceability (observability, diagnostics) and contribute to reliability/availability through platform optimization.
Nice-to-Haves
- Experience with service meshes (e.g., Linkerd, Istio) and their integration with Azure.
- Familiarity with .NET, Python, or Java for developer collaboration.
- Knowledge of air-gapped Kubernetes deployments (e.g., Kubeadm, K3s).