Posted:18 hours ago|
Platform:
Hybrid
Full Time
MLOps Engineering
Experience operationalizing & managing ML/AI workloads in production environments
Distributed Tracing & Observability
Strong understanding and hands-on implementation of metrics, logs, and traces (three pillars of observability)
Monitoring & Alerting
Production experience building Grafana dashboards and actionable alert systems; understands that dashboards without alerts lack operational value
Azure Databricks Operations
Cluster management, performance optimization, timeout resolution, library troubleshooting, and compute issue resolution
Azure Cloud Services
Deep knowledge of Azure PaaS, AKS, cloud-native architectures, and Azure monitoring/diagnostics ecosystem
Good-to-Have Skills
GCP Experience
Exposure to Google Cloud Platform services and telemetry collection
Multi-Cloud Operations : Experience across Azure, GCP, or AWS environments Apache Airflow : Workflow orchestration experience (basic level acceptable; can be learned on job) Python/Scripting : Automation and scripting proficiency MLOps Knowledge : Understanding of ML lifecycle management and MLOps practices Technology Stack Primary Cloud : Microsoft Azure Key Platforms : Azure Databricks, Azure Kubernetes Services (AKS), Azure PaaS services Observability : Grafana, distributed tracing tools, metrics/logs/traces platforms Orchestration : Apache Airflow (basic usage) Secondary Cloud : GCP services (limited scope)
Key Responsibilities
Design and implement comprehensive observability solutions using metrics, logs, and distributed traces Build unified Grafana dashboards for single-pane-of-glass visibility across multi-cloud environments Establish actionable alerting frameworks that drive incident response Implement distributed tracing for AI/ML workloads and microservices Proactively identify and remediate performance bottlenecks Monitor, troubleshoot, and optimize Azure Databricks compute environments Right-size clusters and resolve performance issues (timeouts, long-running jobs, library failures) Build observability layers where current gaps exist Manage and optimize AKS workloads and Azure PaaS offerings Collect telemetry from Azure and GCP services and pipe to observability stack Integrate diverse cloud services into unified monitoring infrastructure Implement logging, metrics collection, and tracing across heterogeneous environments Ensure comprehensive visibility across entire technology stack Create and manage support cases with Databricks and Microsoft Provide technical support for AI/ML workloads on cloud infrastructure Research and implement solutions for unfamiliar technologies
Citiustech
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
pune, chennai, bengaluru
35.0 - 50.0 Lacs P.A.
15.0 - 18.0 Lacs P.A.
chennai, bengaluru
20.0 - 30.0 Lacs P.A.
mumbai
25.0 - 40.0 Lacs P.A.
30.0 - 35.0 Lacs P.A.
nagpur
5.0 - 9.0 Lacs P.A.
pune, chennai, bengaluru
35.0 - 50.0 Lacs P.A.
hyderabad
15.0 - 19.0 Lacs P.A.
ahmedabad
5.0 - 9.0 Lacs P.A.
15.0 - 30.0 Lacs P.A.