MLops Architect

12 - 20 years

35 - 50 Lacs

Posted:18 hours ago| Platform: Naukri logo

Apply

Work Mode

Hybrid

Job Type

Full Time

Job Description

MLOps Engineering

Experience operationalizing & managing ML/AI workloads in production environments
Distributed Tracing & Observability

Strong understanding and hands-on implementation of metrics, logs, and traces (three pillars of observability)
Monitoring & Alerting

Production experience building Grafana dashboards and actionable alert systems; understands that dashboards without alerts lack operational value
Azure Databricks Operations

Cluster management, performance optimization, timeout resolution, library troubleshooting, and compute issue resolution
Azure Cloud Services

Deep knowledge of Azure PaaS, AKS, cloud-native architectures, and Azure monitoring/diagnostics ecosystem

Good-to-Have Skills
GCP Experience

Exposure to Google Cloud Platform services and telemetry collection
Multi-Cloud Operations : Experience across Azure, GCP, or AWS environments Apache Airflow : Workflow orchestration experience (basic level acceptable; can be learned on job) Python/Scripting : Automation and scripting proficiency MLOps Knowledge : Understanding of ML lifecycle management and MLOps practices Technology Stack Primary Cloud : Microsoft Azure Key Platforms : Azure Databricks, Azure Kubernetes Services (AKS), Azure PaaS services Observability : Grafana, distributed tracing tools, metrics/logs/traces platforms Orchestration : Apache Airflow (basic usage) Secondary Cloud : GCP services (limited scope)

Key Responsibilities
Design and implement comprehensive observability solutions using metrics, logs, and distributed traces Build unified Grafana dashboards for single-pane-of-glass visibility across multi-cloud environments Establish actionable alerting frameworks that drive incident response Implement distributed tracing for AI/ML workloads and microservices Proactively identify and remediate performance bottlenecks Monitor, troubleshoot, and optimize Azure Databricks compute environments Right-size clusters and resolve performance issues (timeouts, long-running jobs, library failures) Build observability layers where current gaps exist Manage and optimize AKS workloads and Azure PaaS offerings Collect telemetry from Azure and GCP services and pipe to observability stack Integrate diverse cloud services into unified monitoring infrastructure Implement logging, metrics collection, and tracing across heterogeneous environments Ensure comprehensive visibility across entire technology stack Create and manage support cases with Databricks and Microsoft Provide technical support for AI/ML workloads on cloud infrastructure Research and implement solutions for unfamiliar technologies

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Citiustech logo
Citiustech

IT Services and IT Consulting

Princeton NJ

RecommendedJobs for You

pune, chennai, bengaluru

pune, chennai, bengaluru