SRE/AIOps Manager

7 - 12 years

20 - 25 Lacs

Posted:2 weeks ago| Platform: Naukri logo

Apply

Work Mode

Work from Office

Job Type

Full Time

Job Description

  • Observability & Monitoring Excellence
  • - Design and implement end-to-end observability pipelines spanning AI solutions, data processing workflows, and automation execution environments
  • - Establish comprehensive monitoring strategies for AI model performance, drift detection, data quality, and service health across Databricks and UiPath platforms
  • - Build real-time dashboards and alerting systems that provide actionable insights into system performance, resource utilization, and service reliability
  • - Develop custom metrics and KPIs specific to AI/ML workloads, including model accuracy, latency, throughput, and resource consumption
  • - Implement distributed tracing and logging solutions to enable rapid troubleshooting across complex AI and automation pipelines
  • Automated Resolution & Self-Healing Systems
  • - Architect and deploy automated incident response systems that can detect, diagnose, and resolve common reliability issues without human intervention
  • - Build intelligent event-triggered runbook automation
  • - Implement chaos engineering practices to proactively identify and strengthen system weaknesses
  • - Develop automated remediation workflows for infrastructure issues, service degradations, and capacity constraints
  • - Create self-healing mechanisms for AI inference services, data pipeline failures, and automation workflow interruptions
  • Team Leadership & Development
  • - Build, mentor, and lead a team of Site Reliability Engineers with expertise in AI/ML operations, data platforms, and automation technologies
  • - Establish SRE best practices, standards, and processes tailored to AI and automation workloads
  • - Foster a culture of reliability engineering, continuous improvement, and data-driven decision making
  • - Conduct regular performance reviews, career development discussions, and technical skill assessments
  • - Collaborate with engineering teams to embed reliability principles into the software development lifecycle
  • Platform Reliability & Performance
  • - Ensure near zero downtime and optimal performance of AI solutions, Databricks analytics workloads, and UiPath automation processes
  • - Design and implement disaster recovery and business continuity plans for critical AI and automation services
  • - Optimize resource allocation and cost management across cloud infrastructure supporting AI, analytics, and automation workloads
  • - Establish and maintain service level objectives (SLOs) and error budgets for all managed services
  • - Drive capacity planning initiatives to support growing AI model deployment and automation scale requirements
  • Cross-Functional Collaboration
  • - Partner with AI/ML developers to integrate reliability considerations into AI solutions and deployment pipelines
  • - Work closely with data engineering teams to ensure robust, monitored data flows within Databricks environments
  • - Collaborate with automation developers to build resilient UiPath bot deployment and execution frameworks
  • - Interface with security teams to implement observability solutions that maintain compliance and data protection standards

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
DXC Technology logo
DXC Technology

Information Technology and Services

Tysons

RecommendedJobs for You