SRE/AIOps Manager

DXC Technology

7 - 12 years

20 - 25 Lacs

chennai

Posted:2 weeks ago| Platform:

Apply

Skills Required

automation social media disaster recovery engineering manager workflow troubleshooting distribution system analytics monitoring python

Work Mode

Work from Office

Job Type

Full Time

Job Description

Observability & Monitoring Excellence
- Design and implement end-to-end observability pipelines spanning AI solutions, data processing workflows, and automation execution environments
- Establish comprehensive monitoring strategies for AI model performance, drift detection, data quality, and service health across Databricks and UiPath platforms
- Build real-time dashboards and alerting systems that provide actionable insights into system performance, resource utilization, and service reliability
- Develop custom metrics and KPIs specific to AI/ML workloads, including model accuracy, latency, throughput, and resource consumption
- Implement distributed tracing and logging solutions to enable rapid troubleshooting across complex AI and automation pipelines
Automated Resolution & Self-Healing Systems
- Architect and deploy automated incident response systems that can detect, diagnose, and resolve common reliability issues without human intervention
- Build intelligent event-triggered runbook automation
- Implement chaos engineering practices to proactively identify and strengthen system weaknesses
- Develop automated remediation workflows for infrastructure issues, service degradations, and capacity constraints
- Create self-healing mechanisms for AI inference services, data pipeline failures, and automation workflow interruptions
Team Leadership & Development
- Build, mentor, and lead a team of Site Reliability Engineers with expertise in AI/ML operations, data platforms, and automation technologies
- Establish SRE best practices, standards, and processes tailored to AI and automation workloads
- Foster a culture of reliability engineering, continuous improvement, and data-driven decision making
- Conduct regular performance reviews, career development discussions, and technical skill assessments
- Collaborate with engineering teams to embed reliability principles into the software development lifecycle
Platform Reliability & Performance
- Ensure near zero downtime and optimal performance of AI solutions, Databricks analytics workloads, and UiPath automation processes
- Design and implement disaster recovery and business continuity plans for critical AI and automation services
- Optimize resource allocation and cost management across cloud infrastructure supporting AI, analytics, and automation workloads
- Establish and maintain service level objectives (SLOs) and error budgets for all managed services
- Drive capacity planning initiatives to support growing AI model deployment and automation scale requirements
Cross-Functional Collaboration
- Partner with AI/ML developers to integrate reliability considerations into AI solutions and deployment pipelines
- Work closely with data engineering teams to ensure robust, monitored data flows within Databricks environments
- Collaborate with automation developers to build resilient UiPath bot deployment and execution frameworks
- Interface with security teams to implement observability solutions that maintain compliance and data protection standards

More Jobs at DXC Technology

Senior SAP SD Functional Analyst

Chennai

4 - 5 yrs

INR 6 - 7 Lacs

Partner Solution Architect

Bengaluru, Karnataka

8 - 8 yrs

Salary: Not disclosed

Sr Analyst I ERP Package Applications

Noida

5 - 6 yrs

INR 6 - 7 Lacs

Manager Infrastructure Services (Storage)

Bengaluru

4 - 5 yrs

INR 25 - 30 Lacs

Associate Manager Infrastructure Services

Pune

14 - 20 yrs

INR 25 - 30 Lacs

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.