15 years
0 Lacs
Posted:8 hours ago|
Platform:
On-site
Full Time
Job Title: VP-Digital Expert Support Lead Experience : 15 + Years Location : Pune Position Overview The Digital Expert Support Lead is a senior-level leadership role responsible for ensuring the resilience, scalability, and enterprise-grade supportability of AI-powered expert systems deployed across key domains like Wholesale Banking, Customer Onboarding, Payments, and Cash Management . This role requires technical depth, process rigor, stakeholder fluency , and the ability to lead cross-functional squads that ensure seamless operational performance of GenAI and digital expert agents in production environments. The candidate will work closely with Engineering, Product, AI/ML, SRE, DevOps, and Compliance teams to drive operational excellence and shape the next generation of support standards for AI-driven enterprise systems. Role-Level Expectations Functionally accountable for all post-deployment support and performance assurance of digital expert systems. Operates at L3+ support level , enabling L1/L2 teams through proactive observability, automation, and runbook design. Leads stability engineering squads , AI support specialists, and DevOps collaborators across multiple business units. Acts as the bridge between operations and engineering , ensuring technical fixes feed into product backlog effectively. Supports continuous improvement through incident intelligence, root cause reporting, and architecture hardening . Sets the support governance framework (SLAs/OLAs, monitoring KPIs, downtime classification, recovery playbooks). Position Responsibilities Operational Leadership & Stability Engineering Own the production health and lifecycle support of all digital expert systems across onboarding, payments, and cash management. Build and govern the AI Support Control Center to track usage patterns, failure alerts, and escalation workflows. Define and enforce SLAs/OLAs for LLMs, GenAI endpoints, NLP components, and associated microservices. Establish and maintain observability stacks (Grafana, ELK, Prometheus, Datadog) integrated with model behavior. Lead major incident response and drive cross-functional war rooms for critical recovery. Ensure AI pipeline resilience through fallback logic, circuit breakers, and context caching. Review and fine-tune inference flows, timeout parameters, latency thresholds, and token usage limits. Engineering Collaboration & Enhancements Drive code-level hotfixes or patches in coordination with Dev, QA, and Cloud Ops. Implement automation scripts for diagnosis, log capture, reprocessing, and health validation. Maintain well-structured GitOps pipelines for support-related patches, rollback plans, and enhancement sprints. Coordinate enhancement requests based on operational analytics and feedback loops. Champion enterprise integration and alignment with Core Banking, ERP, H2H, and transaction processing systems. Governance, Planning & People Leadership Build and mentor a high-caliber AI Support Squad – support engineers, SREs, and automation leads. Define and publish support KPIs , operational dashboards, and quarterly stability scorecards. Present production health reports to business, engineering, and executive leadership. Define runbooks, response playbooks, knowledge base entries, and onboarding plans for newer AI support use cases. Manage relationships with AI platform vendors, cloud ops partners, and application owners. Must-Have Skills & Experience 15+ years of software engineering, platform reliability, or AI systems management experience. Proven track record of leading support and platform operations for AI/ML/GenAI-powered systems . Strong experience with cloud-native platforms (Azure/AWS), Kubernetes , and containerized observability . Deep expertise in Python and/or Java for production debugging and script/tooling development. Proficient in monitoring, logging, tracing, and alerts using enterprise tools (Grafana, ELK, Datadog). Familiarity with token economics , prompt tuning, inference throttling, and GenAI usage policies. Experience working with distributed systems, banking APIs, and integration with Core/ERP systems . Strong understanding of incident management frameworks (ITIL) and ability to drive postmortem discipline . Excellent stakeholder management, cross-functional coordination, and communication skills. Demonstrated ability to mentor senior ICs and influence product and platform priorities. Nice-to-Haves Exposure to enterprise AI platforms like OpenAI, Azure OpenAI, Anthropic, or Cohere. Experience supporting multi-tenant AI applications with business-driven SLAs. Hands-on experience integrating with compliance and risk monitoring platforms. Familiarity with automated root cause inference or anomaly detection tooling. Past participation in enterprise architecture councils or platform reliability forums Show more Show less
Intellect Design Arena Ltd
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
My Connections Intellect Design Arena Ltd
Pune, Maharashtra, India
Salary: Not disclosed
Pune, Maharashtra, India
Salary: Not disclosed