Project Overview:
Join a dynamic team responsible for maintaining and enhancing an enterprise-grade GenAI assistant developed for a network operations team. This platform provides intelligent query responses, generates reports, and delivers insights by integrating with various internal data sources including SharePoint, Splunk, ServiceNow, and client-specific documentation. It also orchestrates critical operational workflows like automated alarm correlation and incident resolution.
This role will involve L2/L3 support, ongoing system upkeep, and implementing incremental feature improvements.
Job Responsibilities:
1. Production Support (L2/L3):
- Triage and resolve bugs and incidents.
- Develop hotfixes and minor enhancements.
- Maintain release hygiene and manage change requests.
- Perform root cause analysis and post-mortem documentation.
- Work closely with the client leadership team to prioritize issues.
2. Feature Enhancements:
- Gather and validate requirements with customer SMEs.
- Develop and test enhancements including unit testing and QA.
- Manage deployments across environments (dev, staging, production).
- Support UAT and production rollout with minimal downtime.
Key Responsibilities:
- Operate and maintain Azure cloud infrastructure and Kubernetes clusters.
- Build and manage CI/CD pipelines with smooth environment promotion.
- Implement centralized observability using logs, metrics, and traces.
- Set up dashboards and alerts to ensure system reliability (SLOs).
- Manage secrets, RBAC, image scanning, and maintain audit trails.
- Ensure secure SDLC practices and maintain CVE hygiene.
- Maintain integration stability with platforms like ServiceNow, Splunk, and SharePoint.
- Keep schemas, credentials, and SLAs up to date.
- Maintain accurate documentation: runbooks, playbooks, and environment diagrams.
- Ensure model quality and performance via prompt/config tuning and regression testing.
- Maintain integration health and manage operational runbooks and post-mortems.
- Adhere to best practices in cloud security, access control, and auditing.
- Collaborate with cross-functional teams to support continuous improvement.
Job Requirements:
Experience:
5–9 years in DevOps, SRE, or MLOps roles.- Proficient in
Docker
, Kubernetes
, and Microsoft Azure
. - Experience with
Infrastructure as Code (IaC)
and CI/CD pipeline development
. - Strong background in monitoring, logging, and alerting for distributed systems.
- Experience supporting
Python-based services
and machine learning workloads
. - Familiarity with
MLflow
, feature flag management
, and blue/green deployments
. - Experience building
cost monitoring dashboards
for cloud environments.
Please connect with Grace at grace.beulah@purviewservices.com to learn more about the opportunity.