Posted:3 weeks ago|
Platform:
Hybrid
Full Time
Role & responsibilities Job Title: Mainframe Site Reliability Engineer (SRE) Location: Pune/Hyd Employment Type: Full-Time --- About the Role We are seeking a visionary Mainframe Site Reliability Engineer (SRE) to redefine the reliability, automation, and efficiency of our mission-critical z/OS systems. This role combines deep mainframe expertise with cutting-edge SRE practices, focusing on innovations in observability, AI-driven operations, and DevOps integration to transform legacy workflows into modern, self-healing systems. You will drive initiatives to eliminate manual toil, optimize performance, and ensure the platforms resilience aligns with business-critical service level objectives (SLOs). --- Key Responsibilities 1. SRE-Centric Innovation & Automation - Automation Engineering: - Design and deploy Infrastructure-as-Code (IaC) solutions using Ansible, Zowe CLI, and z/OSMF workflows to automate system provisioning, configuration management, and recovery processes. - Develop self-healing workflows for critical subsystems (CICS, Db2, IMS) to auto-resolve incidents like JVM failures or transaction bottlenecks. - Convert legacy operational scripts (REXX, NCL) into modern, version-controlled pipelines integrated with Git and CI/CD tools like Jenkins. - AI-Driven Observability: - Implement predictive analytics tools (e.g., IBM Watson AIOps, Splunk ITSI) to detect anomalies in system metrics, logs, and message queues. - Build dashboards using Grafana or Prometheus to visualize the Four Golden Signals (latency, traffic, errors, saturation) across mainframe workloads. - Centralize alert management to reduce noise and prioritize actionable alerts using AI-driven correlation. 2. DevOps Integration & Modernization - CI/CD for Mainframe: - Streamline software delivery pipelines for COBOL/PL/I applications using IBM Dependency-Based Build (DBB) and UrbanCode Deploy (UCD). - Integrate mainframe SDLC processes with enterprise Git repositories (GitHub, GitLab) to enable collaborative development and audit trails. - Enable automated testing and phased rollouts for z/OS middleware updates. - Performance & Capacity Engineering: - Optimize CPU/MIPS utilization through runtime tuning (e.g., CICS Threadsafe, AT-TLS offloading) to reduce software licensing costs. - Forecast capacity demands using historical SMF/RMF data and propose dynamic hardware scaling strategies. - Conduct load testing for batch and OLTP workloads to validate system limits and error budgets. 3. Incident Management & Reliability - Lead blameless postmortems for critical incidents, focusing on root cause analysis (RCA) and preventive actions (e.g., monitoring gaps, automation fixes). - Reduce MTTR by implementing automated incident response playbooks (e.g., auto-restart failed subsystems, reroute traffic). - Maintain 24/7 operational readiness through on-call rotations and cross-training in z/OS, CICS, Db2, and storage management. 4. Platform Hardening & Knowledge Sharing - Enforce security best practices (RACF, TLS) and vulnerability remediation for z/OS and middleware. - Develop reusable workbooks and runbooks to document system configurations, troubleshooting steps, and automation workflows. - Mentor teams on SRE principles, fostering a T-shaped skill model (deep mainframe + DevOps/Agile practices). 5. Batch Optimization & Resource Management - Design dynamic resource allocation strategies (e.g., WLM policies, enclaves) to prioritize critical batch jobs and minimize contention for CPU, memory, and I/O resources. - Implement parallel processing (e.g., multi-task JCL, SYSAFF routing) to reduce runtime and avoid bottlenecks in long-running batch cycles. - Streamline job dependencies using graph-based scheduling tools (e.g., IWS, CA7, Control-M ) to eliminate idle wait times between interdependent jobs. 6. Proactive Batch Health Monitoring : - Develop automated checks for batch job SLAs , including real-time alerts for delays, resource starvation, or dataset contention. - Integrate predictive analytics (e.g., historical SMF data analysis) to forecast and mitigate delays caused by seasonal peaks or data volume spikes. --- Required Skills - Technical Expertise: - xx+ years in z/OS system programming, performance tuning, or infrastructure support. - Proficiency in JCL, REXX, Python, and mainframe automation tools (IBM Z System Automation, Broadcom OPS/MVS). - Hands-on experience with Zowe, Ansible, Git, and CI/CD pipelines. - Mastery of SRE tenets: SLOs/SLIs, error budgets, and Infrastructure-as-Code (IaC). - Innovation Focus: - Proven track record in implementing AI/ML-driven monitoring or auto-remediation for mainframe environments. - Experience modernizing legacy workflows (e.g., replacing CA Endevor with Git-based SDLC). - Soft Skills: - Ability to lead cross-functional teams during high-severity incidents. - Strong communication to align technical execution with business objectives. - Education: - Bachelor’s degree in Computer Science, Engineering, or related field. --- Preferred Qualifications - Experience with AI-Driven Automation platforms (e.g. AMELIA AIOps) to standardize and migrate legacy workflows, integrate with event management systems (e.g., BigPanda), and orchestrate ITIL processes (Incident, changes) via ServiceNow - Certifications: IBM z/OS System Programming, Broadcom Mainframe SRE, or Hashicorp Terraform. - Familiarity with Zowe Desktop for modern IDE-driven development or Dynatrace APM for CICS/Db2 monitoring. - Knowledge of mainframe open-source ecosystems (Zowe, Feilong) or hybrid-cloud integrations.
Cognizant
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Practice Video Interview with JobPe AI
My Connections Cognizant
Hyderabad, Pune
22.5 - 30.0 Lacs P.A.
Hyderabad, Pune, Bengaluru
12.0 - 22.0 Lacs P.A.
Hyderabad
18.0 - 30.0 Lacs P.A.
16.0 - 18.0 Lacs P.A.
Bengaluru
12.0 - 22.0 Lacs P.A.
Noida, Goregaon, Bengaluru
17.0 - 32.0 Lacs P.A.
3.25 - 8.25 Lacs P.A.
4.75 - 9.75 Lacs P.A.
15.0 - 25.0 Lacs P.A.
Noida, Hyderabad, Pune
50.0 - 60.0 Lacs P.A.