TransUnion's Job Applicant Privacy Notice
What We'll Bring
We are seeking a highly skilled and motivated SRE Application Support Lead / Sr. Lead to join our 24x7 support team. This role is critical to ensuring the stability, performance, and reliability of mission-critical applications deployed across modern platforms including Docker, Kubernetes, and cloud environments. The ideal candidate will possess strong technical expertise, leadership capabilities, and a proactive mindset to drive operational excellence.
What You'll Bring
Key Responsibilities
Team Leadership & Management
- Lead and mentor a team of SRE/Application Support Engineers.
 - Assign tasks, set goals, and ensure smooth day-to-day operations.
 - Foster a culture of ownership, accountability, and continuous improvement.
 
Incident & Problem Management
- Own and manage critical incidents end-to-end.
 - Perform root cause analysis and drive permanent resolutions.
 - Collaborate with cross-functional teams and vendors for quick recovery.
 
Monitoring & Observability
- Utilize tools like Splunk, Grafana, AppDynamics, Spotfire to monitor application health.
 - Set up proactive alerting and dashboards for performance tracking.
 
Automation & Tooling
- Develop scripts (Shell, Python) to automate routine tasks.
 - Build and maintain internal tools to improve support efficiency.
 
Cloud & DevOps Integration
- Support applications deployed in Docker, Kubernetes, and cloud platforms.
 - Collaborate with DevOps teams for CI/CD pipeline support and release validations.
 
Change & Release Management
- Perform pre- and post-release validations.
 - Ensure production stability during deployments.
 
Documentation & Knowledge Management
- Maintain runbooks, SOPs, and knowledge base articles.
 - Ensure onboarding materials and troubleshooting guides are up-to-date.
 
Stakeholder Communication
- Provide timely updates to leadership and business teams.
 - Present metrics, incident summaries, and improvement plans.
 
SRE Mindset
- Apply SRE principles to improve reliability, scalability, and performance of supported applications through proactive monitoring and automation.
 - Focus on reducing toil by automating repetitive tasks and improving operational efficiency.
 - Participate in blameless postmortems and contribute to continuous improvement initiatives based on incident learnings.
 - Drive observability enhancements by integrating metrics, logs, and traces into monitoring dashboards.
 - Collaborate with engineering teams to define and measure SLIs/SLOs, ensuring alignment with business availability goals.
 
Required Skills
- Strong Incident Management (IM) expertise: Proven ability to lead and coordinate high-severity incidents, including real-time triaging, root cause identification, and resolution tracking. 
 - Bridge Call Management: Experience in initiating and leading bridge calls, ensuring timely updates, stakeholder alignment, and effective resolution. 
 - Stakeholder Communication & Coordination: Ability to interact with cross-functional teams, vendors, and leadership during incidents and planned changes. 
 - Monitoring & Observability Tools: Proficient in Splunk, Grafana, AppDynamics, Spotfire, and other monitoring platforms. 
 - Technical Proficiency: Strong hands-on experience in Linux, SQL, Shell scripting, and Python (preferred). 
 - Cloud & Containerization: Exposure to cloud platforms (AWS, Azure, GCP), Docker, and Kubernetes. 
 - Automation & Tooling: Experience in automating support tasks and building internal tools to improve operational efficiency. 
 - Change & Problem Management: Familiarity with ITIL processes, including change, incident, and problem management. 
 - Certifications: ITIL, AWS, Azure, Kubernetes, or other relevant technical/process certifications are a plus. 
 - Excellent Communication Skills: Strong verbal and written communication for effective collaboration and reporting. 
 - Team Leadership: Experience in managing and mentoring support teams, driving performance, and ensuring 24x7 operational readiness. 
 
Impact You'll Make
Lead 24x7 SRE/Application Support operations ensuring high availability and performance of critical applications.
- Drive Incident Management processes including triage, resolution, and post-incident reviews.
 - Initiate and lead bridge calls during high-severity incidents, ensuring timely updates and coordination across teams.
 - Act as the primary point of contact for stakeholder communication during incidents and planned changes.
 - Oversee monitoring and observability using tools like Splunk, Grafana, AppDynamics, and Spotfire.
 - Support applications deployed in Docker, Kubernetes, and cloud platforms (AWS/Azure/GCP).
 - Lead automation initiatives using Shell scripting and Python to improve operational efficiency.
 - Collaborate with DevOps and Engineering teams for CI/CD and release management.
 - Ensure compliance with ITIL processes (Incident, Problem, Change Management).
 - Maintain documentation including runbooks, SOPs, and knowledge base articles.
 - Tools & Technologies: Linux, SQL, Docker, Kubernetes, Splunk, Grafana, AppDynamics, Spotfire, Shell, Python 
 - Certifications Preferred: ITIL, AWS/Azure/GCP, Kubernetes, DevOps 
 - Work Mode: Hybrid (as per team policy) 
 - Shift Type: Rotational (24x7 coverage)
 
This is a hybrid position and involves regular performance of job responsibilities virtually as well as in-person at an assigned TU office location for a minimum of two days a week.
TransUnion Job TitleSr Lead, Applications Support