?Senior Site Reliability Engineer
?to support the infrastructure platforms powering our Supply Chain systems. This role blends traditional platform engineering with site reliability practices, ensuring that systems are stable, secure, and deployment-ready across on-prem and cloud environments. The ideal candidate will have strong systems administration experience, scripting ability, and an interest in driving reliability through automation, monitoring, and proactive platform support. Occasional on-call support is expected on a rotating basis.
Key Responsibilities:
Platform Configuration & Environment Readiness
- Install, configure, and maintain platform components (Windows/Linux servers, file systems, middleware, etc.) across development, test, and production environments.
- Prepare environments for application deployments and platform-level changes.
Incident Management & Root Cause Analysis:
- Respond to service outages with urgency and lead post-incident reviews to prevent recurrence. Drive RCA & CAPA
- Develop incident playbooks and automate common response actions.
System Monitoring, Reliability & uptime
- Monitor system health using tools like LogicMonitor and Splunk; respond to alerts and incidents with a root cause and resolution mindset.
- Proactively identify and address system bottlenecks and performance issues.
- Improve system performance and reliability through configuration tuning and monitoring enhancements.
- Help Define SLAs, SLOs along with other critical KPIs and work towards continuous improvement
- Track Backup & restore efficiency and record RPO & RTO as a metric
Scripting & Automation
- Develop and maintain scripts (e.g., PowerShell, Bash, Python) to automate health checks, administrative tasks, and environment validation.
- Contribute to efforts that reduce manual support and increase consistency across platforms.
Deployment & Change Coordination
- Collaborate with application teams and infrastructure engineers to validate system readiness for deployments and major changes.
- Ensure platform-level changes follow Medline's change control, documentation, and testing procedures.
Security & Compliance
- Apply system security best practices; ensure patching, access management, and configuration policies are in place and audit-ready.
- Participate in ITGC, SOX, and security reviews to maintain operational compliance.
Documentation & Knowledge Sharing
- Maintain accurate runbooks, technical documentation, and troubleshooting guides.
- Share knowledge across the team to support 24x7 platform operations and reduce key-person risk.
High Availability Testing:
- Design and execute tests that simulate failures (e.g., node failures, network partitions) to verify system resilience.
- Collaborate with development and infrastructure teams to ensure redundancy and fault tolerance are in place.
Continuous Improvement
- Identify opportunities to improve observability, reduce noise, and increase system resilience.
- Collaborate with SREs and automation engineers to advocate for platform improvements, capacity management, and performance optimization.
Qualifications:
- Education: Bachelor's degree in computer science, Information Technology, Engineering, Supply Chain, or a related field, or equivalent work experience.
- Experience:
- Overall 8+ years of experience in IT
- 5+ years of experience in platform support, systems administration, or infrastructure engineering
Skills
:
- Proficiency in both Windows and Linux system administration
- Scripting experience using PowerShell, Bash, or similar tools
- Experience with monitoring tools such as LogicMonitor and Splunk
- Familiarity with DevOps principles and automation practices
- Experience supporting enterprise applications and deployment processes
- Willingness to participate in rotating on-call support
- Experience with performing High availability & Disaster recovery drill exercises and related tooling
- Knowledge of tracking, reporting and continuously improving SLAs and other related metrics