Role overview
The SRE lead will oversee the reliability, performance, and operational excellence of cloud and on-premise infrastructure, applications, and services. This role combines deep technical expertise with leadership responsibilityensuring stability, security, and scalability across environments. The SRE lead will manage a team of engineers providing support and lead initiatives around automation, patching, monitoring, and FinOps optimization to ensure high availability and efficiency.
Key responsibilities
1. Infrastructure and VM management
- Oversee VM provisioning, patching, scaling, and performance management.
- Automate patching and log maintenance processes to minimize downtime.
- Ensure monthly updates, backups, and system health checks.
- Coordinate with business teams to schedule patching for minimal impact.
2. Application and CI/CD service reliability
- Manage patching and updates for app services and associated components.
- Administer Jenkins pipelines, job management, and agent scaling.
- Implement secure access controls and perform regular reviews.
- Maintain Azure DevOps and pipeline governance for CI/CD stability.
3. Security and compliance
- Support CSPM and vulnerability management, prioritizing high-severity remediation.
- Respond to SOC/SIEM alerts, conducting incident triage and resolution.
- Manage PAM integrations, access controls, and compliance tracking.
- Maintain DNS and certificate lifecycle management, including renewals and secure updates.
4. Monitoring and observability
- Establish unified monitoring for infrastructure, applications, and performance metrics.
- Create dashboards and alerting systems to proactively detect anomalies.
- Provide incident response coverage and periodic service health reports.
- Conduct post-mortem analyses and implement corrective actions.
5. Cloud and FinOps operations
- Optimize cloud resource usage and cost through detailed FinOps reporting.
- Identify savings opportunities via rightsizing and unused resource cleanup.
- Generate monthly cost reports by application, service, and environment.
- Collaborate with business and finance teams for budget forecasting and cost governance.
6. Performance and scalability
- Continuously monitor infrastructure utilization and adjust resources dynamically.
- Analyze performance data to drive improvements in reliability and efficiency.
- Manage scaling of services and compute resources based on consumption trends.
7. Change and release management
- Facilitate CAB meetings and manage end-to-end change lifecycle.
- Review and prioritize change requests based on risk and business impact.
- Supervise production deployments and implement rollback strategies.
- Conduct post-implementation evaluations and report on success metrics.
8. Support and maintenance
Lead the SRE team in providing L3 support for incidents and operational issues.
- Maintain documentation, knowledge bases, and troubleshooting guides.
- Implement preventive maintenance measures to enhance system stability.
Qualifications and experience
Essential
- Bachelor’s degree in computer science, engineering, or equivalent experience.
- 8+ years of IT operations experience, with at least 3 years in an SRE or DevOps leadership role.
- Expertise in cloud environments (Azure preferred), including infrastructure automation, monitoring, and FinOps.
- Hands-on experience with CI/CD tools (Jenkins, Azure DevOps).
- Strong knowledge of scripting (PowerShell, Python, or Bash).
- Deep understanding of networking, security, and system administration principles.
Preferred
- Experience with CSPM tools and vulnerability management platforms.
- Familiarity with SOC/SIEM tools (e.g., Microsoft Sentinel, Splunk).
- Strong communication and stakeholder management skills.
- ITIL, Azure Administrator, or DevOps Engineer certification.
Key competencies
Reliability mindset:
designs systems for fault tolerance and operational excellence.Automation-first approach:
reduces manual effort through tooling and scripts.Leadership:
mentors engineers and coordinates cross-functional initiatives.Analytical rigor:
uses data-driven insights for optimization and cost control.Collaboration:
works closely with security, development, and infrastructure teams to ensure seamless delivery.