Critical Skills to Possess:
5+ years
of Site Reliability Engineering, DevOps, or Infrastructure Engineering experience SRE Principles:
Deep understanding of SLOs, SLIs, error budgets, and reliability engineering practices Incident Management:
Proven experience with incident response, on-call rotations, and post-mortem processes Automation:
Strong scripting abilities in PowerShell, Python, or Bash for automation and tooling
Monitoring and Tools
SolarWinds:
Advanced experience with SolarWinds NPM, SAM, and custom monitoring setup Azure Monitor:
Proficient in Azure Monitor, Log Analytics, and Application Insights Ivanti:
Experience with Ivanti ITSM for incident and change management Backup Solutions:
Enterprise backup strategy implementation and monitoring
Professional Skills
- Strong analytical and troubleshooting skills with systematic problem-solving approach
- Excellent communication skills for incident coordination and stakeholder updates
- Experience working in 24/7 production environments with strict SLA requirements
- Ability to balance reliability with feature velocity and business requirements
Preferred Qualifications:
- BS degree in Computer Science or Engineering or equivalent experience
Roles and Responsibilities Roles and Responsibilities:
Service Reliability and Availability
- Design and implement service level objectives (SLOs) and service level indicators (SLIs) for critical systems
- Monitor and maintain 99.9%+ uptime for production environments across hybrid infrastructure
- Develop and execute incident response procedures and post-incident reviews
- Implement chaos engineering practices to proactively identify system weaknesses
- Lead root cause analysis and implement permanent fixes to prevent recurring issues
Monitoring and Observability
- Design comprehensive monitoring strategies using SolarWinds, Azure Monitor, and custom solutions
- Implement alerting systems with appropriate escalation procedures and noise reduction
- Create and maintain dashboards for system health, performance metrics, and business KPIs
- Establish logging strategies and log aggregation across all platforms
- Develop automated health checks and synthetic monitoring for critical services
Automation and Infrastructure as Code
- Develop automation scripts and tools to reduce manual operational overhead
- Implement infrastructure as code practices for consistent environment provisioning
- Create self-healing systems and automated remediation procedures
- Build CI/CD pipelines for infrastructure changes and application deployments
- Automate backup, recovery, and disaster recovery procedures
Database Reliability Engineering
- Ensure high availability and performance of Oracle and SQL Server database systems
- Implement database monitoring, alerting, and automated maintenance procedures
- Manage database backup strategies and recovery time objectives (RTO/RPO)
- Optimize database performance through query tuning and resource management
- Coordinate with Informatica ETL processes for data pipeline reliability
Capacity Planning and Performance
- Conduct capacity planning for compute, storage, and network resources
- Performance tuning across Windows, Linux, and Azure environments
- Implement auto-scaling solutions for cloud workloads
- Analyze system performance trends and proactively address bottlenecks
- Optimize cost efficiency while maintaining performance standards
Security and Compliance
- Implement security best practices across all infrastructure components
- Manage patch management automation and vulnerability remediation
- Ensure compliance with security policies and regulatory requirements
- Implement security monitoring and incident response procedures
- Coordinate with security teams for threat detection and response