About The Company
Our Client is defining the future of cybersecurity through our XDR platform that automatically prevents, detects, and responds to threats in real time. Singularity XDR ingests data and leverages our patented AI models to deliver autonomous protection. With the Client, organizations gain full transparency into everything happening across the network at machine speed to defeat every attack at every stage of the threat lifecycle.We are a values-driven team where names are known, results are rewarded, and friendships are formed. Trust, accountability, relentlessness, ingenuity, and our client-centric approach define the pillars of our collaborative and unified global culture. Were looking for people who will drive team success and collaboration across SentinelOne. If youre enthusiastic about innovative approaches to problem-solving, we would love to speak with you about joining our team!What Are We Looking For?We are seeking a Site Reliability Engineer (SRE) with extensive operational experience managing large-scale SaaS infrastructures. You will be responsible for designing and maintaining data infrastructure that emphasizes automation, self-service, and scalability. This role is vital to ensuring that we meet and exceed our Service Level Objectives (SLOs) and uptime commitments to customers.You will partner closely with engineering teams to help them deliver software faster, safer, and with higher quality, while driving initiatives that enhance the reliability, stability, and cost efficiency of our production environments. Youll join a world-class team of like-minded SREs who manage complex, high-traffic systems that operate at global scale.What Will You Do?As a Site Reliability Engineer, you will play a critical role in ensuring the availability, scalability, and performance of SentinelOnes large-scale distributed systems. Working at the intersection of software development and operations, youll focus on making our infrastructure more reliable, automated, and efficient, while empowering development teams to deliver at speed and with confidence.
In This Role, You Will
Drive Continuous Deployment & Delivery Excellence :
- Design, implement, and optimize CI/CD pipelines for efficient, secure, and reliable software releases.
- Automate build, test, and deployment processes to enhance release velocity and reduce manual intervention.
Manage And Command Production Incidents
- Lead the response to production incidents, ensuring timely mitigation and root cause identification.
- Conduct post-incident reviews, define corrective actions, and drive continuous improvements to prevent recurrence.
Partner With Product Engineering Teams
- Collaborate with product, platform, and infrastructure teams to embed reliability and scalability into design and architecture.
- Provide technical guidance to improve system performance, fault tolerance, and observability.
Automate Operations And Streamline Processes
- Build automation tools and frameworks that eliminate repetitive tasks, standardize operational procedures, and support a self-service infrastructure model for development teams.
Monitor, Measure, And Optimize Reliability
- Establish metrics for system performance and reliability (availability, latency, throughput).
- Proactively identify and resolve potential issues using data-driven insights and continuous monitoring.
Eliminate Infrastructure Bottlenecks
- Analyze production systems to identify performance and scalability limitations.
- Implement architectural improvements to enhance throughput, reliability, and cost efficiency across AWS and GCP environments.
Enhance Observability & Incident Readiness
- Develop and maintain observability stacks with advanced monitoring, logging, and alerting systems (e.g., Prometheus, Grafana, Datadog).
- Conduct chaos engineering experiments to validate system resilience and ensure operational preparedness.
Ensure Security, Compliance & Resilience
- Work with security and compliance teams to enforce secure configurations, data integrity, and regulatory adherence.
- Participate in disaster recovery planning and capacity forecasting for high availability.
Mentor And Collaborate Across Teams
- Share best practices through documentation, technical discussions, and internal workshops.
- Foster a reliability-driven culture and promote continuous improvement across engineering functions.
What Skills and Experience Will You Need?
- 5+ years of experience managing large-scale SaaS operations or distributed systems
- Strong expertise in orchestration systems like Kubernetes, Nomad, or Mesos
- Proficiency in Python (preferred), Golang, or Java for automation and tooling
- Hands-on experience running and deploying Java and JavaScript applications
- Proven experience in AWS and GCP environments
- Practical knowledge of Infrastructure as Code (Terraform, CloudFormation, etc.)
- Experience with CI/CD tools such as Jenkins, GitHub Actions, or ArgoCD, and deployment strategies like blue-green, rolling, or canary deploys
- Familiarity with SRE principles SLOs, SLIs, and error budgets
- Strong problem-solving, communication, and collaboration skills within distributed teams
- Self-starter attitude with a passion for automation, reliability, and continuous learning
- Prior product development or software engineering experience is a strong plus
What We Offer
- Flexible working format remote, office-based, or hybrid
- Competitive salary and comprehensive compensation package
- Personalized career growth opportunities and mentorship programs
- Professional development tools: tech talks, training sessions, and centers of excellence
- Active technical communities with regular knowledge-sharing
- Education reimbursement for continued learning and certifications
- Memorable milestone celebrations and company-sponsored events
- Corporate gatherings and team-building initiatives
(ref:hirist.tech)