About Us
Zycus, recognized by leading analyst firms in procurement technology, empowers teams to unlock deep value through its comprehensive Source-to-Pay (S2P) solutions. At the heart of our S2P solution is the Merlin Agentic Platform, which orchestrates intelligent AI agents to deliver simplified, efficient, and compliant processes.The Merlin Intake Agent Offers Business Users Unparalleled Ease Of Use, Increasing Adoption Rates And Significantly Reducing Non-compliant Spending. For Procurement Teams, The Merlin Autonomous Negotiation Agent Handles Tail Spend Autonomously, Securing Additional Savings; The Merlin Contract Agent Helps Draft Compliant Contracts And Reduces Risks By Actively Monitoring Them; And The Merlin AP Agent Further Enhances Efficiency By Automating Invoice Processing With Exceptional Speed And Accuracy.
We Are An Equal Opportunity Employer:
Zycus is committed to providing equal opportunities in employment and creating an inclusive work environment. We do not discriminate against applicants on the basis of race, color, religion, gender, sexual orientation, national origin, age, disability, or any other legally protected characteristic. All hiring decisions will be based solely on qualifications, skills, and experience relevant to the job requirements.Zycus is looking for a
Site Reliability Engineer (SRE)
with deep expertise in
Kubernetes
,
automation
, and
Linux systems
. The ideal candidate will have hands-on experience in deploying, administrating, and optimizing large-scale production systems, with a strong focus on
microservices architecture
, ensuring automation, performance, and reliability across our SaaS platform.
Roles And Responsibilities:
- System Reliability & Uptime: Ensure high availability, performance, and reliability of applications and infrastructure.
- Kubernetes & Cluster Management: Deploy, administer, and maintain Kubernetes clusters, managing scaling, upgrades, and troubleshooting.
- Microservices Management: Handle the deployment, monitoring, and scaling of microservices in distributed environments.
- Incident Management: Respond to production incidents, perform root cause analysis, and implement long-term fixes to prevent recurrence.
- Automation & Infrastructure as Code (IaC): Automate repetitive tasks, infrastructure provisioning, and deployment workflows using tools like Ansible and Terraform.
- Monitoring & Observability: Implement and maintain monitoring tools (e.g., Prometheus, Grafana, Datadog) to track system health and application performance.
- Performance Optimization: Analyze system performance, identify bottlenecks, and optimize resources for better efficiency.
- Disaster Recovery & Backup: Design and implement backup and disaster recovery (DR) strategies for business continuity.
- Capacity Planning: Forecast infrastructure needs based on performance trends and business growth to ensure scalability.
- Security & Compliance: Ensure infrastructure and applications meet security standards and compliance requirements.
- Collaboration with Dev & Ops Teams: Work closely with development and operations teams to improve deployment pipelines, release processes, and system reliability.
- Documentation: Maintain clear and detailed documentation of systems, processes, and incident reports for knowledge sharing and compliance.
- Continuous Improvement: Identify opportunities for improving system architecture, deployment strategies, and automation workflows.
- Cloud Infrastructure Management: Manage cloud services (AWS, GCP, Azure) for resource optimization, cost management, and automation.
- On-Call Support: Participate in on-call rotations to handle urgent production issues and ensure rapid recovery.
Job Requirement
- Experience : 5 to 12 years
- Technical skills as mentioned below :
Must Have :
Hands-on experience with
installing and provisioning Kubernetes clusters
.Deep understanding of
core Kubernetes components
such as
CRI, CNS, ETCD, CoreDNS, KubeProxy
.