About Calfus
At Calfus, we are known for delivering cutting-edge AI agents and products that transform businesses in ways previously unimaginable. We empower companies to harness the full potential of AI, unlocking opportunities they never imagined possible before the AI era. Our software engineering teams are highly valued by customers, whether start-ups or established enterprises, because we consistently deliver solutions that drive revenue growth. Our ERP solution teams have successfully implemented cloud solutions and developed tools that seamlessly integrate with ERP systems, reducing manual work so teams can focus on high-impact tasks.
None of this would be possible without talent like you! Our global teams thrive on collaboration, and we’re actively looking for skilled professionals to strengthen our in-house expertise and help us deliver exceptional AI, software engineering, and solutions using enterprise applications.
As one of the fastest-growing companies in our industry, we take pride in fostering a culture of innovation where new ideas are always welcomed—without hesitation. We are driven and expect the same dedication from our team members. Our speed, agility, and dedication set us apart, and we perform best when surrounded by high-energy, driven individuals.
To continue our rapid growth and deliver an even greater impact, we invite you to apply for our open positions and become part of our journey!
About the role:
The L3 Production Support Engineer is a backend-focused full-stack incident SME responsible for owning complex production incidents, driving root cause analysis, and implementing systemic improvements for the agentic on-call management platform. This role bridges incident command, deep backend engineering, and targeted frontend troubleshooting to ensure platform reliability at scale.
What You’ll Do:
Incident Management & Leadership
Own Sev-1/Sev-2 incident response as incident commander or lead resolver, driving swift diagnosis and resolution
Lead post-incident RCAs, identifying systemic issues and driving long-term fixes across backend, infrastructure, and UI
Establish and refine incident response playbooks, runbooks, and escalation procedures
Participate in on-call rotation as primary/secondary responder with accountability for critical systems
Backend & Infrastructure Expertise
Perform deep production troubleshooting: log analysis, distributed tracing, metric correlation, and profiling under pressure
Diagnose and fix complex issues across microservices: scheduling engine, LLM orchestration, notification pipeline, and integrations
Optimize database queries, identify locking issues, and manage migrations in PostgreSQL under production constraints
Architect and implement Redis caching, rate limiting, and queue-based patterns for reliability and scale
Work with Kubernetes, container orchestration, and deployment pipelines; manage rollbacks and feature toggles during incidents
Full-Stack Incident Resolution
Resolve end-to-end incidents regardless of origin (backend API, database, LLM vendor, or React frontend)
Debug and ship targeted React fixes when UI is the fastest path to incident resolution
Drive code-level improvements in backend services (Python/FastAPI) to harden agent flows, retry logic, and error handling
Collaborate closely with dev teams on defects, performance bottlenecks, and architecture-level changes
Observability & Continuous Improvement
Design and tune monitoring, alerting, and SLO/SLI frameworks for the platform
Maintain and evolve critical runbooks, playbooks, and knowledge base entries as patterns emerge
Mentor L2 engineers on deep troubleshooting, escalation discipline, and incident best practices
Drive blameless post-mortems and systemic risk reduction across the platform