On your first day, we'll expect you to have:
Operational & Monitoring
4–8+ years in application support, operational support, or platform operations roles
Strong dashboard reading and alert interpretation skills; ability to spot anomalies and correlate signals
Proficiency with on-call and ticketing tools: PagerDuty, Jira, ServiceNow, or similar
Familiarity with observability platforms: Prometheus, Grafana, Datadog, New Relic, or equivalent
Comfortable reading structured logs (JSON format) and using log aggregation platforms (ELK, Datadog, etc.)
Platform & Backend Understanding
Solid working knowledge of the agentic on-call platform architecture: core services, job scheduler, LLM orchestration, notification pipeline
Basic understanding of microservices: how they communicate, common failure modes, and escalation paths
Comfortable with Linux command line basics: SSH, file navigation, process inspection, basic grep/awk for log parsing
Familiarity with containerization and orchestration: Docker and Kubernetes at an operational level (restart pods, check logs, review resource usage)
Basic SQL read-only skills: able to run safe SELECT queries to validate data, check state, and gather troubleshooting context under runbook guidance
Understanding of CI/CD basics: awareness of deployment pipelines, rollback procedures, and config toggle mechanics
LLM & Agentic Systems (Operational Level)
Exposure to LLM/agent usage patterns: understanding of tool-calling, context limits, rate limits, and vendor API quirks
Familiarity with common LLM failure modes: hallucinations, token exhaustion, timeouts, and vendor-specific rate-limiting
Ability to follow troubleshooting flows for agent-driven incidents (prompt tracing, tool execution validation, fallback behavior)
Incident & On-Call Workflows
Understanding of incident classification (Sev-1/2/3/4) and appropriate escalation criteria
Knowledge of on-call best practices: communication protocols, incident documentation, and post-mortem participation
Comfortable with asynchronous and shift-based work; reliable responder with good alert acknowledgment habits
Communication & Documentation
Required Soft Skills
Customer-focused mindset: empathy for users and urgency in resolving their issues
Detail-oriented: accurate notetaking during incidents and meticulous runbook following
Proactive learner: ability to absorb new technologies, platforms, and troubleshooting patterns quickly
Collaborative: works well with L3 engineers, dev teams, and other operational teams
Shift-friendly: reliable availability during on-call rotations, including nights/weekends as scheduled
Humble & curious: asks clarifying questions, escalates appropriately, and doesn't hesitate to ask for help
Experience Requirements
Minimum 2–4 years in application/production support, technical support, operations, or platform engineering roles
Proven experience with incident triage, ticket management, and on-call workflows
Prior exposure to on-call systems or incident management platforms (PagerDuty, Squadcast or custom)
Experience with at least one agentic AI or LLM-integrated product (customer-facing or internal tools) is a plus
Comfortable working shift-based on-call rotation (evenings, nights, weekends, as scheduled)
Nice-to-Have
Prior experience in a Global Capability Center or consulting firm environment
Familiarity with incident severity frameworks and SLO/SLI concepts
Exposure to multiple monitoring and observability tools
Basic scripting (Python or bash) for custom diagnostics and automation
Experience writing operational procedures or internal documentation
Operational Guide & Runbook Responsibilities (Detailed)
Creating New Runbooks
When a recurring issue is identified (by L2 or L3), collaborate to create a step-by-step runbook
Ensure runbooks are clear, actionable, and safe for L2 execution without requiring L3 escalation
Include decision trees: "if X, do Y; if Z, escalate to L3"
Test runbook accuracy by walking through it with a peer before publishing
Maintaining Existing Runbooks
Review runbooks quarterly for accuracy and relevance; update if processes or tool names have changed
Flag outdated runbooks during team syncs (e.g., "This runbook references an old dashboard URL")
Incorporate feedback from L3 when they fix complex issues: simplify complex fixes into runbook steps for future L2 use
Updating the Operational Guide
Maintain a single, authoritative Operational Guide covering:
Platform architecture overview (high-level, non-code)
Alert guide: what each alert means, typical causes, and first-response steps
Runbook index: list of all runbooks with quick-reference links
Troubleshooting decision tree: common symptoms which runbook to follow
Escalation criteria and process
On-call procedures and communication protocols
Known issues and workarounds
Update the guide when new features deploy, alerts change, or new runbooks are created
Conduct semi-annual reviews of the guide to ensure accuracy and completeness
Knowledge Base Management
Maintain a searchable knowledge base (wiki, Notion, Confluence, or similar) with:
Common user issues and L2-resolvable solutions
Frequently asked questions with step-by-step answers
Post-incident summaries (non-sensitive) to share learnings
Troubleshooting checklists organized by symptom
Encourage L2 team members to contribute findings and suggest improvements
Archive or deprecate outdated entries quarterly
Success Metrics
Incident response: Mean Time to Acknowledgment (MTTA) and Mean Time to Escalation (MTTE) for triage decisions
Runbook effectiveness: % of L2 team able to resolve tickets using runbooks without escalation; reduction in "unknown" escalations
Documentation quality: User and L3 feedback on runbook clarity and accuracy; reduced escalations due to missed troubleshooting steps
Operational guide updates: Guide reviewed and refreshed quarterly; 0 outdated procedures in active rotation
On-call reliability: response times, ticket accuracy, and team feedback on L2 availability and professionalism
Knowledge base engagement: number of contributions per quarter, search usage, and user satisfaction with knowledge base accuracy
1. Dashboard Reading: Given a dashboard screenshot, identify what metrics matter most and what they tell you about platform health.
2. Escalation Quality: Describe an escalation you wrote to L3. What context did you include? What would you improve?
3. LLM Failure Mode: An agent is timing out. Walk through your initial troubleshooting using logs and dashboards.
4. On-Call Shift: Describe your ideal on-call shift. What tools do you use? How do you stay alert and responsive?
5. Knowledge Sharing: How would you mentor a new L2 engineer on your team?
Benefits:
At Calfus, we value our employees and offer a strong benefits package. This includes medical, group, and parental insurance, coupled with gratuity and provident fund options. Further, we support employee wellness and provide birthday leave as a valued benefit.
Calfus Inc. is an Equal Opportunity Employer.
We believe diversity drives innovation. We’re committed to creating an inclusive workplace where everyone—regardless of background, identity, or experience—has the opportunity to thrive. We welcome all applicants!