EverAI Labs
We are seeking for experienced SRE/Production Support Engineer to join our dynamic team and ensure the seamless operation of our EverAI Suite products. In this role, you will provide 24/7 production support, troubleshoot issues, monitor system performance, and collaborate with development teams to maintain high availability and reliability. This position is ideal for problem-solvers who thrive in fast-paced environments and are passionate about AI technologies.
Key Responsibilities
- Monitor production environments for EverAI Suite products (EverAI Simulator, EverAI Recruiter, and EverAI Knowledgeminer) using tools like Splunk, Prometheus, Grafana, ELK Stack, or similar monitoring systems.
- Respond to incidents, alerts, and user-reported issues in a timely manner, performing root cause analysis and implementing fixes or workarounds.
- Collaborate with cross-functional teams (development, QA, and operations) to resolve complex production problems and prevent recurrence.
- Maintain and update documentation for support processes, troubleshooting guides, and knowledge bases.
- Perform routine maintenance tasks, such as patching, scaling resources, and optimizes performance in cloud-based infrastructures (e.g., AWS, Azure, or GCP).
- Participate in on-call rotations to provide after-hours support and ensure SLAs are met.
- Analyze logs, metrics, and traces to identify trends, potential bottlenecks, and areas for improvement.
- Assist in deployment activities, including CI/CD pipeline support and rollback procedures.
- Contribute to continuous improvement initiatives, such as automating support tasks and enhancing monitoring capabilities.
Required Qualifications
- Bachelor's degree in computer science, Information Technology, Engineering, or a related field (or equivalent experience).
- 3+ years of experience in production support, DevOps, or site reliability engineering (SRE) roles.
- Strong troubleshooting skills with experience in debugging distributed systems, APIs, and microservices architectures.
- Proficiency in scripting languages such as Python, Bash, or PowerShell for automation.
- Hands-on experience with cloud platforms (AWS, Azure, GCP) and containerization tools (Docker, Kubernetes).
- Familiarity with monitoring and logging tools (e.g., Splunk, Datadog, New Relic).
- Knowledge of databases (SQL/NoSQL) and networking concepts.
- Excellent communication skills, with the ability to explain technical issues to non-technical stakeholders.
- Ability to work in a shift-based or on-call environment.
Preferred Qualifications
- Experience supporting AI/ML-based products or SaaS platforms.
- Certifications such as AWS Certified DevOps Engineer, Google Cloud Professional SRE, or equivalent.
- Familiarity with incident management frameworks (e.g., ITIL) and tools like PagerDuty or Jira.
- Strong problem-solving mindset with a proactive approach to preventing issues.
If you've got the skills to succeed and the motivation to make it happen, we look forward to hearing from you.