Job
Description
Role Overview: As a Incident Manager, your primary responsibility will be to monitor, identify, and triage production incidents in cloud-based software systems. You will work closely with developers to diagnose defects, optimize system performance, and implement corrective actions. Additionally, you will be involved in incident escalation processes, automated monitoring solutions, and post-incident reviews to ensure efficient incident response. Key Responsibilities: - Monitor, identify, and triage production incidents to assess impact, root cause, and potential resolution paths. - Conduct detailed troubleshooting of cloud-based software systems to diagnose complex defects and implement corrective actions. - Manage incident escalation processes, ensuring timely communication and coordination with relevant teams. - Collaborate with developers to resolve bugs, optimize system performance, and deploy hotfixes as needed. - Analyze logs, error reports, and monitoring data to identify patterns and proactively mitigate potential issues. - Implement automated monitoring and alerting solutions to detect anomalies and streamline incident response. - Document incident response processes, including root cause analysis and preventive measures. - Participate in on-call rotation to provide 24/7 support for critical incidents. - Develop and maintain knowledge base articles, playbooks, and incident runbooks for common issues. - Contribute to post-incident reviews, identifying areas for improvement in monitoring, response, and resolution processes. Qualifications: - Bachelors degree in Computer Science, Engineering, or a related field (or equivalent work experience). - 3+ years of experience in software engineering, with a focus on incident management and resolution in cloud environments. - Strong proficiency in Node.js, including debugging, error handling, and performance optimization. - Experience with cloud platforms (AWS, Azure, or GCP), including monitoring and troubleshooting cloud-native applications. - Proficiency in logging frameworks (e.g., Winston, Bunyan) and monitoring tools (e.g., Datadog, ELK Stack, CloudWatch). - Strong problem-solving skills and ability to perform in high-pressure, time-sensitive scenarios. - Experience with CI/CD pipelines and automated deployments (e.g., Jenkins, GitLab CI, AWS CodePipeline). - Excellent communication and documentation skills, with a focus on clear incident reporting and knowledge transfer. - Ability to work effectively in a cross-functional team, collaborating with developers, DevOps, and product owners. - Written and spoken proficiency in English. Preferred Skills: - Experience with containerization (Docker, Kubernetes). - Knowledge of REST APIs, WebSockets, and microservices architecture. - Familiarity with incident management frameworks (e.g., ITIL, SRE practices). - Understanding of security best practices in cloud-based systems.,