As an MS CM Engineer Core, you will ensure end-to-end fault, incident, and problem management with a focus on meeting SLA/OLA targets. Handle high-level escalations and resolve complex network issues through in-depth technical and procedural expertise.
You have:
- 6-10 years of experience in OpenStack Cloud with a degree equivalent to B.E./B.Tech.
- Hands-on experience with OpenShift Container Platform, including installation, configuration, container platform, and administration.
- Perform a Daily health check of the cluster and fix issues based on observations. Implement and customize Grafana dashboards as required to effectively monitor the cluster.
- Assist the application team in resolving issues with pods like crash loopback, image pullback, and other errors. Assist in implementing best practices to configure readiness and liveness probes.
- Manage, modify SCC based on custom requirements and assign to specific projects, and perform certificate management when the existing certificates expire and when needed.
- Check cluster utilization and share reports to the business when needed. Forecast capacity growth requirements and handle capacity increase, and end-to-end architectural knowledge on ACM, ACS, and ODF.
It would be nice if you also had:
- Knowledge ofRHOCP Architecture and experience in Microservices, Containers & Orchestration (Kubernetes, Docker).
- Knowledge of ticketing tools like ITSM, Incident management, problem management change management.
- Fault management/configuration of Red Hat OpenShift Platform and coordinate with Project teams for HLD/LLD/TOL/Design reviews.
- Troubleshoot issues with worker nodes, maintenance, and scale-out tasks.
- Performing Maintenance activities on the cluster line, performing patching, configuration changes, installing operators, etc.
- Hand in Hand coordination with the project for network integration activities, and will handle the trouble ticket/CR within SLA, and drive for Automation of tasks.
- Handling customer issues and ensuring that end-customer services are maintained.
- MOP/WI preparation for the activities and new learnings, and coordinating with the care team for finding and analyzing the RCAs.
- Prepare, implement, and verify the configuration and integration of a Node / System, and will track issues related to tools and timely escalation as per the pre-defined matrix.
- Assist the application team in resolving issues with pods like crash loopback, image pullback, and other errors. Assist in implementing best practices to configure readiness and liveness probes. resources.