Job
Description
Your role and responsibilities
As an Entry-Level Site Reliability Engineer (SRE) at OneIT Lab Engineering Team,you will join a team dedicated to ensuring the reliability, scalability, and performance of IBM systemsand infrastructure. This plays a critical role in advancing critical IBM Power System developmentinitiatives, gaining hands-on experience with both physical hardware and software environments.Your responsibilities will include, but not limited to: Assisting in the setup, configuration, and maintenance of IBM Power servers and related infrastructure. Supporting software-related reliability initiatives such as automation, monitoring, performance tuning, and system optimization. Participating in incident response, diagnostics, and root-cause analysis for both hardware and software issues. Collaborating with cross-functional teams to ensure smooth integration between physical systems and application environments. Supporting projects related to lab analytics"”gathering, analyzing, and interpreting data to help guide better business and operational decisions. Contributing to the deployment, scaling, and ongoing maintenance of production and test systems. Writing clear, concise documentation for processes, configurations, and troubleshooting steps. Learning and applying best practices in systems reliability, observability, and infrastructure operations.You will be expected to grow into a well-rounded SRE capable of tackling challenges in both the physical data center like environment and the software layer that powers our services. Mentorship and hands-on training will be provided to help you develop the skills to excel in both domains. Required education Bachelor's Degree Preferred education Bachelor's Degree Required technical and professional expertise
2+ years of working experience as SRE engineer
Passion for eliminating repetitive manual processes using automation. Strong attention to detail and excellent analytical capabilities. Excellent troubleshooting, problem solving, and debugging skills. Proficiency in programming concepts and frameworks. Proficiency in scripting/coding for automation using Python, shell scripting (bash, etc), Ansible, and related tools and languages. Familiarity with server operations, virtualization, and related infrastructure concepts. Fundamental understanding of computer networks. Fundamental understanding of data science/analytics framework. An automation mindset, wherever possible, you should use scripting and automation. Ability to work independently and as part of a team to achieve the SRE agenda Complete project work, both supervised and unsupervised Ability to effectively prioritize and execute tasks in high-pressure environment. Good Written, oral, and interpersonal communication skills. Preferred technical and professional experience Fundamental understanding of Linux/Unix systems is a plus. Fundamental knowledge of Red Hat OpenShift and Kubernetes is a plus Automation/ScriptingIn-depth experience with Ansible, Python, Terraform, and CI/CD tools is a plus, but a fundamental understanding is a must. Hands-on experience crafting alerts and dashboards using Python or any other language.