Job
Description
As an Engineering Manager - Site Reliability, your primary responsibility is to manage, mentor, and develop a team of Site Reliability Engineers. You will play a crucial role in ensuring that both individual team members and the team as a whole are aligned with the organizational objectives and direction. Your duties will involve overseeing all activities within scope, directing the design of new products, and enhancing existing designs to ensure timely delivery with satisfactory quality. It will be essential for you to analyze technology trends, assess human resource requirements, and understand market demands to plan projects that align with current needs and future aspirations. Additionally, you will collaborate with leaders, production teams, key stakeholders, and marketing departments to assess engineering feasibility, cost-effectiveness, scalability, and time-to-market for new and existing products. Your responsibilities will include: Managing People: - Inspiring, nurturing, and developing individuals by assisting in the creation of personal development plans, utilizing available learning resources, and providing stretch opportunities. - Taking ownership, being proactive, and fostering collaboration with business counterparts, peers, other craft managers, and stakeholders to ensure tasks are completed effectively. - Monitoring team health metrics, tracking KPIs, overseeing roadmap progress, identifying and resolving blockers, and escalating issues when necessary. End to End System Ownership: - Assuming responsibility for a service from end to end by actively monitoring application health and performance, establishing and monitoring relevant metrics, and taking appropriate action when metrics are violated. - Mitigating business continuity risks and bus factor by implementing cutting-edge practices and tools, and documenting procedures such as runbooks and OpDocs. - Independently managing an application or service, overseeing deployment and operations in production, and guiding less experienced team members in these areas. Technical Incident Management: - Addressing and resolving live production issues to minimize customer impact within the SLA. - Enhancing the overall reliability of systems by developing long-term solutions through root cause analysis. - Contributing to postmortem processes, logging live issues, and tracking incidents effectively. Building Software Applications: - Developing software applications using relevant programming languages and leveraging knowledge of systems, services, and tools applicable to the business domain. - Writing readable and reusable code by following standard patterns and utilizing standard libraries. - Refactoring and simplifying code by introducing design patterns when necessary. - Ensuring application quality by employing standard testing techniques and methods that align with the test strategy. - Upholding data security, integrity, and quality by adhering to company standards and best practices. Architectural Guidance: - Offering guidance to product teams on technical solutions that meet functional, nonfunctional, and architectural requirements. - Evaluating and aligning target architecture enhancements, reframing architectural designs and decisions, and providing context within the broader architectural landscape. Key Skills: - Strong people management skills - Excellent communication and stakeholder management abilities - Commercial awareness and technical vision - Experience in software development - Leadership in managing engineering teams in a fast-paced environment - Proficiency in at least one programming language (Java, C/C++, Python, Go) - Ability to devise software solutions from scratch - Understanding of Service Oriented Architecture, Microservices, and OOP patterns - Hands-on experience in Linux administration and troubleshooting - Creative problem-solving skills - Knowledge of defining SLIs and SLOs - Strong analytical skills and data-driven mindset,