Director, Application Operations - SRE

15 - 20 years

50.0 - 55.0 Lacs P.A.

Hyderabad

Posted:3 weeks ago| Platform: Naukri logo

Apply Now

Skills Required

SRErisk managementDevOpsChange managementITSMProblem managementinformation security

Work Mode

Work from Office

Job Type

Full Time

Job Description

The Role : Director, Application Operations, SRE (Site Reliability Engineering) The Team : This team is part of the global SRE group that provides Site Reliability Engineering Services for the critical applications used by the analysts for conducting the business. Application Operations team is responsible for the Stability (Uptime), Reliability (Quality & Performance) and Engineering of these applications to improve business outcomes, user experience and efficiencies. The Team operates at the intersection of IT operations and software development, ensuring that our services are not only robust but also agile enough to adapt to the ever-evolving business needs. Impact and Responsibilities : The Impact of this role extends far beyond the immediate team. You will be instrumental in shaping the reliability and performance standards of our critical applications, ensuring they meet the highest benchmarks. By driving advancements in automation and cloud technologies, you will contribute significantly to the organization's strategic goals and toil reduction, enhancing both the user experience and operational efficiency. You will nurture the team members to be the best-in-class by upskilling and cross-skilling. General & Team management: Ensure the team balances its focus between daily operational tasks and strategic long-term projects Drive the adoption of new technologies and processes through training and mentoring Lead/Mentor/Guide/Coach and transform a team of Application Operations to SREs Create/maintain documentation for systems and processes to ensure continuity and knowledge sharing within the team. Adoption of Gen AI to leverage knowledge repository Collaborate with cross-functional teams to ensure seamless integration and support for new technologies and initiatives Oversee daily operations and ensure the shifts are adequately managed Set the roadmap; derive goals for each team member; review, motivate and support to make them successful Stability: Build a SRE practice that improves system stability with Monitoring & AIOps. Avert P1/P2 incidents and minimize business impact Analyze system vulnerabilities, SPOFs and address them proactively to improve stability Refactor monolithic apps and databases to containerized services to improve delivery/scale Work with business users to understand needs, issues, develop root cause analysis and work with the cross functional teams to address them permanently Reliability: Monitor system performance and create strategies to improve it Reduce the number of incidents and the time taken to resolve them (MTTR) Develop and implement disaster recovery plans to ensure business continuity Lead DevOps transformation to improve the delivery of value to business, reduction of costs & manual errors, increased velocity of releases and improved config management Engineering: Involvement in Architecture and Development design reviews (Shift-left) for new implementation and integration projects to build SRE best practices into the SDLC Continuously look for opportunities to automate tasks, simplify processes, Self-service to reduce the toil Value Stream Alignment: While alignment as horizontal lead is expected to begin with, its expected that you also handle the role of a SRE value stream lead going forward. Ensure smooth inter-working with value streams (VS) to meet the objectives & realize value Foster a 2-way knowledge sharing with VS and reduce dependency on SRE Help shepherd VS to improve SRE maturity levels; implement & prioritize best practices like monitoring, post-mortem, toil reduction, retrospectives etc. Application to User Journey orientation and transformation Whats in it for you : In this role, you will have the opportunity to collaborate with a diverse and talented team, working on cutting-edge technology solutions to drive efficiency and innovation within the organization. You will be at the forefront of implementing best practices in site reliability engineering, with a strong emphasis on automation, cloud technologies, and performance optimization. You will interface with the value stream leads to improve the SRE practices and maturity levels within the value streams. What Were Looking For: Basic Qualifications : Bachelors degree in computer science or equivalent is required, or in lieu, a demonstrated equivalence in work experience 15+ years of experience in Information Technology domain including cloud, systems & database administration, networking, performance, and application operations Proven experience in IT Operations and/or Site Reliability Engineering, successful handling of Application Operations in a complex IT setup Manage Multi-cloud (AWS/Azure) environments Engineering and implementing proactive monitoring of applications, infrastructure & databases. Engineering automation to self-heal and mature towards AIOps Manage, innovate, and create processes, software and tools that continuously improve the availability, reliability, scalability, latency and efficiency of platforms Engineer Self-service portals, Scalable platforms and repeatable processes that allow product teams to own the entire life cycle of their products, reducing the SRE dependency Excellent communication skills with experience in managing, coaching, and building highly effective teams. Manage and inspire a team of full stack Site Reliability Engineers across regions and time zones, emphasizing collaboration and efficiency. Establish relationships with business teams & other IT partners. Identifying and measuring KPIs like CSAT/NPS scores, establishing feedback channels which have a direct correlation to UX Cost management through forecasting consumption, budgeting, tagging assets & tracking cost, disposing unused allocations & right sizing, optimizing usage & correlating cost to business value Establish incident & defect review process to help guide and continually improve stability of applications Shapes and leverages advanced conceptual thinking to solve complex and/or completely new or novel situations that have never been dealt with before. Actively pursues innovative solutions that align with the companys tolerance for risk (business and reputational) Looks at external companies, products and capabilities and how they may accelerate Ratings technology initiatives Preferred Qualifications: Experience in application & data architecture, system design, algorithms, data structures, complexity analysis, and software design Ability to architect high availability application and servers on cloud adhering best practices. Ability to perform technical deep-dives into code, networking, systems, databases and storage configuration Experience working in Agile software product development Experience working with stakeholders and collaborating across organizational boundaries. Configuration management, automation of patching, threat and vulnerability management, security monitoring, network security, endpoint security, cloud application and data security Awareness of security frameworks like NIST to address technology, information and resilience risk, information security and risk management Support & transform ITSM process Incident, Change & Problem management to align with DevOps maturity

S&P Global Market Intelligence

Financial Services

New York

approximately 20,000 Employees

627 Jobs

    Key People

  • Eddie Fishman

    VP, Product Management
  • J. P. O'Connor

    Head of Product Management

RecommendedJobs for You

Hyderabad / Secunderabad, Telangana, Telangana, India