Get alerts for new jobs matching your selected skills, preferred locations, and experience range. Manage Job Alerts
6.0 - 10.0 years
12 - 17 Lacs
Hyderabad
Work from Office
Position Name : Lead SRE (Site Reliability engineer) Position Mode : Experience : 6+ Years Joining Location : Hyderabad (Work from office only) Mode of Interview : 2-3 rounds (Virtual/Inperson) Notice : Immediate - 15 Days Max Technical Skill Requirements ServiceNow Business Analyst, ITIL, ITSM, Dashboard Creation, APM, Scripting, Datadog Role and Responsibilities : - 6+ Years of experience into SRE Engineer , having thorough knowledge on ITIL/ITSM process - Certification in ITIL v4 framework and deep knowledge of ITSM platforms preferable - Hands on experience on APM tool Datadog - Demonstrable ability to implement complex process workflows, and evidence performance through metrics-driven reporting - Strong understanding of IT Operations - Strong written and verbal communication skills with the ability to understand and present complex technical information in a clear and concise manner to a variety of audiences including executive leadership - Ability to develop strategic relationships with other teams, departments, business stakeholders, and 3rd parties - Ability to understand business requirements and define KPIs which can showcase stability of the application in production and give meaningful insights to business - Proven trouble-shooting experience and strong incident reduction-minded focus - Should be able to unsurfaced recurring issues and Toil and suggest automations - Strong problem-solving skills and the ability to think quickly and execute on short-time frames Apply Insights Follow-up Save this job for future reference Did you find something suspiciousReport Here! Hide This Job Click here to hide this job for you. You can also choose to hide all the jobs from the recruiter.
Posted 2 months ago
3.0 - 6.0 years
8 - 13 Lacs
Bengaluru
Work from Office
SRE - System Engineer Experience 4 to 7 years Job description Knowledge in Linux/Unix Administration Knowledge in networking Knowledge of wide variety of open source technologies/tools and cloud services Knowledge of best practices and IT operations in an always-up, always-available service Participate in oncall rotation. Good English communications skills Strong background in linux networking ( ip, iptables, ipsec ) Knowledge of working with MySQL A working understanding of code and script ( egPerl/Golang preferred ) Knowledge of automation/configuration management using either saltstack or equivalent Hands on experience in private and public cloud environmentsKnowledge of the following will be a plus: Implementation of cloud services on linux using kvm/qemu DCOS (mesos & mesos frameworks) aerospike ( nosql ) perl/golang galera openbsd Data center related activities ( Rarely needed ) PhonePe Full Time Employee Benefits (Not applicable for Intern or Contract Roles) Insurance Benefits - Medical Insurance, Critical Illness Insurance, Accidental Insurance, Life Insurance Wellness Program - Employee Assistance Program, Onsite Medical Center, Emergency Support System Parental Support - Maternity Benefit, Paternity Benefit Program, Adoption Assistance Program, Day-care Support Program Mobility Benefits - Relocation benefits, Transfer Support Policy, Travel Policy Retirement Benefits - Employee PF Contribution, Flexible PF Contribution, Gratuity, NPS, Leave Encashment Other Benefits - Higher Education Assistance, Car Lease, Salary Advance Policy
Posted 2 months ago
3.0 - 6.0 years
9 - 13 Lacs
Bengaluru
Work from Office
SRE - System Engineer Job description Knowledge in Linux/Unix Administration Knowledge in networking Knowledge of wide variety of open source technologies/tools and cloud services Knowledge of best practices and IT operations in an always-up, always-available service Participate in oncall rotation. Good English communications skills Strong background in linux networking ( ip, iptables, ipsec ) Knowledge of working with MySQL A working understanding of code and script ( egPerl/Golang preferred ) Knowledge of automation/configuration management using either saltstack or equivalent Hands on experience in private and public cloud environmentsKnowledge of the following will be a plus: Implementation of cloud services on linux using kvm/qemu DCOS (mesos & mesos frameworks) aerospike ( nosql ) perl/golang galera openbsd Data center related activities ( Rarely needed ) PhonePe Full Time Employee Benefits (Not applicable for Intern or Contract Roles) Insurance Benefits - Medical Insurance, Critical Illness Insurance, Accidental Insurance, Life Insurance Wellness Program - Employee Assistance Program, Onsite Medical Center, Emergency Support System Parental Support - Maternity Benefit, Paternity Benefit Program, Adoption Assistance Program, Day-care Support Program Mobility Benefits - Relocation benefits, Transfer Support Policy, Travel Policy Retirement Benefits - Employee PF Contribution, Flexible PF Contribution, Gratuity, NPS, Leave Encashment Other Benefits - Higher Education Assistance, Car Lease, Salary Advance Policy
Posted 2 months ago
2.0 - 6.0 years
5 - 9 Lacs
Bengaluru
Work from Office
Who We Are Applied Materials is the global leader in materials engineering solutions used to produce virtually every new chip and advanced display in the world. We design, build and service cutting-edge equipment that helps our customers manufacture display and semiconductor chips- the brains of devices we use every day. As the foundation of the global electronics industry, Applied enables the exciting technologies that literally connect our world- like AI and IoT. If you want to work beyond the cutting-edge, continuously pushing the boundaries of"science and engineering to make possible"the next generations of technology, join us to Make Possible® a Better Future. What We Offer Location: Bangalore,IND At Applied, we prioritize the well-being of you and your family and encourage you to bring your best self to work. Your happiness, health, and resiliency are at the core of our benefits and wellness programs. Our robust total rewards package makes it easier to take care of your whole self and your whole family. Were committed to providing programs and support that encourage personal and professional growth and care for you at work, at home, or wherever you may go. Learn more about our benefits . Youll also benefit from a supportive work culture that encourages you to learn, develop and grow your career as you take on challenges and drive innovative solutions for our customers."We empower our team to push the boundaries of what is possible"”while learning every day in a supportive leading global company. Visit our Careers website to learn more about careers at Applied. About Applied Applied Materials is the leader in materials engineering solutions used to produce virtually every new chip and advanced display in the world. Our expertise in modifying materials at atomic levels and on an industrial scale enables customers to transform possibilities into reality. At Applied Materials, our innovations make possible the technology shaping the future. Key Responsibilities: Take full end-to-end responsibility for DevOps tasks predominantly in Azure, Jenkins, CI/CD, and Python domains. Design, implement, and maintain CI/CD pipelines using Azure DevOps, GitHub Actions, and other relevant tools. Manage and troubleshoot Kubernetes architecture and containerized environments (Docker). Collaborate with development teams to ensure seamless integration and deployment of applications. Implement and manage system monitoring and centralized logging platforms (e.g., Prometheus, Loki). Utilize configuration management tools (e.g., Ansible) for automation and consistency. Maintain and optimize artifact and container registry tools (e.g., Artifactory). Provide technical support and guidance to junior team members. Skill & Experience Bachelors degree in Computer Science, Information Technology, or a related field, or equivalent experience. At least 8 years of experience in a DevOps or Infrastructure Support role Strong Linux system administration and networking skills Hands-on experience with CI/CD systems (e.g., Azure DevOps, GitLab CI, GitHub Actions) Solid understanding of Kubernetes architecture and hands-on troubleshooting experience Experience with Helm for managing Kubernetes deployments Proven scripting skills in Bash and Python. Proficiency with virtualized and containerized environments (Kubernetes and Docker). Excellent communication skills and ability to collaborate across teams. Additional Skills: SRE experience is a added advantage. Experience with system monitoring and centralized logging tools (e.g., Prometheus, Loki) Familiarity with configuration management tools (e.g., Ansible) Knowledge of artifact and container registry tools (e.g., Artifactory) Strong customer service mindset and ability to collaborate across teams Self-motivated, adaptable, and able to prioritize in a fast-paced environment Ability to work under pressure with a sense of urgency and accountability Applied Materials is committed to diversity in its workforce including Equal Employment Opportunity for Minorities, Females, Protected Veterans and Individuals with Disabilities. Additional Information Time Type: Full time Employee Type: Assignee / Regular Travel: Yes, 10% of the Time Relocation Eligible: Yes Applied Materials is an Equal Opportunity Employer. Qualified applicants will receive consideration for employment without regard to race, color, national origin, citizenship, ancestry, religion, creed, sex, sexual orientation, gender identity, age, disability, veteran or military status, or any other basis prohibited by law.
Posted 2 months ago
8.0 - 13.0 years
25 - 30 Lacs
Bengaluru
Work from Office
Dreaming big is in our DNA Its who we are as a company Its our culture Its our heritage And more than ever, its our future A future where were always looking forward Always serving up new ways to meet lifes moments A future where we keep dreaming bigger We look for people with passion, talent, and curiosity, and provide them with the teammates, resources and opportunities to unleash their full potential The power we create together when we combine your strengths with ours is unstoppable Are you ready to join a team that dreams as big as you do AB InBev GCC was incorporated in 2014 as a strategic partner for Anheuser-Busch InBev The center leverages the power of data and analytics to drive growth for critical business such as operations, finance , people and technology The teams are transforming operations through Tech and Analytics, Do you Dream Big We Need You, Job Description Job Title Junior Site Reliability Engineer Location Bangalore Reporting to Senior Manager Purpose of the role We are looking for a motivated Junior Site Reliability Engineer with a passion for technology and a product-focused mindset The ideal candidate will play a key role in developing, maintaining, and optimizing infrastructure while collaborating with cross-functional teams to enhance and improve system performance, Key tasks & accountabilities Developed and maintain applications using Python or Dot Net and ReactJs, enhancing functionality, Perform database management tasks for MSSQL, including optimization and troubleshooting, Collaborate with the development and operations teams to manage source control tools, including Git, GitHub, and Azure DevOps, Support cloud-based deployments with a focus on Azure, with additional exposure to AWS and GCP as needed, Maintain and enhance system configurations across Windows and Linux environments, Learn and adapt to emerging technologies to support the team in achieving operational excellence, Qualifications, Experience, Skills Level Of Educational Attainment Required Bachelor's degree in Computer Science, Information Technology, or a related field, Technical Expertise: Basic understanding of cloud computing concepts and SRE practices, And above all of this, an undying love for beer! We dream big to create a future with more cheers
Posted 2 months ago
7.0 - 12.0 years
30 - 35 Lacs
Pune
Work from Office
About The Role : Job TitleProduction Specialist, AVP LocationPune, India Role Description Our organization within Deutsche Bank is AFC Production Services. We are responsible for providing technical L2 application support for business applications. The AFC (Anti-Financial Crime) line of business has a current portfolio of 25+ applications. The organization is in process of transforming itself using Google Cloud and many new technology offerings. As an Assistant Vice President, your role will include hands-on production support and be actively involved in technical issues resolution across multiple applications. You will also be working as application lead and will be responsible for technical & operational processes for all application you support. Deutsche Banks Corporate Bank division is a leading provider of cash management, trade finance and securities finance. We complete green-field projects that deliver the best Corporate Bank - Securities Services products in the world. Our team is diverse, international, and driven by shared focus on clean code and valued delivery. At every level, agile minds are rewarded with competitive pay, support, and opportunities to excel. You will work as part of a cross-functional agile delivery team. You will bring an innovative approach to software development, focusing on using the latest technologies and practices, as part of a relentless focus on business value. You will be someone who sees engineering as team activity, with a predisposition to open code, open discussion and creating a supportive, collaborative environment. You will be ready to contribute to all stages of software delivery, from initial analysis right through to production support. What we'll offer you As part of our flexible scheme, here are just some of the benefits that youll enjoy, Best in class leave policy. Gender neutral parental leaves 100% reimbursement under childcare assistance benefit (gender neutral) Sponsorship for Industry relevant certifications and education Employee Assistance Program for you and your family members Comprehensive Hospitalization Insurance for you and your dependents Accident and Term life Insurance Complementary Health screening for 35 yrs. and above Your key responsibilities Provide technical support by handling and consulting on BAU, Incidents/emails/alerts for the respective applications. Perform post-mortem, root cause analysis using ITIL standards of Incident Management, Service Request fulfillment, Change Management, Knowledge Management, and Problem Management. Manage regional L2 team and vendor teams supporting the application. Ensure the team is up to speed and picks up the support duties. Build up technical subject matter expertise on the applications being supported including business flows, application architecture, and hardware configuration. Define and track KPIs, SLAs and operational metrics to measure and improve application stability and performance. Conduct real time monitoring to ensure application SLAs are achieved and maximum application availability (up time) using an array of monitoring tools. Build and maintain effective and productive relationships with the stakeholders in business, development, infrastructure, and third-party systems / data providers & vendors. Assist in the process to approve application code releases as well as tasks assigned to support to perform. Keep key stakeholders informed using communication templates. Approach support with a proactive attitude, desire to seek root cause, in-depth analysis, and strive to reduce inefficiencies and manual efforts. Mentor and guide junior team members, fostering technical upskill and knowledge sharing. Provide strategic input into disaster recovery planning, failover strategies and business continuity procedures Collaborate and deliver on initiatives and install these initiatives to drive stability in the environment. Perform reviews of all open production items with the development team and push for updates and resolutions to outstanding tasks and reoccurring issues. Drive service resilience by implementing SRE(site reliability engineering) principles, ensuring proactive monitoring, automation and operational efficiency. Ensure regulatory and compliance adherence, managing audits,access reviews, and security controls in line with organizational policies. The candidate will have to work in shifts as part of a Rota covering APAC and EMEA hours between 07:00 IST and 09:00 PM IST (2 shifts). In the event of major outages or issues we may ask for flexibility to help provide appropriate cover. Weekend on-call coverage needs to be provided on rotational/need basis. Your skills and experience 9-15 years of experience in providing hands on IT application support. Experience in managing vendor teams providing 24x7 support. Preferred Team lead role experience, Experience in an investment bank, financial institution. Bachelors degree from an accredited college or university with a concentration in Computer Science or IT-related discipline (or equivalent work experience/diploma/certification). Preferred ITIL v3 foundation certification or higher. Knowledgeable in cloud products like Google Cloud Platform (GCP) and hybrid applications. Strong understanding of ITIL /SRE/ DEVOPS best practices for supporting a production environment. Understanding of KPIs, SLO, SLA and SLI Monitoring ToolsKnowledge of Elastic Search, Control M, Grafana, Geneos, OpenShift, Prometheus, Google Cloud Monitoring, Airflow,Splunk. Working Knowledge of creation of Dashboards and reports for senior management Red Hat Enterprise Linux (RHEL) professional skill in searching logs, process commands, start/stop processes, use of OS commands to aid in tasks needed to resolve or investigate issues. Shell scripting knowledge a plus. Understanding of database concepts and exposure in working with Oracle, MS SQL, Big Query etc. databases. Ability to work across countries, regions, and time zones with a broad range of cultures and technical capability. Skills That Will Help You Excel Strong written and oral communication skills, including the ability to communicate technical information to a non-technical audience and good analytical and problem-solving skills. Proven experience in leading L2 support teams, including managing vendor teams and offshore resources. Able to train, coach, and mentor and know where each technique is best applied. Experience with GCP or another public cloud provider to build applications. Experience in an investment bank, financial institution or large corporation using enterprise hardware and software. Knowledge of Actimize, Mantas, and case management software is good to have. Working knowledge of Big Data Hadoop/Secure Data Lake is a plus. Prior experience in automation projects is great to have. Exposure to python, shell, Ansible or other scripting language for automation and process improvement Strong stakeholder management skills ensuring seamless coordination between business, development, and infrastructure teams. Ability to manage high-pressure issues, coordinating across teams to drive swift resolution. Strong negotiation skills with interface teams to drive process improvements and efficiency gains. How we'll support you Training and development to help you excel in your career. Coaching and support from experts in your team A culture of continuous learning to aid progression. A range of flexible benefits that you can tailor to suit your needs.
Posted 2 months ago
5.0 - 8.0 years
16 - 25 Lacs
Noida, Jaipur, Delhi / NCR
Hybrid
As a Senior Site Reliability Engineer (SRE), you will play a critical role in ensuring the reliability, scalability, and performance of our cloud infrastructure on AWS. You will collaborate with cross-functional teams to design, implement, and manage systems and processes that enable continuous availability and seamless operation of our applications and services. The ideal candidate will have extensive experience in AWS cloud technologies, strong problem-solving skills, and a passion for building resilient and efficient systems. Responsibilities: - Design, implement, and maintain highly available and scalable cloud infrastructure on AWS platform. - Develop and implement automated monitoring, alerting, and incident response mechanisms to ensure proactive identification and resolution of system issues. - Collaborate with software engineering teams to establish Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to measure system reliability and performance. - Integrate security practices into the DevOps pipeline, ensuring the implementation of security controls at every stage of the software development lifecycle. - Architect, deploy, and manage cloud infrastructure at scale, with a focus on security best practices and compliance requirements. - Monitor security alerts and incidents and respond promptly to security breaches and incidents. - Conduct regular performance analysis, capacity planning to anticipate and address scaling requirements. - Implement and maintain disaster recovery and failover strategies to mitigate service disruptions and ensure business continuity. - Lead incident response and post-mortem analysis to identify root causes and implement preventive measures. - Continuously improve system reliability through automation, optimization, and implementation of best practices. - Stay updated with the latest AWS services and technologies and evaluate their applicability to enhance our infrastructure and operations. - Mentor junior team members and foster a culture of collaboration, learning, and continuous improvement Qualifications: - Bachelor s degree in computer science, Engineering, or related field. Master s degree preferred. - AWS Certified Solutions Architect - Professional or AWS Certified DevOps Engineer - Professional certification is required. - 6 - 7 years of experience in Site Reliability Engineering, DevOps, or related roles, with a focus on AWS cloud technologies. - Strong understanding of cloud architecture principles and experience with AWS services such as EC2, S3, RDS, Lambda, DynamoDB, etc. - Proficiency in scripting and automation using languages such as Python, Bash, or PowerShell. - Experience with infrastructure as code (IaC) tools such as Terraform or CloudFormation for provisioning and configuration management. - Hands-on experience with monitoring, logging, and observability tools such as CloudWatch, Prometheus, Grafana, ELK stack, etc. - Solid understanding of CI/CD principles and experience with related tools like Jenkins, GitLab CI/CD, or AWS Code Pipeline. - Excellent problem-solving skills and the ability to troubleshoot complex issues in distributed systems. - Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and influence stakeholders at all levels.
Posted 2 months ago
10.0 - 12.0 years
40 - 50 Lacs
Pune
Work from Office
Key Responsibilities: Collaborate with U.S.-based counterparts to define and monitor service SLOs, SLAs, and key performance indicators. Lead root cause analysis, blameless postmortems, and reliability improvements across environments. Review application code (primarily Java/Spring) to assist in identifying defects and systemic performance issues. Automate deployment pipelines, recovery workflows, and runbook processes to minimize manual intervention. Build and manage dashboards, alerts, and health checks using tools like DataDog, Azure Monitor, Prometheus, and Grafana. Contribute to architectural decisions with a lens on performance and operability. Guide and mentor offshore team members in incident response and production readiness. Participate in 24x7 support rotations aligned with EST coverage expectations. Required Experience & Skills: 10+ years in SRE, DevOps, or platform engineering experience, ideally supporting U.S. enterprise systems. Strong hands-on experience with Java/Spring Boot applications, with the ability to assist in code-level troubleshooting. Cloud infrastructure knowledge (Azure preferred) and container orchestration (Kubernetes). Proficient with logging/monitoring stacks (DataDog, ELK, Azure Monitor, Dynatrace, Splunk). Experience with ServiceNow (SNOW) for ITSM processes. Experience with Terraform or ARM templates, CI/CD automation, and scripting (Python, Bash). Familiarity with Salesforce systems highly preferred. Excellent communication skills and outstanding problem-solving ability in distributed environments. Demonstrated history of improving stability, availability, and delivery velocity for large-scale platforms.
Posted 2 months ago
10.0 - 15.0 years
12 - 17 Lacs
Mumbai, Kandivali, Kurla
Work from Office
We are seeking a highly skilled Operational Manager to lead and manage our Cloud Managed Services team. The ideal candidate will oversee day-to-day operations, ensure service delivery excellence, optimize cloud environments (with focus on GCP,Azure & AWS), and drive operational efficiencies. You will be responsible for managing a diverse team, collaborating with stakeholders, and ensuring compliance with service-level agreements (SLAs). Key Responsibilities: Operational Leadership: Manage and lead a team of cloud engineers and support staff in delivering managed services across GCP and other cloud platforms (AWS, Azure). Oversee day-to-day operations, including incident management, problem resolution, and change management. Ensure adherence to ITIL practices, industry standards, and best practices. Service Delivery & Performance Management: Ensure compliance with SLAs, OLAs, and customer satisfaction metrics. Monitor, analyze, and report on cloud service performance, identifying areas for improvement. Implement proactive monitoring, automation, and operational improvements to enhance service delivery. Stakeholder Engagement: Collaborate with internal teams and external clients to understand business needs and deliver customized cloud solutions. Act as the primary point of escalation for operational issues and ensure timely resolution. Operational Efficiency & Automation: Drive automation initiatives to reduce manual intervention and improve operational efficiency. Implement cost optimization strategies and ensure cloud resources are used effectively. Team Management & Development: Recruit, mentor, and develop high-performing technical teams. Conduct regular performance reviews and provide professional growth opportunities. Compliance & Security: Ensure adherence to security and compliance policies across cloud environments. Manage audits, disaster recovery (DR), and business continuity (BC) processes. Required Skills & Qualifications: Bachelor's degree in Computer Science, Information Technology, or a related field. 10+ years of experience in IT operations, with at least 3 years in a managerial capacity. Strong knowledge of Google Cloud Platform (GCP) with hands-on experience in managing services like GCE (VMs), GKE (Kubernetes), BigQuery, and IAM. Proven expertise in cloud operations, monitoring, automation, and incident management. Experience managing hybrid and multi-cloud environments (GCP, AWS, Azure). Solid understanding of ITIL processes, service management frameworks, and SLA/OLA management. Strong leadership, communication, and problem-solving skills. Experience with Infrastructure as Code (IaC) tools (Terraform, Deployment Manager) and automation frameworks (Ansible, Chef, or Puppet). Knowledge of networking, security, and compliance in cloud environments. Good-to-Have Skills (GCP Focus): Google Professional Cloud Architect or Google Cloud Operations certification. Familiarity with FinOps principles for cost management and optimization. Understanding of SRE (Site Reliability Engineering) methodologies.
Posted 2 months ago
8.0 - 12.0 years
18 - 27 Lacs
Bengaluru
Work from Office
Are you an experienced Platform Engineer looking for a new opportunity to showcase your skills and expertise? If so, then Torry Harris is looking for you! We are currently seeking a skilled and motivated individual to join our team and play a critical role in streamlining and automating our cloud infrastructure. As a Lead Platform Engineer to architect, build and lead the development of our internal platform infrastructure. In this role, you will guide a team of engineers to deliver scalable, secure, and developer-friendly platforms that accelerate product delivery. You will collaborate across engineering, security, and operations to define best practices and drive platform strategy. Roles and Responsibilities • 8+ years of experience in DevOps, SRE, or platform engineering roles. • Lead the design and implementation of platform architecture, ensuring scalability, reliability, and security. • Mentor and guide a team of platform and DevOps engineers. • Define and enforce best practices for infrastructure automation, CI/CD, observability, and cloud operations. • Deep expertise in Kubernetes, containerization, and cloud-native technologies. • Strong proficiency in cloud platforms (AWS, GCP, or Azure). • Collaborate with software engineering teams to improve developer experience and productivity. • Own the lifecycle of Kubernetes clusters, cloud infrastructure, and internal tooling. • Drive adoption of GitOps, infrastructure as code (IaC), and platform-as-a-product principles. • Monitor platform performance and lead incident response and root cause analysis. • Evaluate and integrate new technologies to improve platform capabilities. • Strong scripting or programming skills (Python, Go, Bash). • Solid understanding of networking, security, and system design
Posted 2 months ago
9.0 - 14.0 years
20 - 35 Lacs
Bengaluru
Work from Office
Lead automation and expense management initiatives across global network platforms. Ensure secure, cost-effective operations, enhance reliability via SRE practices, and oversee vendor TEM performance, reporting, and billing accuracy. Required Candidate profile Exp in network automation, CI/CD, and cost governance. Skilled in SRE, telecom expense management, circuit cleanup, vendor coordination, and performance reporting using Power BI and Microsoft 365.
Posted 2 months ago
10 - 15 years
20 - 30 Lacs
Pune
Work from Office
Role & responsibilities Assessment and Planning: Evaluate existing systems (On-premises, AWS, GCP, etc.), and associated enabling capabilities (identity, security, HA/DR, monitoring, backup/restore, reporting, integrations, etc.). Design and develop comprehensive migration strategies and plans. Evaluate, recommend, and implement 7 Rs cloud migration strategies - rehost, replatform, refactor, repurchase, retire, retain, and relocate. Migration Execution: Manage and execute the migration process, ensuring minimal downtime and data integrity, and using tools like Azure Migrate. Cloud Infrastructure Management: Configure, optimize, and monitor Azure resources, including but not limited to virtual machines, AKS, storage, networking, and other services. Technical Expertise: Provide technical guidance to project teams, troubleshoot issues, and ensure compliance with cloud security best practices. Technical Leadership: Develop, train, and build internal teams with Azure skills and build a practice/Center of Excellence Post-Migration Support: Provide documentation, training, and ongoing support to internal teams and clients. Optimization and Cost Efficiency: Continuously monitor and optimize cloud infrastructure performance and cost-efficiency. Collaboration: Work with cross-functional teams (developers, IT, security, compliance) to ensure seamless integration and alignment.
Posted 2 months ago
10 - 13 years
18 - 25 Lacs
Bengaluru
Hybrid
Hiring, Lead Site Reliability Engineer with following skills and expertise. What will this person do? Provide leadership in designing and implementing reliable, scalable, and secure infrastructure solutions. Develop and maintain observability solutions, ensuring visibility into system performance using native Azure Cloud solutions. Define and track SLIs, ensuring compliance with SLOs and SLAs. Lead incident response efforts, conduct root cause analysis, and implement preventive measures to minimize downtime. Automate infrastructure provisioning, configuration and management using Terraform & Ansible. Build and maintain robust Observability pipelines to support automated deployments and continuous monitoring practices. Continuously analyze system health and optimize performance by identifying and resolving bottlenecks. Work with our BCDR team to minimize business impact during failures and measure the quality of services. Work with Cloud Governance team to monitor cloud infrastructure spending and implement cost-saving strategies. Implement centralized logging, metric collection, and distributed tracing for troubleshooting and debugging. Deploy, Manage and Monitor containerized workloads. Maintain configuration consistency and compliance across cloud environments using tools like Ansible. Partner with software development teams to integrate reliability best practices into the application development lifecycle. Conduct detailed post-mortems, document learnings, and drive improvements to reduce future incidents. Develop automation scripts in Python, Bash, or other languages to reduce manual efforts and improve efficiency. Provide mentorship to junior engineers, fostering a culture of learning and continuous technical growth. Research and evaluate new technologies, tools, and methodologies to improve system reliability and efficiency. Maintain detailed documentation on infrastructure, monitoring setups, incident responses, and best practices. Qualifications Bachelors degree in Computer Science, Engineering, or a related field. 10+ years in Observability, DevOps, and Site Reliability Engineering (SRE). At least 2 years of experience in defining Observability KPIs for both on-premises and cloud environments. Strong experience with cloud platforms (AWS, Azure, GCP) and cloud-native technologies. Passion for automation, reducing toil and implementing reliability-focused best practices. Deep knowledge of services/tools like Grafana, PowerBI, Prometheus, Azure Monitor, Application Insights & Azure Metrics. Expertise in Terraform, Ansible, Chef, and CI/CD pipeline tools like GitHub Actions, Jenkins, and GitOps methodologies. Working understanding of load balancing, authentication (AAA), encryption, and network parameters monitoring. Strong troubleshooting skills and experience handling on-call incidents and post-mortem analysis. Ability to work cross-functionally, drive technical discussions, and mentor junior engineers. Ability to work in a dynamic team environment and possess time management skills to meet deadlines. Sense of ownership and pride in your performance and its impact on the companys success. Critical thinker with problem-solving skills. Good interpersonal and communication skills.
Posted 2 months ago
7 - 10 years
10 - 15 Lacs
Kochi
Work from Office
Job Title - Site Reliability Engineer + Specialist + Global Song Management Level :9,Specialist Location:Kochi Must have skills: Python, Go, or Java Good to have skills:Expertise with cloud platforms (AWS, Azure, GCP) and tools. Job Summary :As a Site Reliability Engineer (SRE), you'll bring together your software engineering expertise and systems knowledge to ensure our systems are scalable, reliable, and efficient. You'll be instrumental in automating operations, solving complex infrastructure challenges, and driving continuous improvement to deliver seamless and resilient services. Your responsibilities will include: Design, build, and maintain scalable infrastructure and systems. Automate operational tasks to improve efficiency and reliability. Implement application monitoring and continuous improvement of application performance and stability. Develop and implement disaster recovery and incident management strategies. Collaborate with developers to improve application architecture and deployment. Optimize system availability, latency, and performance metrics. Manage CI/CD pipelines for seamless software delivery. Perform root cause analysis and lead detailed post-mortems. Consult with software development teams to implement reliability best practices. Write and maintain infrastructure and operational documentation. Operational responsibility of a number of distributed applications. Including on-call shifts. Roles & Responsibilities: Strong experience in software engineering and systems architecture. Multiple years of experience programming in languages such as Python, Go, or Java. Expertise with cloud platforms (AWS, Azure, GCP) and tools. Hands-on experience with infrastructure as code (Terraform, Ansible, etc.). Familiarity with Linux/Unix systems and networking fundamentals. Familiarity with containerization and orchestration tools like Docker and Kubernetes. Proven ability to monitor, debug, and optimize distributed systems. Experience managing CI/CD pipelines and automation frameworks. Strong problem-solving skills and attention to detail. Excellent communication and collaboration skills for cross-functional teamwork. Ability to analyze and improve complex systems for reliability and scalability. Self-motivated with a passion for continuous learning and improvement. Professional & Technical Skills: Additional Information: (do not remove the hyperlink) Qualifications Experience: Minimum 7-10 year(s) of experience is required Educational Qualification: Any graduation / BE / B Tech
Posted 2 months ago
5 - 6 years
7 - 8 Lacs
Gurugram
Work from Office
Site Reliability Engineer Job Description: Requirements: We are seeking a proactive and technically strong Site Reliability Engineer (SRE) to ensure the stability, performance, and scalability of our Data Engineering Platform. You will work on cutting-edge technologies including Cloudera Hadoop, Spark, Airflow, NiFi, and Kubernetesensuring high availability and driving automation to support massive-scale data workloads, especially in the telecom domain. Key Responsibilities Ensure platform uptime and application health as per SLOs/KPIs Monitor infrastructure and applications using ELK, Prometheus, Zabbix, etc. Debug and resolve complex production issues, performing root cause analysis Automate routine tasks and implement self-healing systems Design and maintain dashboards, alerts, and operational playbooks Participate in incident management, problem resolution, and RCA documentation Own and update SOPs for repeatable processes Collaborate with L3 and Product teams for deeper issue resolution Support and guide L1 operations team Conduct periodic system maintenance and performance tuning Respond to user data requests and ensure timely resolution Address and mitigate security vulnerabilities and compliance issues Technical Skillset Hands-on with Spark, Hive, Cloudera Hadoop, Kafka, Ranger Strong Linux fundamentals and scripting (Python, Shell) Experience with Apache NiFi, Airflow, Yarn, and Zookeeper Proficient in monitoring and observability tools: ELK Stack, Prometheus, Loki Working knowledge of Kubernetes, Docker, Jenkins CI/CD pipelines Strong SQL skills (Oracle/Exadata preferred) Familiarity with DataHub, DataMesh, and security best practices is a plus Strong problem-solving and debugging mindset Ability to work under pressure in a fast-paced environment. Excellent communication and collaboration skills. Ownership, customer orientation, and a bias for action
Posted 2 months ago
5 - 8 years
6 - 10 Lacs
Pune
Work from Office
About The Role Role Purpose The purpose of this role is to provide solutions and bridge the gap between technology and business know-how to deliver any client solution ? Do 1. Bridging the gap between project and support teams through techno-functional expertise For a new business implementation project, drive the end to end process from business requirement management to integration & configuration and production deployment Check the feasibility of the new change requirements and provide optimal solution to the client with clear timelines Provide techno-functional solution support for all the new business implementations while building the entire system from the scratch Support the solutioning team from architectural design, coding, testing and implementation Understand the functional design as well as technical design and architecture to be implemented on the ERP system Customize, extend, modify, localize or integrate to the existing product by virtue of coding, testing & production Implement the business processes, requirements and the underlying ERP technology to translate them into ERP solutions Write code as per the developmental standards to decide upon the implementation methodology Provide product support and maintenance to the clients for a specific ERP solution and resolve the day to day queries/ technical problems which may arise Create and deploy automation tools/ solutions to ensure process optimization and increase in efficiency Sink between technical and functional requirements of the project and provide solutioning/ advise to the client or internal teams accordingly Support on-site manager with the necessary details wrt any change and off-site support ? 2. Skill upgradation and competency building Clear wipro exams and internal certifications from time to time to upgrade the skills Attend trainings, seminars to sharpen the knowledge in functional/ technical domain Write papers, articles, case studies and publish them on the intranet ? Deliver No. Performance Parameter Measure 1. Contribution to customer projects Quality, SLA, ETA, no. of tickets resolved, problem solved, # of change requests implemented, zero customer escalation, CSAT 2. Automation Process optimization, reduction in process/ steps, reduction in no. of tickets raised 3. Skill upgradation # of trainings & certifications completed, # of papers, articles written in a quarter ? Mandatory Skills: SRE Operations. Experience5-8 Years. Reinvent your world. We are building a modern Wipro. We are an end-to-end digital transformation partner with the boldest ambitions. To realize them, we need people inspired by reinvention. Of yourself, your career, and your skills. We want to see the constant evolution of our business and our industry. It has always been in our DNA - as the world around us changes, so do we. Join a business powered by purpose and a place that empowers you to design your own reinvention. Come to Wipro. Realize your ambitions. Applications from people with disabilities are explicitly welcome.
Posted 2 months ago
15 - 20 years
50 - 55 Lacs
Hyderabad
Work from Office
The Role : Director, Application Operations, SRE (Site Reliability Engineering) The Team : This team is part of the global SRE group that provides Site Reliability Engineering Services for the critical applications used by the analysts for conducting the business. Application Operations team is responsible for the Stability (Uptime), Reliability (Quality & Performance) and Engineering of these applications to improve business outcomes, user experience and efficiencies. The Team operates at the intersection of IT operations and software development, ensuring that our services are not only robust but also agile enough to adapt to the ever-evolving business needs. Impact and Responsibilities : The Impact of this role extends far beyond the immediate team. You will be instrumental in shaping the reliability and performance standards of our critical applications, ensuring they meet the highest benchmarks. By driving advancements in automation and cloud technologies, you will contribute significantly to the organization's strategic goals and toil reduction, enhancing both the user experience and operational efficiency. You will nurture the team members to be the best-in-class by upskilling and cross-skilling. General & Team management: Ensure the team balances its focus between daily operational tasks and strategic long-term projects Drive the adoption of new technologies and processes through training and mentoring Lead/Mentor/Guide/Coach and transform a team of Application Operations to SREs Create/maintain documentation for systems and processes to ensure continuity and knowledge sharing within the team. Adoption of Gen AI to leverage knowledge repository Collaborate with cross-functional teams to ensure seamless integration and support for new technologies and initiatives Oversee daily operations and ensure the shifts are adequately managed Set the roadmap; derive goals for each team member; review, motivate and support to make them successful Stability: Build a SRE practice that improves system stability with Monitoring & AIOps. Avert P1/P2 incidents and minimize business impact Analyze system vulnerabilities, SPOFs and address them proactively to improve stability Refactor monolithic apps and databases to containerized services to improve delivery/scale Work with business users to understand needs, issues, develop root cause analysis and work with the cross functional teams to address them permanently Reliability: Monitor system performance and create strategies to improve it Reduce the number of incidents and the time taken to resolve them (MTTR) Develop and implement disaster recovery plans to ensure business continuity Lead DevOps transformation to improve the delivery of value to business, reduction of costs & manual errors, increased velocity of releases and improved config management Engineering: Involvement in Architecture and Development design reviews (Shift-left) for new implementation and integration projects to build SRE best practices into the SDLC Continuously look for opportunities to automate tasks, simplify processes, Self-service to reduce the toil Value Stream Alignment: While alignment as horizontal lead is expected to begin with, its expected that you also handle the role of a SRE value stream lead going forward. Ensure smooth inter-working with value streams (VS) to meet the objectives & realize value Foster a 2-way knowledge sharing with VS and reduce dependency on SRE Help shepherd VS to improve SRE maturity levels; implement & prioritize best practices like monitoring, post-mortem, toil reduction, retrospectives etc. Application to User Journey orientation and transformation Whats in it for you : In this role, you will have the opportunity to collaborate with a diverse and talented team, working on cutting-edge technology solutions to drive efficiency and innovation within the organization. You will be at the forefront of implementing best practices in site reliability engineering, with a strong emphasis on automation, cloud technologies, and performance optimization. You will interface with the value stream leads to improve the SRE practices and maturity levels within the value streams. What Were Looking For: Basic Qualifications : Bachelors degree in computer science or equivalent is required, or in lieu, a demonstrated equivalence in work experience 15+ years of experience in Information Technology domain including cloud, systems & database administration, networking, performance, and application operations Proven experience in IT Operations and/or Site Reliability Engineering, successful handling of Application Operations in a complex IT setup Manage Multi-cloud (AWS/Azure) environments Engineering and implementing proactive monitoring of applications, infrastructure & databases. Engineering automation to self-heal and mature towards AIOps Manage, innovate, and create processes, software and tools that continuously improve the availability, reliability, scalability, latency and efficiency of platforms Engineer Self-service portals, Scalable platforms and repeatable processes that allow product teams to own the entire life cycle of their products, reducing the SRE dependency Excellent communication skills with experience in managing, coaching, and building highly effective teams. Manage and inspire a team of full stack Site Reliability Engineers across regions and time zones, emphasizing collaboration and efficiency. Establish relationships with business teams & other IT partners. Identifying and measuring KPIs like CSAT/NPS scores, establishing feedback channels which have a direct correlation to UX Cost management through forecasting consumption, budgeting, tagging assets & tracking cost, disposing unused allocations & right sizing, optimizing usage & correlating cost to business value Establish incident & defect review process to help guide and continually improve stability of applications Shapes and leverages advanced conceptual thinking to solve complex and/or completely new or novel situations that have never been dealt with before. Actively pursues innovative solutions that align with the companys tolerance for risk (business and reputational) Looks at external companies, products and capabilities and how they may accelerate Ratings technology initiatives Preferred Qualifications: Experience in application & data architecture, system design, algorithms, data structures, complexity analysis, and software design Ability to architect high availability application and servers on cloud adhering best practices. Ability to perform technical deep-dives into code, networking, systems, databases and storage configuration Experience working in Agile software product development Experience working with stakeholders and collaborating across organizational boundaries. Configuration management, automation of patching, threat and vulnerability management, security monitoring, network security, endpoint security, cloud application and data security Awareness of security frameworks like NIST to address technology, information and resilience risk, information security and risk management Support & transform ITSM process Incident, Change & Problem management to align with DevOps maturity
Posted 2 months ago
12 - 21 years
12 - 22 Lacs
Hyderabad, Ahmedabad
Hybrid
Summary: The SRE Manager at Techblocks India will lead the reliability engineering function, ensuring infrastructure resiliency and optimal operational performance. This hybrid role blends technical leadership with team mentorship and cross-functional coordination. Experience Required: 10+ years total experience, with 3+ years in a leadership role in SRE or Cloud Operations. Technical Knowledge and Skills: Mandatory: Deep understanding of Kubernetes, GKE, Prometheus, Terraform Cloud: Advanced GCP administration CI/CD: Jenkins, Argo CD, GitHub Actions Incident Management: Full lifecycle, tools like OpsGenie Nice to Have : Knowledge of service mesh and observability stacks Strong scripting skills (Python, Bash) BigQuery/Dataflow exposure for telemetry Scope: Build and lead a team of SREs Standardize practices for reliability, alerting, and response Engage with Engineering and Product leaders Role & responsibilities if ur interssted please call me 9701923036
Posted 2 months ago
7 - 12 years
9 - 15 Lacs
Coimbatore
Work from Office
*Expertise in Linux / Windows environments *Exposure to Cloud (AWS / Azure) *Proficiency in Docker/Kubernetes *Hands-on with CI/CD Tools *Exposure to Scripting *Infrastructure as Code – Terraform or CloudFormation (preferred) Required Candidate profile We are seeking an experienced Senior DevOps Engineer to lead and optimize our infrastructure and deployment processes.
Posted 2 months ago
4 - 6 years
6 - 8 Lacs
Bengaluru
Work from Office
We are looking for Site Reliability Engineer! Youll make a difference by: SRE L1 Commander is responsible for ensuring the stability, availability, and performance of critical systems and services. As the first line of defense in incident management and monitoring, the role requires real-time response, proactive problem solving, and strong coordination skills to address production issues efficiently. Monitoring and Alerting: Proactively monitor system health, performance, and uptime using monitoring tools like Datadog, Prometheus. Serving as the primary responder for incidents to troubleshoot and resolve issues quickly, ensuring minimal impact on end-users. Accurately categorizing incidents, prioritize them based on severity, and escalate to L2/L3 teams when necessary. Ensuring systems meet Service Level Objectives (SLOs) and maintain uptime as per SLAs. Collaborating with DevOps and L2 teams to automate manual processes for incident response and operational tasks. Performing root cause analysis (RCA) of incidents using log aggregators and observability tools to identify patterns and recurring issues. Following predefined runbooks/playbooks to resolve known issues and document fixes for new problems. Youd describe yourself as: Experienced professional with 4 to 6 years of relevant experience in SRE, DevOps, or Production Support with monitoring tools (e.g., Prometheus, Datadog). Working knowledge of Linux/Unix operating systems and basic scripting skills (Python, Gitlab actions) cloud platforms (AWS, Azure, or GCP). Familiarity with container orchestration (Kubernetes, Docker, Helmcharts) and CI/CD pipelines. Exposure with ArgoCD for implementing GitOps workflows and automated deployments for containerized applications. Possessing experience in Monitoring: Datadog, Infrastructure: AWS EC2, Lambda, ECS/EKS, RDS, Networking: VPC, Route 53, ELB and Storage: S3, EFS, Glacier. Strong troubleshooting and analytical skills to resolve production incidents effectively. Basic understanding of networking concepts (DNS, Load Balancers, Firewalls). Good communication and interpersonal skills for incident communication and escalation. Having preferred certifications: AWS Certified SysOps Administrator Associate, AWS Certified Solutions Architect Associate or AWS Certified DevOps Engineer Professional
Posted 2 months ago
5 - 10 years
20 - 25 Lacs
Pune
Work from Office
In everchanging SaaS landscape there are a few persistent items that contribute to developing quality solutions with speed. Namely, ensuing operational activities are treated as software development enhancements, manual tasks are remediated though automation, risk reduction though compartmentalization of services/code and consumption of readily available provider services. Product/development teams require an accountable partner to advance on these topics, The SRE (Site Reliability Engineering) team will be this partner. The SRE team will support the Siemens Xcelerator platform andwill be responsible for identifying, managing, improving, and reporting on availability, resiliency, reliability, and stability efficiencies. This includes providing technical guidance and leadership to drive solutions, create & enhance processes that deliver excellence. A strong relationship with the various product teams of the Xcelerator platform is necessary to support core objectives. This roles success will be defined by product teams meeting their SLOs with healthy product adoption and operational excellence. This position will be responsible to support technology and cluture though an enterprise ecosystem to ensure developers and products exceed product SLOs (Service level Objectives) and clearly, without dispute, benefit from every interaction with the SRE team. Responsibilities Incident Management, Game Day coordination, Create and drive Metric/observability solutions and reviews Support production readiness reviews Cross division role model to advance the SRE practice in Siemens Complete technological control over methods of automation, codifying optional activities, microservice architecture, platform engineering to ensure changes, updates or technical advancements are in place for a product Ensure the team can provide the design, deployment, automation, and scripting solutions to drive new capabilities, visibility, and efficiency Simplify highly complex ideas, architectures and concepts to encourage achievable adoption Collaborate with other technical platforms and partners to engineer automated and integrated solutions between tools, services, teams that increase availability, reliability, and performance Own and ensure the internal and external SLAs meet and exceed expectations Be part of maintaining a 24x7, global, highly available SaaS environment Participate in an on-call rotation that supports our production infrastructure Troubleshoot production availability incidents that often span across multiple teams and services Ensure the SRE team can coordinate production incident post-mortems, and contribute to solutions to prevent problem recurrence; with the goal of automated response to all non-exceptional service conditions Communicate to business and technical partners on incidents as they occur when they impact system performance or availability at a critical level Required Knowledge/Skills, Education, and Experience Bachelors Degree or equivalent experience; Proven experience as a Site Reliability Engineer or equivalent role; Experience working in a large organization though a SRE transformation where existing applications were adapted to contemporary targets Proven experience with automation via scripting & API development Experience with software development in the cloud Experience with monitoring tools(Datadog, CloudWatch, CloudTrail, Cloudability, or equivalent tools) Proven e xperience with containerization, specifically Kubernetes Experience with Amazon Web Services (AWS) services andTerraform, CloudFormation, Ansible, or equivalent tools Preferred Knowledge/Skills, Education, and Experience Desired certifications includeDatadog, Kubernetes, Security, AWS certification Understanding of ITIL Deep understanding of SRE and Incident management strategies Experience with issue/incident tracking tool(ServiceNOW, ServiceDesk, Jira or equivalent tools) and open source tools (Linux, Python, Git, Ansible) Experience on Enterprise IT environment with distributed environments Networking concepts, including firewalls, VPN, routing, load balancers, security and DNS Senior level system administration experience, including troubleshooting, support, mentorship/training, and oversight Why us? Working at Siemens Software means flexibility - Choosing between working at home and the office at other times is the norm here. We offer great benefits and rewards, as you'd expect from a world leader in industrial software. A collection of over 377,000 minds building the future, one day at a time in over 200 countries. We're dedicated to equality, and we welcome applications that reflect the diversity of the communities we work in. All employment decisions at Siemens are based on qualifications, merit, and business need. Bring your curiosity and creativity and help us shape tomorrow! Siemens Software. Transform the Everyday #LI-PLM #LI-HYBRID
Posted 2 months ago
5 - 8 years
15 - 25 Lacs
Chennai, Bengaluru
Work from Office
We are looking for a Senior Platform Engineer Airflow & Control-M with 5-10 years of experience to join our team in Bangalore or Chennai The ideal candidate will have strong expertise in Airflow, Control-M, Kubernetes, Observability (OpenTelemetry), Python, and Bash scripting The role involves managing critical data workflows, enhancing platform automation, and ensuring system reliability and scalability Excellent communication skills and hands-on experience in stabilizing production environments are essential
Posted 2 months ago
5 - 10 years
20 - 25 Lacs
Bengaluru
Work from Office
As an SRE Lead,you will oversee system health, manage escalations, track and ensure ticket closures, follow up on issues, and enhance support processes to deliver a seamless operation. Define and uphold Service Level Indicators (SLIs) /SLOS Required Candidate profile Prior experience in an SRE, IT operations, or support leadership role. • Knowledge of ticketing and ITSM tools (ServiceNow, Jira Service ) Java , Microservice Architecture ,kubernates , Cloud must.
Posted 2 months ago
2 - 7 years
6 - 8 Lacs
Bengaluru
Work from Office
looking for a Junior Site Reliability Engineer (SRE) with strong Java coding and debugging skill. This role is ideal for candidates passionate about Java, DevOps, cloud technologies, and automation in a fast-paced environment. Required Candidate profile Strong Java programming and debugging skills (must-have). • Experience with Linux systems, networking, and cloud platforms (AWS, Azure, or GCP). • Familiarity with Prometheus, Grafana, or New Relic
Posted 2 months ago
7 - 12 years
25 - 30 Lacs
Mumbai
Work from Office
Minimum of 6+ years of professional experience in software development, Site Reliability Engineering (SRE), DevOps, or Release Engineering. Proven experience in designing, automating, and managing CI/CD pipelines and cloud infrastructure using IaC tools in both Azure and AWS environments.
Posted 2 months ago
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Accenture
39817 Jobs | Dublin
Wipro
19388 Jobs | Bengaluru
Accenture in India
15458 Jobs | Dublin 2
EY
14907 Jobs | London
Uplers
11185 Jobs | Ahmedabad
Amazon
10459 Jobs | Seattle,WA
IBM
9256 Jobs | Armonk
Oracle
9226 Jobs | Redwood City
Accenture services Pvt Ltd
7971 Jobs |
Capgemini
7704 Jobs | Paris,France