Get alerts for new jobs matching your selected skills, preferred locations, and experience range. Manage Job Alerts
15.0 - 20.0 years
50 - 60 Lacs
Pune
Hybrid
Principal, IT Resiliency Architect The Principal, IT Resiliency Architect will be responsible for evaluating and designing complex IT infrastructure solutions across a vast range of technologies within the IT Resiliency Team. The IT Resiliency Team is dedicated to ensuring the resilience and continuity of our operations in the face of unforeseen disasters. The architect must have strong program management skills to work with cross function infrastructure and application owners to assess, design, and implement technical solutions for the Disaster Recovery (DR)program. The architect should be able to troubleshoot, identify root cause, and seek a formal solution with other teams as required. This hands-on role must have experience with data center, networking, security, storage, virtualization, database, and middleware technologies. Responsibilities As an expert, the IT Resiliency Architect will be on a team that will deliver end-to-end technical resiliency solutions to the organization, utilizing the latest technologies and leveraging automation mechanisms for reducing recovery times. The Resiliency Architect should have a solid understanding of program management and leadership skills to engage various teams. Disaster Recovery Orchestration Tools Configure and integrate interfaces with existing systems with the DR Orchestration tool. Integrate project management, disaster recovery and functional business expertise to create a superior customized solution for the Resiliency Team. Work with teams to assess and implement high availability and seamless failover resiliency mechanisms across multiple layers of the application and infrastructure stacks. Develop documentation for onboarding of applications onto this toolset, create training modules, and apply project manager expertise to ensure project milestones are met. Apply new technologies and design of highly complex infrastructure and software solutions.
Posted 1 month ago
7.0 - 12.0 years
13 - 17 Lacs
Hyderabad
Work from Office
YOUR IMPACT: We are seeking a highly skilled and experienced Level 3 Site Reliability Engineer (SRE) to join our Cloud Operations team. This role is critical in driving advanced engineering initiatives to ensure infrastructure reliability, scalability, and automation across multi-cloud environments. As an L3 SRE, you will lead complex cloud support operations, troubleshoot infrastructure as code, implement observability frameworks, and guide junior SREs while helping shape future architectural direction. This role demands hands-on expertise in AWS, Azure, or GCP, advanced scripting, and deep observability integrationcontributing directly to uptime, automation maturity, and strategic improvements to cloud infrastructure. WHAT THE ROLE OFFERS: Cloud Infrastructure & Architecture Architect and maintain scalable, resilient systems across AWS, Azure, and GCP. Lead cloud adoption and migration strategies while ensuring minimal disruption and high reliability. Implement security and governance controls including VPC, Security Groups, Route53, ACM, and Security Hub. Perform deep infrastructure troubleshooting and root cause analysis, especially with IaC-based deployments. Infrastructure as Code (IaC) & Configuration Management Design and manage infrastructure using Terraform, Terragrunt, and CloudFormation. Oversee configuration management using tools like AWS SSM, SaltStack, and Packer. Review and remediate issues within Git-based CI/CD workflows for IaC and service deployment. Observability & Monitoring Build and maintain monitoring/alerting pipelines using CloudWatch, EventBridge, SNS, and Hund.io. Develop custom observability tooling for end-to-end visibility and proactive issue detection. Lead incident response and contribute to post-incident reviews and reliability reports. Automation, Scripting & CI/CD Develop and maintain automation tools using Bash, Python, Ruby, or PHP. Integrate deployment pipelines into secure, scalable CI/CD processes. Automate vulnerability assessments and compliance scans with ISO 27001 standards. Containerization & Microservices Support Lead container platform deployments using EKS, ECS, ECR, and Fargate. Guide engineering teams in Kubernetes resource optimization and troubleshooting. Database & Storage Management Provide advanced operational support for RDS, PostgreSQL, and Elasticsearch. Monitor database performance and ensure availability across distributed systems. Mentorship & Strategy Mentor L1 and L2 SREs on technical tasks and troubleshooting best practices. Contribute to cloud architecture planning, operational readiness, and process improvements. Help define and track Key Performance Indicators (KPIs) related to system uptime, MTTR, and automation coverage. WHAT YOU NEED TO SUCCEED: 7-12 years of experience in Site Reliability Engineering or DevOps roles. Advanced expertise in multi-cloud environments (AWS, Azure, GCP). Strong Linux and Windows administration background (Fedora, Debian, Microsoft). Proficiency in Terraform, Terragrunt, CloudFormation, and config management tools. Hands-on with monitoring tools like CloudWatch, SNS, EventBridge, and third-party integrations. Advanced scripting skills in Python, Bash, Ruby, or PHP. Knowledge of container platforms including EKS, ECS, and Fargate. Familiarity with Vulnerability Management, ISO 27001, and audit-readiness practices.
Posted 1 month ago
10.0 - 14.0 years
15 - 20 Lacs
Bengaluru
Work from Office
Site Reliability Engineer - Wintel, Linux, Vmware, Redhat Devops CI/CD AWS Design, implement, and maintain scalable and reliable compute infrastructure, with a focus on Wintel, Linux, VMWare, and Redhat KVM environments. Collaborate with development teams to ensure applications are designed for reliability and performance across different operating systems and virtualization platforms. Automate repetitive tasks to improve efficiency and reduce manual intervention, specifically within Wintel and Linux systems. Monitor system performance, identify bottlenecks, and implement solutions to improve overall system reliability in VMWare and Redhat KVM environments. Develop and maintain tools for deployment, monitoring, and operations tailored to Wintel, Linux, VMWare, and Redhat KVM. Troubleshoot and resolve issues in development, test, and production environments, focusing on compute-related challenges. Participate in on-call rotations and respond to incidents promptly, ensuring high availability of compute resources. Implement best practices for security, compliance, and data protection within Wintel, Linux, VMWare, and Redhat KVM systems. Document processes, procedures, and system configurations specific to the compute infrastructure. Primary Skills Site Reliability Engineer SRE Compute Infrastructure Wintel Administration Linux Administration VMWare Administration Redhat Proficiency in scripting languages Python, Java, C/C++, Bash Infrastructure tools Terraform, Ansible Experience with monitoring and logging tools Prometheus, Grafana, ELK stack Solid understanding of networking, security, and system administration within Wintel and Linux environments. Experience with CI/CD pipelines and tools Jenkins, GitLab CI Knowledge of database management systems MySQL, PostgreSQL
Posted 1 month ago
6.0 - 10.0 years
12 - 16 Lacs
Pune
Work from Office
We are on the lookout for a hands-on DevOps / SRE expert who thrives in a dynamic, cloud-native environment! Join a high-impact project where your infrastructure and reliability skills will shine.. Key Responsibilities. Design & implement resilient deployment strategies (Blue-Green, Canary, GitOps). Manage observability tools: logs, metrics, traces, and alerts. Tune backend services & GKE workloads (Node.js, Django, Go, Java). Build & manage Terraform infra (VPC, CloudSQL, Pub/Sub, Secrets). Lead incident responses & perform root cause analyses. Standardize secrets, tagging & infra consistency across environments. Enhance CI/CD pipelines & collaborate on better rollout strategies. Must-Have Skills. 510 years in DevOps / SRE / Infra roles. Kubernetes (GKE preferred). IaC with Terraform & Helm. CI/CD: GitHub Actions + GitOps (ArgoCD / Flux). Cloud architecture expertise (IAM, VPC, Secrets). Strong scripting/coding & backend debugging skills (Node.js, Django, etc.) ?. Incident management with tools like Datadog & PagerDuty. Excellent communicator & documenter. Tech Stack. GKE, Kubernetes, Terraform, Helm. GitHub Actions, ArgoCD / Flux. Datadog, PagerDuty. CloudSQL, Cloudflare, IAM, Secrets. You're. A proactive team player & strong individual contributor. Confident yet humble. Curious, driven & always learning. Not afraid to solve deep infrastructure challenges. (ref:hirist.tech). Show more Show less
Posted 1 month ago
8.0 - 10.0 years
11 - 15 Lacs
Thane
Work from Office
People & Organization (HR) - Business Partner Professional Can you help us craft the futureWere looking for dedicated people with the skills and vision to build a better tomorrow. Join our P&O BP team and help us build the technology that will transform entire industries, cities and even countries. We need a People Business Partner anchored out of Mumbai to enhance the current team, as our business has grown multifold over the past few years, and is continuing to grow. The incumbent will play a pivotal role in contributing to shaping the business with People interventions. Come and change the future with us! Your primary role involves Partnering with Managers, Leaders and employees to implement emerging people related topics. And play a key role in orchestrating and facilitating changes to organizational culture, design and structure. Youll partner with the location P & O leader and Business teams at relevant levels to help implement programs and processes related to organizational capability and partnering along with Specialist Teams to execute related activities. Youll conduct regular connect sessions with the employees, facilitating Employee pulse check discussions and plan actions to enable Employee engagement & Experience. Youll provide the perspective of business needs and requirements to Professional teams in P&O and work with them to optimize results and guide the business on the appropriate People oriented programs, processes, and policies based on the business environment / strategy Youll orchestrate and connect business with experts across the P&O value chain to develop new or use existing solutions. And also support and participate in interviewing the Key Critical Roles Youll keep an overall pulse of the organization. And manage regular on-going HR tasks - transfers, E-list corrections, compensation interventions, HR operations, Employees programs, Employee benefits management etc You will work closely with the Location P&O on topics of Compliance, Retirals. You will partner with the specific teams to build specific policies specific to business requirements. You have minimum 8-10 years of experience in P&O business partnering. Independent, self-motivated, with strong communication, interpersonal and negotiation skills. Drive PMP process using the Workday tool, Target setting, FIT ratings, RTs, Global resource alignments, in lines with country Work closely with the Country C&B team on annual and mid-year salary calculations (component wise for letters and payroll input), Revision Letter tool- end to end management. We dont need superheroes, just super minds with an winning attitude! Weve got quite a lot to offer. How about you This role is based in Mumbai where youll get the chance to work with teams impacting entire cities, countries- and the shape of things to come. Were Siemens. A collection of over 379,000 minds building the future, one day at a time in over 200 countries. We're dedicated to equality, and we encourage applications that reflect the diversity of the communities we work in across Gender, LGBTQ+, Abilities & Ethnicity. All employment decisions at Siemens are based on qualifications, merit and business need. Bring your curiosity and imagination and help us shape tomorrow. Find out more about us athttps://new.siemens.com and about Siemens careers at Feel free to apply for this role. On finding your candidature suitable, our recruiter will contact you for the preliminary discussion for this role. All the very best.
Posted 1 month ago
5.0 - 10.0 years
20 - 25 Lacs
Pune
Work from Office
In everchanging SaaS landscape there are a few persistent items that contribute to developing quality solutions with speed. Namely, ensuing operational activities are treated as software development enhancements, manual tasks are remediated though automation, risk reduction though compartmentalization of services/code and consumption of readily available provider services. Product/development teams require an accountable partner to advance on these topics, The SRE (Site Reliability Engineering) team will be this partner. The SRE team will support the Siemens Xcelerator platform andwill be responsible for identifying, managing, improving, and reporting on availability, resiliency, reliability, and stability efficiencies. This includes providing technical guidance and leadership to drive solutions, create & enhance processes that deliver excellence. A strong relationship with the various product teams of the Xcelerator platform is necessary to support core objectives. This roles success will be defined by product teams meeting their SLOs with healthy product adoption and operational excellence. This position will be responsible to support technology and cluture though an enterprise ecosystem to ensure developers and products exceed product SLOs (Service level Objectives) and clearly, without dispute, benefit from every interaction with the SRE team. Responsibilities Incident Management, Game Day coordination, Create and drive Metric/observability solutions and reviews Support production readiness reviews Cross division role model to advance the SRE practice in Siemens Complete technological control over methods of automation, codifying optional activities, microservice architecture, platform engineering to ensure changes, updates or technical advancements are in place for a product Ensure the team can provide the design, deployment, automation, and scripting solutions to drive new capabilities, visibility, and efficiency Simplify highly complex ideas, architectures and concepts to encourage achievable adoption Collaborate with other technical platforms and partners to engineer automated and integrated solutions between tools, services, teams that increase availability, reliability, and performance Own and ensure the internal and external SLAs meet and exceed expectations Be part of maintaining a 24x7, global, highly available SaaS environment Participate in an on-call rotation that supports our production infrastructure Troubleshoot production availability incidents that often span across multiple teams and services Ensure the SRE team can coordinate production incident post-mortems, and contribute to solutions to prevent problem recurrence; with the goal of automated response to all non-exceptional service conditions Communicate to business and technical partners on incidents as they occur when they impact system performance or availability at a critical level Required Knowledge/Skills, Education, and Experience Bachelors Degree or equivalent experience; Proven experience as a Site Reliability Engineer or equivalent role; Experience working in a large organization though a SRE transformation where existing applications were adapted to contemporary targets Proven experience with automation via scripting & API development Experience with software development in the cloud Experience with monitoring tools(Datadog, CloudWatch, CloudTrail, Cloudability, or equivalent tools) Proven e xperience with containerization, specifically Kubernetes Experience with Amazon Web Services (AWS) services andTerraform, CloudFormation, Ansible, or equivalent tools Preferred Knowledge/Skills, Education, and Experience Desired certifications includeDatadog, Kubernetes, Security, AWS certification Understanding of ITIL Deep understanding of SRE and Incident management strategies Experience with issue/incident tracking tool(ServiceNOW, ServiceDesk, Jira or equivalent tools) and open source tools (Linux, Python, Git, Ansible) Experience on Enterprise IT environment with distributed environments Networking concepts, including firewalls, VPN, routing, load balancers, security and DNS Senior level system administration experience, including troubleshooting, support, mentorship/training, and oversight Why us Working at Siemens Software means flexibility - Choosing between working at home and the office at other times is the norm here. We offer great benefits and rewards, as you'd expect from a world leader in industrial software. A collection of over 377,000 minds building the future, one day at a time in over 200 countries. We're dedicated to equality, and we welcome applications that reflect the diversity of the communities we work in. All employment decisions at Siemens are based on qualifications, merit, and business need. Bring your curiosity and creativity and help us shape tomorrow! Siemens Software. Transform the Everyday #LI-PLM #LI-HYBRID
Posted 1 month ago
4.0 - 6.0 years
17 - 22 Lacs
Bengaluru
Work from Office
Hello talented techie! We know that the only way a business thrive is if our people are growing. Thats why we always put our people first. Our global, diverse team would be happy to support you and challenge you to grow in new ways. Who knows where our shared journey will take you We are looking for Senior Dev-ops Engineer Youll make a difference by Being an SRE L1 Commander, who is responsible for ensuring the stability, availability, and performance of critical systems and services. As the first line of defense in incident management and monitoring, the role requires real-time response, proactive problem solving, and strong coordination skills to address production issues efficiently. Monitoring and AlertingProactively supervise system health, performance, and uptime using monitoring tools like Datadog, Prometheus. Serving as the primary responder for incidents to tackle and resolve issues quickly, ensuring minimal impact on end-users. Accurately categorizing incidents, prioritize them based on severity, and raise to L2/L3 teams when vital. Ensuring systems meet Service Level Objectives (SLOs) and maintain uptime as per SLAs. Collaborating with DevOps and L2 teams to automate manual processes for incident response and operational tasks. Performing root cause analysis (RCA) of incidents using log aggregators and observability tools to identify patterns and recurring issues. Following predefined runbooks/playbooks to resolve known issues and document fixes for new problems. Youd describe yourself as Experienced professional with 4 to 6 years of validated experience in SRE, DevOps, or Production Support with monitoring tools (e.g., Prometheus, Datadog). Proven understanding of Linux/Unix operating systems and basic scripting skills (Python, Gitlab actions) cloud platforms (AWS, Azure, or GCP). Familiarity with container orchestration (Kubernetes, Docker, Helmcharts) and CI/CD pipelines. Exposure with ArgoCD for implementing GitOps workflows and automated deployments for containerized applications. Possessing experience in MonitoringDatadog, InfrastructureAWS EC2, Lambda, ECS/EKS, RDS, NetworkingVPC, Route 53, ELB and StorageS3, EFS, Glacier. Strong analytical skills to resolve production incidents efficiently. Basic understanding of networking concepts (DNS, Load Balancers, Firewalls). Good communication and interpersonal skills for incident communication and issue. Having preferred certificationsAWS Certified SysOps Administrator- Associate, AWS Certified Solutions Architect- Associate or AWS Certified DevOps Engineer- Professional Create a better #TomorrowWithUs! This role, based in Bangalore, is an individual contributor position. You may be required to visit other locations within India and internationally. In return, you'll have the opportunity to work with teams shaping the future. At Siemens, we are a collection of over 312,000 minds building the future, one day at a time, worldwide. We are dedicated to equality and welcome applications that reflect the diversity of the communities we serve. All employment decisions at Siemens are based on qualifications, merit, and business need. Bring your curiosity and imagination, and help us shape tomorrow Find out more about Siemens careers at
Posted 1 month ago
5.0 - 10.0 years
12 - 17 Lacs
Bengaluru
Work from Office
Educational Master Of Science,Bachelor of Engineering,Bachelor Of Technology,Master Of Technology,Master Of Engineering,Bachelor Of Science Service Line Application Development and Maintenance Responsibilities Improve reliability, quality, and time-to-market of our suite of products/applications.Define suitable metrics for system with SLO/SLI and setup observability mechanism to track itDefine error budget as per the SLODefine strategy and setup up High Availability and Load Balancer based architecture Drive a metrics-driven culture and software delivery process using data to measure overall system quality and reliability.Balance feature development speed and reliability with well-defined service level objectivesProvide primary operational support and engineering for products/applicationsPartner with solution architect and development teams to improve services reliabilityParticipate in system design, infra management and capacity planningParticipate in optimizing code, automating operational tasks and toil reductionProvide solutions for performance management, disaster recovery, monitoring and observabilityWork with business users to understand issues, develop root cause analysis and work with the development team for enhancements/fixesWorking on distributed traces to visualize the entire workflow and analyze the cause of problems/incidentsImprove security and performance of infrastructure and applications Provide support, improve, and implement infrastructure as codeDefine, evangelize, and maintain SRE best practices Solutionize and implement DevSecOps best practices Improve automation including system’s self-healing capability Additional Responsibilities: Ability to develop value-creating strategies and models that enable clients to innovate, drive growth and increase their business profitability Good knowledge on software configuration management systemsAwareness of latest technologies and Industry trends Logical thinking and problem-solving skills along with an ability to collaborate Understanding of the financial processes for various types of projects and the various pricing models available Ability to assess the current processes, identify improvement areas and suggest the technology solutions One or two industry domain knowledge Client Interfacing skills Project and Team management Technical and Professional : As a Site Reliability Engineer, you will play a critical role in supporting application developers by providing expert guidance on Application and infrastructure best practices from reliability perspective.Your role covers the entire life cycle of a product/application. Your primary focus will be Automation, Observability, reliability and Release management with CICD with an emphasis on solving operations issues.At least 3+ years of SRE experience in large programs with focus on release engineering, observability tasks and reliability.Must have good understanding of Site Reliability Engineering (SRE) and release management processes.should possess strong analytical and troubleshooting skills.Should be a strong team player and enjoy collaborating with different people and profiles as well as share knowledge and strive for continuous development and learning.Excellent communication skills along with leadership skills Preferred Skills: Technology-Agile Testing-Agile Testing - ALL-CD/CI Technology-Cloud Platform-Amazon Webservices DevOps-AWS DevOps Technology-DevOps-Continuous integration - Java-Jenkins Technology-DevOps-Continuous integration - Others Technology-Container Platform-Docker Technology-Container Platform-Kubernetes
Posted 1 month ago
9.0 - 13.0 years
11 - 15 Lacs
Bengaluru
Work from Office
Educational Bachelor of Engineering,BTech,Bachelor Of Science,Master Of Engineering,Master Of Technology Service Line Infosys Cobalt Unit Responsibilities A day in the life of an Infoscion As a Senior Site Reliability Engineer, you will play a critical role in supporting application developers by providing expert guidance on Application and infrastructure best practices from reliability perspective. Improve reliability, quality, and time-to-market of our suite of products/applications. Define suitable metrics for system with SLO/SLI and setup observability mechanism to track it Define error budget as per the SLO Define strategy and setup up High Availability and Load Balancer based architecture Drive a metrics-driven culture and software delivery process using data to measure overall system quality and reliability. Balance feature development speed and reliability with well-defined service level objectives Provide primary operational support and engineering for products/applications Partner with solution architect and development teams to improve services reliability Participate in system design Participate in optimizing code, automating operational tasks and toil reduction Provide solutions for performance management, monitoring and observability Work with business users to understand issues, develop root cause analysis and work with the development team for enhancements/fixes Working on distributed traces to visualize the entire workflow and analyze the cause of problems/incidents Improve security and performance of applications Define, evangelize, and maintain SRE best practices Solutionize and implement DevSecOps best practices Improve automation including system’s self-healing capability Manage and participate in on-call incidents, if required (Priority Incident) If you think you fit right in to help our clients navigate their next in their digital transformation journey, this is the place for you! Additional Responsibilities: AIOps and related tools Experience in container orchestration and practices, including Kubernetes, Docker Swarm Experience in infrastructure automation tools like Terraform, Cloud Formation, Ansible, and Puppet (Any one) Knowledge on SQL, NoSQL (Oracle, Couchbase) Experience working on ITSM tools like Remedy, ServiceNow, Confluence, Jira Experience with Cloud cost optimization / FinOps Technical and Professional : Must have at least 5+ years of SRE experience in large programs with focus on release engineering, observability tasks and reliability Reliability practices Chaos engineering Strong experience on one or more Observability tools like New Relic, AppDynamics, Prometheus, Dynatrace, DataDog, Splunk, Experience in event correlation using observability or other tools like BigPanda Experience in Observability Dashboard creation, custom metrics, Synthetic Monitoring and Real User Monitoring (RUM) Good experience in scripting or development languages, including expertise in Python, Ruby, JSON, Java, and Node.JS, PHP (anyone) Experience with scripting in PowerShell(M) and Bash/Shell/Perl (anyone) Strong knowledge of application design and architecture including microservices architecture Experience in CICD tooling and best practices Experience of Cloud platforms such as AWS, Azure, and Google Preferred Skills: Foundational-Configuration Management-Configuration Management-Ansible Technology-Infra_ToolAdministration-Others-Splunk Admin Technology-Infra_ToolAdministration-PerformanceManagement-AppDynamics Technology-Infra_ToolAdministration-PerformanceManagement-Dynatrace Generic Skills: Technology-Infra_ToolAdministration-ITSM-ServiceNow Technology-OpenSystem-Python - OpenSystem-Python
Posted 1 month ago
9.0 - 14.0 years
13 - 17 Lacs
Pune
Work from Office
Pls see below: We need a strong profile having good exp in stakeholder & SRE team management . Good understanding of Production engineering/ production support projects is a must which includes handling teams working in 24/7 model. Good understanding of Incident, change, service req management is a daily routine – so candidate should know how to manage the workload, rotate FTEs as and when required. Management of Ad hoc activities such as Vulnerabilities fixes/ patching awareness is required. Should be able to lead BAU governance activities Daily, Weekly & Monthly cadence with necessary reporting data. Having GCP cloud infra management knowledge, Postgres DB basic knowledge & banking domain experience is a big advantage to the role. === Mandatory experience on SRE (not Traditional Production Support) covering integration platforms on cloud-based deployments. Knowledge of applying SRE practices to daily operations is key. Ability to manage teams in shifts from office is mandatory; this is a 24x7 on desk operation. Computer Science and/or Engineering degrees are preferred. Having domain experience in Banking will be a great advantage. Working Experience/ Awareness 24x7 operations support model for mission critical applications and infrastructure using ServiceNow as the ITSM ticketing tool. GCP and private-cloud operational support / administration activities such as provision, capacity management, reliability management, monitoring, restoration, etc. Working knowledge on AppDynamics and Splunk for monitoring and setting up observability is key. CI/CD tool chains, setting up and running deployment pipelines and propagating changes on different environments. Maintaining middleware such as Kafka (open source) and MQ as well as application servers (Tomcat). Maintain Hazelcast Data storage platform clusters and Control M job schedulers. Kubernetes cluster management, monitoring, and remediation. Knowledge of Docker is important. Automating deployments and scripting self-healing workflows based on telemetry. Work closely with the team to define SLIs and configure SLOs, respond to threshold alerts and optimize monitoring capability. Work closely with the team to understand the code as well as configuration artifacts to debug and fix issues that may arise. Must be inclined to work on proof of concepts solutions to optimize reliability such as those incorporating AI models for event correlation and assisted triaging. Able to lead & drive SRE team to parallelly work on Service or Change Requests, Defect management board, backlog management in agile manner. Good to have SRE Foundation certification by DevOps Institute or any other equivalent certification on SRE by a recognized body is mandatory. CKA certification. GCP Cloud Digital Leader certification at a minimum is mandatory; Cloud Engineer level is a bonus. Hazelcast Platform Operations certification badge
Posted 1 month ago
7.0 - 12.0 years
8 - 12 Lacs
Hyderabad
Work from Office
Experience Minimum 7 years of work experience as an SRE (not Traditional Production Support) covering integration platforms on cloud-based deployments Coding / automation scripting experience in any programming language, particularly for integration tier and middleware Working as a DevOps Engineer or SRE in mission critical applications and infrastructure Working experience with GCP (Google Cloud), particularly with GKE is important Working with AppDynamics and Splunk for monitoring and setting up observability is key CI CD tool chains, setting up and running deployment pipelines and propagating changes on different environments Core Capabilities Maintaining middleware such as Kafka (open source) and MQ as well as application servers (Tomcat) Maintain Hazelcast Data storage platform clusters and Control M job schedulers GCP and private-cloud operational support / administration activities such as provision, capacity management, reliability management, monitoring, restoration, etc Kubernetes cluster management, monitoring and remediation. Knowledge of Docker is important Automating deployments and scripting self-healing workflows based on telemetry Define SLIs and configure SLOs, respond to threshold alerts and optimize monitoring capability Work with code as well as configuration artifacts to debug and fix issues that may arise Knowledge of applying SRE practices to daily operations is key Must be inclined to work on proof-of-concept solutions to optimize reliability such as those incorporating AI models for event correlation and assisted triaging Ability to work in shifts in office is mandatory; this is a 24 / 7 on-desk operation Qualification Computer Science and or Engineering degrees are preferred SRE Foundation certification by DevOps Institute or any other equivalent certification on SRE by a recognized body is mandatory CKA certification GCP Cloud Digital Leader certification at a minimum is mandatory; Cloud Engineer level is a bonus Hazelcast Platform Operations certification badge Role & Responsibilities Work as part of a 24 / 7 on-desk team in shifts that will manage middleware and associated applications that are being consumed globally incident, change, event, problem management Debugging integrations and consumers at the code level Work with CI CD pipelines and automate new change rollouts. Change deployment and sanity testing is part of the scope Set up and configure an observability product, preferably AppDynamics or Splunk for end-to-end traceability and log analytics Be the guardian to ensure high reliability of the applications, middleware, storage platforms, scheduler (and its jobs) and underlying cloud infrastructure Define and set up SLIs as well as SLOs while continuously refining thresholds Set up anomaly detection and auto-remediation workflows Ensure all alerts and incidents within scope are actioned upon before breaching SLOs
Posted 1 month ago
6.0 - 11.0 years
12 - 16 Lacs
Hyderabad
Work from Office
SRE is part of an application team matrixed to the Cloud Services Team to perform a specialized function that focuses on the automation of availability, performance, maintainability and optimization of business applications on the platform. To be effective in the position, a SRE must have strong AWS, Terraform and GitHub skills as the platform is 100% automated. All changes being applied to the environment must be automated with Terraform and checked into GitHub version control. A matrixed SRE will be provided the Reliability Engineering role in the accounts they are responsible for. This role includes the rights to perform all the necessary functions required to support the applications in the IaaS environment. An SRE is required to adhere to all Enterprise processes and controls (ie ChgMgt, Incident and Problem Mgmt, etc) and ensure alignment to Cloud standards and best practices. Ability to write and implement infrastructure as code and platform automation Experience implementing Infrastructure as Code Terraform Collaborate with Cloud Services and Application teams to deliver projects Deploy infrastructure as code (IaC) releases to QA, staging, and production environments Responsible for building the automation for any account customizations required by the application custom roles, policies, security groups, etc DevOps Engineer Should be having Minimum 5years of working experience especially as DevOps Engineer/SRE Should be working as IC role with very good communication skills Verbal & Written OS Knowledge Should have 3Years hands on working experience on Linux SCM Should have 3 years of hands on working experience in Git Preferably GitHub Enterprise Cloud ExperienceShould have a thorough knowledge of AWS Certification is preferred CICD Tool 4 Years hands on working experience in Jenkins If not any other CICD tool EKS CICD Working experience with Jenkins and if not any other CICD tool for EKS. Jenkins Pipeline script hands on experience with pipeline script is preferred. Containers Minimum 1 Year hands-on working experience in Docker/Kubernetes. Preferred if candidate is certified CKA(Certified Kubernetes Administrator) Mulesoft Runtime Fabric Install configure Anypoint Runtime Fabric environment and deploy application on runtime fabric. Cloud Infra Provisioning Tool 2 Years hands on working experience in Terraform/ Terraform Enterprise/Cloud Formation Application Provisioning Tool 2 Years hands on working experience in Puppet/Ansible/Chef Data Components Should have good knowledge and Min 1 year of working experience with ELK, Kafka, Zookeeper HDF knowledge added advantage Tools Consul Vault Knowledge is added advantage Scripting Knowledge 3 years hands on working experience on any scripting language Shell/Python/Ruby etc Very good troubleshooting skills and should have hands on working experience in production deployments and Incidents. Mulesoft Knowledge Added advantage Java SpringBoot Knowledge Added advantage.
Posted 1 month ago
9.0 - 11.0 years
15 - 20 Lacs
Bengaluru
Work from Office
Educational Bachelor of Engineering,Bachelor Of Technology (Integrated),Bachelor Of Comp. Applications,Bachelor Of Science (Tech),Bachelor Of Technology,Master Of Technology,Master Of Comp. Applications,Master of Science (Technology),Master Of Engineering,Master of Technology (Integrated) Service Line Application Development and Maintenance Responsibilities A day in the life of an Infoscion As part of the Infosys consulting team, your primary role would be to lead the engagement effort of providing high-quality and value-adding consulting solutions to customers at different stages- from problem definition to diagnosis to solution design, development and deployment. You will review the proposals prepared by consultants, provide guidance, and analyze the solutions defined for the client business problems to identify any potential risks and issues. You will identify change Management requirements and propose a structured approach to client for managing the change using multiple communication mechanisms. You will also coach and create a vision for the team, provide subject matter training for your focus areas, motivate and inspire team members through effective and timely feedback and recognition for high performance. You would be a key contributor in unit-level and organizational initiatives with an objective of providing high-quality, value-adding consulting solutions to customers adhering to the guidelines and processes of the organization. If you think you fit right in to help our clients navigate their next in their digital transformation journey, this is the place for you! Additional Responsibilities: Good knowledge on software configuration management systems Strong business acumen, strategy and cross-industry thought leadership Awareness of latest technologies and Industry trends Logical thinking and problem solving skills along with an ability to collaborate Two or three industry domain knowledge Understanding of the financial processes for various types of projects and the various pricing models available Client Interfacing skills Knowledge of SDLC and agile methodologies Project and Team management Technical and Professional : Primary skills:Technology-DevOps-DevOps Architecture Consultancy Preferred Skills: Technology-DevOps-DevOps Architecture Consultancy
Posted 1 month ago
3.0 - 5.0 years
14 - 19 Lacs
Bengaluru
Work from Office
Educational Bachelor of Engineering,Bachelor Of Technology (Integrated),Bachelor Of Science (Tech),Bachelor Of Comp. Applications,Intergrated course BCA+MCA,Master of Science (Technology),Master Of Technology,Master Of Tech (Integrated) Service Line Application Development and Maintenance Responsibilities A day in the life of an Infoscion As part of the Infosys consulting team, your primary role would be to actively aid the consulting team in different phases of the project including problem definition, effort estimation, diagnosis, solution generation and design and deployment You will explore the alternatives to the recommended solutions based on research that includes literature surveys, information available in public domains, vendor evaluation information, etc. and build POCs You will create requirement specifications from the business needs, define the to-be-processes and detailed functional designs based on requirements. You will support configuring solution requirements on the products; understand if any issues, diagnose the root-cause of such issues, seek clarifications, and then identify and shortlist solution alternatives You will also contribute to unit-level and organizational initiatives with an objective of providing high quality value adding solutions to customers. If you think you fit right in to help our clients navigate their next in their digital transformation journey, this is the place for you! Additional Responsibilities: Ability to work with clients to identify business challenges and contribute to client deliverables by refining, analyzing, and structuring relevant data Awareness of latest technologies and trends Logical thinking and problem solving skills along with an ability to collaborate Ability to assess the current processes, identify improvement areas and suggest the technology solutions One or two industry domain knowledge Technical and Professional : Primary skills:Technology-DevOps-DevOps Architecture Consultancy Preferred Skills: Technology-DevOps-DevOps Architecture Consultancy
Posted 1 month ago
7.0 - 12.0 years
37 - 40 Lacs
Pune
Work from Office
: Job TitleSenior Engineer, AVP LocationPune, India Role Description As a senior engineer, you will be tasked with overseeing and directly involved in creation of scalable microservices utilizing Java and Spring Boot. You will work closely with technical stakeholders to guarantee that development adheres to established architectural patterns and guidelines. You will guide the team through mentoring and coaching to help them reach their technical objectives and foster a culture of technical excellence. What well offer you 100% reimbursement under childcare assistance benefit (gender neutral) Sponsorship for Industry relevant certifications and education Accident and Term life Insurance Your key responsibilities Oversee the design, development, and implementation of microservices utilizing Java, Spring Boot, and associated technologies. Work in conjunction with product managers, architects, and DevOps to provide high-quality solutions. Uphold and advocate for best practices and standards. Facilitate code reviews, establish coding standards, and mentor junior team members. Your skills and experience Must Have: A comprehensive experience exceeding 8 years, featuring practical coding and engineering skills predominantly in Java technologies and microservices. Significant expertise in Microservices architecture, including various patterns and practices. Profound proficiency in Spring Boot, Spring Cloud, and the development of REST APIs. Desirable skills that will help you excel Previous experience in an Agile/Scrum environment. Solid understanding of containerization technologies (Docker/Kubernetes) and build tools (Maven/Gradle). Demonstrated experience with databases including Oracle, SQL, and various NoSQL databases. Familiarity with Architecture and Design Principles, Algorithms and Data Structures, as well as User Interface design. Experience with cloud platforms is advantageous (preferably GCP). Knowledge of messaging systems such as Kafka and RabbitMQ would be beneficial. Previous experience working with Python. Strong problem-solving skills. Excellent communication abilities. Proficient in GIT, Jenkins, CI/CD, Gradle, DevOps, and SRE methodologies. Prior experience in team leadership and mentoring is a plus. How well support you About us and our teams Please visit our company website for further information: https://www.db.com/company/company.htm
Posted 1 month ago
7.0 - 12.0 years
9 - 14 Lacs
Hyderabad
Work from Office
Looking for Infra DevOps having 7+ years of experience with Knowledge on CI/CD tools such as Helm, Git, Jira, Jenkins, Tekton, UrbanCode Deploy, Harness, Linux environments, shell scripting, python scripting, Ansible Automaton and CI/CD tools such as Helm, Git, Jira, Jenkins, Tekton, UrbanCode Deploy, Harness Skill Set Experience in Linux environments Experience in shell scripting and python scripting Experience in Ansible Automaton Experience in CI/CD tools such as Helm, Git, Jira, Jenkins, Tekton, UrbanCode Deploy, Harness, Familiarity with container orchestration services like Docker, Kubernetes, OpenShift Experience in supporting WebSphere/ TOMCAT/Apache based MW technologies. Knowledge on foundation infrastructure services (Compute, Storage, Network and Database) Excellent oral and written communication skills Active involvement in and ownership of Support Project items, covering Stability, Efficiency, and Effectiveness initiatives. Leverage the Standard CI/CD tools and Process to Enhance the automation framework and transform the engineering qualification efforts into continues integration and Delivery. Participate in application releases related to Infrastructure and CI/CD set ups, from development, testing and deployment into production. Participate in different type of migration activities. Analyzing Application impact, strategize, coordinate and execute. Be flexible in Shift work or in weekend activities if required. Perform post release checkouts after application releases and infrastructure updates. Develop and maintain technical support documentation. Analyses applications to identify risks, vulnerabilities and security issues. Identifying, analyzing and remediating reported VA issues by coordinating with different teams Building, testing and maintaining the infrastructure and tools to allow for the reliable, fast development and release of software. Using coding/scripting to solve problem. Participation on architecture and software development activities.
Posted 1 month ago
1.0 - 3.0 years
8 - 12 Lacs
Bengaluru
Work from Office
As a DevOps + Site Reliability Engineer you will work in an agile, collaborative environment to build, deploy, configure, and support services in the IBM Cloud. Your responsibilities will encompass the design and implementation of innovative features/automation, fine-tuning and sustaining existing code for optimal performance, uncovering efficiencies, supporting adopters globally, and driving to deliver a highly available cloud offering within IBM Cloud Security Services. In this role, you will be implementing and consuming APIs in the IBM cloud infrastructure environment while configuring integrating services. You will be a motivated self-starter who loves to solve challenging problems and feels comfortable managing multiple and changing priorities, and meeting deadlines in an entrepreneurial environment. Your primary responsibilities include: Contributing to new features and improving existing capabilities or processes while relentlessly troubleshooting problems to deliver. Practice secure development principles supporting continuous integration and delivery leveraging tools such as Tekton, Ansible, and Terraform Orchestrate and maintain Kubernetes/OpenShift clusters to ensure high availability and resilience Collaborate across teams in activities including code reviews, testing, audit support, and mitigating issues. Continuously improve code, automation, testing, monitoring and alerting processes to ensure proactive identification and resolution of potential issues. Lead or contribute to the problem resolution process for our clients, from analysis and troubleshooting, to deploying workarounds or fixes Participate in on-call rotation and lead or contribute to the problem resolution process for our clients, from analysis and troubleshooting, to deploying workarounds or fixes Required education Bachelor's Degree Preferred education Master's Degree Required technical and professional expertise 1-3 Years Experience delivering code and debugging problems. 1-3 Years Experience in SRE, DevOps or similar role A strong preference for collaborative teamwork A rigorous approach to problem-solving Experience with cloud computing technologies Programming skills – scripting, Go, Python, or similar Hands-on experience with Container technologiesKubernetes (IKS), RedHat OpenShift, Docker, Rancher, Podman Proficient with automation tools and CI/CDs Preferred technical and professional experience Strongly preferred experience in working with production Kubernetes/OpenShift environments. Excellent Git skills (merges, rebase, branching, forking, submodules) Experience with Tekton, Ansible, Terraform, Jenkins Experience with Rust, C/C++, or Java Experience using, configuring and troubleshooting CI/CDs Excellent record of improving solutions through automation Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, Kibana, Sysdig, LogDNA). SQL or Postgresql experience
Posted 1 month ago
7.0 - 12.0 years
9 - 14 Lacs
Hyderabad
Work from Office
Experience Minimum 7 years of work experience as an SRE (not Traditional Production Support) covering integration platforms on cloud-based deployments Coding / automation scripting experience in any programming language, particularly for integration tier and middleware Working as a DevOps Engineer or SRE in mission critical applications and infrastructure Working experience with GCP (Google Cloud), particularly with GKE is important Working with AppDynamics and Splunk for monitoring and setting up observability is key CI CD tool chains, setting up and running deployment pipelines and propagating changes on different environments Core Capabilities Maintaining middleware such as Kafka (open source) and MQ as well as application servers (Tomcat) Maintain Hazelcast Data storage platform clusters and Control M job schedulers GCP and private-cloud operational support / administration activities such as provision, capacity management, reliability management, monitoring, restoration, etc Kubernetes cluster management, monitoring and remediation. Knowledge of Docker is important Automating deployments and scripting self-healing workflows based on telemetry Define SLIs and configure SLOs, respond to threshold alerts and optimize monitoring capability Work with code as well as configuration artifacts to debug and fix issues that may arise Knowledge of applying SRE practices to daily operations is key Must be inclined to work on proof-of-concept solutions to optimize reliability such as those incorporating AI models for event correlation and assisted triaging Ability to work in shifts in office is mandatory; this is a 24 / 7 on-desk operation Qualification Computer Science and or Engineering degrees are preferred SRE Foundation certification by DevOps Institute or any other equivalent certification on SRE by a recognized body is mandatory CKA certification GCP Cloud Digital Leader certification at a minimum is mandatory; Cloud Engineer level is a bonus Hazelcast Platform Operations certification badge Role & Responsibilities Work as part of a 24 / 7 on-desk team in shifts that will manage middleware and associated applications that are being consumed globally incident, change, event, problem management Debugging integrations and consumers at the code level Work with CI CD pipelines and automate new change rollouts. Change deployment and sanity testing is part of the scope Set up and configure an observability product, preferably AppDynamics or Splunk for end-to-end traceability and log analytics Be the guardian to ensure high reliability of the applications, middleware, storage platforms, scheduler (and its jobs) and underlying cloud infrastructure Define and set up SLIs as well as SLOs while continuously refining thresholds Set up anomaly detection and auto-remediation workflows Ensure all alerts and incidents within scope are actioned upon before breaching SLOs
Posted 1 month ago
6.0 - 11.0 years
10 - 14 Lacs
Mumbai
Hybrid
Greetings from #IDESLABS We have Immediate opening for SRE JD: SRE Platform:- Skill Profile SRE Client Platform: 7+ years of relevant experience as an SRE/DevOps Engineer Have a background in either Systems Administration or Software Engineering Strong experience with major public Cloud Providers (ideally GCP but this is not a must have) Strong experience with Docker and Kubernetes. Strong experience with IaC (Terraform) Strong understanding of GitOps concepts and tools (ideally Flux) Excellent knowledge of technical architecture and modern design patterns, including micro-services, serverless functions, NoSQL, RESTful APIs, etc. Ability to set up and support CI/CD pipelines and tooling using Gitlab. Proficiency in a high-level programming language such as Python, Ruby or Go Experience with monitoring, log aggregation and alerting tooling (GCP Logging, Prometheus, Grafana). Additional Job Description SRE Data Platform:- SRE Data Platform: Linux administration skills and a deep understanding of networking and TCP/IP. Experience with the major cloud providers and Terraform. Knowledge of technical architecture and modern-day design patterns, including micro-services, serverless functions, NoSQL, RESTful APIs, etc. Demonstrable skills in a Configuration Management tool like Ansible. Experience in setting up and supporting CI/CD pipelines and tooling such as GitHub or Gitlab CI Proficiency in a high-level programming language such as Python or Go. Experience with monitoring, log aggregation, and alerting tooling (ELK, Prometheus, Grafana, etc). Experience with Docker and Kubernetes Experience with secret management tools like Hashicorp Vault is deemed a plus Proficient in applying SRE core tenets, including SLI/SLO/SLA measurement, toil elimination, and reliability modeling for optimizing system performance and resilience. Experience with cloud-native tools like Cluster API, service mesh, KEDA, OPA, Kubernetes Operators Experience with big data technologies such as NoSQL/RDBMS(PostgreSQL, Oracle, MongoDB), Redis, Spark, Rabbit, Kafka, etc. Experience in troubleshooting and monitoring large-scale distributed systems
Posted 1 month ago
4.0 - 6.0 years
7 - 11 Lacs
Telangana
Work from Office
Resource to have minimum 3.5+ years of experience with SRE - Production support experience Participate in on-call rotations to provide 24/7 support for critical systems. Strong knowledge to create incident or Change requests using Service now Experience on Devops, Service now, Dynatrace, Ansible, Github, UCD and Helios. Utilize Dynatrace for application performance monitoring and troubleshooting Experience on Production incidents and troubleshot the Production issues Strong knowledge of RestAPIs and their implementations. Proven experience as a Site Reliability Engineer or similar role. Need to work in Shifts (24/7 support model) and ready to provide extra support when team required. Good background/knowledge in managing and supporting microservices, with expertise in RestAPIs, and dashboard management. Good in scripting languages such as PowerShell, Ansible/YAML, or Shell Scripting and nice to have Python Experience in Kibana/Splunk to monitor the logs for several platforms like Apigee,Pingfed, PCF or OCP. Experience in Resolving the compliance issues for Linux serves Added AdvantageGood to have knowledge on Monitor jobs and verify the jobs status in JCL Knowledge on monitor the logs in JCL. Good to have knowledge on Mainframe
Posted 1 month ago
4.0 - 9.0 years
10 - 20 Lacs
Hyderabad, Pune, Bengaluru
Work from Office
Role & responsibilities Preferred candidate profile
Posted 1 month ago
6.0 - 11.0 years
6 - 15 Lacs
Pune, Bengaluru
Work from Office
This is a FULL TIME POSITION with Infosys. F2F interview must for these roles. Multiple roles - 8-10 Positions including Architect level Location - Bangalore or Pune Are you an SRE or Observability Enthusiast? Do you thrive on turning complex systems into transparent ones? Are you passionate about diving deep into metrics, logs, and traces to uncover insights and optimize performance? We're seeking experienced professionals in the following roles (with minimum 2-3 years of relevant experience in any of the below skills) : SRE Engineer / Architect / Consultant - Design and implement SRE practices - Design and implement robust monitoring and alerting systems - Automate routine tasks and streamline operations - Ensure system reliability, scalability, and performance - Strong understanding of cloud platforms and containerization technologies Observability Engineer / Lead - Design and implement effective observability strategies - Analyze logs, metrics, and traces to identify performance bottlenecks - Set up alerts and notifications for critical issues - Experience in tools like Datadog, Dynatrace, New Relic, Splunk, Prometheus, and Grafana We'd love to hear from you, if you think you fit into any of the above roles. Let's build the future of technology together! Abhishek.Sharma@ZentekInfosoft.com
Posted 1 month ago
5.0 - 8.0 years
5 - 8 Lacs
Pune, Maharashtra, India
On-site
As a QlikSense Engineer , will be responsible for overseeing, designing, building, and maintaining multiple dashboards, and data pipelines within the IT Service Management and Site Reliability Engineering (SRE) environment. Responsibilities: Design, build, deliver, and maintain multiple data metrics and dashboards. Create and maintain dashboards and visualizations that enable efficient data retrieval and analysis at scale. Evaluate and recommend new technologies, frameworks, and tools that can improve the efficiency and effectiveness of the IT Service Management and Site Reliability Engineering (SRE) team. Work closely with highly capable teams of data analysts, overseeing their development, and ensuring it is closely aligned to our best practice development standards. Essential Skillset/Experience: Experience in building and maintaining complex dashboards using Qlik Sense as the major platform visualisation tool. Knowledge about SCRUM, Kanban, or PMP/Prince2 program management methods. Experience of IT Service Management and Site Reliability Engineering (SRE)
Posted 1 month ago
4.0 - 6.0 years
6 - 8 Lacs
Pune
Work from Office
Develop and optimize cloud infrastructure using Python and SRE principles. Ensure system reliability and automation.
Posted 1 month ago
5.0 - 10.0 years
15 - 30 Lacs
Gurugram, Delhi / NCR
Work from Office
JOB PURPOSE: Reporting to the Sr Manager, DevSecOps & SRE, the Site Reliability Engineer will be responsible for: Site reliability engineers (SREs) are responsible for improving system reliability and resilience to make it faster and easier to develop and deploy new software capabilities. SREs focus especially on building automation to reduce manual effort and prevent operations incidents. JOB RESPONSIBILITIES: Work with stakeholders such as product owners and Engineering to define service level objectives (SLOs) for system operations. Track performance against SLOs in partnership with monitoring teams or other stakeholders, and ensure systems continue to meet SLOs over time. Create dashboards and reports to communicate key metrics. Create software to improve performance, scalability, and stability of systems. Collaborate with development teams to promote the concept of reliability engineering during all phases of the software development lifecycle to detect and correct performance issues and meet availability goals. Design, code, test, and deliver infrastructure software to automate manual operational work (i.e., toil”). Participate in operational support and on-call rotation shifts for supported systems and products. Conduct blameless post mortems to troubleshoot priority incidents. Perform analytics on previous incidents to understand root causes and better predict and prevent future issues. Use automation to reduce the probability and/or impact of problem recurrence. Identify, evaluate, and recommend monitoring tools and diagnostic techniques to improve system observability. Participate in system design consulting, platform management, capacity planning and launch reviews. Collaborate and share lessons learned regarding performance and reliability issues with all stakeholders including developers, other SREs, operations teams, and project management teams. Participate in communities of practice to share knowledge and foster continuous improvement. Remain current on site reliability engineering methods and trends such as observability-driven development and chaos engineering. Drive continuous improvement in software quality and infrastructure reliability and resilience. Oversee, design, implement, and manage DevOps capabilities using continuous integration/continuous delivery toolsets and automation. SRE engineer will focus on Application Performance Monitoring (APM) including Design, Solution, POC, profiling and tuning application compute and data nodes and resources. Some key duties of this role are: Assist in defining SRE and Observability architecture, design Analyze, Implement new features of SRE and Observability Platform Full stack monitoring across all layers (Infrastructure/Network/Database/Application/Services/Third Party) Provide technical hands-on leadership in commercial and Open source/commercial monitoring Tool selection Implementation. Implement SRE driven automated Incident Detection -> automated Engagement –> Triage/Mitigate – RCA/Postmortems -> Problem task Remediation. AI Driven Correlation, De-duplication Noise Reduction and Auto Remediation Provide weekly monitoring and alert analysis and continuous improvement Create a model of the run-time environment (discovery) Profile the performance and behavior of user-defined transactions Establish Performance metrics from each of the applications/systems technical components (Webserver, App server, Database, etc.) Application performance management database APM tool Administration and Support Monitoring Tool design and implementation APM Setup/Usage policies and guidelines Capacity Planning and monitoring Monitor selected application performance Report vital statistics of application performance in production Make recommendations for improvements with Service Desk Make recommendations for adjustments to runtime resources to improve overall performance profile KEY QUALIFICATION & EXPERIENCES: Strong problem solving and analytical skills. Strong interpersonal and written and verbal communication skills. Highly adaptable to changing circumstances. Interest in continuously learning new skills and technologies. Experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell). Experience with incident and response management. Experience with Agile and DevOps development methodologies. Experience with container technologies and supporting tools (e.g. Docker Swarm, Podman, Kubernetes, Mesos). Experience with working in cloud ecosystems (Microsoft Azure AWS, Google Cloud Platform,). Experience with monitoring and observability tools (e.g. Splunk, Cloudwatch, AppDynamics, NewRelic, ELK, Prometheus, OpenTelemetry). Experience with configuration management systems (e.g. Puppet, Ansible, Chef, Salt, Terraform). Experience working with continuous integration/continuous deployment tools (e.g. Git, Teamcity, Jenkin, Artifactory). Experience in GitOps based automation is Plus Bachelor’s degree (or equivalent years of experience). 5+ years of relevant work experience. SRE experience preferred. Background in Manufacturing, Platform/Tech compnies is preferred. Must have Public Cloud provider certifications (Azure, GCP or AWS) Having CNCF certification is plus
Posted 1 month ago
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
Accenture
39581 Jobs | Dublin
Wipro
19070 Jobs | Bengaluru
Accenture in India
14409 Jobs | Dublin 2
EY
14248 Jobs | London
Uplers
10536 Jobs | Ahmedabad
Amazon
10262 Jobs | Seattle,WA
IBM
9120 Jobs | Armonk
Oracle
8925 Jobs | Redwood City
Capgemini
7500 Jobs | Paris,France
Virtusa
7132 Jobs | Southborough