Jobs
Interviews

652 Sre Jobs - Page 21

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

8.0 - 12.0 years

11 - 16 Lacs

Hyderabad

Work from Office

Overview Responsible for infrastructure engineering devops task for PepsiCo e-commerce. The person needs to also lead the 4 member capability in India Responsibilities Deploy infrastructure in Azure & AWS cloud using terraform and Infra-as-code best practices. Participate in development of Ci/CD workflows to launch application from build to deployment using modern devOps tools like Kubernetes, ArgoCD/Flux, terraform, helm. Ensure the highest possible uptime for our Kubernetes based developer productivity platforms. Partner with development teams to recommend best practices for application uptime and recommend best practices for cloud native infrastructure architecture. Collaborate in infra & application architecture discussions decision making that is part of continually improving and expanding these platforms. Automate everything. Focus on creating tools that make your life easy and benefit the entire org and business. Evaluate and support onboarding of 3rd party SaaS applications or work with teams to integrate new tools and services into existing apps. Create documentation, runbooks, disaster recovery plans and processes. Collaborate with application development teams to perform RCA. Implement and manage threat detection protocols, processes and systems. Conduct regular vulnerability assessments and ensure timely remediation of flagged incidents. Ensure compliance with internal security policies and external regulations like PCI. Lead the integration of security tools such as Wiz, Snyk, DataDog and others within the Pepsico infrastructure. Coordinate with PepsiCo's broader security teams to align Digital Commerce security practices with corporate standards. Provide security expertise and support to various teams within the organization. Advocate and enforce security best practices, such as RBAC and the principle of least privilege. Continuously review, improve and document security policies and procedures. Participate in on-call rotation to support our NOC and incident management teams. Qualifications BSc/MSc in computer science, software engineering or related field is a plus, alternatively completion of a devOps or Infrastructure training course or bootcamp is acceptable as well. 8+ year of Kubernetes, ideally running workloads in a production environment on AKS or EKS platforms. 4+ year of creating Ci/CD pipelines in any templatized format in Github, Gitlab or Azure ADO. 3+ year of Python, bash and any other OOP language. (Please be prepared for coding assessment in your language of choice.) 5+ years of experience deploying infrastructure to Azure platforms. 3+ year of experience with using terraform or writing terraform modules. 3+ year of experience with Git, Gitlab or GitHub. 2+ year experience as SRE or supporting micro services in containerized environment like Nomad, docker swarn or K8s. Kubernetes certifications like KCNA, KCSA, CKA, CKAD or CKS preferred Good understanding of software development lifecycle. Familiarity with: Site Reliability Engineering, AWS, Azure, or similar cloud platforms, Automated build process and tools ,Service Mesh like Istio, linkerd,,Monitoring tools like Datadog, Splunk etc. Able to administer and run basic SQL queries in Postgres, mySQL or any relational database. Current skills in following technologies: Kubernetes, Terraform, AWS or Azure (Azure Preferred). GitHub Actions or Gitlab workflow. Familiar with Agile processes and tools such as Jira; good to have experience being part of Agile teams, continuous integration, automated testing, and test-driven development

Posted 2 months ago

Apply

8.0 - 13.0 years

20 - 35 Lacs

Bengaluru

Work from Office

Role & responsibilities: Help build a Site Reliability Engineering culture by sharing the best practices, approaches, documentation, and code with other engineering teams. Apply automation and software to any tasks or parts of the system which are performed manually. Able to troubleshoot complicated, cross platform issues handling OS, Networking, Database in a cloud-based SaaS environment and handle live production incidents. Monitor application performance take steps to improve overall application performance and stability and follow through with implementation. Conduct system analysis, configuration management and develops improvements for system software performance, availability and reliability. Design, write, ship, and motivate the creation of software and systems to increase observability, product reliability and organizational efficiency. Maintain and monitoring deployment, orchestration, of the servers, docker containers, databases, and general backend infrastructure Develop Run Books/Standard Operating Procedure for recurring Production issues, also working on a permanent solve. Perform Incident Analysis on a regular basis with the intention of preventing and finding a long term solve for Incidents. Preferred candidate profile: Experience in monitoring and analyzing infrastructure performance using standard performance monitoring tools. Demonstrable experience in Containerization - Docker and orchestration (Kubernetes) Experience with Infrastructure As Code (Terraform, Cloud Formation, Ansible) Knowledge and proven hands-on experience in large-scale databases and distributed technologies, such as Kafka and Confluent Platform Kafka Basic programming and scripting skills Proficient in AWS Cloud, Azure, Docker, GCP Cloud, More into cloud, Azure, Devops engineer.

Posted 2 months ago

Apply

6.0 - 10.0 years

15 - 25 Lacs

Gurugram, Bengaluru

Hybrid

What you will be doing The Site Reliability Engineer (SRE) operates and maintains production systems in the cloud. Their primary goal is to make sure the systems are up and running and provide the expected performance. This involves daily operations tasks of monitoring, deployment and incident management as well as strategic tasks like capacity planning, provisioning and continuous improvement of processes. Also, a major part of the role is the design for reliability, scalability, efficiency and the automation of everyday system operations tasks. SREs work closely together with technical support teams, application developers and DevOps engineers both on incident resolution and on long-term evolution of systems. Employees will primarily work on creating Terraform, Shell & Ansible scripts and will be part of Application deployments using Azure Kubernetes service. Employees will work with a cybersecurity client/company. Monitor production systems' health, usage, and performance using dashboards and monitoring tools. Track provisioned resources, infrastructure, and their configuration. Perform regular maintenance activities on databases, services, and infrastructure. Respond to alerts and incidents: investigate, resolve, or dispatch according to SLAs. Respond to emergencies: recover systems and restore services with minimal downtime. Coordinate with customer success and engineering teams on incident resolution. Perform postmortems after major incidents. Change management: perform rollouts, rollbacks, patching and configuration changes. Drive demand forecasting and capacity planning with engineering and customer success teams. Consider projected growth and demand spikes. Provision production resources according to capacity demands. Work with the engineering teams on the design and testing for reliability, scalability, performance, efficiency, and security. Track resource utilization and cost-efficiency of production services. What were BSc/MSc, B. Tech degree in STEM, 3+ years of relevant industry experience. Technical skills: Terraform, Docker Swarm/K8s, Python, Unix/Linux Shell scripting, DevOps, GitHub Actions, Azure Active Directory, Azure monitor & Log Analytics. Experience in integrating Grafana with Prometheus will be an added advantage. Strong verbal and written communication skills. Ability to perform on-call duties. Regards, Kajal Khatri Kajal@beanhr.com

Posted 2 months ago

Apply

2.0 - 7.0 years

12 - 17 Lacs

Bengaluru

Work from Office

Cloud Engineer With 36 facilities worldwide, Eurofins BioPharma Product Testing (BPT) is the largest network of bio/pharmaceutical GMP product testing laboratories providing comprehensive laboratory services for the world's largest pharmaceutical, biopharmaceutical, and medical device companies. BPT is enabled by global engineering teams working on next-generation applications and Laboratory Information Management Systems (LIMS). As Site Reliability Engineer, you will be a key part of our cloud strategy, ensuring our IT systems operate effectively on the Azure cloud. As a technology leader, BPT wants to give you the opportunity not just to accept new challenges and opportunities, but to impress with your ingenuity, focus, attention to details and collaboration with a global team of professionals. This role reports to a SRE Manager. Primary Responsibilities : We are looking for a skilled Site Reliability Engineer (SRE) to join our Cloud Engineering and Operations team. The ideal candidate will be responsible for ensuring high availability, performance, and reliability of our cloud-hosted systems, particularly in Microsoft Azure environments. This role combines software engineering practices with operational excellence to build scalable, automated, and resilient infrastructure. You’ll work closely with developers, security teams, and platform engineers to implement best practices, reduce toil, and proactively manage incidents and risks across production environments. Key Responsibilities Infrastructure as Code (IaC) Automate deployment and configuration of resources using Bicep, PowerShell, and Azure CLI. Build repeatable, version-controlled infrastructure aligned with Azure Well-Architected Framework. Cloud Operations & Monitoring Manage and monitor Azure cloud resources including VMs, App Services, AKS clusters, and storage solutions. Ensure platform health using tools such as Azure Monitor, Log Analytics, and custom alerting frameworks. Optimize system performance, plan capacity, and proactively identify reliability risks. Access & Identity Lifecycle Management Administer access controls and identity provisioning using Azure Active Directory. Implement RBAC policies and maintain secure access patterns across the environment. Incident Response & Troubleshooting Respond to incidents and performance alerts, ensuring rapid resolution with minimal impact. Collaborate with engineering teams to analyze root causes and implement preventive solutions. Security & Compliance Enforce best practices for cloud security including encryption, key management via Azure Key Vault, and access control. Align infrastructure with compliance standards such as GDPR, HIPAA, or internal policies. Cost Optimization & Resource Management Monitor and manage cloud spending using Azure Cost Management tools. Recommend and implement strategies for efficient resource usage and scaling. Disaster Recovery & Availability Planning Design and maintain robust backup and DR plans across Azure regions. Ensure recovery objectives are met and tested regularly. Collaboration & Documentation Work closely with DevOps, Security, and Architecture teams to align goals. Maintain clear documentation on infrastructure design, operations, and procedures. Specific Deliverables Daily operations support for Azure-hosted workloads. Timely provisioning and access management for end users. Effective triaging and closure of cloud incidents. Secure and scalable infrastructure design and automation. Support and drive adoption of SRE practices, including monitoring, incident postmortems, and continuous improvement. Skills required: Strong knowledge of Azure services (Compute, Networking, Storage, Identity). Hands-on experience with IaC tools (Bicep), scripting (PowerShell, Python), and deployment automation. Solid foundation in networking concepts and cloud-native security. Experience with cloud observability and logging tools. Proficiency in managing access control using Azure AD and RBAC. Preferred Qualities Strong analytical and problem-solving mindset. A passion for automation and eliminating manual work. Documentation-oriented—writes things down to scale learning and onboarding. Proactive attitude toward identifying and fixing reliability gaps. Collaborative, self-driven, and adaptable to changing priorities. Stack: CloudMicrosoft Azure LanguagesPowerShell, Python, YAML ToolsAzure DevOps, Azure Monitor, Bicep PracticesInfrastructure as Code, CI/CD, RBAC, Zero Trust Security, Postmortems Qualifications Preferred Qualifications: Bachelors in Engineering, Computer Science or equivalent. At least 2 years of professional experience in Azure cloud in a multiple region environment

Posted 2 months ago

Apply

6.0 - 9.0 years

18 - 20 Lacs

Pune

Work from Office

Notice Period: (Immediate Joiner - Only) Duration: 6 Months (Possible Extension) Shift Timing: 11:30 AM 9:30 PM IST About the Role We are looking for a highly skilled and experienced DevOps / Site Reliability Engineer to join on a contract basis. The ideal candidate will be hands-on with Kubernetes (preferably GKE), Infrastructure as Code (Terraform/Helm), and cloud-based deployment pipelines. This role demands deep system understanding, proactive monitoring, and infrastructure optimization skills. Key Responsibilities: Design and implement resilient deployment strategies (Blue-Green, Canary, GitOps). Configure and maintain observability tools (logs, metrics, traces, alerts). Optimize backend service performance through code and infra reviews (Node.js, Django, Go, Java). Tune and troubleshoot GKE workloads, HPA configs, ingress setups, and node pools. Build and manage Terraform modules for infrastructure (VPC, CloudSQL, Pub/Sub, Secrets). Lead or participate in incident response and root cause analysis using logs, traces, and dashboards. Reduce configuration drift and standardize secrets, tagging, and infra consistency across environments. Collaborate with engineering teams to enhance CI/CD pipelines and rollout practices. Required Skills & Experience: 5-10 years in DevOps, SRE, Platform, or Backend Infrastructure roles. Strong coding/scripting skills and ability to review production-grade backend code. Hands-on experience with Kubernetes in production, preferably on GKE. Proficient in Terraform, Helm, GitHub Actions, and GitOps tools (ArgoCD or Flux). Deep knowledge of Cloud architecture (IAM, VPCs, Workload Identity, CloudSQL, Secret Management). Systems thinking understands failure domains, cascading issues, timeout limits, and recovery strategies. Strong communication and documentation skills capable of driving improvements through PRs and design reviews. Tech Stack & Tools Cloud & Orchestration: GKE, Kubernetes IaC & CI/CD: Terraform, Helm, GitHub Actions, ArgoCD/Flux Monitoring & Alerting: Datadog, PagerDuty Databases & Networking: CloudSQL, Cloudflare Security & Access Control: Secret Management, IAM Driving Results: A good single contributor and a good team player. Flexible attitude towards work, as per the needs. Proactively identify & communicate issues and risks. Other Personal Characteristics: Dynamic, engaging, self-reliant developer. Ability to deal with ambiguity. Manage a collaborative and analytical approach. Self-confident and humble. Open to continuous learning Intelligent, rigorous thinker who can operate successfully amongst bright people

Posted 2 months ago

Apply

3.0 - 4.0 years

5 - 15 Lacs

Bengaluru

Work from Office

As our first Site Reliability Engineer (SRE), youll take ownership of the reliability, observability, and resilience of our systems across development, staging, and production. Youll bring stability to our infrastructure, implement proactive monitoring, lead incident response, optimize costs, and collaborate cross functionally with developers, QA, and security teams. This is a hands-on role with both strategic and tactical responsibilities, ideal for someone who thrives in early-stage environments. Key Responsibilities Monitoring & Observability Define and enforce monitoring standards across services (metrics, logs, traces). Consolidate and manage monitoring tools (Elastic, Sentry, Slack, Azure Monitor, etc.). Build actionable dashboards and configure alerting for RabbitMQ, APIs, databases, and third-party integrations and data pipelines on Databricks. Establish SLIs, SLOs, and error budgets to guide operational priorities. Incident Management & Response Implement on-call rotations and escalation policies. Develop and maintain incident response runbooks and post-incident reviews (RCAs). Reduce MTTR (Mean Time to Recovery) by automating detection and remediation where possible. Infrastructure & Reliability Engineering Own availability and scalability of our services on Microsoft Azure. Optimize performance and memory usage of services like RabbitMQ, APIs, and analytics pipelines running in Databricks Build fault-tolerant systems: retries, backoff, circuit breakers, etc. Collaborate with developers to implement resilience patterns in the codebase. Cost Optimization & Efficiency Track, analyze, and report on cloud infrastructure costs. Configure budgets, alerts, and resource tagging to prevent surprises. Lead right-sizing and cleanup initiatives to remove unused or overprovisioned assets. Security & Compliance Collaboration Work with the security team to maintain infrastructure diagrams and data flow diagrams. Participate in threat modeling and define trust boundaries. Ensure systems and tooling are audit-ready for compliance (e.g., ISO 27001, GDPR, PDPA). Tooling & Automation Build internal tools to improve deployment reliability, diagnostics, and rollback safety. Implement and manage Infrastructure-as-Code using Terraform, Bicep, or similar. Improve CI/CD pipelines for safer and faster releases. Tech Stack You’ll Work With : Cloud:Microsoft Azure (App Services, VMs, Cosmos DB, Monitor, etc.) Monitoring & Logs: ELK, Sentry, Azure Monitor, Prometheus, Grafana Queueing: RabbitMQ,Kafka Languages: Node.js, Python (mostly reading/debugging) Infra as Code: Terraform, Bicep, GitHub Actions Requirements Must-Have 3+ years of experience in DevOps, SRE, or infrastructure engineering roles. Experience managing high-availability systems and debugging production issues under pressure. Proven track record with cloud infrastructure (Azure preferred) and observability tooling. Strong understanding of distributed systems, incident response, and cost management. Comfortable collaborating across functions — including developers, QA, and security. Nice-to-Have Experience with compliance/regulatory frameworks (ISO 27001, GDPR, etc.). Familiarity with customer engagement or loyalty platforms. Contributions to infra/tooling culture in an early-stage startup. What You’ll Get The opportunity to shape the reliability strategy of a fast-growing product from the ground up. A strong voice in infra design, tooling choices, and culture. A globally distributed, high-caliber team that’s customer-obsessed and product-driven.

Posted 2 months ago

Apply

12.0 - 20.0 years

40 - 45 Lacs

Pune, Bengaluru

Work from Office

RTB Lead with Payments exposure Java, Springboot, MQ, Kafka, Hazelcast Production support experience(24*7, on-call support) Immediate or 15 days joiner Good Communication skills Banking/Payments Domain is mandatory "Kashif@d2nsolutions.com"

Posted 2 months ago

Apply

3.0 - 6.0 years

9 - 16 Lacs

Bengaluru

Work from Office

Role & responsibilities Drive and execute infrastructure/platform migration activities. Develop and maintain automation scripts using Python or Shell scripting to support migration processes. Work closely with SRE, DevOps, and platform teams to plan, test, and validate migration activities. Troubleshoot and resolve issues during migrations with minimal downtime and risk. Ensure proper documentation and monitoring during and after migrations. Participate in post-migration verification and support activities. Preferred candidate profile Python Linux Shell Scripting Migration SRE If Interested then connect with shravani.m@genxhire.in OR 7710889351

Posted 2 months ago

Apply

12.0 - 14.0 years

35 - 50 Lacs

Pune

Work from Office

Job Summary The Architect role is pivotal in designing and implementing robust solutions using advanced DevOps and cloud technologies. With a focus on SRE Jenkins and Azure the candidate will drive innovation and efficiency in software development processes. This position requires a deep understanding of API management and container environments to optimize system performance and scalability. Responsibilities Design and implement scalable and efficient solutions using DevOps tools and concepts to enhance software development processes. Collaborate with cross-functional teams to integrate Jenkins and GitHub for continuous integration and delivery pipelines. Utilize Apigee API Management to streamline API development and ensure secure and reliable communication between services. Develop strategies for DevSecOps to incorporate security measures into the DevOps lifecycle ensuring compliance and data protection. Leverage Azure Platform capabilities to deploy and manage applications in a cloud environment optimizing resource utilization. Set up and manage container environments using Docker and Container Registry to ensure consistent and reliable application deployment. Implement API Gateway solutions to manage and monitor API traffic ensuring high availability and performance. Drive automation initiatives to reduce manual intervention and increase efficiency in deployment processes. Utilize Azure DevOps for project management and collaboration ensuring timely delivery of projects and tasks. Employ Terraform for infrastructure as code enabling automated and repeatable infrastructure deployment. Oversee the setup and configuration of container environments to support application scalability and resilience. Provide technical guidance and support to development teams fostering a culture of innovation and continuous improvement. Monitor system performance and implement proactive measures to ensure reliability and uptime. Collaborate with stakeholders to understand business requirements and translate them into technical solutions. Qualifications Possess extensive experience in SRE and DevOps tools demonstrating a strong ability to optimize software development processes. Have a deep understanding of Jenkins and GitHub for effective continuous integration and delivery. Demonstrate proficiency in Apigee API Management for secure and efficient API development. Show expertise in DevSecOps practices to ensure security integration within the DevOps lifecycle. Exhibit strong knowledge of Azure Platform for cloud-based application deployment and management. Have hands-on experience with Docker and Container Registry for reliable container environment setup. Display proficiency in API Gateway solutions to manage API traffic effectively. Demonstrate capability in using Azure DevOps for project management and collaboration. Show expertise in Terraform for infrastructure automation and deployment. Certifications Required Certified Kubernetes Administrator (CKA) Microsoft Certified: Azure Solutions Architect Expert

Posted 2 months ago

Apply

5.0 - 9.0 years

20 - 25 Lacs

Chennai

Work from Office

Skills Required: Python, Java, C/C++, Ruby, and JavaScript J2EE, NoSQL/SQL Datastore, Spring Boot, GCP/AWS/Azure & Docker/K8 RESTful APIs and microservices platform Experience with any of APM and other monitoring tools Exp 5+ CTC 28 LPA Chennai

Posted 2 months ago

Apply

3.0 - 8.0 years

8 - 15 Lacs

Chennai

Work from Office

Role & responsibilities Proven experience as a Site Reliability Engineer (SRE) or in a similar role. Strong knowledge of machine learning concepts and workflows. Proficiency in programming languages such as Python, Java, or Go. Experience with cloud platforms such as AWS, Azure, or Google Cloud. Familiarity with containerization technologies like Docker and Kubernetes. Experience with CI/CD tools such as Jenkins, GitLab CI, or CircleCI. Strong problem-solving skills and the ability to troubleshoot complex issues. Excellent communication and collaboration skills. Preferred candidate profile Experience with machine learning frameworks such as TensorFlow, PyTorch, or Scikit-learn Knowledge of data engineering and data pipeline tools such as Apache Spark, Apache Kafka, or Airflow. Experience with monitoring and logging tools such as Prometheus, Grafana, or ELK stack. Familiarity with infrastructure as code (IaC) tools like Terraform or Ansible. Experience with automated testing frameworks for machine learning models. Knowledge of security best practices for machine learning systems and data. Education : Bachelor's degree in Computer Science, Engineering, or a related field. Master's degree in Computer Science, Engineering, or a related field.

Posted 2 months ago

Apply

7.0 - 10.0 years

20 - 27 Lacs

Noida, Gurugram

Hybrid

Oversee and report on project status, assemble project teams, and help to define assignments against defined schedules and milestones. •Continuously work to improve the reliability, stability, and performance of the digital platforms by overseeing the implementation of fully automated telemetry, observation, & applied intelligence systems. •Continuously work to improve problem identification and service restoration of digital platforms by leading and overseeing efforts to define, enhance, and deliver automated alerting and response systems with intelligent, self-healing capabilities. •Provide periodic on-call escalations support based on established 24/7/365 support schedules. •Communicate and provide timely status and incident reports to Leadership. •Collaborate with admins and platform engineers through implementation decisions to achieve highly reliable infrastructure, systems, and integrations. •Provide advanced Incident Management and Problem Management support to teams, to effectively identify, remediate, and resolve issues related to platform reliability, stability, and performance through careful analysis of telemetry data and system logs. •Document all changes following controls, procedures and documentation standards and raises issues and concerns with recommendations for follow-up action. Software engineering is the application of engineering to the design, development, implementation, testing and maintenance of software in a systematic method. The roles in this function will cover all primary development activity across all technology functions that ensure we deliver code with high quality for our applications, products and services and to understand customer needs and to develop product roadmaps. These roles include, but are not limited to analysis, design, coding, engineering, testing, debugging, standards, methods, tools analysis, documentation, research and development, maintenance, new development, operations and delivery. With every role in the company, each position has a requirement for building quality into every output. This also includes evaluating new tools, new techniques, strategies; Automation of common tasks; build of common utilities to drive organizational efficiency with a passion around technology and solutions and influence of thought and leadership on future capabilities and opportunities to apply technology in new and innovative ways. - Analyzes and investigates. - Provides explanations and interpretations within area of expertise. Qualifications - Internal - Undergraduate degree or equivalent experience. 5-7 Years of experience working in global organizations with the ability to effectively communicate with executives, leaders and individual contributors across the organization. •1+ years of SRE experience working with telemetry, observation, self-healing solutions, and platform automation. •Experience with monitoring, logging & telemetry tools like Dynatrace, Splunk, Prometheus, AWS Cloudwatch, Datadog, etc. •Advanced understanding of Networking, Content Delivery Networks (CDN, e.g. Akamai, Cloudflare), and Cloud Platforms. •Experience with automation and tools such as (but not limited to) Jenkins, Chef, Terraform, Ansible, etc. •Expert in designing, creating and supporting Automation (PowerShell, Python, Ruby, AWK, SED, etc.) to run health-checks and self-healing capabilities for the platforms. •Advanced experience in the use of the following platforms and tools: oCloud: MS Azure/AWS Cloud oCollaboration & Change Management tools: Jira, ServiceNow, Cherwell, etc. oDatabases: (MongoDB, SQL etc.) •Bachelors Degree in Computer Science or equivalent •Azure/AWS, Microsoft, RedHat, certifications

Posted 2 months ago

Apply

5.0 - 10.0 years

5 - 14 Lacs

Pune

Hybrid

Minimum of 3 years of experience in a DevOps, SRE, or Infrastructure Engineering role. Solid understanding of Terraform and experience maintaining reusable module libraries. Hands-on experience managing workloads on Kubernetes (preferably GKE). Working knowledge of CI/CD tools such as GitHub Actions and Helm. Familiarity with Google Cloud services, including networking, CloudSQL (Postgres), and container security. Competence in observability tooling, especially Datadog dashboards and alert configurations. Strong operational mindset with attention to detail in release processes and deployment integrity

Posted 2 months ago

Apply

6.0 - 9.0 years

8 - 11 Lacs

Hyderabad, Bengaluru

Work from Office

Responsibilities : - A bachelors degree in computer science, software engineering, or a related discipline. An advanced degree (e.g., MS) is preferred but not required. Experience is the most relevant factor. Strong software engineering foundation with the understanding of OOPs, data-structure, algorithms, code instrumentations, beautiful coding practices, etc. 5+ years proven experience with most of the following: Angular, React, NodeJS, Python, C#, .NET Core, Java, Golang, SQL/NoSQL. 5+ years of experience with cloud-native engineering, using FaaS/PaaS/micro-services on cloud hyper-scalers like Azure, AWS, and GCP. Strong understanding of methodologies & tools like, XP, Lean, SAFe, DevSecOps, SRE, ADO, GitHub, SonarQube, etc. to deliver high quality products rapidly. Strong preference will be given to candidates with experience in AI/ML and GenAI. Excellent interpersonal and organizational skills, with the ability to handle diverse situations, complex projects, and changing priorities, behaving with passion, empathy, and care.

Posted 2 months ago

Apply

5.0 - 8.0 years

20 - 32 Lacs

Pune, Gurugram

Hybrid

About the role: Site Reliability Engineer is one of the critical role in the technology team and the person working in this team will be responsible for application performance, availability, reliability and system uptime. Candidate is responsible to provide consultation and strategic recommendations by quickly assessing and remediating complex platform availability issues. Site Reliability Engineer LEAD will dive head-first into creating or applying innovative solutions and techniques that advance the reliability of Digital products. Role & responsibilities Key responsibilities: • Installation/deployment of new releases , environments for applications. • Build and maintain highly scalable, large scale deployments globally • Co-Create and maintain architecture for 100% uptime. E.g. creating alternate connectivity. • Practice sustainable incident response/management and blameless post-mortems. • Monitor and maintain production environment stability. • Own entire platforms (prod environments) Deploying, automating, maintaining and managing production systems, to ensure the availability, performance, scalability and security of productions systems • Engage in and improve the whole lifecycle of services from inception and design, through deployment, operation and refinement. • Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews. • Maintain services once they are live by measuring and monitoring availability, latency and overall system health. • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity. • Collaborate with Agile teams in defining technical requirements and best practices with containerized and cloud-native applications • Represent production support and site reliability in stand-ups, planning sessions, code reviews, and architecture reviews • Help evolve our configuration management (CM) efforts and our move to containers • Help the operations head in selecting the enthusiastic and technically knowledgeable team and guide the existing team members. Preferred candidate profile • Should have good knowhow of application, middleware, Databases (posgres, mongo, mysql etc.), infra, OS. • Should have good understanding in Docker and Kubernetes. • Should have an understanding of CI/CD and DevOps tools like Jenkins, Ansible, Shell scripting etc • Monitoring and Logging: Experience with monitoring and logging tools (e.g. Nagios / appdynamics, ELK, Prometheus). • Good Experience of distributed systems RabbitMQ, Kafka, Redis etc. • Should have an experience of working on Linux, Weblogic/tomcat, Jboss and middleware technology. • Should have worked on high traffic & highly scalable systems in past • Knowledge on fundamental aspects for release automation (packaging, dependencies, promotion, deployment, compliance) • Experience on project management tools such as JIRA and insight on quality analysis as well.

Posted 2 months ago

Apply

7.0 - 12.0 years

10 - 20 Lacs

Hyderabad, Chennai, Mumbai (All Areas)

Work from Office

Skills: SRE, AWS Devops, Azure Devops Education: B.TECH, B.Sc, BCA Year of Experience : 3-15 Yrs Location : Pan India

Posted 2 months ago

Apply

3.0 - 7.0 years

4 - 8 Lacs

Mumbai

Work from Office

Requirements Lead and manage the RCA process for all SRE incidents, ensuring athorough and timely investigation. Facilitate RCA workshops, guiding teams through a structuredanalysis to identify the root cause of incidents. Document RCA findings and recommendations in a clear and concisemanner. Work with SRE engineers and developers to implement correctiveactions and preventative measures based on RCA findings. Analyze trends in incident data to identify areas for improvement insystem design, monitoring, and automation. Develop and implement best practices for RCA within the SRE organization. Stay up-to-date on the latest SRE practices and incident responsemethodologies. Collaborate with other teams (e.g., security, product) to ensure aholistic approach to incident management. Mentor and coach SRE engineers on effective RCA techniques. Track and report on key metrics related to incident management and RCAeffectiveness.

Posted 2 months ago

Apply

6.0 - 10.0 years

7 - 17 Lacs

Hyderabad, Pune, Bengaluru

Work from Office

Role & responsibilities We are hiring for a Production Support role involving L2/L3 support, Java/.NET stack, Splunk monitoring, and SRE practices. 24/7 rotational shift mandatory. Interested candidates can share your resume to sarvani.j@ifinglobalgroup.com

Posted 2 months ago

Apply

7.0 - 11.0 years

12 - 13 Lacs

Gurugram

Work from Office

Execute DevSecOps community strategy, drive content and events, moderate forums, and boost engagement. Collaborate across teams to grow an inclusive, trusted, and vibrant technical community. Mail:kowsalya.k@srsinfoway.com

Posted 2 months ago

Apply

10.0 - 13.0 years

35 - 50 Lacs

Chennai

Work from Office

Job Summary Site Reliability Engineer Responsibilities Ensure security automation across our entire platform collaborating with developers security and operations teams to ensure platform integrity Have a passion for Security Agile and DevOps and promote shiftleft and ShiftRight culture which integrates security analysis into each CI/CD stages Implement new tools and processes to enable security in Cloud environment Automatic audit and implement security control in the DevOps CI/CD pipeline ensuring processes are followed maintained reviewed and updated regularly Contribute to SRE operations (Production support incident response and Oncall rota) Pasion for observability The skills you will need Strong experience in SRE practice with knowledge of conducting security checks and mitigation (static and dynamic code analysis SAST DAST IAST vulnerability analysis / penetration tests security component analysis) Hands on Experience with Azure DevOps is a must including Repos advanced pipelines and package management. Must have knowledge in Azure Cloud and its solutions Hands on Experience in IaC JSON/YAML Azure Bicep Azure policies Azure DevOps Open Telemetry Azure Monitoring Azure Sentinel Azure Defender Grafana Kusto queries Kubernetes AKS Azure ARC BICEP Azure function apps Azure Synapse PowerBI Azure Data Factory Dynamics 365 AzureML and MLflow Programming skills on PowerShell Knowledge on building and testing .NET and C# application and APIs Experience onCloud Networking Skills (TCP/IP SSL SMTP HTTP FTP DNS) WAF IPS/IDS Azure FrontDoor Experience working on large scale distributed systems with deep understanding of design impacts on performance reliability operations and security Working Experience in Monitoring tools and their implementation preferably with Azure Monitoring Suit. Knowledge of securing APIs and security in microservices is beneficial Should have demonstrated ability to work in an Agile environment Strong communication and teamwork skills Certifications Required Azure DevOps

Posted 2 months ago

Apply

6.0 - 10.0 years

27 - 42 Lacs

Chennai

Work from Office

Job Summary We are seeking an experienced Infra Dev Specialist with 6 to 10 years of experience to join our team. The ideal candidate will have expertise in SRE Grafana ELK Dynatrace AppMon and Splunk. This role involves working in a hybrid model with day shifts. The candidate will play a crucial role in ensuring the reliability and performance of our infrastructure contributing to the overall success of our projects and the positive impact on society. Responsibilities Lead the design implementation and maintenance of infrastructure solutions to ensure high availability and performance. Oversee the monitoring and alerting systems using tools like Grafana ELK Dynatrace AppMon and Splunk. Provide expertise in Site Reliability Engineering (SRE) to enhance system reliability and scalability. Collaborate with cross-functional teams to identify and resolve infrastructure issues promptly. Develop and maintain automation scripts to streamline infrastructure management tasks. Implement best practices for infrastructure security and compliance. Conduct regular performance tuning and optimization of infrastructure components. Monitor system health and performance and proactively address potential issues. Create and maintain detailed documentation of infrastructure configurations and procedures. Participate in on-call rotations to provide 24/7 support for critical infrastructure components. Drive continuous improvement initiatives to enhance infrastructure reliability and efficiency. Mentor and guide junior team members in best practices and technical skills. Contribute to the overall success of the company by ensuring the reliability and performance of our infrastructure. Qualifications Possess strong expertise in SRE principles and practices. Have extensive experience with monitoring and alerting tools such as Grafana ELK Dynatrace AppMon and Splunk. Demonstrate proficiency in scripting languages for automation purposes. Exhibit strong problem-solving skills and the ability to work under pressure. Show excellent communication and collaboration skills. Have a solid understanding of infrastructure security and compliance requirements. Display a proactive approach to identifying and addressing potential issues. Hold a relevant certification in SRE or related fields. Possess a strong commitment to continuous learning and improvement. Demonstrate the ability to mentor and guide junior team members. Have a proven track record of successfully managing and optimizing infrastructure components. Show a strong commitment to contributing to the overall success of the company. Exhibit a passion for ensuring the reliability and performance of infrastructure solutions. Certifications Required Certified SRE Practitioner Grafana Certified ELK Stack Certification Dynatrace Certified Associate Splunk Core Certified User

Posted 2 months ago

Apply

12.0 - 17.0 years

16 - 20 Lacs

Bengaluru

Work from Office

locationsBangalore, Indiaposted onPosted 9 Days Ago job requisition id30604 FICO (NYSEFICO) is a leading global analytics software company, helping businesses in 100+ countries make better decisions. Join our world-class team today and fulfill your career potential! The Opportunity "We are seeking an experienced DevOps Engineer to join our development team to assist in the continuing evolution of our Platform Orchestration product. You will be able to demonstrate the required potential and technical curiosity to work on software that utilizes a range of leading-edge technologies and integration frameworks. Staff training, investment and career growth form an important part of our team ethos. Consequently, you will gain exposure to different software validation techniques supported by industry-standard engineering processes that will help to grow your skills and experience." - VP, Software Engineering. What Youll Contribute Build and maintain CI/CD pipelines for multi-tenant deployments using Jenkins and GitOps practices. Manage Kubernetes infrastructure (AWS EKS), Helm charts, and service mesh configurations (ISTIO). Use kubectl, Lens, or other dashboards for real-time workload inspection and troubleshooting. Evaluate security, stability, compatibility, scalability, interoperability, monitorability, resilience, and performance of our software. Support development and QA teams with code merge, build, install, and deployment environments. Ensure continuous improvement of the software automation pipeline to increase build and integration efficiency. Oversee and maintain the health of software repositories and build tools, ensuring successful and continuous software builds. Verify final software release configurations, ensuring integrity against specifications, architecture, and documentation. Perform fulfillment and release activities, ensuring timely and reliable deployments. What Were Seeking A Bachelors or Masters degree in Computer Science, Engineering, or a related field. 812 years of hands-on experience in DevOps or SRE roles for cloud-native Java-based platforms. Deep knowledge of AWS Cloud Services (EKS, IAM, CloudWatch, S3, Secrets Manager), including networking and security components. Strong experience with Kubernetes, Helm, ConfigMaps, Secrets, and Kustomize. Expertise in authoring and maintaining Jenkins pipelines integrated with security and quality scanning tools. Hands-on experience with infrastructure provisioning tools such as Docker and CloudFormation. Familiarity with CI/CD pipeline tools and build systems including Jenkins and Maven. Experience administering software repositories such as Git or Bitbucket. Proficient in scripting/programming languages such as Ruby, Groovy, and Java. Proven ability to analyze and resolve issues related to performance, scalability, and reliability. Solid understanding of DNS, Load Balancing, SSL, TCP/IP, and general networking and security best practices. Our Offer to You An inclusive culture strongly reflecting our core valuesAct Like an Owner, Delight Our Customers and Earn the Respect of Others. The opportunity to make an impact and develop professionally by leveraging your unique strengths and participating in valuable learning experiences. Highly competitive compensation, benefits and rewards programs that encourage you to bring your best every day and be recognized for doing so. An engaging, people-first work environment offering work/life balance, employee resource groups, and social events to promote interaction and camaraderie. Why Make a Move to FICO At FICO, you can develop your career with a leading organization in one of the fastest-growing fields in technology today Big Data analytics. Youll play a part in our commitment to help businesses use data to improve every choice they make, using advances in artificial intelligence, machine learning, optimization, and much more. FICO makes a real difference in the way businesses operate worldwide Credit Scoring FICO Scores are used by 90 of the top 100 US lenders. Fraud Detection and Security 4 billion payment cards globally are protected by FICO fraud systems. Lending 3/4 of US mortgages are approved using the FICO Score. Global trends toward digital transformation have created tremendous demand for FICOs solutions, placing us among the worlds top 100 software companies by revenue. We help many of the worlds largest banks, insurers, retailers, telecommunications providers and other firms reach a new level of success. Our success is dependent on really talented people just like you who thrive on the collaboration and innovation thats nurtured by a diverse and inclusive environment. Well provide the support you need, while ensuring you have the freedom to develop your skills and grow your career. Join FICO and help change the way business thinks! Learn more about how you can fulfil your potential at FICO promotes a culture of inclusion and seeks to attract a diverse set of candidates for each job opportunity. We are an equal employment opportunity employer and were proud to offer employment and advancement opportunities to all candidates without regard to race, color, ancestry, religion, sex, national origin, pregnancy, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. Research has shown that women and candidates from underrepresented communities may not apply for an opportunity if they dont meet all stated qualifications. While our qualifications are clearly related to role success, each candidates profile is unique and strengths in certain skill and/or experience areas can be equally effective. If you believe you have many, but not necessarily all, of the stated qualifications we encourage you to apply. Information submitted with your application is subject to theFICO Privacy policy at

Posted 2 months ago

Apply

8.0 - 10.0 years

27 - 42 Lacs

Bengaluru

Work from Office

Job Summary We are seeking a Senior Cloud Engineer with 8 to 10 years of experience to join our team. The ideal candidate will have expertise in AWS SRE DevOps and a strong understanding of the SRE work model. This hybrid role requires a proactive individual who can contribute to our cloud infrastructure and ensure its reliability and scalability. Responsibilities Lead the design and implementation of cloud infrastructure solutions using AWS. Oversee the development and deployment of SRE DevOps practices to enhance system reliability. Provide technical guidance and mentorship to junior engineers. Collaborate with cross-functional teams to define and implement cloud strategies. Monitor and optimize cloud infrastructure performance and cost. Develop and maintain automation scripts to streamline cloud operations. Ensure compliance with security policies and best practices in cloud environments. Troubleshoot and resolve complex cloud infrastructure issues. Implement and manage CI/CD pipelines to support continuous integration and delivery. Conduct regular reviews and audits of cloud infrastructure to identify areas for improvement. Stay updated with the latest cloud technologies and trends to drive innovation. Contribute to the development of disaster recovery and business continuity plans. Document cloud infrastructure designs processes and procedures for future reference. Qualifications Possess a strong understanding of AWS services and architecture. Demonstrate expertise in SRE DevOps practices and principles. Have experience with automation tools and scripting languages. Show proficiency in monitoring and logging tools for cloud environments. Exhibit excellent problem-solving and troubleshooting skills. Have strong communication and collaboration abilities. Display a proactive approach to learning and staying current with industry trends.

Posted 2 months ago

Apply

10.0 - 14.0 years

35 - 50 Lacs

Chennai

Work from Office

Job Summary We are seeking a highly skilled Principal Infra Developer with 10 to 14 years of experience to join our team. The ideal candidate will have expertise in SRE Grafana EKS JBOSS and Managing the teams with client interaction. Experience in Property Casualty Insurance is a plus. This hybrid role involves rotational shifts and does not require travel. Responsibilities Strong experience in AWS EKS. Having Good knowledge on creating Kubernetes Cluster pods namespace replicas daemon sets replica controller and set up kubectl. Working Knowledge on AWS EKS EC2 IAM MSK. Good working knowledge on Docker github setting up pipelines troubleshooting related issues. Working knowledge on monitoring tools such as AppDynamics ELK Grafana Nagios. Working knowledge on Rancher vault and Argocd. Good knowledge in networking concepts. Strong troubleshooting skills for triaging and fixing application issues on k8s cluster Hands on experience on installing configuring and maintenance of Jboss EAP 6x 7x in various environments domain based and standalone setup. Strong experience in configuring and administering Connection pools for JDBC connections and JMS queues in Jboss EAP. Strong experience in deploying applications JAR WAR EAR and maintain load balancing High availability and failover functionality in clustered environment through command line in JBoss EAP Extensive experience in troubleshooting by using thread Dumps heap dumps for Jboss server issues. Good experience on SSl certificates creation for JBoss 5x 6x 7x Experience in providing technical assistance for performance tuning and troubleshooting techniques of Java Application. Good to have deployment procedures of J2EE applications and code to JBoss Application server. Good knowledge on installation maintenance and integration of Webservers like Apache Web server OHS Nginx. Good knowledge in scripting .Automation using Ansible Bash and Terraform.

Posted 2 months ago

Apply

4.0 - 9.0 years

0 - 3 Lacs

Visakhapatnam, Hyderabad

Work from Office

Key Responsibilities: Cloud Platform: GCP Infrastructure Automation: Design, implement, and manage infrastructure as code using Terraform to provision and manage GCP resources. Container Orchestration: Deploy and manage Kubernetes clusters, ensuring efficient operation of containerized applications. Continuous Integration/Continuous Deployment (CI/CD): Develop and maintain CI/CD pipelines using Jenkins to automate application build, test, and deployment processes. Containerization: Collaborate with development teams to containerize applications using Docker and manage deployments with Helm Charts. Code Quality Assurance: Integrate and manage SonarQube to ensure code quality and security standards are met. Monitoring and Logging: Implement and manage monitoring solutions using Datadog to ensure system health, performance, and security. Collaboration: Work closely with cross-functional teams, including developers, QA, and operations, to streamline processes and improve productivity. Requirements: Experience: 5+ years in DevOps or cloud engineering roles, with at least 3 years of relevant experience in the specified technologies. Technical Proficiency: Hands-on experience with GCP services and architecture. Proficiency in Terraform for infrastructure as code implementations. Strong understanding and experience with Kubernetes and Docker. Experience in setting up and managing CI/CD pipelines using Jenkins. Familiarity with Helm Charts for application deployment. Experience with SonarQube for code quality analysis. Proficiency in monitoring and logging tools, particularly Datadog. Scripting Skills: Proficiency in scripting languages such as Bash or Python is an added advantage. Strong problem-solving abilities and analytical thinking. Excellent communication skills, both verbal and written. Ability to work collaboratively in a team environment. Strong organizational and time management skills.

Posted 2 months ago

Apply
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies