Jobs
Interviews

646 Sre Jobs

Setup a job Alert
JobPe aggregates results for easy application access, but you actually apply on the job portal directly.

1.0 - 4.0 years

4 - 8 Lacs

Hyderabad, Telangana, India

On-site

Job description You are experienced with infrastructure as code practices You consistently use your programming skills to automate tasks You are comfortable working in a CLI environment You think of software and infrastructure coming together to form a larger system You dig deep into incidents/problems and come up with unique solutions You are enthusiastic about learning new technologies and spreading your knowledge You battle ruthlessly to fix whats broken and protect the customer experience You are compelled to leave a situation better than you found it Your work as an SRE at Bottomline will involve: Increasing the observability of our various applications, services, and infrastructure using: Open Telemetry Grafana eco-system (Grafana, Loki, Mimir, Tempo) Fluentd Automating our applications and infrastructure using: Terraform Kubernetes Puppet Creating CI/CD pipelines for these services using: Gitlab ArgoCD Kustomize Working with our Product teams and helping them capture the user experience in SLOs Reducing the impact of service disruptions through our incident, problem, change management programs

Posted 1 day ago

Apply

1.0 - 4.0 years

4 - 8 Lacs

Kolkata, West Bengal, India

On-site

Job description You are experienced with infrastructure as code practices You consistently use your programming skills to automate tasks You are comfortable working in a CLI environment You think of software and infrastructure coming together to form a larger system You dig deep into incidents/problems and come up with unique solutions You are enthusiastic about learning new technologies and spreading your knowledge You battle ruthlessly to fix whats broken and protect the customer experience You are compelled to leave a situation better than you found it Your work as an SRE at Bottomline will involve: Increasing the observability of our various applications, services, and infrastructure using: Open Telemetry Grafana eco-system (Grafana, Loki, Mimir, Tempo) Fluentd Automating our applications and infrastructure using: Terraform Kubernetes Puppet Creating CI/CD pipelines for these services using: Gitlab ArgoCD Kustomize Working with our Product teams and helping them capture the user experience in SLOs Reducing the impact of service disruptions through our incident, problem, change management programs

Posted 1 day ago

Apply

5.0 - 9.0 years

0 Lacs

karnataka

On-site

You are a talented Site Reliability Engineering Manager with a passion for distributed storage systems. You will be part of a focused team at Apple, bringing distributed storage technologies to Apple's infrastructure. Your role is crucial as Apple operates at a huge scale and your impact will be enormous. The mission is to power storage behind many of Apple's most popular services, and with your passion and dedication, there are no limits to what you can achieve. As the Storage SRE organization seeks a strong engineering leader to manage Storage focused SRE teams, you will work closely with peer SRE teams and development partners. Your responsibilities include building and optimizing the Storage stack from the bare metal to the top of the application. This involves designing provisioning systems, code deployment, monitoring, alerting, and performance improvements. Together with your team, you will help run the storage used by some of Apple's largest teams. Minimum Qualifications for this role include a Bachelor's or Master's degree in Computer Science, Engineering, or a related field. You should have proven experience in a leadership role within an SRE or DevOps team, with a specific passion for distributed storage. A strong background in distributed systems, storage architectures, and data management is essential. Deep knowledge of SRE principles, including monitoring, alerting, error budgets, fault analysis, and other common reliability engineering concepts is required. Leading initiatives to enhance the scalability and performance of distributed storage systems is also part of the role, along with collaborating with engineering teams to design and implement robust and scalable storage solutions. Preferred Qualifications include experience with Kubernetes, Docker, and containerization, as well as proficiency in at least one of these programming languages: Golang, Java, or Rust. Knowledge of distributed storage (block storage) or similar large-scale distributed databases is beneficial. Familiarity with CI/CD pipelines and infrastructure as code (Terraform, Ansible), knowledge of security best practices, and compliance requirements in storage systems are also advantageous. An understanding of data durability, consistency models, and storage performance optimization techniques is a plus. Education & Experience requirements are not specified in the job description.,

Posted 1 day ago

Apply

4.0 - 8.0 years

3 - 12 Lacs

Pune, Maharashtra, India

On-site

Responsibilities Full stack development Design, develop, and maintain applications, including the front-end and back-end Infrastructure management Design and maintain cloud-based infrastructure, configure servers, and manage databases Automation Use scripting and automation tools to automate tasks and improve efficiency Continuous integration and delivery Implement and manage processes to ensure that code changes are continuously integrated, tested, and deployed System optimization Design and implement scalable systems to handle increasing loads and user demands Disaster recovery planning Develop and test plans to restore services quickly in case of critical incidents Monitoring Collect and visualize critical information about the systems performance to identify issues Risk mitigation Identify, assess, and implement measures to eliminate potential risks that could impact the systems performance Skills Strong problem-solving skills Proficiency in scripting and automation tools Knowledge of infrastructure management, automation, and collaboration Experience with DevOps tools Experience with cloud technologies

Posted 1 day ago

Apply

3.0 - 7.0 years

3 - 12 Lacs

Hyderabad, Telangana, India

On-site

Are you passionate, driven, and ready for an exciting new challenge We re looking for talented individuals to join our team! Primary skill must have 3+ years of experience in SRE, Observability tools, and automation Have experience in transforming traditional IT Ops into SRE ops with deep understanding of SLI/SLOs/Toil/error budget etc. Experienced in Automating, and improving the reliability, performance, and availability of IT Infrastructure and network Create scalable and resilient/ reliable IT Infrastructure and network to minimize downtime/ incidents and ensuring the availability of critical applications and services Use automation tools such as Ansible to automate IT infrastructure tasks, such as system management and application monitoring and auto heal incidents. Secondary skill Coding experience/automation using Ansible or other automation tools Experience working with US based clients Excellent communication skills Technology-Experienced in atleast one of IT Infrastructure technologies such as Windows, Linux/Unix, VMware, Network Routing Switching, Backup / storage technologies, Cloud /Container technologies, Database/App support.

Posted 1 day ago

Apply

5.0 - 8.0 years

2 - 6 Lacs

Pune, Maharashtra, India

On-site

Responsibility : Implement and maintain AWS infrastructure components (RDS, EventBridge, Lambda, FIS) using best practices. Author and extend IaC modules (Terraform or CloudFormation) for reproducible environments. Develop and enhance CI/CD pipelines (Jenkins, GitHub Actions) for automated builds, deployments, and chaos drills. Write automation scripts and CLI tools for snapshot validation, load-test orchestration, and smoke-test execution. Contribute to monitoring, alerting, and incident-response workflows; support on-call rotations and update runbooks. Assist with compliance tasks in our FedRAMP and SOC2 environments. Collaborate with identity, automation, and chaos-engineering teams to share libraries and utilities. Debug complex issues spanning infrastructure, code, and third-party services. Requirements 4u20137 years in SRE, DevOps, or Cloud Engineering roles. Solid experience with AWS servicesu2014especially RDS, Lambda, EventBridge, and Fault Injection Service. Proficiency in Terraform and/or CloudFormation for infrastructure automation. Strong scripting skills in Python or Bash; familiarity with Node.js is a plus. Hands-on with CI/CD tools (Jenkins, GitHub Actions) and pipeline design. Working knowledge of monitoring and observability tools (CloudWatch, Prometheus, Grafana). Good understanding of FedRAMP compliance. Strong troubleshooting and collaboration skills. Preferred Exposure to chaos-engineering frameworks like AWS FIS. Basic familiarity with container orchestration (Kubernetes or ECS). Experience with database backup/restore and disaster-recovery patterns. Prior involvement in load-testing or performance-tuning projects. Education B.E./B.Tech or M.E./M.Tech in Computer Science, Software Engineering, or a related field. (or equivalent practical experience

Posted 1 day ago

Apply

12.0 - 16.0 years

0 Lacs

noida, uttar pradesh

On-site

As a Principal Site Reliability Engineer, you will be responsible for leading all infrastructure aspects of a new cloud-native, microservice-based security platform. This platform is fully multi-tenant, operates on Kubernetes, and utilizes the latest cloud-native CNCF technologies such as Istio, Envoy, NATS, Fluent, Jaeger, and Prometheus. Your role will involve technically leading an SRE team to ensure high-quality SLA for a global solution running in multiple regions. Your responsibilities will include building tools and frameworks to enhance developer efficiency on the platform and abstracting infrastructure complexities. Automation and utilities will be developed to streamline service operation and monitoring. The platform handles large amounts of machine-generated data daily and is designed to manage terabytes of data from numerous customers. You will actively participate in platform design discussions with development teams, providing infrastructure insights and managing technology and business tradeoffs. Collaboration with global engineering teams will be crucial as you contribute to shaping the future of Cybersecurity. At GlobalLogic, we prioritize a culture of caring, where people come first. You will experience an inclusive environment promoting acceptance, belonging, and meaningful connections with collaborative teammates, supportive managers, and compassionate leaders. Continuous learning and development are essential at GlobalLogic. You will have access to numerous opportunities to expand your skills, advance your career, and grow personally and professionally. Our commitment to your growth includes programs, training curricula, and hands-on experiences. GlobalLogic is recognized for engineering impactful solutions worldwide. Joining our team means working on projects that make a difference, stimulating your curiosity and problem-solving skills. You will engage in cutting-edge solutions that shape the world today. We value balance and flexibility, offering various career paths, roles, and work arrangements to help you achieve a harmonious work-life balance. At GlobalLogic, integrity is key, and we uphold a high-trust environment focused on ethics and reliability. You can trust us to provide a safe, honest, and ethical workplace dedicated to both employees and clients. GlobalLogic, a Hitachi Group Company, is a leading digital engineering partner to top global companies. With a history of digital innovation since 2000, we collaborate with clients to create innovative digital products and experiences, driving business transformation and industry redefinition through intelligent solutions.,

Posted 2 days ago

Apply

5.0 - 9.0 years

0 Lacs

udupi, karnataka

On-site

As a Full Stack Team Lead at Ordrio, a dynamic SaaS e-commerce platform, you will play a pivotal role in empowering D2C brands and traditional retailers to thrive online. By providing a comprehensive and user-friendly solution coupled with expert guidance, you will assist businesses in navigating the complexities of e-commerce and achieving significant growth. Join our team and become a key player in fostering a positive and productive work environment that drives our success. In this role, you will lead and mentor the full stack team, promoting continuous learning and fostering a collaborative, growth-oriented culture. You will be responsible for owning the Software Development Life Cycle (SDLC) from inception to deployment, handling sprint planning, resource allocation, progress tracking, and delivery. Your technical expertise will be crucial as you architect and implement scalable web and mobile solutions. You will guide code reviews, enforce standards, and shape the technical direction of the projects. Additionally, you will scout, evaluate, and integrate AI tools and automation into workflows and products to drive innovation. Your responsibilities will include driving architectural decisions, designing microservices/serverless solutions, and engaging in technical deep-dives. You will oversee responsive UI/UX development with technologies such as React, Next.js, Tailwind CSS, HTML, CSS, JavaScript, and TypeScript. Furthermore, you will build secure and reliable APIs and microservices using Node.js (Nest.JS, Express), TypeScript, and JavaScript. Database management will be a key aspect of your role, where you will design, maintain, and optimize Postgres and MongoDB databases, lead ORM usage (Prisma), and handle ETL flows. You will also oversee RESTful APIs, WebSockets, and API Gateway (NGINX, Swagger/Open API), as well as manage CI/CD, cloud, containers, orchestration, observability, and infrastructure as code. Championing automated/unit testing, monitoring system health, resolving incidents swiftly, and ensuring robust security, privacy, and compliance practices throughout development will be part of your day-to-day responsibilities. You will collaborate with product, design, QA, and business teams to translate business needs into technical solutions. Your role will also involve triaging, troubleshooting, and rapidly resolving production incidents while ensuring minimal downtime and clear post-mortems. Staying updated on emerging technologies, driving process improvements, and working under pressure in high-growth, agile environments will be essential for success in this position. If you have 5+ years of hands-on full stack and leadership experience, proven fluency in backend, frontend, and mobile frameworks, in-depth knowledge of ORM and database management, good DevOps skills, and strong communication and collaboration abilities, we encourage you to apply. Join us at Ordrio and lead end-to-end architecture/technology decisions in an AI-first, high-ownership environment, influencing and mentoring a skilled and ambitious team while driving innovation in the future of commerce.,

Posted 2 days ago

Apply

3.0 - 10.0 years

5 - 7 Lacs

Pune, Maharashtra, India

On-site

Key Responsibilities: Design, deploy, and managehighly availableand scalable infrastructure on AWS. Automatedinfrastructure provisioning and configuration using tools like Terraform andAnsible. Develop and implement monitoring and alerting systems to proactivelyidentifyand troubleshoot incidents. Optimizeinfrastructure costs on AWS through resource management andutilizationAnalysis Collaborate with development teams to implement DevOps practices and ensure smooth deployments. Participate in on-call rotations and diligently respond to incidents to minimize downtime Continuously improve infrastructure reliability and performance through automation and bestpractices. Stayup to datewith the latest trends and technologies in cloud computing and SRE principles. Qualifications : 3+ years of experience in Site Reliability Engineering or a related field (Devops) Provenexpertisein deploying and managing infrastructure on AWS (EC2, S3, VPC, etc.) Experience in Linux OS isa must. Prior experience as a Linux administratora plus. Strong understanding of networking fundamentals isa must. Strong knowledge of infrastructure automation tools like Terraform and Ansible Experience with DevOps methodologies and CI/CD pipelines A keen understanding of cost optimization principles in AWS Excellent problem-solving and analytical skills Ability to work independently and as part of a cross-functional team Diligent and proactive approach to incident response Willingness toparticipatein on-call rotations

Posted 3 days ago

Apply

5.0 - 9.0 years

0 Lacs

karnataka

On-site

As a Site Reliability Engineering Manager at Apple, you will be a part of a dynamic team dedicated to bringing distributed storage technologies to Apple's infrastructure. Your role will involve managing Storage-focused SRE teams, collaborating closely with peer SRE teams, and development partners. You will play a pivotal role in building and optimizing the Storage stack, ranging from bare metal to application layers. This includes designing provisioning systems, code deployment strategies, monitoring, alerting, and performance enhancements. Your contributions will be instrumental in running the storage infrastructure utilized by some of Apple's largest teams. To excel in this role, you must possess a Bachelor's or Master's degree in Computer Science, Engineering, or a related field. Additionally, you should have proven experience in a leadership position within an SRE or DevOps team, with a specific focus on distributed storage systems. A strong background in distributed systems, storage architectures, and data management is essential. Deep knowledge of SRE principles, such as monitoring, alerting, error budgets, fault analysis, and other reliability engineering concepts, will be beneficial in this role. Your responsibilities will also include leading initiatives to enhance the scalability and performance of distributed storage systems and collaborating with engineering teams to implement robust and scalable storage solutions. Preferred qualifications for this role include experience with Kubernetes, Docker, and containerization, proficiency in programming languages like Golang, Java, or Rust, and knowledge of distributed storage or large-scale distributed databases. Familiarity with CI/CD pipelines and infrastructure as code tools like Terraform and Ansible, along with an understanding of security best practices and compliance requirements in storage systems, will be advantageous. Moreover, a grasp of data durability, consistency models, and storage performance optimization techniques will further enhance your effectiveness in this role. Join Apple's Storage SRE organization and be part of a team that is revolutionizing the storage solutions behind some of Apple's most popular services. Your passion and dedication can make a significant impact on the scale and efficiency of Apple's infrastructure. Embrace this opportunity to contribute to innovative products, services, and customer experiences that define Apple's commitment to excellence.,

Posted 3 days ago

Apply

6.0 - 10.0 years

5 - 7 Lacs

Bengaluru, Karnataka, India

On-site

Key Responsibilities: Design and manage scalable infrastructure on Azure and AWS Automate infrastructure using Terraform, Ansible Set up monitoring and incident response systems Optimizecloud performance and cost Support CI/CD pipelines and DevOps practices Participate in on-call rotations. Requirements: 6 to 10 years of experience in SRE/DevOps roles Mandatory Azure hands-on experience (VMs, Networking, Monitoring, etc.) Strong experience with AWS (EC2, S3, VPC) Proficiencyin Linux OS Solid knowledge of networking fundamentals Experience with Terraform, Ansible, CI/CD tools Strong problem-solving and analytical skills Good To have skills: Experience with container orchestration platforms likeKubernetes Familiarity with monitoring tools likePrometheus, Grafana, Datadog, orAzure Monitor Knowledge ofSRE principlessuch as SLAs, SLOs, and error budgets Scripting knowledge inPython,Shell, or similar

Posted 3 days ago

Apply

4.0 - 8.0 years

0 Lacs

hyderabad, telangana

On-site

The Architect Site Reliability Engineering plays a crucial role in providing technical leadership to support initiatives in cloud computing at Inspire. With a primary focus on enhancing efficiency, reducing toil, and increasing uptime and availability of Inspire's cloud platforms, you will collaborate with peers to influence cloud application and infrastructure design, improve production readiness reviews, streamline build/test/release automation, elevate observability practices, and fortify platform resiliency, scalability, and recovery capabilities. Your success in this role will stem from your ability to engage with diverse technical partners, employ data-driven problem-solving approaches, demonstrate self-motivation, and exhibit a commitment to continuous improvement. In this position, your responsibilities will include: - Involvement in the entire application and cloud services development lifecycle, from inception to refinement, ensuring well-designed and monitored software releases in collaboration with application and platform teams. - Designing, motivating, guiding, and supporting the development of software, systems, and processes to enhance product reliability, organizational efficiency, and resource optimization. - Advocating for reliability practices across the software development lifecycle through activities like architecture reviews, code reviews, platform creation, and capacity planning. - Collaborating with senior engineering and testing team members to develop tools and recommend testing strategies for problem prevention, detection, and chaos testing. - Enhancing SRE practices by establishing error budgets, refining SRE dashboards, and improving anomaly detection capabilities. - Providing design recommendations for platform enhancements based on production incident analysis and root cause investigations. - Improving service reliability through blameless post-incident reviews and leveraging automation tools to respond to or prevent future issues. - Identifying automation opportunities, designing tools, and supporting their implementation to automate routine, time-consuming, or manual tasks. - Periodically evaluating current SRE practices and tools to suggest enhancements and improvements. - Training, guiding, and mentoring teammates on SRE practices and principles. - Developing strategies to ensure infrastructure scalability and elasticity, along with code-level debugging for escalated issues. To be successful in this role, you should have: - A minimum of 8 years of experience as a platform architect with expertise in containers, deployment architecture, benchmarking, design, and network engineering. - At least 4 years of combined experience in DevOps, SRE, Systems, and/or software development roles. - Hands-on experience in establishing and maturing SRE practices, programs, and roadmaps. - Extensive knowledge of public cloud technologies, particularly Azure, and cloud-native architectures. - Proficiency in Infrastructure-as-Code (IAC), DevOps, and CI/CD practices and tools like Terraform, Gitlab, ArgoCD, and Jenkins. - Familiarity with configuration management tools such as Ansible, Chef, and Packer. - Expertise in container technology and orchestration, including Kubernetes and Docker. - Experience with Observability and Monitoring practices and tools like OpenTelemetry, New Relic, Prometheus, Grafana, and more. - Deep understanding of microservice architectures, application servers, networks, and databases. - Excellent grasp of scalability processes and techniques. - Strong communication and collaboration skills, with the ability to understand and improve complex systems. In summary, this role requires a dedicated professional with a strong technical background, a proactive approach to problem-solving, and a commitment to enhancing reliability and efficiency across cloud platforms. If you are someone who thrives in a dynamic and collaborative environment, excels in technical challenges, and is passionate about driving continuous improvement, this opportunity at Inspire may be the perfect fit for you.,

Posted 3 days ago

Apply

3.0 - 7.0 years

0 Lacs

navi mumbai, maharashtra

On-site

As a Junior AWS SRE at our company located in Mumbai, you will be responsible for setting up a world-class observability platform for multi-cloud infrastructure services. Your role will involve reviewing and contributing to the establishment of observability for the infrastructure of new and existing cloud applications. You will analyze, troubleshoot, and design critical services, platforms, and infrastructure with a focus on reliability, scalability, resilience, automation, security, and performance. Your duties will also include continuously improving cloud product reliability, availability, maintainability, and cost benefits, including the development of fault-tolerant tools to ensure the general robustness of the cloud infrastructure. You will play a key role in ensuring the availability, performance, monitoring, and incident response of the platforms and services of the cloud Landing zone. Managing capacity across public and private cloud resource pools, including automating the scale down/up of environments, will be part of your responsibilities. It will be your responsibility to ensure that all production deployments comply with a set of general requirements such as diagrams, documents, security compliance, dependencies of other services, monitoring and logging plans, backups, and high availability setups. You will need to ensure the efficient functioning of cloud resources and functions in alignment with the company's security policies and best practices in cloud security. As a Junior AWS SRE, you will be expected to employ exceptional problem-solving skills to proactively identify and resolve issues before they impact business productivity. You will also provide support to developers in optimizing and automating cloud engineering activities, such as real-time migration, provisioning, and deployment. Monitoring and taking action on hardware degradation, networking problems, resource usage, and slow responses on the cloud Landing zone will be part of your daily tasks. You will be responsible for preparing and managing runbooks containing procedures necessary for quickly restoring services in case of any issues. Enabling automation for key functions like CI/CD across SDLC phases, monitoring, alerting, incident response, infrastructure provisioning, and patching will be essential to your role. As a Junior AWS SRE, you will focus on system reliability to reduce operational expenses, mitigate failure points, and automate time-consuming tasks, resulting in significant cost savings. Your proactive approach to failure resolution will involve identifying failure causes early and mitigating faults holistically. You will be involved in developing and maintaining cloud solutions in accordance with best practices and performing regular incident analysis to prevent and find long-term solutions for incidents. If you are interested in this challenging and rewarding position, please send your CV to riddhi.joshi@blazeclan.com.,

Posted 3 days ago

Apply

5.0 - 10.0 years

8 - 12 Lacs

Hyderabad

Work from Office

ql-editor "> Senior Site Reliability Engineer - JD As a Senior Site Reliability Engineer (SRE) , you will collaborate closely with our Development and IT teams to ensure the reliability, scalability, and performance of our applications. You will take ownership of setting and maintaining service-level objectives (SLOs), building robust monitoring and alerting, and continually improving our infrastructure and processes to maximize up time and deliver exceptional customer experience. This role operates at the intersection of development and operations, reinforcing best practices, automating solutions, and reducing toil across systems and platforms. About QualMinds: QualMinds is a global technology company dedicated to empowering clients on their digital transformation journey. We help our clients to design & develop world-class digital products, custom softwares and platforms. Our primary focus is delivering enterprise grade interactive software applications across web, desktop, mobile, and embedded platforms. Responsibilities: 1. Ensure Reliability & Performance : Own the observability of our systems, ensuring they meet established service-level objectives (SLOs) and maintain high availability. 2. Cloud & Container Orchestration : Deploy, configure, and manage resources on Google Cloud Platform (GCP) and Google Kubernetes Engine (GKE), focusing on secure and scalable infrastructures. 3. Infrastructure Automation & Tooling : Set up and maintain automated build and deployment pipelines; drive continuous improvements to reduce manual work and risks. 4. Monitoring & Alerting : Develop and refine comprehensive monitoring solutions (performance, uptime, error rates, etc.) to detect issues early and minimize downtime. 5. Incident Management & Troubleshooting : Participate in on-call rotations; manage incidents through resolution, investigate root causes, and create blameless postmortems to prevent recurrences. 6. Collaboration with Development : Partner with development teams to design and release services that are production-ready from day one, emphasizing reliability, scalability, and performance. 7. Security & Compliance : Integrate security best practices into system design and operations; maintain compliance with SOC 2 and other relevant standards. 8. Performance & Capacity Planning : Continuously assess system performance and capacity; propose and implement improvements to meet current and future demands. 9. Technical Evangelism : Contribute to cultivating a culture of reliability through training, documentation, and mentorship across the organization. Requirements : Bachelor s degree in Computer Science, Business Administration, or relevant work experience. A minimum of 5+ years in an SRE, DevOps, or similar role in an IT environment, required . Hands-on experience with Microsoft SQL Clusters, Elasticsearch, Kubernetes, required . Deep familiarity with Windows or Linux environments and .NET or PHP stack applications, including IIS/Apache, SQL Server/MySQL, etc. Strong understanding of networking, firewalls, intrusion detection, and security best practices. Proven administrative experience with tools like GIT, TFS, Bitbucket, and Bamboo for continuous Integration, Delivery, and Deployment. Knowledge of automation testing tools such as SonarQube, Selenium, or comparable technologies. Experience with performance profiling, logging, metrics collection, and alerting tools. Competence in debugging solutions across diverse environments. Hands-on experience with GCP, AWS, or Azure, container orchestration (Kubernetes), and microservices-based architectures. Understanding of authentication, authorization, OAUTH, SAML, encryption (public/private key, symmetric, asymmetric), token validation, and SSO. Familiarity with security strategies to optimize performance while maintaining compliance (e.g., SOC 2). Willingness to participate in an on-call rotation and respond to system emergencies 24/7 when necessary. Monthly weekend rotation for Production Patching. A+, MCP, Dell certifications and Microsoft office expertise are a plus!

Posted 3 days ago

Apply

6.0 - 10.0 years

10 - 15 Lacs

Bengaluru

Work from Office

We are looking for a Senior Site Reliability Engineer, to join our Service Reliability and Operation group. We provide innovative team collaboration and an opportunity to build, operate and support scalable and reliable services that underpin Thomson Reuters products. About the Role In this opportunity as a Senior Site Reliability Engineer, you will: Be a Professional SRE: Implement site reliability engineering and DevOps best practices. Feed non-functional requirements into the product backlog, such as, but not limited to, high availability, scalability, self-healing, observability, continuous delivery, security Build and maintain monitoring for all aspects of infrastructure, micro-services and the platform and implement Alerting mechanism using cloud native solutions Provide primary operational support and engineering for distributed platforms Act as the go to person for any production issue. Troubleshoot and monitor until successful mitigation, communicate effectively, postmortem and implementation of the learnings. Maintain IaCand CICD and promote best practices for our CI/CD processes Focus on Continuousimprovement andtechnical standardsdrive improvements in productivity,monitoring,toolingand set industry best practices. On-call Rotation:Participate in on-call/shift rotations. When on-call, you are expected to drive the troubleshooting and mitigation activities while working on incident Be innovative and curious: Maintain end-to-end security ensuring that we meet best practices standards Keepup-to-datewith emerging cloud technology trends, especially around DevOps, Service Reliability and Security. Adopt pan-TR operation principles to ensure consistency and efficiency Documenting tribal knowledge. Constant upkeep of documentation and runbooks can ensure that teams get the information they need right when they need it Be collaborative: Extreme collaboration within our teams Canada, US, Mexico and India About you: Youre a fit for the role of Senior Site Reliability Engineer if you: Bachelors degree in computer science or related field - a must Minimum of 6-10 yearsofexperience as DevOps/SRE engineer andCloud engineerwith hands-on experience in AWS cloud technologies. Highly skilled in UNIX/Linux-based Systems Proven experience in building and operating PRODUCTION cloud-native infrastructure, applications, and services on AWS. Experience or knowledge of Container technology such as Docker, Kubernetes and Istio service mesh Must have experience using AWS services (such as Cloud Front, EKS, ECS, RDS, Threat detection and other security controls) Must have 2+ years scripting and programming experience(PowerShell, Bash) Experience or knowledge of Observability toolsDataDog, ELK, SumoLogic, CloudWatch Experience or knowledge with Version Control and CI/CD (Git/ Azure DevOps / JFrog Artifactory) Experience or knowledge writing Infrastructure as Code (IaC) (Terraform / CloudFormation / other) Team player with a can do attitude#LI-SS6 Whats in it For You Hybrid Work Model Weve adopted a flexible hybrid working environment (2-3 days a week in the office depending on the role) for our office-based roles while delivering a seamless experience that is digitally and physically connected. Flexibility & Work-Life Balance: Flex My Way is a set of supportive workplace policies designed to help manage personal and professional responsibilities, whether caring for family, giving back to the community, or finding time to refresh and reset. This builds upon our flexible work arrangements, including work from anywhere for up to 8 weeks per year, empowering employees to achieve a better work-life balance. Career Development and Growth: By fostering a culture of continuous learning and skill development, we prepare our talent to tackle tomorrows challenges and deliver real-world solutions. Our Grow My Way programming and skills-first approach ensures you have the tools and knowledge to grow, lead, and thrive in an AI-enabled future. Industry Competitive Benefits We offer comprehensive benefit plans to include flexible vacation, two company-wide Mental Health Days off, access to the Headspace app, retirement savings, tuition reimbursement, employee incentive programs, and resources for mental, physical, and financial wellbeing. Culture: Globally recognized, award-winning reputation for inclusion and belonging, flexibility, work-life balance, and more. We live by our valuesObsess over our Customers, Compete to Win, Challenge (Y)our Thinking, Act Fast / Learn Fast, and Stronger Together. Social Impact Make an impact in your community with our Social Impact Institute. We offer employees two paid volunteer days off annually and opportunities to get involved with pro-bono consulting projects and Environmental, Social, and Governance (ESG) initiatives. Making a Real-World Impact: We are one of the few companies globally that helps its customers pursue justice, truth, and transparency. Together, with the professionals and institutions we serve, we help uphold the rule of law, turn the wheels of commerce, catch bad actors, report the facts, and provide trusted, unbiased information to people all over the world. Thomson Reuters informs the way forward by bringing together the trusted content and technology that people and organizations need to make the right decisions. We serve professionals across legal, tax, accounting, compliance, government, and media. Our products combine highly specialized software and insights to empower professionals with the data, intelligence, and solutions needed to make informed decisions, and to help institutions in their pursuit of justice, truth, and transparency. Reuters, part of Thomson Reuters, is a world leading provider of trusted journalism and news. We are powered by the talents of 26,000 employees across more than 70 countries, where everyone has a chance to contribute and grow professionally in flexible work environments. At a time when objectivity, accuracy, fairness, and transparency are under attack, we consider it our duty to pursue them. Sound excitingJoin us and help shape the industries that move society forward. As a global business, we rely on the unique backgrounds, perspectives, and experiences of all employees to deliver on our business goals. To ensure we can do that, we seek talented, qualified employees in all our operations around the world regardless of race, color, sex/gender, including pregnancy, gender identity and expression, national origin, religion, sexual orientation, disability, age, marital status, citizen status, veteran status, or any other protected classification under applicable law. Thomson Reuters is proud to be an Equal Employment Opportunity Employer providing a drug-free workplace. We also make reasonable accommodations for qualified individuals with disabilities and for sincerely held religious beliefs in accordance with applicable law. More information on requesting an accommodation here. Learn more on how to protect yourself from fraudulent job postings here. More information about Thomson Reuters can be found on thomsonreuters.com.

Posted 3 days ago

Apply

8.0 - 13.0 years

12 - 16 Lacs

Bengaluru

Work from Office

We are seeking a visionary Principal DevOps Engineer to lead the transformation of our DevOps strategydriving innovation, automation, and cloud excellence with a strong focus on AWS technologies. You will spearhead efforts to modernize CI/CD pipelines, enhance cloud security and observability, and lead a DevOps culture that accelerates business growth and technical excellence. Key Responsibilities Architect and evolve scalable, secure AWS infrastructure for performance and efficiency. Lead DevOps engineers to implement cloud-native, fully automated environments. Define and enforce IaC best practices using Terraform, CloudFormation, or CDK. Enable self-service deployments for development teams. Modernize CI/CD pipelines using Jenkins, GitHub Actions, AWS CodePipeline, etc. Champion serverless and event-driven architecture using AWS Lambda and Step Functions. Implement Kubernetes (EKS/ECS) and microservices infrastructure for agility and resilience. Enhance observability using CloudWatch, Prometheus, Grafana, and AI-based tools. Embed DevSecOps into CI/CD and cloud workflows, ensuring robust security posture. Lead disaster recovery, high availability, and business continuity planning. Optimize AWS spend using FinOps principles without compromising performance. Build and scale AI/ML pipelines using Spark, Kafka, TensorFlow, PyTorch, Kubeflow, MLflow, Airflow, and AWS SageMaker. Research and adopt emerging DevOps and cloud technologies to drive innovation. Required Qualifications 8+ years in DevOps, Cloud Infrastructure, or SRE with proven transformation leadership. 6+ years of AWS hands-on experience (EC2, S3, RDS, Lambda, VPC, IAM, etc.). Expertise in IaC with Terraform, CloudFormation, or CDK. Advanced scripting skills in Python, Bash, or Go. Mastery of Kubernetes (EKS or custom clusters), Docker, and microservices architecture. Extensive CI/CD experience with Jenkins, GitHub Actions, GitLab CI/CD, AWS tools. Deep knowledge of observability tools (CloudWatch, Prometheus, Grafana, ELK). Strong cloud security, IAM policies, and compliance understanding. Proven leadership in DevOps cultural and technological transformation. Hands-on AI/ML pipeline experience using Spark, Kafka, TensorFlow, Kubeflow, SageMaker, etc. Preferred Qualifications AWS Certified DevOps Engineer Professional or AWS Solutions Architect Professional. Experience building large-scale serverless systems. Strong FinOps and AWS cost management knowledge. Experience with SRE practices and site reliability principles. Familiarity with AI-driven automation and self-healing infrastructure approaches.

Posted 3 days ago

Apply

5.0 - 8.0 years

11 - 21 Lacs

Hyderabad, Pune, Bengaluru

Hybrid

Hi, Please share your updated resume with below details : Total Exp in Devops: SRE: Current CTC: Expected CTC Notice period: Current Location: Pref Location:PAN India Thanks & Regards, Prangyaparamit Padhy Talent Acquisition Team prangyaparamit.padhy@ltimindtree.com

Posted 3 days ago

Apply

10.0 - 20.0 years

15 - 30 Lacs

Hyderabad, Chennai, Bengaluru

Work from Office

Role & responsibilities Job Title: DevOps Lead with Site Reliability Engineering (SRE) Experience: 0615 Years Location: Any Employment Type: Full-Time Job Summary: We are seeking an experienced and proactive Dev Lead with strong Site Reliability Engineering (SRE) capabilities to lead the design, implementation, and maintenance of our production systems. The ideal candidate will have deep experience in observability, automation, performance monitoring, and incident management, using tools such as Splunk, AppDynamics, IPSoft (Amelia), Python, BigPanda, Ansible , and ThousandEyes . Key Responsibilities: Lead a cross-functional SRE/DevOps team responsible for the reliability, scalability, and performance of production systems. Develop and implement monitoring and observability strategies using tools like Splunk, AppDynamics, and ThousandEyes . Automate infrastructure deployment and configuration using Ansible and other IaC tools. Utilize BigPanda for intelligent alerting and incident correlation. Integrate and manage IPSoft/Amelia for automated L1 incident handling and resolution. Design, develop, and maintain Python scripts for automation, data processing, and tool integration. Define and track SLOs, SLIs, and SLAs in collaboration with product and operations teams. Lead incident management processes and conduct postmortems to ensure continuous improvement. Collaborate with development teams to enhance application reliability, deployment, and CI/CD processes. Drive operational excellence, cost optimization , and security best practices. Technical Skills Required: Monitoring & Observability: Splunk, AppDynamics, ThousandEyes Incident Management & Correlation: BigPanda, IPSoft/Amelia Automation & Scripting: Python, Ansible, Shell scripting DevOps Practices: CI/CD, version control (Git), infrastructure as code Cloud Platforms: AWS, Azure, or GCP (preferred) Containerization & Orchestration: Docker, Kubernetes (added advantage) ITSM & Collaboration Tools: ServiceNow, Jira, Confluence Required Qualifications: Bachelor’s or Master’s degree in Computer Science, Engineering, or related discipline. 11+ years of experience in DevOps/SRE, with at least 3+ years in a leadership role . Proven experience managing 24x7 production systems with high availability and performance requirements. Strong analytical, problem-solving, and incident response skills . Excellent communication and leadership abilities. Preferred Certifications (Optional): AWS / Azure Certified DevOps Engineer Splunk Core Certified Power User / Admin AppDynamics Certified Associate Performance Analyst Red Hat Certified Specialist in Ansible Automation Python Certification (e.g., PCEP, PCAP)

Posted 3 days ago

Apply

6.0 - 10.0 years

0 - 2 Lacs

Bengaluru

Work from Office

SRE/Production Support - -Should have good understanding of ITSM process - Incident, Problem & Change Management - Should have experience in driving Outage calls - Should be able to use tools like Splunk, AppDynamics, Prometheus, Grafana & Loki etc during a triage to isolate the problem and provide metrics like total FCI during the outage call - Should adhere to Change Review process and prepare the team for operational readiness by setting up required monitoring, Traffic/Triage dashboards - Should understand SRE Concepts such as SLI/SLO, Error Budget - Should have good knowledge of ServiceNow and Jira Tool - Should be able to do Toil Reduction through automation ideas S Should be well versed in one Programming Language - Java Spring Boot Application Webservices - Should have the understanding of WebServices - Restful Services, SOAP based Services, JSON/XML -Should have basic understanding of UNIX systems - Should have understanding Kubernetes & Devops Basic UNIX commands for file processing (cat, grep, cp, chmod etc). Knowledge of Python, Jupyter notebook, Shell Scripting and awk is added advantage" - - Candidate should have understanding of UI technologies - HTML, CSS, JavaScript Should be able to interpret the code for analysis - Should understand the concept of Cache and Cookies WebServers - - Should have good understanding of Infrastructure - Apache/Tomcat ; IIS Server; Kubernetes Cluster/Pods and how to interpret analyze metrics, logs and traces to detect any issues Should understand the concept of Load Balancer -Should have knowledge about CPU Utilization, DiskSpace, Memory, Network Latency related to a Server SQL Knowledge - -Should be able to write complex queries with Joins/temp tables for various requirements - Should understand the query execution path and able to write optimized query - Should have knowledge of Index, constraints etc - Able to interpret the Stored Procedure for day to day analysis - Good working knowledge of DB2. Role & responsibilities Preferred candidate profile

Posted 3 days ago

Apply

3.0 - 8.0 years

0 Lacs

kochi, kerala

On-site

As a Tech Lead Full Stack at Qubryx, a US based Product Consulting and Development company, you will play a crucial role in leading the development, implementation, and maintenance of software solutions and applications for both client and company web-based products. This is a full-time remote role with the opportunity to work on designing and developing user interfaces, testing, and debugging code. While the role is primarily located in Kochi, there is flexibility for remote work as well. To be considered for this position, you should have at least 8 years of experience in full cycle software development projects and a minimum of 3 years of experience as a Tech Lead. You should have a proven track record of designing and developing software applications from scratch. Proficiency in JavaScript, Typescript, Node.js, and strong skills in No-SQL MongoDB designing and querying are essential for this role. Additionally, you should possess strong SQL skills and experience working with SQL Server and Postgres. Experience with AWS Lambda, S3, RDS, API Gateway, as well as familiarity with front-end UI frameworks such as React and React Native are highly desirable. Knowledge of Scrum methodologies, sprint planning, project planning, estimation, and product feature management is crucial for success in this role. You should have experience managing teams of developers, providing technical guidance, and fostering a collaborative team environment. Preferred qualifications include AWS Certifications, experience with Docker, containers, Kubernetes, and microservices, as well as proficiency in Python with past Java or .NET experience. Experience with serverless coding on AWS Lambda or Azure Functions, Azure Devops, and working in Scrum Teams are advantageous. A deep understanding of DevOps and SRE principles, along with experience implementing DevOps best practices, is also preferred. As a Tech Lead Full Stack, you should be a self-starter with excellent problem-solving skills and strong verbal and written communication abilities. You should be comfortable working independently as well as collaborating closely with other team members, both offshore and onsite. You should have the ability to code new features, troubleshoot problems, and identify areas for improvement. If you are a highly motivated individual with a passion for software development and a willingness to learn and grow with the team, we encourage you to apply for this exciting opportunity. Join us at Qubryx and be part of a dynamic team that values innovation, collaboration, and continuous improvement. Benefits include competitive compensation and the opportunity to work on cutting-edge projects with a talented team. To qualify for this role, you should have a bachelor's degree and a minimum of 8 years of experience in relevant technologies.,

Posted 4 days ago

Apply

10.0 - 14.0 years

0 Lacs

karnataka

On-site

Join us as an Infrastructure Engineer. You will collaborate in building the best possible solutions for public and private cloud environments and engineer infrastructure technology to comply with security, resilience, sustainability, and operational requirements with observability and guardrails built-in. You will also use automation to provide testing and a route to live for the product, identifying ways to use new and existing technology tools to enhance performance, removing inefficiencies. This is a chance to work with colleagues across the bank to share engineering best practices, allowing you to provide thought leadership while developing solutions. We're offering this role at the vice president level. As an Infrastructure Engineer, you will contribute to and manage the selection, creation, and maintenance of technologies required to meet the needs of our customers, strategic targets, and architecture outcomes, along with developing products using modern engineering practices and tools. We'll look to you to collaborate with Product Owners to develop product roadmaps and manage the lifecycle of the team's products and support engineered products to respond to customer feedback, new feature requests, resolve production issues, and help customers consume our products. Additionally, you'll take a lead role within a team to design and engineer intuitive, self-service infrastructure products, develop technical skills through continuous learning and development, contribute to the delivery of infrastructure as code solutions, build an awareness of design thinking tools and techniques with users in order to improve your product, provide operational support for pattern or product-related issues, and work with key vendors in the delivery of the infrastructure services and technology for the product. To thrive in this role, you'll need ten plus years of experience in various monitoring tools with extensive automation, DevOps, and cloud adoption (AWS) experience to support the Splunk platform. You'll also have to define, create, and provide oversight and governance of engineering and design solutions with a focus on end-to-end automation, simplification, resilience, security, performance, scalability, and reusability for onboarding customers (SRE) and help them develop alerting & monitoring solutions. This role also requires Incident & Change management activities. Furthermore, you'll need experience and a strong understanding of implementing DevOps/CICD pipelines like Git Lab, Jira, Confluence, Python, JavaScript, general scripting, Infrastructure as Code like Puppet, Terraform, Ansible, public cloud vendor knowledge covering Cloud adoption/migration (AWS), experience of working with technology deployed to an on-premise data center, and strong collaborative communication skills for articulating technical concepts clearly to stakeholders.,

Posted 4 days ago

Apply

5.0 - 9.0 years

0 Lacs

karnataka

On-site

As a Senior Site Reliability Engineer for the Operational Readiness team at HashiCorp, you play a crucial role in enhancing the scalability, performance, and reliability of our cloud products. With over 5 years of experience in site reliability engineering or a related field, you lead efforts to identify performance bottlenecks, address operational challenges proactively, and ensure our services meet the highest standards of operational excellence. Your expertise in load testing, performance analysis, and system hardening is instrumental in maintaining the operational resilience of our enterprise and cloud-based products. You focus on ensuring high availability and performance across all of HashiCorp's offerings, with a holistic view of enterprise and cloud systems. In this role, you define and execute test plans, develop system-wide strategies for product load and performance testing, and explore new avenues to meet essential operational readiness criteria. You utilize troubleshooting techniques like Chaos engineering to identify and provide novel solutions for complex system issues that may impact customers. Key Responsibilities: - Implement best practices for system reliability, including proactive identification of potential failure points and automated mitigations. - Design and execute comprehensive load testing strategies to identify performance bottlenecks and scalability limits. - Improve system resilience by implementing best practices and technologies for high availability and fault tolerance. - Collaborate with engineering and product teams to integrate operational readiness into the development lifecycle. - Build tools and frameworks for automated testing, environment simulation, and incident reproduction to increase test coverage. - Analyze testing results, document findings, and make actionable recommendations for system enhancements. - Drive systemic improvements through Chaos Testing and work closely with product development teams. - Share knowledge and expertise with team members, promoting a culture of learning and continuous improvement. - Develop and implement disaster recovery and backup strategies to ensure data integrity and system resilience. Ideal Candidate: - 5+ years of experience in SRE, systems engineering, or non-functional testing roles with a focus on operational readiness and performance testing. - Proficiency in high-level programming languages or scripting. - Track record of leading successful load testing and performance optimization initiatives in cloud and on-prem environments. - Experience in creating and managing test environments for automated testing. - Strong understanding of CI/CD processes and maintaining quality pipelines. - Familiarity with version control systems (e.g., Git) and agile project management methodologies. - Knowledge of monitoring and alerting systems, with the ability to develop metrics and alarms reflecting system health and operational risks. - Technical foundation in cloud technologies (AWS, Azure, or GCP) and container technologies like Nomad or Kubernetes. - Experience with performance testing tools like K6, Artillery, Vegeta, Locust, etc. - Effective communication and collaboration skills with cross-functional teams and diverse audiences. - Familiarity with HashiCorp products and tools is a plus. - Exposure to the disaster recovery domain is also a plus.,

Posted 4 days ago

Apply

2.0 - 8.0 years

0 Lacs

karnataka

On-site

NTT DATA is looking to hire an Azure platform SRE to join their team in Bengaluru, Karnataka, India. As an Azure platform SRE, you will need to have 6 to 8 years of overall experience with a strong working knowledge of SRE, including expertise in Databricks and Terraform. Specifically, you should have at least 3 years of experience with Azure infrastructure, 2+ years of experience with Databricks (including Unity Catalog), and 2+ years of experience with Terraform. Additionally, experience with SQL and DevOps pipelines would be a plus. NTT DATA is a trusted global innovator of business and technology services with a commitment to helping clients innovate, optimize, and transform for long-term success. As a Global Top Employer, they have diverse experts in more than 50 countries and a robust partner ecosystem. Their services include business and technology consulting, data and artificial intelligence, industry solutions, as well as the development, implementation, and management of applications, infrastructure, and connectivity. NTT DATA is one of the leading providers of digital and AI infrastructure globally. Being a part of the NTT Group, they invest significantly in R&D to support organizations and society in moving confidently and sustainably into the digital future. Visit us at us.nttdata.com,

Posted 4 days ago

Apply

7.0 - 12.0 years

15 - 30 Lacs

Bengaluru

Hybrid

Job Description: SRE Infrastructure Platform Engineering (IPE), part of the LSEG Infrastructure & Cloud organisation, are searching for a senior Associate to drive Site Reliability Engineering (SRE) and a professional, best in class, approach to service operations across the Production infrastructure environment. IPE operates globally with around 600 people in functionally aligned teams across Data Centres, Storage, Platforms, Database, Middleware and the virtualized Private Cloud. This role will require to work as a senior Associate in a role which works with teams across IPE and drive Site Reliability culture during APAC hours, partnering with other regional squads, to drive improvements across the infrastructure estate and service centricity across all teams. As an Infrastructure SRE, the candidate is required to have a sound understanding of the ITSM methodology specifically Service Operations including Incident, problem and change management. This role champions a culture of continuous service improvement using the policies as a framework and reporting on our service performance in terms of improvements to SRE areas of focus, This role will also entail driving Site Reliability principals and drive opportunities for service resilience, scalability and performance across our critical infrastructure working with our teams across IPE. Topics within your remit will include assurance of service data quality; compliance with policy, hygiene / performance metrics and SLAs; champion best practices in Infrastructure management including driving proactive monitoring and capacity planning; Collaborate with Security professionals to enhance infrastructure security; vendor engagement; scenario and business continuity readiness testing; continuous improvement and training/upskilling counterparts in aspects of service management. KEY RESPONSIBILITIES: Drive high levels of stability and availability of services driving Site Reliability Engineering as a practice across IPE. Grow partnership with Product Engineering owners, drive initiatives which benefit the team in accordance with SRE. 24*7 available as an escalation point for the operational teams. Reduced MTTR and service impact Address technical debt across IPE to remove risk Reduce recovery time on incidents Aid in major incidents which are owned by IPE. Validate service communications from technical perspective during major incidents Drive standard process and continual improvement for incident recovery, problem management, service resilience and availability Bring in best ITSM practices to evaluate and update existing practice as in creating Knowledge articles, Runbooks, and process documents. Responsible for IPE Technical Recovery and Problem Management response ensuring cross coordination across Technology Teams for complex, IPE owned issues. Accountable for technical decisions and communications on service recovery during live incidents. Reduce recovery time on incidents and act as the main contact point for Major Incidents. Collaborates with stakeholders to meet business objectives in Group IT initiatives by utilising in-depth knowledge of operations, processes and applications and contributes towards Identify trends and possible opportunities for Service Improvement Program (cross-domain/divisional), gain support and sponsorship then track and drive those program's through to conclusion providing regular service updates on progress. Responsible for oversight and governance of key resilience requirements for applications within IPE and address technical debt across IPE to remove risk. . MINIMUM REQUIREMENTS: Bachelors degree or equivalent experience in an IT related discipline preferred. Technical knowledge of SRE areas of focus – implementations with Datadog as an observability focus, Capacity management etc. Outstanding communication and influencing skills. Experience of industry best-practice processes and ability to drive approach and process changes. Initiative-taking, focused, and resilient, with a cheerful outlook. Good negotiation / influencing skills able to overcome resistance and reach consensus and compromise to attain the required objective. Demonstrated ability to manage time critical incident and recovery (crisis) situations and communication and liaison with internal stakeholders ITIL Foundation certificate must. Extensive experience with monitoring tools (e.g. Datadog, ITRS etc.)

Posted 4 days ago

Apply

10.0 - 20.0 years

25 - 35 Lacs

Pune

Work from Office

Role & responsibilities: Develop automation scripts and integrations using Python, Node.js, and Bash to streamline operations and improve observability. Monitor application and infrastructure performance using Splunk and Dynatrace. Participate in incident response and root cause analysis. Implement and manage Akamai configurations for performance optimization and bot mitigation. Required Skills: 5+ years of experience in a Site Reliability Engineering, DevOps, or related role. Experience developing scripts using Python, Node.js, and Bash. Understanding of REST APIs, data serialization (JSON, YAML), and HTTP protocols. Hands-on experience with Jenkins to build pipelines or similar tools. Proficiency with monitoring and observability tools, especially Splunk and Dynatrace. Experience with Jira and agile development. Experience with Salesforce Commerce Cloud a plus.

Posted 4 days ago

Apply

Exploring SRE Jobs in India

Site Reliability Engineering (SRE) is a rapidly growing field in India, with numerous job opportunities available for skilled professionals. SREs play a crucial role in ensuring the reliability, performance, and availability of software systems. If you are considering a career in SRE in India, here is some valuable information to help you navigate the job market.

Top Hiring Locations in India

  1. Bangalore
  2. Hyderabad
  3. Pune
  4. Mumbai
  5. Chennai

These cities have a high demand for SRE professionals, with numerous tech companies actively hiring for SRE roles.

Average Salary Range

The average salary range for SRE professionals in India varies based on experience level. Entry-level SREs can expect to earn around INR 6-8 lakhs per annum, while experienced professionals with 5+ years of experience can earn upwards of INR 15 lakhs per annum.

Career Path

In the field of SRE, a typical career progression may include roles such as Junior SRE, SRE, Senior SRE, SRE Team Lead, and SRE Manager. As professionals gain experience and expertise, they may move into more strategic and leadership roles within the organization.

Related Skills

In addition to expertise in site reliability engineering, SRE professionals are often expected to have skills in areas such as: - Cloud computing - Automation tools - Monitoring and alerting systems - Scripting languages like Python or Shell scripting

Interview Questions

  • What is the difference between uptime and availability? (basic)
  • How would you handle a sudden increase in traffic to a website? (medium)
  • What is the purpose of a Service Level Objective (SLO) in SRE? (basic)
  • Explain the concept of error budget in SRE. (medium)
  • How do you ensure the security of a system as an SRE? (medium)
  • What is the role of chaos engineering in SRE practices? (advanced)
  • Describe your experience with incident management and postmortems. (medium)
  • How do you prioritize tasks when multiple incidents occur simultaneously? (medium)
  • What are the key metrics you would monitor in a production system? (medium)
  • How would you approach capacity planning for a system? (advanced)
  • Explain the concept of "toil" in an SRE context. (basic)
  • How do you ensure high availability in a distributed system? (advanced)
  • Describe your experience with implementing CI/CD pipelines. (medium)
  • How do you handle configuration management in SRE? (medium)
  • What is the role of automation in SRE practices? (basic)
  • How do you stay updated with the latest trends and technologies in SRE? (basic)
  • Explain the concept of "SLI, SLO, SLA" in the context of SRE. (medium)
  • Describe a challenging incident you resolved and the steps you took to mitigate it. (medium)
  • How do you perform load testing for a system? (medium)
  • What is the importance of monitoring in SRE? (basic)
  • How do you ensure disaster recovery in a system? (advanced)
  • Describe your experience with containerization technologies like Docker. (medium)
  • How do you handle database failures in a production environment? (medium)
  • What steps would you take to optimize the performance of a slow application? (medium)
  • How do you approach on-call rotations and incident response? (medium)

Closing Remark

As you explore opportunities in the field of SRE in India, it is essential to stay updated with industry trends, continuously upskill yourself, and be prepared for challenging technical interviews. With the right skills and preparation, you can confidently apply for SRE roles and embark on a rewarding career in this dynamic field. Good luck!

cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

Featured Companies