Get alerts for new jobs matching your selected skills, preferred locations, and experience range.
2.0 - 4.0 years
4 - 6 Lacs
Hyderabad
Work from Office
Summary: We are looking for a Site Reliability Engineer (SRE), initially focused on production AppOps, who can manage scalable systems, using best practices around automation, that improve reliability, and velocity and enable monitoring of the operational health of services throughout their lifecycle including metrics collection, aggregation, and visualization. As a member of the SRE team, you will support NCRs Financial Services business unit, product, and technology teams to improve the design and operation of systems, focusing on making them scalable, reliable, and efficient while ensuring production performance and high availability of products/services primarily deployed/running in the cloud. You will influence the development and implementation of reliable production systems and services to address emerging business needs (such as Cloud-based SaaS). SREs pride themselves on the resiliency and stability of production systems, yet at the same time are committed to innovation and operational improvement through the application of software engineering practices to operations. The SRE will support innovation and operational improvement through the application of software engineering practices to operations. You will make our products easier to adopt and use by making improvements to the product, tools, processes, and documentation. You are someone who strives for six 9s or better in availability/uptime! Key Areas of Responsibility (or where we need your support): Maintenance, scale production services and servers for complex and high-throughput cloud services. Bridge and own the union between development, quality, security, and operations. Improving the scalability, service reliability, capacity, and performance of the SaaS services. Writing automation code for provisioning and operating infrastructure at a massive scale. To be an experienced software engineer focused on application reliability and scalability. Contribution to the continuous improvement of our software delivery processes and practices in a multi-location, multidisciplinary team to empower and accelerate product development. To design, configure, manage, and monitor systems in support of our product development teams. To participate in disaster recovery planning and execution. Maintaining/patching servers supporting SaaS products. This also includes Windows and Linux Servers running in private data centers and/or using cloud PaaS providers (Azure). Collaborating with other teams to promote the code using CI/CD and AppSec tooling. Accountable to collaborate with development/support/dependent teams and use intuition, experience and understanding to create SLIs, SLOs, and SLAs. Responsible to implement monitoring alerts, build dashboards, and manage escalation paths. Accountable for prompt support and preparation of PIR/RCA during/for the critical incidents to help not only to remediate/resolve the problem but also to minimize the downtime window. Participate in on-call Rota/schedules, and during off-hours it may require providing assistance for production outage scenarios. IDEAL TECHNICAL AND PROFESSIONAL SKILLS: BS degree in Computer Science or related technical field or 5 years prior relevant experience. Extensive experience in a DevOps / SRE role with demonstrable experience in deploying and managing large-scale production environments in Azure, AWS, GCP, and multi-data center environments. Experience developing and debugging code (i.e., one or more of the following: Ansible, Python, Shell, Perl, Golang or JavaScript, Java, C, C++, .NET) 2+ years deploying and supporting high-traffic, scalable web applications/services. 2+ years with Azure/GCP/AWS 2+ years with Docker, Kubernetes, and an early version of OpenShift. Experience with Linux, Shell Scripting, PKI TLS/SSL, Network, firewalls, load balancers and backup. Experience in designing, analyzing, and running large-scale distributed systems. Experience in hosting and solving problems in public-facing services securely in Azure, AWS or GCP Experience with orchestration, automation, and configuration management tools like Ansible (or Puppet, Chef, Terraform, Helm or related technology), git and Fabric. Excellent analysis, debugging, root-cause identification, and troubleshooting skills. Experience with Kubernetes, system virtualization, on-prem and/or hybrid cloud computing, cloud Identity, security systems, cloud monitoring and logging, and/or local/cloud storage. Experience with one or more CI/CD and related tools like Azure DevOps/Jenkins/GitHub Actions, Artifactory, Harness, CloudBuild. Experience with application disaster recovery, migration, roll-back plans, expansion, routine deployments, and system upgrades. Experience with log management, including monitoring, aggregation, alerting, and graphing (i.e., NagiosXI/Prometheus/ELK/Sensu/StackDriver/TICK stacks) Bonus points for experience with Kafka, Elasticsearch, or Cassandra. Extra bonus points for Cloud certifications and exposure to Harness.
Posted 1 week ago
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
Accenture
36723 Jobs | Dublin
Wipro
11788 Jobs | Bengaluru
EY
8277 Jobs | London
IBM
6362 Jobs | Armonk
Amazon
6322 Jobs | Seattle,WA
Oracle
5543 Jobs | Redwood City
Capgemini
5131 Jobs | Paris,France
Uplers
4724 Jobs | Ahmedabad
Infosys
4329 Jobs | Bangalore,Karnataka
Accenture in India
4290 Jobs | Dublin 2