On-site
Full Time
We are looking for a dedicated and skilled Operations Engineer (SRE) to ensure the reliability, scalability, and performance of our enterprise systems and applications. In this hybrid role, you will blend software engineering and IT operations to build automated solutions for operational challenges, improve system health, minimize manual effort, and support continuous delivery. You will play a key role in monitoring, maintaining, and improving production infrastructure, and ensuring stable, high-quality service delivery.
Ensure high availability, performance, and stability of applications and infrastructure: servers, services, databases, network and other core components.
Design, build, and maintain fault-tolerant, highly-available, and scalable infrastructure.
Define, implement and monitor Service Level Objectives (SLOs) / Service Level Indicators (SLIs) / SLAs to measure reliability, performance, latency, error rates, uptime, etc.
Implement and maintain robust monitoring, logging and alerting systems for infrastructure and applications to proactively detect issues before they impact users.
Build dashboards and observability tooling to track system health metrics (latency, error rates, resource usage, throughput, etc.).
Set alert thresholds and alerting workflows for critical infrastructure components and services.
Lead incident response for system outages or performance degradation: triage issues, coordinate with relevant teams, mitigate impact, restore service.
Perform root-cause analysis (RCA) and post-incident reviews to understand failures and identify permanent fixes/preventive measures.
Maintain incident runbooks, playbooks and documentation to support consistent and efficient incident handling.
Automate routine operational tasks deployment, configuration, infrastructure provisioning, scaling, backups, recovery, etc. to minimize manual intervention and reduce errors.
Develop and maintain Infrastructure-as-Code (IaC), configuration management, and automated deployment/CI-CD pipelines.
Build internal tools or scripts to streamline operations, monitoring, alerting, deployments, and recovery.
Monitor system performance, resource utilization, load, and growth trends to plan capacity and scaling requirements proactively.
Optimize infrastructure, services, and configurations for performance, cost-efficiency, fault tolerance, and scalability.
Collaborate with development teams to design and deploy systems with reliability and scalability in mind.
Work closely with development, QA, and product teams to support deployments, ensure operability of applications, and incorporate reliability practices into development lifecycle.
Provide feedback on system design, performance, and operational best practices to help build reliable, maintainable systems.
Contribute to documentation system architecture, runbooks, troubleshooting guides, and standard operating procedures (SOPs).
Ensure infrastructure security, compliance, and follow best practices in configuration, access control, backups, and disaster-recovery planning.
Plan and test disaster recovery and backup strategies to ensure business continuity.
Bachelors degree in Computer Science, Information Technology, Engineering, or a related field (or equivalent experience).
Proven experience in SRE, system operations, infrastructure engineering, or related roles managing production-grade systems.
Strong scripting/programming skills (e.g., Python, Bash, Go, etc.) to build automation tools and operational scripts.
Experience with cloud platforms (AWS, GCP, Azure) or on-prem infrastructure; familiarity with containerization/orchestration (e.g., Docker, Kubernetes) is a plus.
Familiarity with monitoring / observability tools, logging, metrics, dashboards, alerting frameworks.
Strong understanding of Linux/Unix systems, networking, load balancing, redundancy, failover, and system architecture.
Good problem-solving, troubleshooting, root-cause analysis skills, with ability to diagnose, mitigate and resolve critical production issues.
Experience or comfort with CI/CD pipelines, Infrastructure-as-Code (IaC), configuration management, automated deployments.
Excellent collaboration and communication skills ability to work across teams (development, QA, operations) and coordinate under pressure.
Proactive mindset, commitment to reliability, operational excellence, automation, and continuous improvement.
Taggd
Upload Resume
Drag or click to upload
Your data is secure with us, protected by advanced encryption.
Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.
We have sent an OTP to your contact. Please enter it below to verify.
bangalore
0.00021 - 0.00025 Lacs P.A.
bangalore
0.00021 - 0.00025 Lacs P.A.