Technical Support Engineer

3 - 7 years

10 - 20 Lacs

Posted:1 day ago| Platform: Naukri logo

Apply

Work Mode

Hybrid

Job Type

Full Time

Job Description

Job Overview:

We are looking for Site Reliability Engineer (SRE) Tech Support to for our Cloud Site Reliability operations and ensure the smooth functioning of cloud infrastructure powered by OpenStack and Kubernetes. In this role, you will focus on monitoring, basic troubleshooting, and incident response, helping to maintain high system availability, reliability, and performance. You will be responsible for identifying and addressing simple issues, as well as escalating more complex problems to development team when needed.

The ideal candidate should have a good understanding of cloud infrastructure (especially OpenStack and Kubernetes), containerized environments, and system monitoring. This position offers an excellent opportunity for someone looking to grow into a more advanced SRE or DevOps role.

Key Responsibilities:

  • Incident Monitoring, Triage & Resolution:
  • Respond to system alerts, monitor infrastructure health using tools like Prometheus, Grafana, and Observability for both OpenStack and Kubernetes.
  • Identify low-level issues and follow runbooks or predefined scripts to perform first-level triage.
  • Investigate and resolve more complex issues compared to L0, such as Kubernetes pod crashes, network misconfigurations in OpenStack, and minor service disruptions.
  • Work with tools like kubectl to troubleshoot Kubernetes pods and nodes, and OpenStack CLI to diagnose problems with VMs, storage, and networks
  • System Health Checks:
  • Perform daily health checks for Kubernetes pods, nodes, and OpenStack instances.
  • Verify functionality of VMs, containers, and network services within the environment.
  • Ticket Management:
  • Log incidents and issues into a ticketing system (e.g., JIRA, ServiceNow) for tracking and escalation.
  • Update incident tickets and provide relevant information for ongoing resolution efforts.
  • Work closely with L2 and L3 engineers for complex troubleshooting or advanced system issues that require in-depth knowledge
  • Automation & Scripting:
  • Automate routine tasks, such as VM provisioning, pod deployments, or status checks, using basic scripting languages (Python, Bash).
  • Improve automation workflows based on feedback and frequently encountered issues.
  • Log Aggregation & Monitoring:
  • Review logs and metrics collected from ELK Stack, Prometheus, Grafana, or other logging tools to detect trends and potential issues.
  • Analyze logs and metrics from OpenStack and Kubernetes clusters to pinpoint underlying problems (e.g., high CPU usage, memory leaks).

Skills & Qualifications:

  • Familiarity with

    OpenStack architecture (e.g., Nova, Neutron, Cinder).

  • Good understanding of

    Kubernetes

    components, including

    pods, services, deployments, and namespaces.

  • Knowledge

    of Linux/Unix

    -based operating systems (e.g.,

    Ubuntu, CentOS, Red Hat).

  • Understanding of networking concepts like

    DNS, IP routing, and VLANs in cloud environments.

  • Familiarity with

    monitoring tools like Prometheus, Grafana, Zabbix, or CloudWatch for alert management and system health monitoring

    .
  • Troubleshooting & Incident Response

    : Experience in using log aggregation tools (

    ELK stack, Splunk

    ) and interpreting logs for incident detection
  • Ability to perform basic troubleshooting steps (e.g., restarting services, running basic shell commands) to resolve issues
  • Strong communication skills to collaborate effectively
  • Ability to document incidents, solutions, and troubleshooting steps clearly.
  • Basic Scripting

    preferably in Python for Automation

Certifications:

  • Basic certifications such as CompTIA Linux+, AWS Certified Solutions Architect, Kubernetes Fundamentals (CKA), or OpenStack COA are a plus.

Work Environment:

  • Require working in shifts, on rotation basis covering evenings/nights, weekends, and holidays to ensure 24*7*365 availability of critical systems.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now
Truminds logo
Truminds

Technology Consulting

San Francisco

RecommendedJobs for You