Site Reliability Engineer

5 - 10 years

25 - 35 Lacs

Hyderabad

Posted:16 hours ago| Platform: Naukri logo

Apply

Skills Required

Site Reliability Engineering Terraform Infrastructure Management Cloud AWS Kubernetes

Work Mode

Work from Office

Job Type

Full Time

Job Description

TriNet is a leading provider of comprehensive human resources solutions for small to midsize businesses (SMBs). We enhance business productivity by enabling our clients to outsource their HR function to one strategic partner and allowing them to focus on operating and growing their core businesses. Our full-service HR solutions include features such as payroll processing, human capital consulting, employment law compliance and employee benefits, including health insurance, retirement plans and workers compensation insurance. TriNet has a nationwide presence and an experienced executive team. Our stock is publicly traded on the NYSE under the ticker symbol TNET. If youre passionate about innovation and making an impact on the large SMB market, come join us as we power our clients business success with extraordinary HR. Don't meet every single requirement? Studies have shown that many potential applicants discourage themselves from applying to jobs unless they meet every single requirement. TriNet always strives to hire the most qualified candidate for a particular role, ensuring we deliver outstanding results for our small and medium-size customers. So if you're excited about this role but your past experience doesn't align perfectly with every single qualification in the job description, nobodys perfect and we encourage you to apply. You may just be the right candidate for this or other roles. JOB SUMMARY The SRE will work with engineering developments, Analytics Organization, Architects, IT organizations to implement best practices for reliability and performance with the applications and services they support. Our ideal candidate is well-versed in modern cloud-based and on prem architecture and experienced in designing systems for reliability as well as implementing monitoring, alerting, and ops automation to reliably operate and maintain the services they build. Essential Duties/Responsibilities Collaborate with Engineering teams to support services before they go live through activities such as system design consulting, developing secure, reliable and highly available software platforms and frameworks, monitoring/alerting, capacity planning, production readiness and reliability reviews. Guides reliability practices through activities including architecture reviews, code reviews, capacity/scaling planning, security vulnerability remediations. Conducts, coordinates, and oversees post-incident Root Cause Analysis / Reviews and drive product improvements. Participate with other SRE leaders in setting the enterprise strategy for designing and developing resiliency in the application code. Participates in on-call rotation for the services owned by the SRE team, effectively triaging, and resolving production and development issues. Should be able to perform code level debugging on issues escalated to the team. Mentor Junior engineers and developers to help them grow and refine their SRE skills. Performs other duties as assigned Complies with all policies and standards QUALIFICATIONS Education Bachelor's Degree computer science, Engineering, or related field preferred Work Experience Typically 8+ years experience in Site Reliability Engineering, infrastructure management, or a related field- required Typically 5+ years experience in public cloud (AWS, Azure etc), and container technologies- preferred Licenses and Certifications Cloud Architect Certifications (AWS preferred) Kubernetes Certifications (preferred) Knowledge, Skills and Abilities Demonstrate strong experience with programing languages like Java, Python. Strong experience on High availability planning, Capacity planning, and Disaster Recovery is required. Technical proficiency: Strong hands-on experience with Ansible or Terraform and building services in AWS, and strong understanding of in-memory data stores such as Redis, Memcached. Deep understanding of REST APIs: Fundamental understanding of REST APIs. Hands on experience with container technologies such as Docker, Kubernetes. Knowledge of various network protocols like IPv4/6 TCP/IP, FTP, SMTP, UDP, SSL and HTTP/HTTPS. Practical understanding of messaging technologies such as ActiveMQ, RabbitMQ etc. Ability to leverage monitoring / logging analytics tools such as Prometheus, Grafana, Splunk and AppDynamics. Ability to architect applications & solutions that are highly available, scalable and highly fault tolerant. A problem solver mindset. Work Environment: Work in a clean, pleasant, and comfortable office work setting. The work environment characteristics described here are representative of those an employee encounters while performing the essential functions of this job. Reasonable accommodations may be made to enable persons with disabilities to perform the essential functions. This position is 100% in office. Prescreening Questions: Q: What strategies would you use to scale a web application handling millions of requests per second? A: Global CDN , Caching , DB Query Optimization / Indexing , Autoscaling Q: Difference between TCP & UDP A: TCP for HTTP/HTTP(S) applications. Reliable and Secure. Use cases: web, emails, file transfers UDP for is connectionless. Less / No reliability . Used for video streaming , gaming Q: What strategies do you use to improve Mean Time to Detect (MTTD) and Mean Time to Resolve (MTTR)? A: Enhance Monitoring & Observability Automate Alerts & Incident Detection (Use AI to integrate with monitoring tools to quickly diagnose the problem) Improve Runbooks & Incident Response (Clear and extensive runbooks for anyone on the team to be able to resolve issues quickly) Q: How would you handle DDoS attacks on a critical service? A: Not a short answer, can be very broad. Scale infrastructure temporarily to handle the surge. Identify attack vectors using monitoring tools and logs Enable DDoS protection services like AWS Shield , Azure DDoS Protection, or Google Cloud Armor. Rate limiting: Apply request throttling (AWS WAF, Cloudflare Rate Limiting). Implement CDN to absorb traffic surges Use geo-blocking temporarily to deny requests Q: What is the difference Authentication and Authorization ? A: Authn confirms who you are Authz determines what you can access once authenticated Q: Which k8s object is used to expose a pod or deployment to external traffic? A: Ingress or Kubernetes Ingress Q: Which k8s component is responsible for maintaining the desired state of the cluster? A: kube-controller-manager Q: What is the primary purpose of AWS KMS (Key Management Service)? A: AWS KMS is used to manage encryption keys for AWS services and applications . ensuring secured data encryption and decryption Q: What are Terraform state files, and why are they important? A: Q: How do you manage Terraform state in a team environment? A: 1) Instead of storing the terraform.tfstate file locally, use a remote state backend to allow multiple team members to access and update the state safely. Common backends include: Amazon S3 (with DynamoDB for state locking) , Azure Storage , Google Cloud Storage etc 2) Enable state locking 3) Avoid manual state edits / modifications Other questions ============= Can you walk me through a recent project where you used tools like Terraform or Ansible? What was your role in that project? Look for hands-on involvement, not just team exposure. Have you worked with AWS in your previous roles? Which services did you use most often and why? Listen for services like EC2, S3, IAM, VPC, etc. How do you typically manage application deployments in your current or past roles? Look for mentions of Kubernetes, Docker, Helm, or CI/CD pipelines. Can you describe a situation where you had to respond to a production issue or outage? What steps did you take? This reveals their calmness under pressure and troubleshooting mindset. What tools have you used for monitoring and logging system performance? Expect names like Prometheus, Grafana, Splunk, AppDynamics. Have you worked with any caching or in-memory data stores like Redis or Memcached? What were they used for? Look for use cases like session storage, caching, or pub/sub. Have you ever worked with secure login systems or APIs? What technologies or protocols were involved? Listen for OAuth, OpenID Connect (OIDC), or SAML. Can you give an example of how youve helped make a system more reliable or scalable? Look for strategies like load balancing, redundancy, or failover. Have you used any messaging systems like RabbitMQ or ActiveMQ? What were they used for in your project? Expect use in asynchronous processing or service communication. How do you stay calm and focused when something goes wrong in production? This helps assess their mindset and soft skills under pressure. Behavioral Can you tell me about a time when you had to work closely with a team to solve a problem?" Look for: Collaboration, communication, shared responsibility, and how they describe others. "How do you handle disagreements or conflicts with teammates?" Look for: Constructive resolution, empathy, and willingness to compromise. "Have you ever had to support a teammate who was struggling? What did you do?" Look for: Initiative, empathy, and team-first mindset. "How do you ensure smooth communication when working with cross-functional teams (e.g., developers, QA, product)?" Look for: Proactive communication, adaptability, and clarity.

Mock Interview

Practice Video Interview with JobPe AI

Start Site Reliability Engineering Interview Now

RecommendedJobs for You

Hyderabad, Chennai, Bengaluru

Noida, Uttar Pradesh, India

Bengaluru / Bangalore, Karnataka, India

Noida, Uttar Pradesh, India