Architect - SOC

7 years

0 Lacs

Posted:3 days ago| Platform: SimplyHired logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Chennai, Tamil Nadu, India


Department
Information Systems Security
Job posted on
Dec 02, 2025
Employment type
Full Time Employee
A Senior Site Reliability Engineer (SRE) focuses on leveraging advanced software engineering practices to ensure the reliability, scalability, performance, and operational efficiency of large-scale production systems. This role typically involves a high degree of ownership, strategic thinking, developing automation scripts, and DR strategies.

Design, implement, and maintain highly available, scalable, and resilient systems and infrastructure on cloud platforms ( AWS, Azure ) with 7 Years of experience.

Define, Configure, measure, and report on Service Level Indicators (SLIs) and Service Level Objectives (SLOs) and SLIs ( service Level Indicator )to meet user-facing Service Level Agreements (SLAs)

Conduct performance analysis, capacity planning, and load testing to proactively identify and resolve bottlenecks.

System/Tooling Improvement - Designing and implementing new monitoring stacks, tracing tools, deployment pipelines (CI/CD), and other platform tooling.

Develop and maintain advanced automation frameworks and tools using high-level languages (Python, Go, Java) to eliminate manual operational tasks (toil).

Manage infrastructure through Infrastructure as Code (IaC) using tools like Terraform or Ansible.

Enhance and manage CI/CD pipelines to ensure fast, reliable, and secure deployments

Lead major incident response efforts, minimizing time-to-detection and time-to-resolution (MTTD/MTTR).

Drive blameless postmortems and Root Cause Analyses (RCAs), ensuring effective preventative and corrective actions are implemented.

Participate in an on-call rotation (with appropriate compensation) to respond to and resolve critical production issues.

Collaborate closely with development teams to influence system design and architecture for improved reliability, operability, and security from the outset.

Expertly configure and manage the monitoring and observability stack (e.g., Prometheus, Grafana, ELK/Loki) to provide deep insights into system health (metrics, logs, traces)

Develop/configure Error budgets and dashboard

Develop scripts to reduce the manual, repetitive, tactical, and non-durable operational work (e.g., manually restarting servers, running shell scripts to fix a common issue) that SREs and Operations teams perform.

Skillset

Deep knowledge of Linux/Unix system administration, troubleshooting, and performance tuning; understanding of networking fundamentals (TCP/IP, DNS, load balancing)
Experience with major cloud platforms like AWS and Azure; expertise in Infrastructure as Code (IaC) tools like Terraform or CloudFormation
Docker for containerizing applications and Kubernetes for managing, scaling, and deploying containerized workloads.
Proficiency with monitoring tools (Prometheus, Grafana), logging solutions (ELK Stack: Elasticsearch, Logstash, Kibana), and tracing
Familiarity with Continuous Integration/Continuous Delivery (CI/CD) pipelines (e.g., Azure DevOps Jenkins, GitLab CI) and using Git for version control.

Mock Interview

Practice Video Interview with JobPe AI

Start Java Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Java Skills

Practice Java coding challenges to boost your skills

Start Practicing Java Now

RecommendedJobs for You