Key Responsibilities: System Reliability & Performance: Design, implement, and maintain highly available, scalable, and resilient systems on Azure. Proactively monitor system health, performance, and availability using Azure Monitor, Application Insights, Log Analytics, and other monitoring tools (e.g., Grafana, Prometheus, Splunk). Define, track, and report on Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to ensure adherence to service availability and performance targets. Conduct root cause analysis (RCA) for incidents and implement preventive measures to avoid recurrence. Participate in on-call rotation to provide 24/7 support for production systems, diagnosing and resolving critical issues promptly. Automation & Infrastructure as Code (IaC): Develop and maintain automation scripts and tools using PowerShell, Python, Bash, or Go to automate repetitive tasks, deployments, and infrastructure provisioning. Implement and manage infrastructure using IaC principles with tools like Terraform or Azure Bicep. Contribute to the design and implementation of robust CI/CD pipelines using Azure DevOps, GitHub Actions, or similar tools to ensure efficient and reliable application deployments. Azure Ecosystem Management: Hands-on experience deploying, configuring, and managing a wide range of Azure services, including: Compute: Azure Virtual Machines, Azure Kubernetes Service (AKS), Azure Functions, Azure App Service Networking: Azure Virtual Networks, Load Balancers, Azure Front Door, DNS Storage: Azure Storage Accounts (Blob, File, Queue, Table), Azure SQL Database, Azure Cosmos DB Monitoring & Logging: Azure Monitor, Application Insights, Log Analytics, Kusto Query Language (KQL) Security: Azure Active Directory (AAD), Azure Security Center, Azure Policy, Key Vault, Network Security Groups (NSGs) Optimize Azure resource utilization for cost efficiency and performance. Collaboration & Best Practices: Collaborate closely with development teams (DevOps culture) to integrate reliability practices into the software development lifecycle ("shift-left"). Promote and implement SRE best practices, including error budgets, blameless post-mortems, and continuous improvement. Contribute to documentation of system architecture, operational procedures, and troubleshooting guides. Stay up-to-date with emerging Azure technologies and SRE trends, proposing and adopting relevant innovations. Required Skills & Qualifications: Bachelor's degree in Computer Science, Information Technology, or a related field, or equivalent practical experience. 3-5 years of hands-on experience in a Site Reliability Engineering, DevOps, or similar role with a strong focus on Microsoft Azure. Proficiency in at least one scripting or programming language (e.g., Python, PowerShell, Go, Bash). Solid understanding of Infrastructure as Code (IaC) principles and experience with tools like Terraform or Azure Bicep. Demonstrated experience with CI/CD pipelines (Azure DevOps preferred). Strong experience with Azure monitoring and logging solutions (Azure Monitor, Application Insights, Log Analytics, KQL). Experience with containerization and orchestration technologies, particularly Azure Kubernetes Service (AKS). Good understanding of networking concepts (TCP/IP, DNS, Load Balancing). Familiarity with database systems (SQL and NoSQL). Strong problem-solving, analytical, and troubleshooting skills. Excellent communication and collaboration skills, with the ability to work effectively in a team environment. Ability to work independently and manage multiple priorities in a fast-paced environment. Preferred Skills & Certifications: Microsoft Certified: Azure Administrator Associate (AZ-104) Microsoft Certified: Azure DevOps Engineer Expert (AZ-400) Certified Kubernetes Administrator (CKA) Experience with other monitoring tools like Grafana, Prometheus, Splunk, Datadog. Familiarity with security best practices in cloud environments. Experience with Git and version control systems.