Job Description
SN
Required Information
Details
1
Role
SRE (Application and Infrastructure Monitoring)
2
Required Technical Skill Set
Knowledge on Infrastructure
3
No. of Requirements
2
4
Desired Experience Range
4-6 years
5
Location of Requirement
Pune, Indore, Kochi
Desired Competencies (Technical/Behavioral Competency )
Must-Have
- To Detect the Incidents and act proactively escalate using the built in dashboards.
- Hands on using Dynatrace dashboards and creation of customized dashboards.
- Hands on using ServiceNow to perform analytics, Reporting, knowledge management, CMDB, ITOM modules(Event Management, Operator Workspace etc)
- Basic knowledge on other monitoring tools would be advantage (SolarWinds, Nimsoft, SCOM, Redgate etc.)
- Basic understanding of Application Architecture and its infra components, so that the user impact can be understood.
- Troubleshooting/communication skills:
- Hands on experience in troubleshooting basic issues in Windows, Unix
- Able to write and understand basic command lines and scripting
- Able to communicate effectively during the incident and Problem management calls
Good-to-Have
- Dynatrace Admin skills
- Knowledge on integration platforms like MQ,APIC etc
- Service now Fundamentals or Admin knowledge or Developer certified
- Knowledge on the Data visualization tools(Eg. Power BI)
- Knowledge on IaC(Ansible or Terraform)
- Scripting : Python or PowerShell
- Basic understanding of Autosys Batches and its components
- Experience in working with Batch management(monitoring and incident resolution)
- Basic knowledge on AWS/Azure to support the access management tasks.
- Understanding of Relational Database(any one)
- Able to write& understand basic queries using SQL.
- Service now Fundamentals or Admin knowledge or certified
- Understanding of Devops practices and tools
Type
Details of The Role (For Candidate Briefing)
Reporting To Which Role
SRE Engineer
Size of the Team, if any Reporting to this Role
6-8 Years
On-site Opportunity
NA
Unique Selling Proposition (USP) of The Role
Banking Domain
Details of The Project (A short Briefing on the Project may be attached with this document for candidate- briefing). It may be shared with external stakeholders like job-agencies etc.
We are looking for a ServiceNow Developer to work closely with Product Owners, Solution Architects and Analysts to refine epics, user stories and translate solution architecture into technical design and working software, lead technical teams by setting high development standards and applying industry best practices.
Experience:
As mentioned above
Sample Questions:
- Monitoring & Observability Strategy
How would you design a comprehensive monitoring strategy for a large-scale distributed application
Expected Response: Knowledge of key monitoring pillars (metrics, logs, traces), selecting the right tools (Prometheus, Grafana, Datadog, New Relic, ELK, Splunk), and defining SLIs, SLOs, and SLAs.
- Incident Response & Root Cause Analysis
An application is experiencing intermittent latency issues. How would you troubleshoot and identify the root cause
Expected Response: Use of APM tools, log correlation, distributed tracing, dependency mapping, anomaly detection, and defining runbooks for incident response.
- Automation & Self-Healing Systems
How can you leverage automation to improve monitoring efficiency and reduce MTTR (Mean Time to Recovery)
Expected Response: Experience with auto-remediation via Terraform/Ansible, self-healing scripts, predictive monitoring (AI/ML-based alerts), and implementing auto-scaling mechanisms.
- Infrastructure & Cloud Monitoring
How do you monitor and optimize cloud infrastructure in an AWS/Azure/GCP environment
Expected Response: Expertise in CloudWatch, Azure Monitor, GCP Operations Suite, cost monitoring, setting up synthetic monitoring, and best practices for alerting thresholds.
- Performance Optimization & Capacity Planning
How do you ensure high availability and optimal performance in a large-scale production environment
Expected Response: Understanding of load balancing, caching strategies, capacity planning using historical trends, chaos engineering for resilience, and observability-driven scaling.