Cloud Ops & Monitoring Engineer
Job Title:
Cloud Ops & Monitoring Engineer
Location:
Bangalore
Department:
Technology
Reporting To:
Cloud Infra Director
Position Overview
Tookitaki is seeking a
Cloud Ops & Monitoring Engineer
to ensure the stability, performance, and security
of our cloud-based infrastructure
across all product offerings. This role is crucial in maintaining high availability
, optimizing cloud operations
, and proactively monitoring our cloud environments
. The ideal candidate will have deep expertise in cloud platforms, automation
, and observability tools
to drive incident response, cost optimization
, and operational efficiency
. Position Purpose
The
Cloud Ops & Monitoring Engineer
is responsible for monitoring, optimizing, and maintaining
Tookitaki s cloud infrastructure. This role ensures high system reliability
, proactive incident management
, and efficient resource utilization
. By leveraging automation
and advanced monitoring tools
, the engineer will drive operational excellence
, minimize downtime
, and enhance cloud security
. Key Responsibilities
Cloud Operations Management
-
Monitor and manage
cloud infrastructure (AWS, GCP, Azure)
for performance, availability, and security
. -
Ensure
99.99% uptime
of mission-critical systems through proactive maintenance
and incident resolution
. -
Implement
best practices
for cloud governance, cost optimization
, and capacity planning
.
Monitoring & Incident Response
-
Set up and maintain
observability tools
(
Prometheus, Grafana, ELK stack, Datadog, New Relic
).
-
Develop
real-time monitoring
and alerting mechanisms
to detect anomalies before they impact operations. -
Act as the
first responder
for production incidents, ensuring swift issue resolution
and root cause analysis
.
Automation & Infrastructure Optimization
-
Develop and maintain
Infrastructure as Code (IaC)
scripts (
Terraform, CloudFormation
) for
cloud automation
.
-
Automate
cloud scaling, log management, and incident resolution workflows. -
Optimize cloud environments
for performance, security
, and cost efficiency
.
Security & Compliance Enforcement
-
Implement
security best practices
, including IAM policies, encryption
, and vulnerability management
. -
Work closely with
security teams
to detect and mitigate threats in cloud environments
. -
Ensure
compliance
with global
financial regulatory standards
(
GDPR, PCI-DSS, SOC 2
).
Cross-Team Collaboration & Reporting
-
Collaborate with
DevOps, Security, and Development teams
to enhance cloud performance. -
Provide
operational insights and reports
on cloud system health, trends, and optimization opportunities. -
Document
incident reports, troubleshooting steps
, and operational playbooks
for continuous learning.
Key OKRs
-
Maintain
99.99% system uptime
by proactively monitoring and resolving cloud incidents. -
Reduce cloud operational costs by 20%
through optimization and automation. -
Automate 80%
of cloud monitoring and alerting processes within six months. -
Ensure
100% compliance
with cloud security policies and regulatory standards. -
Improve
MTTR (Mean Time to Resolution) by 30%
for critical incidents.
Qualifications and Skills
Education
-
Bachelor s or Master s degree
in Computer Science, Engineering, or a related technical field. -
Certifications
in AWS, Azure, Google Cloud, or Kubernetes ( preferred
).
Experience
-
5+ years
of experience in cloud operations, monitoring
, or DevOps
roles. -
Proven experience in managing
highly available, production-grade cloud environments
.
Technical Expertise
-
Proficiency in
AWS, GCP
, or Azure cloud services
. -
Strong hands-on experience with
monitoring tools
(
Prometheus, Grafana, ELK, Datadog, New Relic
).
-
Expertise in
Infrastructure as Code (IaC)
tools (
Terraform, CloudFormation
).
-
Experience with
containerization and orchestration
(
Docker, Kubernetes
).
-
Knowledge of
cloud security
, IAM policies
, encryption
, and threat detection
. -
Familiarity with
CI/CD pipelines, scripting
(
Python, Bash
), and
cloud automation
.
Soft Skills
-
Analytical mindset
with strong troubleshooting and problem-solving abilities
. -
Excellent communication skills
to work cross-functionally with multiple teams. -
Proactive and detail-oriented
, with a focus on continuous improvement
. -
Ability to work in a
fast-paced, dynamic environment
with tight deadlines
.
Key Competencies
-
Cloud Monitoring & Performance Optimization:
Ensures system health and efficiency through real-time observability. -
Incident Management & Troubleshooting:
Rapidly diagnoses and resolves production issues with minimal downtime. -
Automation & Infrastructure Management:
Implements self-healing and scalable cloud solutions. -
Security & Compliance Awareness:
Ensures adherence to regulatory standards and cloud security best practices. -
Cross-Functional Collaboration:
Works closely with engineering, security, and DevOps teams to enhance cloud operations.
Success Metrics
-
Maintain
99.99% system uptime
, ensuring minimal service disruption. -
Reduce MTTR
(Mean Time to Resolution) for critical incidents by 30%
. -
Automate 80%
of cloud monitoring and incident response workflows. -
Optimize cloud resource utilization
, achieving a 20% cost reduction
. -
Implement a
fully operational cloud observability framework
within six months.
Benefits
-
Competitive Salary:
Aligned with industry standards and experience. -
Professional Development:
Access to training in big data, cloud computing
, and data integration tools
. -
Comprehensive Benefits:
Health insurance
and flexible working options
. -
Growth Opportunities:
Career progression within Tookitaki s rapidly expanding Services Delivery team
.
",