We exist to wow our customers. We know we re doing the right thing when we hear our customers say, How did we ever live without CoupangBorn out of an obsession to make shopping, eating, and living easier than ever, we are collectively disrupting the multi-billion-dollar commerce industry from the ground up and establishing an unparalleled reputation for being leading and reliable force in South Korean commerce.
We are proud to have the best of both worlds a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been at since our inception. We are all entrepreneurial surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day.
Our mission to build the future of commerce is real. We push the boundaries of what s possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.
Role Overview:
The ICT Reliability Engineering team is dedicated to maintaining the continuity and stability of Coupang s enterprise IT services. The team operates and continuously improves monitoring systems for both IT infrastructure and applications, ensuring high visibility and rapid incident detection. In the event of service disruptions, the team collaborates closely with engineering and operations teams to resolve issues efficiently and manage key performance metrics. Additionally, the team leads regular disaster recovery (DR) tests to validate system resilience and ensure business continuity.
Key Responsibilities:
- Identify operational inefficiencies and automation opportunities within monitoring workflows and infrastructure.
- Design and implement automated solutions for deployment, configuration, and scaling of monitoring tools using Infrastructure-as-Code (IaC) technologies such as Terraform, Ansible, Puppet, or similar.
- Leverage REST APIs of platforms like Zabbix, SolarWinds, Prometheus, and Grafana to streamline and standardize monitoring setup and management.
- Develop reusable automation assets scripts, templates, and modules to ensure consistent monitoring practices across diverse environments.
- Automate Grafana dashboard creation and management, including templating, data source integration, and role-based access control.
- Integrate monitoring systems with alerting, ticketing, and reporting platforms to enable seamless incident management and visibility.
- Establish tagging strategies and observability standards to ensure uniform data collection and traceability across services.
- Support incident response by building automated diagnostics and enriching telemetry data for faster root cause analysis.
- Collaborate cross-functionally with DevOps and SRE teams to align monitoring automation with CI/CD pipelines and operational goals.
Tech Skills:
Scripting languages: Python, Bash, PowerShell, SSH Monitoring & Observability Tools
Grafana (including dashboard templating, provisioning, and API-based automation) Datadog or Dynatrace (as alternatives or complementary tools) Experience working with REST APIs for automation and integration Familiarity with JSON, YAML, and HTTP methods (GET, POST, PUT, Jenkins, GitLab CI, GitHub Actions, or similar Docker and Kubernetes (for containerized environments) ServiceNow, Jira, VictorOps, xMatters, or similar Knowledge of event correlation and automated diagnostics AWS, Azure, or Google Cloud Platform Cloud-native monitoring tools like CloudWatch, Azure Monitor, or GCP Operations Suite
Preferred Qualifications:
Soft Skills & Operational Mindset Strong problem-solving and gap analysis capabilities Ability to identify low-hanging fruits for automation Experience in cross-functional collaboration (DevOps, SRE, IT Ops) Understanding of observability principles and tagging strategies Coupang hybrid work model is designed to enable a culture of collaboration that acts a catalyst to enrich the experience of employees. Employees are required to work at least 3 days in the office per week, with the flexibility to work from home 2 days a week, depending on the role requirement. Some businesses may require more time in office due to nature of work.
We care about your privacy
Thank you, your preferences have been updated.
By clicking "Accept All," you agree to the storing of cookies on your device to give you the most optimal experience using our website. We may also use cookies to enhance performance, analyze site usage and to personalize your experience.