Fynd is India’s largest omnichannel platform and a multi-platform tech company specializing in retail technology and products in AI, ML, big data, image editing, and the learning space. It provides a unified platform for businesses to seamlessly manage online and offline sales, store operations, inventory, and customer engagement. Serving over 2,300 brands, Fynd is at the forefront of retail technology, transforming customer experiences and business processes across various industries.Are you passionate about building ultra-reliable systems at scale? Join our team as a
Site Reliability Engineer (SRE)
and be the driving force behind our site’s performance and uptime. Embrace a culture of end-to-end ownership, collaboration, and engineering excellence. In this role, you’ll blend software development and systems engineering skills to ensure our platform is
massively scalable, fault-tolerant, and lightning-fast
. It’s a discipline that
combines software engineering and systems engineering to ensure the scalability, performance, and reliability of large-scale systems
– exactly what’s needed to delight millions of online shoppers. You’ll work from our Mumbai headquarters, taking ownership of product reliability from day one and working across teams to keep our services robust and customers happy.
What will you do at Fynd?
- Influence technical direction by evaluating change requests, participating in architectural discussions across teams to uphold best practices and decide on appropriate technologies.
- Lead incident response and root cause analysis to rapidly resolve issues and implement preventive measures, ensuring we never fail for the same reason twice.
- Identify any bottleneck in current processes and build or improve tools to support incident management.
- Go on-call, respond to automated alerts, and execute playbooks.
- Continuously monitor and fine-tune our infrastructure using industry-standard observability tools, ensuring high performance even under heavy load.
- Conduct rigorous load tests for critical sales events and optimise system capacity to handle peak demand seamlessly.
- Own availability and performance for key products. Be responsible for ensuring the product's architecture, changes, incident response, and technology choices support its target availability and performance levels.
- Remove unnecessary noise from our signals to obtain a clearer understanding of our platform and enable more effective debugging.
- Develop production tooling and services to improve our platform’s resilience.
Minimum Qualification
- Bachelor's degree (B. E./B. Tech.) in Computer Science, or a related technical field, or equivalent practical experience.
- 2+ years of experience in an SRE or DevOps role, preferably within the e-commerce sector.
- 2+ years of experience in programming languages such as Go, Python, or JavaScript, coupled with a solid understanding of data structures and algorithms.
- Experience with containerisation technologies such as Docker and Kubernetes.
- Experience with cloud platforms like AWS, GCP, or Azure.
- Experience with monitoring and alerting tools such as Grafana, Prometheus, Sentry, PagerDuty, New Relic, AWS CloudWatch, etc.
- Proficiency in Unix/Linux shell environments.
Some Specific Requirements
- 3+ years of experience in an SRE or DevOps role, preferably within the e-commerce sector.
- 3+ years of experience managing production infrastructure. Prior experience leading or managing a team is a strong advantage.
- Experience with message queues like Kafka or RabbitMQ and a strong understanding of event-driven architectures.
- Experience with any orchestration and deployment tools such as Terraform, Pulumi, AWS CloudFormation, etc.
- Hands-on experience with any configuration management systems like Ansible, Chef, Puppet, SaltStack, etc.
- Understanding of load testing methodologies and tools such as Grafana k6, Gatling, Locust, Apache JMeter, etc.
What do we offer?
Growth
Growth knows no bounds, as we foster an environment that encourages creativity, embraces challenges, and cultivates a culture of continuous expansion. We are looking at new product lines, international markets and brilliant people to grow even further. We teach, groom and nurture our people to become leaders. You get to grow with a company that is growing exponentially.Flex University: We help you upskill by organising in-house courses on important subjectsLearning Wallet: You can also do an external course to upskill and grow, we reimburse it for you.
Culture
Community and Team building activitiesHost weekly, quarterly and annual events/parties.
Wellness
Mediclaim policy for you + parents + spouse + kidsExperienced therapist for better mental health, improve productivity & work-life balanceWe work from the office 5 days a week to promote collaboration and teamwork. Join us to make an impact in an engaging, in-person environment!