About the role
ThoughtSpot is seeking an experienced Staff Engineer to lead the architecture and evolution of our Cloud Infrastructure and Observability control plane. You will lead the design of a multi-cloud control plane (AWS, GCP, Azure) that powers our Business Intelligence (BI) application, ensuring it is resilient, cost-efficient, and deeply observable. This role is ideal for a distributed systems expert who wants to solve complex challenges like Multi-Cloud Disaster Recovery, AI-Driven Operations, and FinOps-as-Code, while enabling engineering velocity through self-service platforms.
What you will do:
Architect the Next-Gen Observability Stack
-
Build the "Single Pane of Glass": Design and operationalize a cutting-edge observability pipeline (Logs, Metrics, Traces) using Prometheus, ELK/EFK, Kafka, and OpenTelemetry.
-
AI-Powered Operations: Lead the development of a customer-facing Operations Portal that incorporates AI agents and analytics to provide real-time health insights, automated root cause analysis, and QoS visibility to our customers.
-
No-Touch Operations: Drive the platform toward "no-touch/low-touch" operations by implementing self-healing mechanisms and symptom-based alerting.
-
Control Plane Engineering: Architect scalable microservices that orchestrate tenancy, feature flags, and configuration across AWS, GCP, and Azure .
Multi-Cloud Hybrid Cloud Strategy
-
Drive the architecture and implementation of multi-cloud disaster recovery (DR) frameworks for both multi-tenant and single-tenant SaaS offerings.
-
Create SDLC frameworks that allow for seamless deployment across multiple clouds without requiring redundant testing.
-
Develop an app modernisation framework to migrate applications from legacy infrastructure to modern Kubernetes-based platforms.
Automation Infrastructure as a Service
-
Implement Infrastructure-as-Code (IaC) solutions using tools such as Terraform, Ansible, and CloudFormation to automate provisioning and deployments.
-
Provide automation and tools for both customer workflows and internal software development lifecycle (SDLC) processes.
-
Integrate open-source technologies and custom-developed modules to build a state-of-the-art infrastructure stack.
-
Customer-Obsessed Engineering: Ensure our observability isnt just watching servers, but watching the Customer Experience. You will instrument key user journeys (Login, Search, Checkout) to detect customer pain before they file a ticket.
Leadership Collaboration:
-
Provide technical leadership to a team of developers, conducting architecture reviews, and code reviews, and sharing best practices in cloud-native software development.
-
Lead cross-functional collaborations to ensure infrastructure is built for scalability, performance, and security.
-
Mentor and develop team members, driving a culture of technical excellence
What you will bring
-
Experience: 10+ years of engineering experience, with at least 5+ years in a Staff/Principal role scaling enterprise SaaS platforms.
-
Cloud Native Mastery: Deep hands-on expertise with Kubernetes, Docker. You have built and operated large-scale infrastructure on AWS, GCP, or Azure.
-
Coding Proficiency: expert-level skills in Go (Golang) (preferred), Java , or Python . You can write production-grade microservices and K8s operators.
-
Observability Deep Dive: You understand the internals of monitoring frameworks. You have scaled Prometheus federation, tuned Elasticsearch/Kafka for massive log ingestion, and implemented distributed tracing.
-
IaC Expert: You treat infrastructure as software. Advanced proficiency with Terraform and Ansible is required.
-
Distributed Systems Knowledge: You have a strong grasp of CAP theorem, consensus algorithms (Raft/Paxos), distributed storage, and networking fundamentals.
-
Strategic Thinking: Experience building "Single Pane of Glass" solutions and managing the trade-offs between speed, cost, and reliability in a multi-cloud environment
Hybrid Work at ThoughtSpot
This office-assigned role is available as a hybrid position. Spotters assigned to an office are encouraged to experience the energy of their local office with an in-office requirement of at least three days per week. This approach balances the benefits of in-person collaboration and peer learning with the flexibility needed by individuals and teams.