Posted:1 day ago| Platform: Linkedin logo

Apply

Work Mode

On-site

Job Type

Full Time

Job Description

Global Data Insight & Analytics organization is looking for a Principal Software Engineer focused on building and driving the strategy forward for our internal Data Science / AI/ML products and platform. This role will work in a small, cross-functional team. The position will collaborate directly and continuously with other engineers, business partners, product managers and designers, and will release early and often. The team you will be working on is focused on building Mach1ML platform – an AI/ML enablement platform to democratize Machine Learning across Ford enterprise.

Responsibilities

  • Engineer for Scale: Design and build the end-to-end of our core Generative AI and ML products / platform. You will own the technical blueprints for highly available, scalable, and modular systems, ensuring our infrastructure can handle petabyte-scale data and millions of API requests. You will design a decoupled system of microservices (using FastAPI, gRPC) and event-driven workflows (using Google Pub/Sub, Kafka) that allows teams to self-serve jobs, manage model lifecycles, and deploy inference endpoints with minimal friction.
  • Lead Hands-On Development: Be the lead developer and subject matter expert for our most complex technical challenges. You will write production-grade Python code for both backend services and frontend tooling, creating a seamless experience for our users. This includes developing robust APIs, data processing pipelines, and core platform components.
  • Productionize Cutting-Edge AI: Go beyond the notebook. You will be responsible for taking advanced AI/ML models—including LLMs, RAG systems, and agentic workflows—and productionizing them in a robust, repeatable, and monitored fashion. Your work ensures our AI is not just smart, but also reliable and performant.
    • Serving: Containerizing models with Docker and deploying them on Google Kubernetes Engine (GKE) using sophisticated serving frameworks like KServe or NVIDIA Triton Inference Server, configured for auto-scaling on GPU node pools (H100s).
    • Optimization: Implementing advanced model optimization techniques like quantization, pruning, and knowledge distillation to reduce latency and cost for LLM inference.
    • Monitoring & Reliability: Building a comprehensive monitoring stack using Prometheus, Grafana, and OpenTelemetry. You will implement observability for everything: GPU utilization, inference latency, token costs, and model-specific metrics like data drift and output quality using tools like LangSmith or Arize AI.
    • CI/CD for Models: Designing and implementing robust CI/CD (Continuous Integration/Continuous Delivery) pipelines using Cloud Build or Jenkins/GitLab CI, enabling automated RAG/agentic evaluation, A/B testing, and canary deployments.
  • Master the Deployment Ecosystem: Implement and manage the deployment of applications and services on our hybrid infrastructure, leveraging the best of Google Cloud Platform (GCP) or any cloud platform — including GKE, Vertex AI, and BigQuery—and our on-premise High-Performance Computing (HPC) clusters for large-scale model training and inference. You will use Terraform to provision and manage a sophisticated infrastructure that seamlessly blends
  • Champion Technical Excellence & Mentorship: As our most senior technical expert, you will lead through influence. Mentor other engineers, conduct rigorous design and code reviews, and establish the patterns and best practices that define our engineering culture. You will be the go-to person for our hardest technical problems.
  • Amplify Productivity with GenAI: Actively leverage and pioneer the use of Generative AI productivity tools (e.g., GitHub Copilot, internal code generation models, automated testing agents) to accelerate our development lifecycle and foster a culture of hyper-efficiency across the enterprise. You'll build the developer experience. Create a clean, well-documented Python SDK and a set of REST/gRPC APIs that become the "paved road" for data scientists and application developers to interact with the platform.
  • Software Engineering with Agentic AI: You will design and build sophisticated multi-agent systems from the ground up to automate complex segments of our SDLC. Your primary toolkit will include state-of-the-art frameworks like LangGraph for building stateful, cyclical agentic architectures, alongside CrewAI for multi-agent collaboration. Your work will involve:
    • Implementing Advanced Agentic Patterns: You will move beyond simple ReAct loops to design complex systems involving dynamic planning, self-reflection, and hierarchical tool use. You will build a robust library of custom tools (e.g., functions to interact with our codebase, databases, and internal APIs) that agents can intelligently select and execute.
    • Building Goal-Oriented AI Agents: You will design, build, and deploy autonomous systems such as:
      • A "DevOps Agent" that can autonomously diagnose production alerts by querying logs via the Splunk/Datadog API, inspect infrastructure state using the GCP API, and execute remediation plans like rolling back a deployment on GKE.
      • A "Code Generation & Refactoring Agent" that takes a Jira ticket as input, writes the initial Python implementation using FastAPI, generates corresponding unit tests with Pytest, runs the tests, and then iteratively refactors the code based on feedback from static analysis tools.

Qualifications

  • Experience: 10+ years of professional software engineering experience, with a proven track record of designing, building, and operating large-scale, distributed systems in a production environment.
  • Technical Leadership: Demonstrated experience as a technical lead, principal engineer, or staff engineer, where you were responsible for the architectural direction of a team or major project and mentored fellow engineers.
  • Expert-Level Python: Deep, authoritative knowledge of Python and its ecosystem. You have extensive experience building high-performance backend services (e.g., with FastAPI, gRPC), data-intensive applications, and understand the nuances of the language . You write clean, performant, and testable code and have extensive experience with modern backend frameworks (FastAPI, Pydantic), testing tools (Pytest), and high-performance data libraries (Pandas 2.0+).
  • Architectural Depth: You are a systems thinker that deeply understands their trade-offs. You have a proven history of architecting for resilience, scalability, and maintainability. You are a vocal advocate for best practices in API design, data modeling, and clean code. A history of designing modular, decoupled systems (e.g., microservices, event-driven architecture). You think in terms of APIs, data contracts, and long-term maintainability.
  • Production AI/ML Expertise: Proven, hands-on experience in productionizing machine learning systems (MLOps). You have deep familiarity with the challenges of deploying and monitoring ML models, especially Large Language Models (LLMs). You have expertise in the MLOps lifecycle, including hands-on experience with tools like MLflow, Kubeflow, KServe/KFServing, and the unique challenges of productionizing LLMs.
  • Cloud & Infrastructure Proficiency: Extensive experience with cloud platforms, primarily Google Cloud (GCP) or any cloud platform. You are an expert in containerization (Docker, Kubernetes/GKE) and infrastructure-as-code (Terraform). Experience with HPC environments (e.g., Slurm, MPI) is a significant plus.

Education:

A Bachelor's or Master's degree in Computer Science, Engineering, or a related field, or equivalent industry experience.

Mock Interview

Practice Video Interview with JobPe AI

Start Python Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Python Skills

Practice Python coding challenges to boost your skills

Start Practicing Python Now

RecommendedJobs for You