We are seeking a skilled and motivated Data Engineer to join our data team, with a strong focus on building and managing the data ingestion layer of our Databricks Lakehouse Platform. You will be responsible for creating reliable, scalable, and automated pipelines to pull data from a wide variety of sources including third-party APIs, streaming platforms, relational databases, and file-based systems like Google Analytics 4 (GA4) ensuring it lands accurately and efficiently in our Bronze layer.
This role requires hands-on expertise in Python (PySpark), SQL, and modern ingestion tools like Databricks Auto Loader and Structured Streaming. You will be the expert on connecting to new data sources, ensuring our data lakehouse has the raw data it needs to power analytics and business insights across the organization.
What youll be doing:
- Design, build, and maintain robust data ingestion pipelines to collect data from diverse sources such as APIs, streaming sources (e.g., Kafka, Event Hubs), relational databases (via JDBC), and cloud storage.
- Heavily utilize Databricks Auto Loader and COPY INTO for the efficient, incremental, and scalable ingestion of files into Delta Lake.
- Develop and manage Databricks Structured Streaming jobs to process near-real-time data feeds.
- Ensure the reliability, integrity, and freshness of the Bronze layer in our Medallion Architecture, which serves as the single source of truth for all raw data.
- Perform initial data cleansing, validation, and structuring to prepare data for further transformation in the Silver layer.
- Monitor, troubleshoot, and optimize ingestion pipelines for performance, cost, and stability.
- Develop Python scripts and applications to automate data extraction and integration processes.
- Work closely with platform architects and other data engineers to implement best practices for data ingestion and management.
- Document data sources, ingestion patterns, and pipeline configurations.
- Conform to agile development practices, including version control (Git), CI/CD, and automated testing.
What youll need:
- Education: Minimum of a Bachelors degree in Computer Science, Engineering, Mathematics, or a related technical field preferred.
- Experience: 4-6+ years of relevant experience in data engineering, with a strong focus on data ingestion and integration.
-
Engineers Core Skills:
- Databricks Platform Expertise:
- Data Ingestion Mastery: Deep, practical experience with Databricks Auto Loader, COPY INTO, and Structured Streaming.
- Apache Spark: Strong hands-on experience with Spark architecture, writing and optimizing PySpark and Spark SQL jobs for ingestion and basic transformation.
- Delta Lake: Solid understanding of Delta Lake for creating reliable landing zones for raw data. Proficient in writing data to Delta tables and understanding its core concepts like ACID transactions and schema enforcement.
- Core Engineering & Cloud Skills:
- Programming: 4+ years of strong, hands-on experience in Python, with an emphasis on PySpark and libraries for API interaction (e.g., requests).
- SQL: 4+ years of strong SQL experience for data validation and querying.
- Cloud Platforms: 3+ years working with a major cloud provider (Azure, AWS, or GCP), with specific knowledge of cloud storage (ADLS Gen2, S3), security, and messaging/streaming services.
- Diverse Data Sources: Proven experience ingesting data from a variety of sources (e.g., REST APIs, SFTP, relational databases, message queues).
- CI/CD & DevOps: Experience with version control (Git) and CI/CD pipelines (e.g., GitHub Actions, Azure DevOps) for automating deployments.
- Data Modeling: Familiarity with data modeling concepts (e.g., star schema) to understand the downstream use of the data you are ingesting.
-
Tools & Technologies:
- Primary Data Platform: Databricks
- Cloud Platforms: Azure (Preferred), GCP, AWS
- Data Warehouses (Integration): Snowflake, Google BigQuery
- Orchestration: Databricks Workflows
- Version Control: Git/GitHub or similar repositories
- Infrastructure as Code (Bonus): Terraform