The Senior Data Engineer will be responsible for the architecture, design, development, and maintenance of our data platforms, with a strong focus on leveraging Python and PySpark for data processing and transformation. This role requires a strong technical leader who can work independently and as part of a team, contributing to the overall data strategy and helping to drive data-driven decision-making across the organization.
Key Responsibilities
- Data Architecture & Design: Design, develop, and optimize data architectures, pipelines, and data models to support various business needs, including analytics, reporting, and machine learning.
- ETL/ELT Development (Python/PySpark Focus): Build, test, and deploy highly scalable and efficient ETL/ELT processes using Python and PySpark to ingest, transform, and load data from diverse sources into data warehouses and data lakes. Develop and optimize complex data transformations using PySpark.
- Data Quality & Governance: Implement best practices for data quality, data governance, and data security to ensure the integrity, reliability, and privacy of our data assets.
- Performance Optimization: Monitor, troubleshoot, and optimize data pipeline performance, ensuring data availability and timely delivery, particularly for PySpark jobs.
- Infrastructure Management: Collaborate with DevOps and MLOps teams to manage and optimize data infrastructure, including cloud resources (AWS, Azure, GCP), databases, and data processing frameworks, ensuring efficient operation of PySpark clusters.
- Mentorship & Leadership: Provide technical guidance, mentorship, and code reviews to junior data engineers, particularly in Python and PySpark best practices, fostering a culture of excellence and continuous improvement.
- Collaboration: Work closely with data scientists, analysts, product managers, and other stakeholders to understand data requirements and deliver solutions that meet business objectives.
- Innovation: Research and evaluate new data technologies, tools, and methodologies to enhance our data capabilities and stay ahead of industry trends.
- Documentation: Create and maintain comprehensive documentation for data pipelines, data models, and data infrastructure.
Qualifications
Education
- Bachelor's or Master's degree in Computer Science, Software Engineering, Data Science, or a related quantitative field.
Experience
- 5+ years of professional experience in data engineering, with a strong emphasis on building and maintaining large-scale data systems.
- Extensive hands-on experience with Python for data engineering tasks.
- Proven experience with PySpark for big data processing and transformation.
- Proven experience with cloud data platforms (e.g., AWS Redshift, S3, EMR, Glue; Azure Data Lake, Databricks, Synapse; Google BigQuery, Dataflow).
- Strong experience with SQL and NoSQL databases (e.g., PostgreSQL, MySQL, MongoDB, Cassandra).
- Extensive experience with distributed data processing frameworks, especially Apache Spark.
Technical Skills
- Programming Languages: Expert proficiency in Python is mandatory. Strong SQL mastery is essential. Familiarity with Scala or Java is a plus.
- Big Data Technologies: In-depth knowledge and hands-on experience with Apache Spark (PySpark) for data processing, including Spark SQL, Spark Streaming, and DataFrame API. Experience with Apache Kafka, Apache Airflow, Delta Lake, or similar technologies.
- Data Warehousing: In-depth knowledge of data warehousing concepts, dimensional modeling, and ETL/ELT processes.
- Cloud Platforms: Hands-on experience with at least one major cloud provider (AWS, Azure, GCP) and their data services, particularly those supporting Spark/PySpark workloads.
- Containerization: Familiarity with Docker and Kubernetes is a plus.
- Version Control: Proficient with Git and CI/CD pipelines.
Soft Skills
- Excellent problem-solving and analytical abilities.
- Strong communication and interpersonal skills, with the ability to explain complex technical concepts to non-technical stakeholders.
- Ability to work effectively in a fast-paced, agile environment.
- Proactive and self-motivated with a strong sense of ownership.
Preferred Qualifications
- Experience with real-time data streaming and processing using PySpark Structured Streaming.
- Knowledge of machine learning concepts and MLOps practices, especially integrating ML workflows with PySpark.
- Familiarity with data visualization tools (e.g., Tableau, Power BI).
- Contributions to open-source data projects.
------------------------------------------------------
Job Family Group:
Technology------------------------------------------------------
Job Family:
Data Analytics------------------------------------------------------
Time Type:
Full time------------------------------------------------------
Most Relevant Skills
Please see the requirements listed above.------------------------------------------------------
Other Relevant Skills
For complementary skills, please see above and/or contact the recruiter.------------------------------------------------------
Citi is an equal opportunity employer, and qualified candidates will receive consideration without regard to their race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other characteristic protected by law.
If you are a person with a disability and need a reasonable accommodation to use our search tools and/or apply for a career opportunity review Accessibility at Citi.
View Citi’s EEO Policy Statement and the Know Your Rights poster.