Posted:2 weeks ago| Platform: Foundit logo

Apply

Skills Required

Linux/Unix

Work Mode

On-site

Job Type

Full Time

Job Description

Key Responsibilities:

PySpark Development:

  • Design, implement, and optimize

    PySpark

    solutions for large-scale data processing and analysis.
  • Develop

    data pipelines

    using

    Spark

    to handle data transformations, aggregations, and other complex operations efficiently.
  • Write and optimize

    Spark SQL queries

    for big data analytics and reporting.
  • Handle data extraction, transformation, and loading (ETL) processes from various sources into a unified data warehouse or data lake.

Data Pipeline Design & Optimization:

  • Build and maintain

    ETL pipelines

    using

    PySpark

    , ensuring high scalability and performance.
  • Implement

    batch and streaming processing

    to handle both real-time and historical data.
  • Optimize the performance of PySpark applications by applying best practices and techniques such as

    partitioning

    ,

    caching

    , and

    broadcast joins

    .

Data Storage & Management:

  • Work with large datasets and integrate them into storage solutions such as

    HDFS

    ,

    S3

    ,

    Azure Blob Storage

    , or

    Google Cloud Storage

    .
  • Ensure efficient data storage, access, and retrieval through Spark and other tools (e.g.,

    Parquet

    ,

    ORC

    ).
  • Maintain data quality, consistency, and integrity throughout the pipeline lifecycle.

Cloud Platforms & Big Data Frameworks:

  • Deploy Spark-based applications on cloud platforms such as

    AWS (Amazon EMR)

    ,

    Azure HDInsight

    , or

    Google Dataproc

    .
  • Work with cloud-native services such as

    AWS Lambda

    ,

    S3

    ,

    Google Cloud Storage

    , and

    Azure Data Lake

    to handle and process big data.
  • Leverage cloud data processing tools and frameworks to scale and optimize the PySpark jobs.

Collaboration & Integration:

  • Collaborate with cross-functional teams (data scientists, analysts, product managers) to understand business requirements and develop appropriate data solutions.
  • Integrate data from multiple sources and platforms (e.g., databases, external APIs, flat files) into a unified system.
  • Provide support for downstream applications and data consumers by ensuring timely and accurate delivery of data.

Performance Tuning & Troubleshooting:

  • Identify bottlenecks and optimize Spark jobs to improve performance.
  • Conduct performance tuning of both the

    cluster

    and individual

    Spark jobs

    , leveraging Spark's in-built tools for monitoring.
  • Troubleshoot and resolve issues related to data processing, application failures, and cluster resource utilization.

Documentation & Reporting:

  • Maintain clear and comprehensive documentation of data pipelines, architectures, and processes.
  • Create technical documentation to guide future enhancements and troubleshooting.
  • Provide regular updates on the status of ongoing projects and data processing tasks.

Continuous Improvement:

  • Stay up to date with the latest trends, technologies, and best practices in big data processing and PySpark.
  • Contribute to improving development processes, testing strategies, and code quality.
  • Share knowledge and provide mentoring to junior team members on PySpark best practices.

Required Qualifications:

  • 2-4 years

    of professional experience working with

    PySpark

    and big data technologies.
  • Strong expertise in

    Python

    programming with a focus on data processing and manipulation.
  • Hands-on experience with

    Apache Spark

    , particularly with PySpark for distributed computing.
  • Proficiency in

    Spark SQL

    for data querying and transformation.
  • Familiarity with cloud platforms like

    AWS

    ,

    Azure

    , or

    Google Cloud

    , and experience with

    cloud-native big data tools

    .
  • Knowledge of

    ETL

    processes and tools.
  • Experience with data storage technologies like

    HDFS

    ,

    S3

    , or

    Google Cloud Storage

    .
  • Knowledge of data formats such as

    Parquet

    ,

    ORC

    ,

    Avro

    , or

    JSON

    .
  • Experience with

    distributed computing

    and

    cluster management

    .
  • Familiarity with

    Linux/Unix

    and command-line operations.
  • Strong problem-solving skills and ability to troubleshoot data processing issues.

Mock Interview

Practice Video Interview with JobPe AI

Start Job-Specific Interview
cta

Start Your Job Search Today

Browse through a variety of job opportunities tailored to your skills and preferences. Filter by location, experience, salary, and more to find your perfect fit.

Job Application AI Bot

Job Application AI Bot

Apply to 20+ Portals in one click

Download Now

Download the Mobile App

Instantly access job listings, apply easily, and track applications.

coding practice

Enhance Your Skills

Practice coding challenges to boost your skills

Start Practicing Now
Teamware Solutions logo
Teamware Solutions

IT Services and IT Consulting

Chennai Tamilnadu

RecommendedJobs for You

Hyderabad, Telangana, India

Bhubaneswar, Odisha, India