Senior PySpark Developer

5 - 10 years

8 - 14 Lacs

Hyderabad

Posted:2 weeks ago| Platform: Naukri logo

Apply

Skills Required

Data Pipeline PySpark Hadoop Spark Distributed Computing Java Scala Cloud Big Data Databricks AWS

Work Mode

Work from Office

Job Type

Full Time

Job Description

Key Responsibilities : - Design and develop scalable PySpark pipelines to ingest, parse, and process XML datasets with extreme hierarchical complexity. - Implement efficient XPath expressions, recursive parsing techniques, and custom schema definitions to extract data from nested XML structures. - Optimize Spark jobs through partitioning, caching, and parallel processing to handle terabytes of XML data efficiently. - Transform raw hierarchical XML data into structured DataFrames for analytics, machine learning, and reporting use cases. - Collaborate with data architects and analysts to define data models for nested XML schemas. - Troubleshoot performance bottlenecks and ensure reliability in distributed environments (e.g., AWS, Databricks, Hadoop). - Document parsing logic, data lineage, and optimization strategies for maintainability. Qualifications : - 5+ years of hands-on experience with PySpark and Spark XML libraries (e.g., `spark-xml`) in production environments. - Proven track record of parsing XML data with 20+ levels of nesting using recursive methods and schema inference. - Expertise in XPath, XQuery, and DataFrame transformations (e.g., `explode`, `struct`, `selectExpr`) for hierarchical data. - Strong understanding of Spark optimization techniques: partitioning strategies, broadcast variables, and memory management. - Experience with distributed computing frameworks (e.g., Hadoop, YARN) and cloud platforms (AWS, Azure, GCP). - Familiarity with big data file formats (Parquet, Avro) and orchestration tools (Airflow, Luigi). - Bachelor's degree in Computer Science, Data Engineering, or a related field. Preferred Skills : - Experience with schema evolution and versioning for nested XML/JSON datasets. - Knowledge of Scala or Java for extending Spark XML libraries. - Exposure to Databricks, Delta Lake, or similar platforms. - Certifications in AWS/Azure big data technologies.

Mock Interview

Practice Video Interview with JobPe AI

Start Data Pipeline Interview Now
Victrix Systems And Labs
Victrix Systems And Labs

Defense Technology

Tech City

50 Employees

117 Jobs

    Key People

  • John Doe

    CEO
  • Jane Smith

    CTO

RecommendedJobs for You