Pandas Interview Questions
Comprehensive pandas interview questions and answers for Python. Prepare for your next job interview with expert guidance.
Questions Overview
1. How do you handle missing data in a pandas DataFrame?
Basic2. What is the difference between `loc` and `iloc` in pandas?
Moderate3. How can you merge two DataFrames in pandas?
Moderate4. Explain the use of the `groupby` function in pandas.
Moderate5. How do you apply functions to DataFrame columns or rows?
Moderate6. What is the purpose of the `pivot_table` method in pandas?
Advanced7. How can you read and write different file formats using pandas?
Basic8. Explain how to perform data filtering and selection in pandas.
Moderate9. What are multi-indexes in pandas and how are they used?
Advanced10. How can you optimize performance when working with large pandas DataFrames?
Advanced1. How do you handle missing data in a pandas DataFrame?
BasicMissing data can be handled using methods like `df.dropna()` to remove missing values, `df.fillna()` to fill them with specified values or strategies (e.g., mean, median), and `df.isnull()` or `df.notnull()` to detect missing values. Additionally, interpolation methods can estimate missing data.
2. What is the difference between `loc` and `iloc` in pandas?
Moderate`loc` is label-based and used for selecting rows and columns by their labels or boolean arrays. `iloc` is integer position-based and selects by integer indices. For example, `df.loc[2, 'A']` selects the value in row labeled 2 and column 'A', while `df.iloc[2, 0]` selects the value at the third row and first column by position.
3. How can you merge two DataFrames in pandas?
ModerateYou can merge two DataFrames using the `pd.merge()` function, specifying the keys to join on and the type of join (e.g., inner, outer, left, right). Alternatively, `df1.join(df2)` can be used for joining on indexes, and `pd.concat([df1, df2])` can concatenate along a particular axis.
4. Explain the use of the `groupby` function in pandas.
ModerateThe `groupby` function is used to split a DataFrame into groups based on one or more keys, apply a function to each group independently, and then combine the results. It is commonly used for aggregation, transformation, and filtration operations, such as calculating group-wise statistics.
5. How do you apply functions to DataFrame columns or rows?
ModerateYou can apply functions using the `df.apply()` method, specifying `axis=0` for columns or `axis=1` for rows. Additionally, vectorized operations or specific methods like `df.applymap()` for element-wise operations can be used for efficiency.
6. What is the purpose of the `pivot_table` method in pandas?
AdvancedThe `pivot_table` method creates a spreadsheet-style pivot table, allowing you to summarize and aggregate data based on specified index, columns, and values, with support for various aggregation functions. It facilitates data analysis by reorganizing data for better insights.
7. How can you read and write different file formats using pandas?
BasicPandas provides functions like `pd.read_csv()`, `pd.read_excel()`, `pd.read_json()`, `pd.read_sql()`, and `pd.read_html()` to read various file formats. Similarly, you can write DataFrames using methods like `df.to_csv()`, `df.to_excel()`, `df.to_json()`, and `df.to_sql()` to export data to different formats.
8. Explain how to perform data filtering and selection in pandas.
ModerateData filtering and selection can be done using boolean indexing, the `query()` method, `loc` and `iloc` for label or position-based selection, and conditions applied to DataFrame columns. For example, `df[df['age'] > 30]` filters rows where the 'age' column is greater than 30.
9. What are multi-indexes in pandas and how are they used?
AdvancedMulti-indexes allow pandas DataFrames to have multiple levels of indexing on rows and/or columns. They enable more complex data structures and facilitate hierarchical data organization, making it easier to perform operations like grouping, reshaping, and selecting subsets of data based on multiple keys.
10. How can you optimize performance when working with large pandas DataFrames?
AdvancedPerformance can be optimized by using efficient data types (e.g., categorical data), avoiding unnecessary copies, leveraging vectorized operations instead of loops, using built-in pandas functions, applying chunk processing for large datasets, indexing appropriately, and utilizing parallel processing or libraries like Dask for handling very large DataFrames.