Interview

25 Pandas Interview Questions and Answers

Prepare for your next data science interview with this guide on Pandas, featuring common questions and detailed answers to enhance your skills.

Pandas is a powerful and flexible open-source data manipulation and analysis library for Python. It provides data structures like DataFrames and Series, which are essential for handling structured data efficiently. Pandas is widely used in data science, finance, and various other fields that require robust data analysis capabilities.

This article offers a curated selection of interview questions designed to test your understanding and proficiency with Pandas. By working through these questions and their detailed answers, you will be better prepared to demonstrate your expertise and problem-solving skills in any technical interview setting.

Pandas Interview Questions and Answers

1. How do you create a DataFrame from a dictionary of lists?

In Pandas, a DataFrame can be created from a dictionary of lists using the pd.DataFrame() constructor. Each key in the dictionary represents a column name, and the corresponding list contains the data for that column.

Example:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

2. How would you read a CSV file into a DataFrame?

To read a CSV file into a DataFrame, use the read_csv function. This function allows you to specify various parameters to handle different CSV formats.

Example:

import pandas as pd

df = pd.read_csv('path/to/your/file.csv')
print(df.head())

The read_csv function supports parameters like delimiter, header, names, index_col, and dtype to customize the reading process.

3. How do you select rows and columns using loc and iloc?

In Pandas, loc and iloc are used to select rows and columns from a DataFrame.

loc is label-based, meaning you specify the name of the rows and columns.
iloc is integer position-based, meaning you specify the index of the rows and columns.

Example:

import pandas as pd

data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]}
df = pd.DataFrame(data)

loc_result = df.loc[0:1, ['A', 'B']]
iloc_result = df.iloc[0:2, 0:2]

4. How would you identify and handle missing data in a DataFrame?

Missing data can be identified using isnull() or notnull(), which return a DataFrame with boolean values indicating the presence of missing data. Once identified, missing data can be handled by removing it with dropna() or filling it with a specific value using fillna().

Example:

import pandas as pd

data = {
    'A': [1, 2, None, 4],
    'B': [None, 2, 3, 4],
    'C': [1, None, None, 4]
}
df = pd.DataFrame(data)

missing_data = df.isnull()
df_dropped = df.dropna()
df_filled = df.fillna(0)

5. How would you merge two DataFrames on a common column?

To merge two DataFrames on a common column, use the merge function. This function allows you to specify the column(s) on which to merge and the type of join to perform (inner, outer, left, or right).

Example:

import pandas as pd

df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Age': [25, 30, 35, 40]
})

merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

6. How would you sort a DataFrame by multiple columns?

To sort a DataFrame by multiple columns, use the sort_values method. This method allows you to specify the columns to sort by and the order for each column.

Example:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 23, 22, 21],
    'Score': [88, 92, 85, 90]
}

df = pd.DataFrame(data)
sorted_df = df.sort_values(by=['Age', 'Score'], ascending=[True, False])
print(sorted_df)

7. How would you perform string operations on a column in a DataFrame?

To perform string operations on a column, use the str accessor, which provides vectorized string functions.

Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie']}
df = pd.DataFrame(data)

df['Name_lower'] = df['Name'].str.lower()
df['Name_upper'] = df['Name'].str.upper()
df['Name_replaced'] = df['Name'].str.replace('a', '@', case=False)
df['Name_contains_b'] = df['Name'].str.contains('b', case=False)

print(df)

8. How do you convert a column to datetime and extract specific date parts?

To convert a column to datetime, use the pd.to_datetime function. Once converted, use the dt accessor to extract specific date parts.

Example:

import pandas as pd

data = {'date': ['2023-01-01', '2023-02-15', '2023-03-20']}
df = pd.DataFrame(data)

df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day

print(df)

9. How would you create and manipulate a MultiIndex DataFrame?

A MultiIndex DataFrame allows for hierarchical indexing, useful for high-dimensional data. Create a MultiIndex using pd.MultiIndex.from_tuples or pd.MultiIndex.from_product.

Example:

import pandas as pd

index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)], names=['letter', 'number'])
data = {'value': [10, 20, 30, 40]}
df = pd.DataFrame(data, index=index)

print(df.loc['A'])
print(df.xs(1, level='number'))
print(df.unstack())

10. How would you calculate summary statistics for a DataFrame?

To calculate summary statistics, use the describe() method. It provides metrics like count, mean, standard deviation, minimum, and maximum values.

Example:

import pandas as pd

data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1],
    'C': [2, 3, 4, 5, 6]
}

df = pd.DataFrame(data)
summary_stats = df.describe()
print(summary_stats)

11. How do you apply a custom function to each element in a DataFrame?

To apply a custom function to each element, use the applymap method.

Example:

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

def increment(x):
    return x + 1

df = df.applymap(increment)
print(df)

12. How would you convert a column to a categorical type and why would you do it?

Converting a column to a categorical type can save memory and improve performance, especially with limited unique values. Use the astype method or pd.Categorical.

Example:

import pandas as pd

df = pd.DataFrame({
    'fruit': ['apple', 'banana', 'apple', 'orange', 'banana', 'apple']
})

df['fruit'] = df['fruit'].astype('category')
print(df.dtypes)

13. How do you perform rolling window calculations on a DataFrame?

Rolling window calculations are used for operations on a sliding window of a specified size. Use the rolling method to specify the window size and apply aggregation functions.

Example:

import pandas as pd

data = {'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
df = pd.DataFrame(data)

df['rolling_mean'] = df['value'].rolling(window=3).mean()
print(df)

14. How would you perform expanding window calculations on a DataFrame?

Expanding window calculations apply a function over a progressively larger subset of data. Use the expanding() function for cumulative calculations.

Example:

import pandas as pd

data = {'values': [1, 2, 3, 4, 5]}
df = pd.DataFrame(data)

df['cumulative_sum'] = df['values'].expanding().sum()
print(df)

15. How do you create a line plot from a DataFrame?

To create a line plot, use the plot method, which leverages Matplotlib.

Example:

import pandas as pd
import matplotlib.pyplot as plt

data = {
    'Year': [2015, 2016, 2017, 2018, 2019],
    'Sales': [200, 300, 400, 500, 600]
}
df = pd.DataFrame(data)

df.plot(x='Year', y='Sales', kind='line', title='Yearly Sales')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

16. How would you perform custom aggregation functions using groupby?

The groupby method is used to split data into groups. Custom aggregation functions can be applied using the agg method.

Example:

import pandas as pd

data = {
    'Category': ['A', 'A', 'B', 'B'],
    'Value1': [10, 20, 30, 40],
    'Value2': [100, 200, 300, 400]
}
df = pd.DataFrame(data)

custom_agg = {
    'Value1': 'sum',
    'Value2': lambda x: x.max() - x.min()
}

result = df.groupby('Category').agg(custom_agg)
print(result)

17. How do you use window functions like rank and percent_rank?

Window functions like rank and percent_rank are used to perform operations on a specified range of rows.

Example:

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'],
        'Score': [85, 95, 75, 90, 80]}
df = pd.DataFrame(data)

df['Rank'] = df['Score'].rank()
df['Percent_Rank'] = df['Score'].rank(pct=True)

print(df)

18. How would you handle sparse data in a DataFrame?

Sparse data can be efficiently managed using Pandas’ sparse data structures, which store only non-zero elements and their positions.

Example:

import pandas as pd
import numpy as np

data = {'A': [0, 0, 1, 0, 0], 'B': [0, 2, 0, 0, 0], 'C': [0, 0, 0, 3, 0]}
df = pd.DataFrame(data)

sparse_df = df.astype(pd.SparseDtype("float", np.nan))
print(sparse_df)

19. How can you integrate Pandas with other libraries like NumPy or SQL?

Pandas can integrate with libraries like NumPy and SQL for enhanced data manipulation. With NumPy, you can perform efficient array operations. For SQL, Pandas provides functions to read from and write to databases.

Example:

import pandas as pd
import numpy as np

data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

df['A'] = np.log(df['A'])
import pandas as pd
import sqlite3

conn = sqlite3.connect('example.db')
df = pd.read_sql_query("SELECT * FROM my_table", conn)
df.to_sql('my_table_copy', conn, if_exists='replace', index=False)

20. How do you handle time series data?

Handling time series data involves parsing dates, setting the datetime column as the index, and resampling data.

Example:

import pandas as pd

data = {
    'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
    'value': [10, 20, 30, 40]
}

df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

monthly_data = df.resample('M').sum()
print(monthly_data)

21. What are the differences between join, merge, and concat?

Join, merge, and concat are used to combine DataFrames.

join: Joins two DataFrames based on their index.
merge: Combines DataFrames based on one or more columns.
concat: Concatenates DataFrames along a particular axis.

Example:

import pandas as pd

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}, index=[0, 1, 2])
df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'], 'D': ['D0', 'D1', 'D2']}, index=[0, 1, 2])

joined_df = df1.join(df2)

df3 = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'E': ['E0', 'E1', 'E2']})
df4 = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'F': ['F0', 'F1', 'F2']})
merged_df = pd.merge(df3, df4, on='key')

concat_df = pd.concat([df1, df2], axis=1)

22. How do you use the apply function for row-wise and column-wise operations?

The apply function is used to apply a function along an axis of the DataFrame.

Example:

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

row_sum = df.apply(lambda x: x.sum(), axis=1)
col_sum = df.apply(lambda x: x.sum(), axis=0)

print("Row-wise sum:\n", row_sum)
print("Column-wise sum:\n", col_sum)

23. How do you work with hierarchical indexing (MultiIndex)?

Hierarchical indexing, or MultiIndex, allows for more complex data structures by enabling multiple levels of indexing.

Example:

import pandas as pd

index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)], names=['letter', 'number'])
df = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index)

print(df.loc['A'])
print(df.loc[('A', 1)])

24. How do you handle large datasets that don’t fit into memory?

Handling large datasets that don’t fit into memory can be managed using libraries like Dask, which extends Pandas to operate on larger-than-memory datasets by breaking them into smaller chunks.

Example:

import dask.dataframe as dd

df = dd.read_csv('large_dataset.csv')
result = df.groupby('column_name').mean().compute()

25. How do you perform groupby operations and aggregate data?

The groupby method is used to split data into groups and apply aggregation functions.

Example:

import pandas as pd

data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Values': [10, 20, 30, 40, 50, 60]
}
df = pd.DataFrame(data)

grouped_df = df.groupby('Category').sum()
print(grouped_df)
Previous

10 Pegasystems Interview Questions and Answers

Back to Interview
Next

15 Cloud Architecture Interview Questions and Answers