25 Pandas Interview Questions and Answers
Prepare for your next data science interview with this guide on Pandas, featuring common questions and detailed answers to enhance your skills.
Prepare for your next data science interview with this guide on Pandas, featuring common questions and detailed answers to enhance your skills.
Pandas is a powerful and flexible open-source data manipulation and analysis library for Python. It provides data structures like DataFrames and Series, which are essential for handling structured data efficiently. Pandas is widely used in data science, finance, and various other fields that require robust data analysis capabilities.
This article offers a curated selection of interview questions designed to test your understanding and proficiency with Pandas. By working through these questions and their detailed answers, you will be better prepared to demonstrate your expertise and problem-solving skills in any technical interview setting.
In Pandas, a DataFrame can be created from a dictionary of lists using the pd.DataFrame()
constructor. Each key in the dictionary represents a column name, and the corresponding list contains the data for that column.
Example:
import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago'] } df = pd.DataFrame(data) print(df)
To read a CSV file into a DataFrame, use the read_csv
function. This function allows you to specify various parameters to handle different CSV formats.
Example:
import pandas as pd df = pd.read_csv('path/to/your/file.csv') print(df.head())
The read_csv
function supports parameters like delimiter
, header
, names
, index_col
, and dtype
to customize the reading process.
In Pandas, loc
and iloc
are used to select rows and columns from a DataFrame.
– loc
is label-based, meaning you specify the name of the rows and columns.
– iloc
is integer position-based, meaning you specify the index of the rows and columns.
Example:
import pandas as pd data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]} df = pd.DataFrame(data) loc_result = df.loc[0:1, ['A', 'B']] iloc_result = df.iloc[0:2, 0:2]
Missing data can be identified using isnull()
or notnull()
, which return a DataFrame with boolean values indicating the presence of missing data. Once identified, missing data can be handled by removing it with dropna()
or filling it with a specific value using fillna()
.
Example:
import pandas as pd data = { 'A': [1, 2, None, 4], 'B': [None, 2, 3, 4], 'C': [1, None, None, 4] } df = pd.DataFrame(data) missing_data = df.isnull() df_dropped = df.dropna() df_filled = df.fillna(0)
To merge two DataFrames on a common column, use the merge
function. This function allows you to specify the column(s) on which to merge and the type of join to perform (inner, outer, left, or right).
Example:
import pandas as pd df1 = pd.DataFrame({ 'ID': [1, 2, 3, 4], 'Name': ['Alice', 'Bob', 'Charlie', 'David'] }) df2 = pd.DataFrame({ 'ID': [3, 4, 5, 6], 'Age': [25, 30, 35, 40] }) merged_df = pd.merge(df1, df2, on='ID', how='inner') print(merged_df)
To sort a DataFrame by multiple columns, use the sort_values
method. This method allows you to specify the columns to sort by and the order for each column.
Example:
import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [24, 23, 22, 21], 'Score': [88, 92, 85, 90] } df = pd.DataFrame(data) sorted_df = df.sort_values(by=['Age', 'Score'], ascending=[True, False]) print(sorted_df)
To perform string operations on a column, use the str
accessor, which provides vectorized string functions.
Example:
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Charlie']} df = pd.DataFrame(data) df['Name_lower'] = df['Name'].str.lower() df['Name_upper'] = df['Name'].str.upper() df['Name_replaced'] = df['Name'].str.replace('a', '@', case=False) df['Name_contains_b'] = df['Name'].str.contains('b', case=False) print(df)
To convert a column to datetime, use the pd.to_datetime
function. Once converted, use the dt
accessor to extract specific date parts.
Example:
import pandas as pd data = {'date': ['2023-01-01', '2023-02-15', '2023-03-20']} df = pd.DataFrame(data) df['date'] = pd.to_datetime(df['date']) df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day'] = df['date'].dt.day print(df)
A MultiIndex DataFrame allows for hierarchical indexing, useful for high-dimensional data. Create a MultiIndex using pd.MultiIndex.from_tuples
or pd.MultiIndex.from_product
.
Example:
import pandas as pd index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)], names=['letter', 'number']) data = {'value': [10, 20, 30, 40]} df = pd.DataFrame(data, index=index) print(df.loc['A']) print(df.xs(1, level='number')) print(df.unstack())
To calculate summary statistics, use the describe()
method. It provides metrics like count, mean, standard deviation, minimum, and maximum values.
Example:
import pandas as pd data = { 'A': [1, 2, 3, 4, 5], 'B': [5, 4, 3, 2, 1], 'C': [2, 3, 4, 5, 6] } df = pd.DataFrame(data) summary_stats = df.describe() print(summary_stats)
To apply a custom function to each element, use the applymap
method.
Example:
import pandas as pd df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6] }) def increment(x): return x + 1 df = df.applymap(increment) print(df)
Converting a column to a categorical type can save memory and improve performance, especially with limited unique values. Use the astype
method or pd.Categorical
.
Example:
import pandas as pd df = pd.DataFrame({ 'fruit': ['apple', 'banana', 'apple', 'orange', 'banana', 'apple'] }) df['fruit'] = df['fruit'].astype('category') print(df.dtypes)
Rolling window calculations are used for operations on a sliding window of a specified size. Use the rolling
method to specify the window size and apply aggregation functions.
Example:
import pandas as pd data = {'value': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]} df = pd.DataFrame(data) df['rolling_mean'] = df['value'].rolling(window=3).mean() print(df)
Expanding window calculations apply a function over a progressively larger subset of data. Use the expanding()
function for cumulative calculations.
Example:
import pandas as pd data = {'values': [1, 2, 3, 4, 5]} df = pd.DataFrame(data) df['cumulative_sum'] = df['values'].expanding().sum() print(df)
To create a line plot, use the plot
method, which leverages Matplotlib.
Example:
import pandas as pd import matplotlib.pyplot as plt data = { 'Year': [2015, 2016, 2017, 2018, 2019], 'Sales': [200, 300, 400, 500, 600] } df = pd.DataFrame(data) df.plot(x='Year', y='Sales', kind='line', title='Yearly Sales') plt.xlabel('Year') plt.ylabel('Sales') plt.show()
The groupby
method is used to split data into groups. Custom aggregation functions can be applied using the agg
method.
Example:
import pandas as pd data = { 'Category': ['A', 'A', 'B', 'B'], 'Value1': [10, 20, 30, 40], 'Value2': [100, 200, 300, 400] } df = pd.DataFrame(data) custom_agg = { 'Value1': 'sum', 'Value2': lambda x: x.max() - x.min() } result = df.groupby('Category').agg(custom_agg) print(result)
Window functions like rank
and percent_rank
are used to perform operations on a specified range of rows.
Example:
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Edward'], 'Score': [85, 95, 75, 90, 80]} df = pd.DataFrame(data) df['Rank'] = df['Score'].rank() df['Percent_Rank'] = df['Score'].rank(pct=True) print(df)
Sparse data can be efficiently managed using Pandas’ sparse data structures, which store only non-zero elements and their positions.
Example:
import pandas as pd import numpy as np data = {'A': [0, 0, 1, 0, 0], 'B': [0, 2, 0, 0, 0], 'C': [0, 0, 0, 3, 0]} df = pd.DataFrame(data) sparse_df = df.astype(pd.SparseDtype("float", np.nan)) print(sparse_df)
Pandas can integrate with libraries like NumPy and SQL for enhanced data manipulation. With NumPy, you can perform efficient array operations. For SQL, Pandas provides functions to read from and write to databases.
Example:
import pandas as pd import numpy as np data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) df = pd.DataFrame(data, columns=['A', 'B', 'C']) df['A'] = np.log(df['A'])
import pandas as pd import sqlite3 conn = sqlite3.connect('example.db') df = pd.read_sql_query("SELECT * FROM my_table", conn) df.to_sql('my_table_copy', conn, if_exists='replace', index=False)
Handling time series data involves parsing dates, setting the datetime column as the index, and resampling data.
Example:
import pandas as pd data = { 'date': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'], 'value': [10, 20, 30, 40] } df = pd.DataFrame(data) df['date'] = pd.to_datetime(df['date']) df.set_index('date', inplace=True) monthly_data = df.resample('M').sum() print(monthly_data)
Join, merge, and concat are used to combine DataFrames.
– join
: Joins two DataFrames based on their index.
– merge
: Combines DataFrames based on one or more columns.
– concat
: Concatenates DataFrames along a particular axis.
Example:
import pandas as pd df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}, index=[0, 1, 2]) df2 = pd.DataFrame({'C': ['C0', 'C1', 'C2'], 'D': ['D0', 'D1', 'D2']}, index=[0, 1, 2]) joined_df = df1.join(df2) df3 = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'E': ['E0', 'E1', 'E2']}) df4 = pd.DataFrame({'key': ['K0', 'K1', 'K2'], 'F': ['F0', 'F1', 'F2']}) merged_df = pd.merge(df3, df4, on='key') concat_df = pd.concat([df1, df2], axis=1)
The apply
function is used to apply a function along an axis of the DataFrame.
Example:
import pandas as pd df = pd.DataFrame({ 'A': [1, 2, 3], 'B': [4, 5, 6] }) row_sum = df.apply(lambda x: x.sum(), axis=1) col_sum = df.apply(lambda x: x.sum(), axis=0) print("Row-wise sum:\n", row_sum) print("Column-wise sum:\n", col_sum)
Hierarchical indexing, or MultiIndex, allows for more complex data structures by enabling multiple levels of indexing.
Example:
import pandas as pd index = pd.MultiIndex.from_tuples([('A', 1), ('A', 2), ('B', 1), ('B', 2)], names=['letter', 'number']) df = pd.DataFrame({'value': [10, 20, 30, 40]}, index=index) print(df.loc['A']) print(df.loc[('A', 1)])
Handling large datasets that don’t fit into memory can be managed using libraries like Dask, which extends Pandas to operate on larger-than-memory datasets by breaking them into smaller chunks.
Example:
import dask.dataframe as dd df = dd.read_csv('large_dataset.csv') result = df.groupby('column_name').mean().compute()
The groupby
method is used to split data into groups and apply aggregation functions.
Example:
import pandas as pd data = { 'Category': ['A', 'B', 'A', 'B', 'A', 'B'], 'Values': [10, 20, 30, 40, 50, 60] } df = pd.DataFrame(data) grouped_df = df.groupby('Category').sum() print(grouped_df)