dplyr is a powerful R package designed for data manipulation and transformation, making it an essential tool for data analysts and statisticians. Known for its intuitive syntax and efficient handling of large datasets, dplyr simplifies complex data operations through a set of straightforward functions. Its ability to seamlessly integrate with other R packages and data visualization tools makes it a cornerstone in the data science ecosystem.
This article offers a curated selection of dplyr-related interview questions and answers to help you prepare effectively. By familiarizing yourself with these questions, you can enhance your understanding of dplyr’s capabilities and demonstrate your proficiency in data manipulation during your interview.
dplyr Interview Questions and Answers
1. Write a command to create a new column “grade” with values “Pass” if “score” is greater than 60, otherwise “Fail”.
To create a new column “grade” with values “Pass” if “score” is greater than 60, otherwise “Fail”, use the mutate
function.
Example:
library(dplyr) # Sample data frame df <- data.frame( name = c("Alice", "Bob", "Charlie"), score = c(70, 50, 80) ) # Add new column 'grade' df <- df %>% mutate(grade = ifelse(score > 60, "Pass", "Fail")) print(df)
2. Write a command to calculate the average “score” for each “class”.
To calculate the average “score” for each “class”, use the group_by
and summarize
functions.
Example:
library(dplyr) # Sample data frame data <- data.frame( class = c('A', 'A', 'B', 'B', 'C', 'C'), score = c(85, 90, 78, 82, 88, 91) ) # Calculate average score for each class average_scores <- data %>% group_by(class) %>% summarize(average_score = mean(score)) print(average_scores)
3. How would you group data by “department” and then summarize the total “sales” for each group?
To group data by “department” and summarize the total “sales” for each group, use group_by
and summarize
.
Example:
library(dplyr) # Sample data frame data <- data.frame( department = c('A', 'B', 'A', 'C', 'B', 'A'), sales = c(100, 200, 150, 300, 250, 100) ) # Group by department and summarize total sales result <- data %>% group_by(department) %>% summarize(total_sales = sum(sales)) print(result)
4. How would you perform an inner join between two dataframes on the “id” column?
To perform an inner join between two dataframes on the “id” column, use the inner_join
function.
Example:
library(dplyr) # Sample dataframes df1 <- data.frame(id = c(1, 2, 3), value1 = c("A", "B", "C")) df2 <- data.frame(id = c(2, 3, 4), value2 = c("X", "Y", "Z")) # Perform inner join on the "id" column result <- inner_join(df1, df2, by = "id") print(result) # id value1 value2 # 1 2 B X # 2 3 C Y
5. How would you chain together filter()
, mutate()
, and arrange()
to filter rows where “age” > 30, create a new column “age_group”, and sort by “age_group”?
To filter rows where “age” > 30, create a new column “age_group”, and sort by “age_group”, chain filter()
, mutate()
, and arrange()
.
library(dplyr) data <- data.frame( name = c("John", "Jane", "Doe", "Smith"), age = c(25, 35, 45, 28) ) result <- data %>% filter(age > 30) %>% mutate(age_group = ifelse(age > 40, "Senior", "Adult")) %>% arrange(age_group) print(result)
6. Describe some techniques to optimize performance when working with large datasets.
To optimize performance with large datasets, consider these techniques:
– Use efficient data structures like data.table
.
– Filter early to reduce dataset size.
– Select only relevant columns.
– Use vectorized operations.
– Leverage parallel processing.
– Use database backends for very large datasets.
Example:
library(dplyr) library(data.table) # Convert data frame to data.table for efficient data manipulation dt <- as.data.table(large_dataset) # Filter early and select relevant columns optimized_data <- dt %>% filter(condition) %>% select(column1, column2, column3) # Perform vectorized operations result <- optimized_data %>% mutate(new_column = column1 + column2) %>% summarise(mean_value = mean(new_column))
7. Write a command to fill missing values in the “salary” column with the mean salary.
To fill missing values in the “salary” column with the mean salary, use mutate
and ifelse
.
library(dplyr) # Sample data frame df <- data.frame( name = c("John", "Jane", "Doe", "Smith"), salary = c(50000, NA, 60000, NA) ) # Fill missing values in the salary column with the mean salary df <- df %>% mutate(salary = ifelse(is.na(salary), mean(salary, na.rm = TRUE), salary)) print(df)
8. How would you use a window function to calculate the cumulative sum of the “sales” column?
To calculate the cumulative sum of the “sales” column, use the cumsum
function within a mutate
call.
Example:
library(dplyr) # Sample data frame df <- data.frame( id = 1:5, sales = c(100, 200, 150, 300, 250) ) # Calculate cumulative sum of sales df <- df %>% mutate(cumulative_sales = cumsum(sales)) print(df)
9. How would you use dplyr to sample 10% of the rows from a dataframe randomly?
To sample 10% of the rows from a dataframe randomly, use the sample_frac
function.
Example:
library(dplyr) # Assuming df is your dataframe sampled_df <- df %>% sample_frac(0.1)
10. Write a command to create a summary table that shows the count of unique values in the “category” column for each “region”.
To create a summary table showing the count of unique values in the “category” column for each “region”, use group_by
and summarize
.
library(dplyr) # Sample data frame data <- data.frame( region = c('North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'), category = c('A', 'B', 'A', 'C', 'A', 'B', 'D', 'C') ) # Create summary table summary_table <- data %>% group_by(region) %>% summarize(unique_count = n_distinct(category)) print(summary_table)