10 dplyr Interview Questions and Answers
Prepare for your data science interview with this guide on dplyr, covering key concepts and practical examples to enhance your data manipulation skills.
Prepare for your data science interview with this guide on dplyr, covering key concepts and practical examples to enhance your data manipulation skills.
dplyr is a powerful R package designed for data manipulation and transformation, making it an essential tool for data analysts and statisticians. Known for its intuitive syntax and efficient handling of large datasets, dplyr simplifies complex data operations through a set of straightforward functions. Its ability to seamlessly integrate with other R packages and data visualization tools makes it a cornerstone in the data science ecosystem.
This article offers a curated selection of dplyr-related interview questions and answers to help you prepare effectively. By familiarizing yourself with these questions, you can enhance your understanding of dplyr’s capabilities and demonstrate your proficiency in data manipulation during your interview.
To create a new column “grade” with values “Pass” if “score” is greater than 60, otherwise “Fail”, use the mutate
function.
Example:
library(dplyr) # Sample data frame df <- data.frame( name = c("Alice", "Bob", "Charlie"), score = c(70, 50, 80) ) # Add new column 'grade' df <- df %>% mutate(grade = ifelse(score > 60, "Pass", "Fail")) print(df)
To calculate the average “score” for each “class”, use the group_by
and summarize
functions.
Example:
library(dplyr) # Sample data frame data <- data.frame( class = c('A', 'A', 'B', 'B', 'C', 'C'), score = c(85, 90, 78, 82, 88, 91) ) # Calculate average score for each class average_scores <- data %>% group_by(class) %>% summarize(average_score = mean(score)) print(average_scores)
To group data by “department” and summarize the total “sales” for each group, use group_by
and summarize
.
Example:
library(dplyr) # Sample data frame data <- data.frame( department = c('A', 'B', 'A', 'C', 'B', 'A'), sales = c(100, 200, 150, 300, 250, 100) ) # Group by department and summarize total sales result <- data %>% group_by(department) %>% summarize(total_sales = sum(sales)) print(result)
To perform an inner join between two dataframes on the “id” column, use the inner_join
function.
Example:
library(dplyr) # Sample dataframes df1 <- data.frame(id = c(1, 2, 3), value1 = c("A", "B", "C")) df2 <- data.frame(id = c(2, 3, 4), value2 = c("X", "Y", "Z")) # Perform inner join on the "id" column result <- inner_join(df1, df2, by = "id") print(result) # id value1 value2 # 1 2 B X # 2 3 C Y
filter()
, mutate()
, and arrange()
to filter rows where “age” > 30, create a new column “age_group”, and sort by “age_group”?To filter rows where “age” > 30, create a new column “age_group”, and sort by “age_group”, chain filter()
, mutate()
, and arrange()
.
library(dplyr) data <- data.frame( name = c("John", "Jane", "Doe", "Smith"), age = c(25, 35, 45, 28) ) result <- data %>% filter(age > 30) %>% mutate(age_group = ifelse(age > 40, "Senior", "Adult")) %>% arrange(age_group) print(result)
To optimize performance with large datasets, consider these techniques:
– Use efficient data structures like data.table
.
– Filter early to reduce dataset size.
– Select only relevant columns.
– Use vectorized operations.
– Leverage parallel processing.
– Use database backends for very large datasets.
Example:
library(dplyr) library(data.table) # Convert data frame to data.table for efficient data manipulation dt <- as.data.table(large_dataset) # Filter early and select relevant columns optimized_data <- dt %>% filter(condition) %>% select(column1, column2, column3) # Perform vectorized operations result <- optimized_data %>% mutate(new_column = column1 + column2) %>% summarise(mean_value = mean(new_column))
To fill missing values in the “salary” column with the mean salary, use mutate
and ifelse
.
library(dplyr) # Sample data frame df <- data.frame( name = c("John", "Jane", "Doe", "Smith"), salary = c(50000, NA, 60000, NA) ) # Fill missing values in the salary column with the mean salary df <- df %>% mutate(salary = ifelse(is.na(salary), mean(salary, na.rm = TRUE), salary)) print(df)
To calculate the cumulative sum of the “sales” column, use the cumsum
function within a mutate
call.
Example:
library(dplyr) # Sample data frame df <- data.frame( id = 1:5, sales = c(100, 200, 150, 300, 250) ) # Calculate cumulative sum of sales df <- df %>% mutate(cumulative_sales = cumsum(sales)) print(df)
To sample 10% of the rows from a dataframe randomly, use the sample_frac
function.
Example:
library(dplyr) # Assuming df is your dataframe sampled_df <- df %>% sample_frac(0.1)
To create a summary table showing the count of unique values in the “category” column for each “region”, use group_by
and summarize
.
library(dplyr) # Sample data frame data <- data.frame( region = c('North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'), category = c('A', 'B', 'A', 'C', 'A', 'B', 'D', 'C') ) # Create summary table summary_table <- data %>% group_by(region) %>% summarize(unique_count = n_distinct(category)) print(summary_table)