Data Manipulation

This chapter is incomplete, work-in-progress.

dplyr is a package in R that provides a grammar of data manipulation, enabling you to easily manipulate data in a data frame or tibble. Here are some commonly used functions in dplyr:

select(): Selects specific columns from a data frame or tibble.

filter(): Filters rows based on a specified condition.

mutate(): Creates new columns based on calculations or transformations of existing columns.

arrange(): Sorts rows based on one or more columns.

group_by(): Groups the data by one or more columns.

summarize(): Calculates summary statistics for each group.

distinct(): Removes duplicate rows based on a specific column or columns.

rename(): Renames specific columns in a data frame or tibble.

left_join(), right_join(), inner_join(), full_join(): Joins two data frames or tibbles based on a common column or columns.

case_when(): Creates conditional statements to generate new columns.

if_else(): Creates a conditional statement based on a logical expression.

Illustration on mtcars data

Here’s an illustration of how to use dplyr to manipulate the mtcars data:

Load the dplyr package by running library(dplyr).

Create a tibble from the mtcars data frame:

R Copy code

Create a tibble from the mtcars data frame mtcars_tbl <- as_tibble(mtcars)

Select specific columns from the mtcars tibble using select() function.

For example, select the mpg, cyl, and hp columns:

#

Select specific columns from the mtcars tibble

selected_cols <- select(mtcars_tbl, mpg, cyl, hp)

Filter rows based on a condition using the filter() function.

For example, filter the mtcars tibble to only include cars with a mpg greater than or equal to 20: #

Filter rows based on a condition

filtered_tbl <- filter(mtcars_tbl, mpg >= 20)

Create new columns based on existing columns using the mutate() function.

For example, add a new column called kmpl that contains the mpg column converted to kilometers per liter: #

Create new columns based on existing columns

mutated_tbl <- mutate(mtcars_tbl, kmpl = mpg * 0.425144)

Sort rows based on one or more columns using the arrange() function.

For example, sort the mtcars tibble by descending mpg: R Copy code #

Sort rows based on one or more columns

arranged_tbl <- arrange(mtcars_tbl, desc(mpg))

Group the data by one or more columns using the group_by() function and calculate summary statistics for each group using the summarise() function.

For example, group the mtcars tibble by the cyl column and calculate the mean mpg for each group: R Copy code #

Group the data by one or more columns and calculate summary statistics for each group grouped_tbl <- group_by(mtcars_tbl, cyl) %>% summarise(mean_mpg = mean(mpg))