Welcome to the dplyr Package!
Data wrangling is one of the most important steps in data science. In R, a wide range of such tasks can be readily accomplished using the core package dplyr. From this tutorial, you’ll truly master the power of this package.
First, you’ll learn the important pipe operator %>%
, which allows you to streamline functions in a highly efficient manner. Majority of functions in this tutorial will be written in a piped style.
Then, you’ll study the six most commonly used functions. A diverse combination of these six functions allows you to perform a good majority of data wrangling tasks.
- Select columns with
select()
. You’ll also learn many related techniques, including selection helper functions and the purrr style. These techniques are highly helpful forselect()
and many other dplyr functions. - Filter rows with
filter()
. - Modify or create new columns with
mutate()
. - Create summarizing statistics with
summarize()
. - Divide dataset into groups with
group_by()
. - Arrange rows with
arrange()
.
Next, you’ll master more functions divided into the following sections. These functions further empower you to address complicated tasks with great efficiency and flexibility.
Functions that operate on rows. You’ll learn
distinct()
to select unique non-duplicated rows, and theslice_*()
family functions to select rows (complementary tofilter()
).Functions that operate on columns. You’ll learn
glimpse()
to quickly glance at the dataset structure and content,pull()
to extract values from a single column,rename()
andrename_with()
to rename column header names, andrelocate()
to change column order.Column-wise and row-wise repeated operations. You’ll use
across()
to perform repeated operations across multiple columns, androwwise()
andc_across()
to apply repeated operations across multiple rows.Functions for paired datasets This section covers a wide variety of techniques to merge (or subtract) two datasets, including mutating join (
inner_join()
,left_join()
,right_join()
, andfull_join()
), filtering join (semi_join()
andanti_join()
),nest_join()
, andcross_join()
. In addition, you’ll learn how to bind two or multiple datasets by columns or rows withbind_cols()
andbind_rows()
, and perform a variety of row-based set operations.In the end, you’ll learn the advanced feature of data masking to effectively incorporate these dplyr tools into your self-defined functions.