22  R Reordering Exercise

22.1 Create some example data

Here we set up a data dictionary dd and some corresponding data ds. However, it is better if the order of the rows in the data dictionary dd match the order of the columns in the data ds.

Let’s examine what we’ve created:

22.2 Task: Reorder rows in dd in the order of ds’s columns

This assumes that every row of dd is in colnames(ds) and every colnames(ds) value is represented in dd. Perhaps that should be checked first.

22.3 Assumption Check Question

How would you check that every variable listed in the data dictionary dd is named in colnames(ds) and every colnames(ds) value is represented in the data dictionary dd?

Note that we should also check to see if the VARNAME’s are unique and the colnames of ds are unique.

22.4 Task: Reorder rows in dd to match the order of the columns in ds

Task: Reorder rows in the data dictionary dd to match the order of the columns in the data ds

  • What are various ways you could rearrange the rows of a data frame?

This works by assigning VARNAME to be the row names of dd, and then rearranging dd by row names.

We can use match also:

To better understand how this works, let’s look at the match function’s output:

The first element of the match output is 27. This means that the first element in colnames(ds), ID is the 27th element in dd$VARNAME. So when we do dd[match(colnames(ds), dd$VARNAME),], we’re moving the 27th row of dd into the first position, so as to put the ID row first. And so on. Thus dd[match(colnames(ds), dd$VARNAME),] rearranges the rows of dd into the order of the colnames(ds).

22.5 Question: use arrange?

Question: Is there a way to do this using arrange?

This does not work, because tidyverse wants to work on columns of data within dd:

You can’t use colnames(ds) directly in arrange because arrange wants to work on columns of data within dd, and colnames(ds) is not a column of dd. In the Tidyverse, it is generally required that you work with columns of data contained within the input data frame, and not with external data that lies outside of the input data frame such as, in this case, the column names of ds.

22.6 Question: use arrange?

Question: Is there a way to do this using arrange?

arrange() orders the rows of a data frame by the values of selected columns.

This works by creating a new column neworder that contains the position of the VARNAME in the colnames(ds). We then use arrange to sort by this new column, and then remove the new column.

22.7 Question: use slice

Question: Is there a way to do this using the slice command?

The slice() function lets you index rows by their (integer) locations.

Hint: Combine slice with match.

The match function returns a vector of the positions of (first) matches of its first argument in its second. So if you have a data frame with row names, you can use match to get the positions of the row names you want to select. You can then use slice to select the rows by position.

According to its documentation, slice lets you index rows by their (integer) locations. It allows you to select, remove, and duplicate rows. Here’s an example combining slice and match where we have a data frame with row names and we want to select the rows in the order “A”, “B”, “C”:

> df
  a b
B 1 3
C 2 2
A 3 1
> match(c("A","B","C"),rownames(df))
[1] 3 1 2
> df %>% slice(match(c("A","B","C"),rownames(df)))
  a b
A 3 1
B 1 3
C 2 2

Here the match vector of 3 1 2 tells us that row “A” is in position 3 in the rownames(df) vector, row “B” is in position 1, and row “C” is in position 2. So when we do slice(3,1,2) that moves row 3 into position 1, row 1 into position 2, and row 2 into position 3, as desired.

In our case, we can use match to get the positions of the VARNAME’s, and then use slice to rearrange the rows by position.

22.8 Question: use select?

Question: Is there a way to do this by transposing and then using select?

The t() function transposes a matrix or data frame. It generates a new data frame where the rows are the columns of the original data frame and the columns are the rows of the original data frame.

Here is an example of using t() to transpose a data frame:

> df
  a b
B 1 3
C 2 2
A 3 1
> t(df)
  B C A
a 1 2 3
b 3 2 1

When we use t(df) we get a new data frame where the rows are the columns of the original data frame and the columns are the rows of the original data frame.

We can then use select to rearrange the columns, and then transpose back to get the original data frame with the columns in the desired order.

So here we apply this to our data dictionary dd:

This works by transposing the data frame so that the rows become columns. We then use select to rearrange the columns, and then transpose back to get the original data frame with the columns in the desired order. We then rename the columns as needed.

22.9 Question: use row names

Question: What about using row names?

“While a tibble can have row names (e.g., when converting from a regular data frame), they are removed when subsetting with the [ operator. A warning will be raised when attempting to assign non-NULL row names to a tibble. Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column.”

From: https://tibble.tidyverse.org/reference/rownames.html