Tibbles

Mon, Jul 2, 2018 11 min read R

Creating Tibbles
Tibbles Versus data.frame
Interacting with Older Code
- Turn a tibble into dataframe using as.data.frame()
- Frustrations caused by subsetting default data.frame

** This post is heavily based on R for Data Science. Please consider to buy that book if you find this post useful.**

The tibble package is part of the tidyverse package.

library (tidyverse)

Creating Tibbles

Coerce into tibble using `as.tibble`

Most other R packages use regular data frames, so you might want to coerce a data frame to a tibble. You can do that with as.tibble():

as.tibble(iris)

## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1         5.10        3.50         1.40       0.200 setosa 
##  2         4.90        3.00         1.40       0.200 setosa 
##  3         4.70        3.20         1.30       0.200 setosa 
##  4         4.60        3.10         1.50       0.200 setosa 
##  5         5.00        3.60         1.40       0.200 setosa 
##  6         5.40        3.90         1.70       0.400 setosa 
##  7         4.60        3.40         1.40       0.300 setosa 
##  8         5.00        3.40         1.50       0.200 setosa 
##  9         4.40        2.90         1.40       0.200 setosa 
## 10         4.90        3.10         1.50       0.100 setosa 
## # ... with 140 more rows

New tibble from individual vectors using `tibble()`

tibble(
  x = 1:5,
  y = 1,
  z = x ^ 2 + y
)

## # A tibble: 5 x 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1    1.    2.
## 2     2    1.    5.
## 3     3    1.   10.
## 4     4    1.   17.
## 5     5    1.   26.

Observe that the input 1 in column y is repeated.

tibble() never changes the type of inputs (i.e, changes strings into factors), never changes the names of the variables, and never create row names, as compared to data.frame.

Nonsyntactic column names in tibble

It’s possible for a tibble to have column names that are not valid R variable names, aka nonsyntactic names. To refer to these variables, you need to surround them with backticks, `:

tb <- tibble(
  `:)` = "smile",
  ` ` = "space",
  `2000` = "number"
)
tb

## # A tibble: 1 x 3
##   `:)`  ` `   `2000`
##   <chr> <chr> <chr> 
## 1 smile space number

You’ll also need the backticks when working with these variables in other packages, like ggplot2,dplyr, and tidyr.

Rename column names using `rename()`

new_tb = rename(tibble, new_col_name = old_col_name, ...). For example,

tb_new <- rename(tb, smile = `:)`, space = ` `)
tb_new

## # A tibble: 1 x 3
##   smile space `2000`
##   <chr> <chr> <chr> 
## 1 smile space number

Data Entry with transposed tibble: `tribble()`

tribble()` is customized for data entry in code: column headings are defined by formulas (i.e., they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy-to-read form:

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)

## # A tibble: 2 x 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a        2.  3.60
## 2 b        1.  8.50

A comment is added to make it clear where the column header is.

Convert vectors or lists to two-column data frames using `enframe`

enframe(x, name = "name", value = "value")

Example:

# natural sequence, unnamed vectors
enframe(1:3)

## # A tibble: 3 x 2
##    name value
##   <int> <int>
## 1     1     1
## 2     2     2
## 3     3     3

## named sequence
enframe(c(a = 5, b = 7))

## # A tibble: 2 x 2
##   name  value
##   <chr> <dbl>
## 1 a        5.
## 2 b        7.

enframe(c(a = 1, b = 2, c = 3))

## # A tibble: 3 x 2
##   name  value
##   <chr> <dbl>
## 1 a        1.
## 2 b        2.
## 3 c        3.

Tibbles Versus data.frame

There are two main differences in the usage of a tibble versus a classic data.frame: printing and subsetting.

Printing

Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames. It only shows the first 10 rows, and all the columns that fit on screen. In addition to its name, each column reports its type.

tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)

## # A tibble: 1,000 x 5
##    a                   b              c     d e    
##    <dttm>              <date>     <int> <dbl> <chr>
##  1 2018-07-10 01:58:26 2018-08-07     1 0.751 m    
##  2 2018-07-09 14:09:06 2018-07-17     2 0.503 c    
##  3 2018-07-10 05:34:49 2018-07-27     3 0.388 z    
##  4 2018-07-09 20:07:05 2018-08-01     4 0.520 n    
##  5 2018-07-10 04:54:28 2018-07-24     5 0.351 g    
##  6 2018-07-09 21:16:36 2018-08-04     6 0.504 r    
##  7 2018-07-09 10:23:43 2018-07-13     7 0.419 i    
##  8 2018-07-09 10:19:01 2018-07-23     8 0.260 c    
##  9 2018-07-09 19:16:31 2018-08-02     9 0.216 v    
## 10 2018-07-10 07:01:25 2018-07-19    10 0.341 e    
## # ... with 990 more rows

Display n rows and all columns using `print()`

Use print(n = , width = Inf) to display n rows and display all columns.

nycflights13::flights %>%
  print(n = 10, width = Inf)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515        2.      830
##  2  2013     1     1      533            529        4.      850
##  3  2013     1     1      542            540        2.      923
##  4  2013     1     1      544            545       -1.     1004
##  5  2013     1     1      554            600       -6.      812
##  6  2013     1     1      554            558       -4.      740
##  7  2013     1     1      555            600       -5.      913
##  8  2013     1     1      557            600       -3.      709
##  9  2013     1     1      557            600       -3.      838
## 10  2013     1     1      558            600       -2.      753
##    sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
##             <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
##  1            819       11. UA        1545 N14228  EWR    IAH       227.
##  2            830       20. UA        1714 N24211  LGA    IAH       227.
##  3            850       33. AA        1141 N619AA  JFK    MIA       160.
##  4           1022      -18. B6         725 N804JB  JFK    BQN       183.
##  5            837      -25. DL         461 N668DN  LGA    ATL       116.
##  6            728       12. UA        1696 N39463  EWR    ORD       150.
##  7            854       19. B6         507 N516JB  EWR    FLL       158.
##  8            723      -14. EV        5708 N829AS  LGA    IAD        53.
##  9            846       -8. B6          79 N593JB  JFK    MCO       140.
## 10            745        8. AA         301 N3ALAA  LGA    ORD       138.
##    distance  hour minute time_hour          
##       <dbl> <dbl>  <dbl> <dttm>             
##  1    1400.    5.    15. 2013-01-01 05:00:00
##  2    1416.    5.    29. 2013-01-01 05:00:00
##  3    1089.    5.    40. 2013-01-01 05:00:00
##  4    1576.    5.    45. 2013-01-01 05:00:00
##  5     762.    6.     0. 2013-01-01 06:00:00
##  6     719.    5.    58. 2013-01-01 05:00:00
##  7    1065.    6.     0. 2013-01-01 06:00:00
##  8     229.    6.     0. 2013-01-01 06:00:00
##  9     944.    6.     0. 2013-01-01 06:00:00
## 10     733.    6.     0. 2013-01-01 06:00:00
## # ... with 3.368e+05 more rows

Control default printing behaviours by setting `options()`

You can also control the default print behavior by setting options:

options(tibble.print_max = n, tibble.print_min = m): print a maximun of m rows, and a mininum of n rows. Use options(dplyr.print_min = Inf) to always show all rows.
Use options(tibble.width = Inf) to always print all columns, regardless of the width of the screen.

Use RStudios Viewer

nycflights13::flights %>%
  View()

Control number of column names printed at the footer of a tibble using `print`

Looking up ?print tells that n_extra argument tells the number of extra columns to print abbreviated information for, if the width is too small for the entire tibble.

print(as_tibble(mtcars), n_extra = 5, width = 20)

## # A tibble: 32 x 11
##      mpg   cyl
##  * <dbl> <dbl>
##  1  21.0    6.
##  2  21.0    6.
##  3  22.8    4.
##  4  21.4    6.
##  5  18.7    8.
##  6  18.1    6.
##  7  14.3    8.
##  8  24.4    4.
##  9  22.8    4.
## 10  19.2    6.
## # ... with 22 more
## #   rows, and 9
## #   more variables:
## #   disp <dbl>,
## #   hp <dbl>,
## #   drat <dbl>,
## #   wt <dbl>,
## #   qsec <dbl>, ...

The width is set so that it limits the number of columns displayed (so that we can see the footer with column names).

Subsetting tibble using $ and [[.]]

Use $ and [[.]] to pull out a single variable.

df <- tibble(
  x = runif(5),
  y = rnorm(5)
)
# Extract by name
df$x

## [1] 0.51711995 0.05153991 0.63379396 0.58501986 0.65745106

df[["x"]]

## [1] 0.51711995 0.05153991 0.63379396 0.58501986 0.65745106

# Extract by position
df[[1]]

## [1] 0.51711995 0.05153991 0.63379396 0.58501986 0.65745106

Extract reference variables for subsetting tibbles

If you have the name of a variable stored in an object, e.g. var <- "mpg", you can extract the reference variable from a tibble by using the double bracket, like df[[var]] (instead of df[[“var”]]). You cannot use the dollar sign, because df$var would look for a column named var.

Subsetting in a pipe using .

To use these in a pipe, you’ll need to use the special placeholder . :

df %>% .$x

## [1] 0.51711995 0.05153991 0.63379396 0.58501986 0.65745106

df %>% .[["x"]]

## [1] 0.51711995 0.05153991 0.63379396 0.58501986 0.65745106

Subset tibbles with nonsyntactic column names

An annoying tibble is defined below:

annoying <- tibble(
  `1` = 1:10,
  `2` = `1` * 2 + rnorm(length(`1`))
)

Extracting the variable called 1.

annoying$`1` # or

##  [1]  1  2  3  4  5  6  7  8  9 10

annoying[["1"]]

##  [1]  1  2  3  4  5  6  7  8  9 10

Note that the backtick is not used in the tb [[.]].

Plotting a scatterplot of 1 vs 2.

ggplot(data = annoying) +
  geom_point(mapping = aes(x = `1`, y = `2`))

A new column 3 which is 2 divided by 1:

annoying3 = annoying %>%
  mutate(`3` = .[["2"]]/.[["1"]])

## or an easier way: insert straight away

annoying4 = annoying
annoying4[["3"]] = annoying4[["2"]]/ annoying4[["1"]]
annoying4

## # A tibble: 10 x 3
##      `1`   `2`   `3`
##    <int> <dbl> <dbl>
##  1     1  3.97  3.97
##  2     2  5.72  2.86
##  3     3  8.75  2.92
##  4     4 10.5   2.63
##  5     5  9.77  1.95
##  6     6 11.6   1.93
##  7     7 15.1   2.16
##  8     8 15.0   1.88
##  9     9 17.3   1.92
## 10    10 19.7   1.97

## or

annoying5 = annoying
annoying5$`3`= annoying4[["2"]]/ annoying4[["1"]]
annoying5

## # A tibble: 10 x 3
##      `1`   `2`   `3`
##    <int> <dbl> <dbl>
##  1     1  3.97  3.97
##  2     2  5.72  2.86
##  3     3  8.75  2.92
##  4     4 10.5   2.63
##  5     5  9.77  1.95
##  6     6 11.6   1.93
##  7     7 15.1   2.16
##  8     8 15.0   1.88
##  9     9 17.3   1.92
## 10    10 19.7   1.97

Check if an object is a tibble using `class()`

mtcars is a data.frame object:

class(mtcars)

## [1] "data.frame"

A tibble object would only show 10 rows and show the type of data in each column. In addition, class(mtcars) shows that it is a dataframe. Additionally, tibbles have class "tbl_df" and "tbl" in addition to "data.frame":

class(as.tibble(mtcars))

## [1] "tbl_df"     "tbl"        "data.frame"

Interacting with Older Code

Turn a tibble into dataframe using `as.data.frame()`

Some older functions don’t work with tibbles due to the [ function. [ is not used because dplyr::filter() and dplyr::select() allow you to solve the same problems with clearer code. If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data.frame:

class(as.data.frame(tb))

## [1] "data.frame"

Frustrations caused by subsetting default `data.frame`

Let’s define a df dataframe and perform subsetting operations on it.

df <- data.frame(abc = 1, xyz = "a")
df$x

## [1] a
## Levels: a

df[, "xyz"]

## [1] a
## Levels: a

df[, c("abc", "xyz")]

##   abc xyz
## 1   1   a

First of all, dataframe partially complete a column name such that df$x is the same as df$xyz, posing the possible disaster of calling unintended variables.

Secondly, the [ object returns a vector when there is one colomn but it returns a dataframe if more than one (tibble always return tibbles). These is problematic if you are passing df[, vars], where the number of variables is unknown. You’d have to write codes to account those situations.

Let’s compare the similar code but using a tibble instead:

tb = as.tibble(df)
tb$x

## NULL

tb[, "xyz"]

## # A tibble: 1 x 1
##   xyz  
##   <fct>
## 1 a

tb[, c("abc", "xyz")]

## # A tibble: 1 x 2
##     abc xyz  
##   <dbl> <fct>
## 1    1. a

r4ds R for Data Science

Tibbles

Creating Tibbles

Coerce into tibble using as.tibble

New tibble from individual vectors using tibble()

Nonsyntactic column names in tibble

Rename column names using rename()

Data Entry with transposed tibble: tribble()

Convert vectors or lists to two-column data frames using enframe