timeplyr - Fast Tidy Functions for Date and Time Manipulation

Nick Christofides

I am..

  • Healthcare Analyst primarily focused on working with health and social care data

  • R programmer (duh)

  • Someone that loves open source, learning new things and solving problems

  • Rock and heavy metal drummer

What is timeplyr?

A package designed to make it easier to manipulate messy, large longitudinal and time-based datasets with potentially many groups.

It was inspired by my work analysing COVID-19 data consisting of millions of individuals.

It’s a package of functions I find useful!

It can for example:

✅ Aggregate dates and date-times into any time interval, such as days, weeks, months and years

✅ Identify and complete missing gaps in time

✅ Apply rolling calculations for large numbers of groups, with respect to a time index

✅ Calculate key episodic information based on events separated by a specified threshold of time

It cannot:

❌Perform advanced time-series analysis

❌Advanced data visualisation

❌Modelling or machine-learning

Why timeplyr?

  • Fast - Uses a mix of collapse, data.table, dplyr and custom C++ to be as fast and efficient as possible

  • Consistent - Uses tidy principles where possible with emphasis on code safety

  • Developing - It is continuously being built and improved upon

Some simple examples

Using a well-known tidyverse dataset of flights departing from 3 NYC airports in 2013.

library(tidyverse)
library(timeplyr)
library(nycflights13)
library(bench)

Gaps in time

We can easily check for missing gaps in time with time_num_gaps()

# Adding a date variable (day of flight)
flights <- flights |>
  mutate(date = as_date(time_hour))

# Check for hours with no flights
(n_gaps <- time_num_gaps(flights$time_hour))
[1] 1819

We can also use time_num_gaps() with an additional argument g to find the number of gaps for each group, in this case, each airport.

time_num_gaps(flights$time_hour, g = flights$origin)
 EWR  JFK  LGA 
2489 1820 2468 

Completing gaps

hourly_flights <- flights |>
  count(time_hour)
hourly_flights
# A tibble: 6,936 × 2
   time_hour               n
   <dttm>              <int>
 1 2013-01-01 05:00:00     6
 2 2013-01-01 06:00:00    52
 3 2013-01-01 07:00:00    49
 4 2013-01-01 08:00:00    58
 5 2013-01-01 09:00:00    56
 6 2013-01-01 10:00:00    39
 7 2013-01-01 11:00:00    37
 8 2013-01-01 12:00:00    56
 9 2013-01-01 13:00:00    54
10 2013-01-01 14:00:00    48
# ℹ 6,926 more rows

Here we complete the hourly gaps using time_complete()

completed <- hourly_flights |>
  time_complete(time_hour)

# Sanity check
(nrow(completed) - nrow(hourly_flights) ) == n_gaps
[1] TRUE

Monthly summaries

Using time_count() we can very quickly count the number of flights each month.

flights |> 
  time_count(date, time_by = "month")
# A tibble: 12 × 2
   date           n
   <date>     <int>
 1 2013-01-01 27004
 2 2013-02-01 24951
 3 2013-03-01 28834
 4 2013-04-01 28330
 5 2013-05-01 28796
 6 2013-06-01 28243
 7 2013-07-01 29425
 8 2013-08-01 29327
 9 2013-09-01 27574
10 2013-10-01 28889
11 2013-11-01 27268
12 2013-12-01 28135

We can make our implicit month interval into a more explicit one with the as_interval argument.

flights |> 
  time_count(date, time_by = "month", as_interval = TRUE)
# A tibble: 12 × 2
                       date     n
                  <tm_intv> <int>
 1 [2013-01-01, 2013-02-01) 27004
 2 [2013-02-01, 2013-03-01) 24951
 3 [2013-03-01, 2013-04-01) 28834
 4 [2013-04-01, 2013-05-01) 28330
 5 [2013-05-01, 2013-06-01) 28796
 6 [2013-06-01, 2013-07-01) 28243
 7 [2013-07-01, 2013-08-01) 29425
 8 [2013-08-01, 2013-09-01) 29327
 9 [2013-09-01, 2013-10-01) 27574
10 [2013-10-01, 2013-11-01) 28889
11 [2013-11-01, 2013-12-01) 27268
12 [2013-12-01, 2014-01-01) 28135

year_months

timeplyr has it’s own year-month class, inspired by the excellent zoo and tsibble packages

year_month(today())
[1] "2024 Jan"

It is very easy to create a sequence of year_months

year_month(today()) + 0:11
 [1] "2024 Jan" "2024 Feb" "2024 Mar" "2024 Apr" "2024 May" "2024 Jun"
 [7] "2024 Jul" "2024 Aug" "2024 Sep" "2024 Oct" "2024 Nov" "2024 Dec"

time_by

The time_by() function is similar to dplyr’s group_by() but with an emphasis on dates and date-times.

Let’s work with time_intervals throughout by setting the timeplyr.use_intervals option to true.

options(timeplyr.use_intervals = TRUE)
options(timeplyr.interval_sub_formatter = \(x) format(x, "%b-%d"))

Number of flights and mean departure delay per quarter

flights |>
  time_by(date, "3 months") |>
  summarise(n = n(), mean_dep_delay = mean(dep_delay, na.rm = TRUE))
# A tibble: 4 × 3
              date     n mean_dep_delay
         <tm_intv> <int>          <dbl>
1 [Jan-01, Apr-01) 80789          11.4 
2 [Apr-01, Jul-01) 85369          15.9 
3 [Jul-01, Oct-01) 86326          13.8 
4 [Oct-01, Jan-01) 84292           9.36

Diving into the arguments of time_count

This is just the time variable we want to specify.

flights |> 
  time_count(time = date)
# A tibble: 365 × 2
               date     n
          <tm_intv> <int>
 1 [Jan-01, Jan-02)   842
 2 [Jan-02, Jan-03)   943
 3 [Jan-03, Jan-04)   914
 4 [Jan-04, Jan-05)   915
 5 [Jan-05, Jan-06)   720
 6 [Jan-06, Jan-07)   832
 7 [Jan-07, Jan-08)   933
 8 [Jan-08, Jan-09)   899
 9 [Jan-09, Jan-10)   902
10 [Jan-10, Jan-11)   932
# ℹ 355 more rows

These are additional variables to count which use dplyr style data-masking

flights |>
  time_count(date, across(where(is.character)))
# A tibble: 316,497 × 6
               date carrier tailnum origin dest      n
          <tm_intv> <chr>   <chr>   <chr>  <chr> <int>
 1 [Jan-01, Jan-02) 9E      N602LR  JFK    CVG       1
 2 [Jan-01, Jan-02) 9E      N604LR  JFK    MSP       1
 3 [Jan-01, Jan-02) 9E      N836AY  JFK    PHL       1
 4 [Jan-01, Jan-02) 9E      N8409N  JFK    SYR       1
 5 [Jan-01, Jan-02) 9E      N8444F  JFK    CLE       1
 6 [Jan-01, Jan-02) 9E      N8444F  JFK    IAD       1
 7 [Jan-01, Jan-02) 9E      N8515F  JFK    RDU       1
 8 [Jan-01, Jan-02) 9E      N8598B  JFK    BWI       1
 9 [Jan-01, Jan-02) 9E      N8611A  JFK    RDU       1
10 [Jan-01, Jan-02) 9E      N8631E  JFK    ROC       1
# ℹ 316,487 more rows

This argument allows us to control which time unit we will aggregate our time variable to.

flights |> 
  time_count(date, time_by = "2 weeks")
# A tibble: 27 × 2
               date     n
          <tm_intv> <int>
 1 [Jan-01, Jan-15) 12208
 2 [Jan-15, Jan-29) 12078
 3 [Jan-29, Feb-12) 12173
 4 [Feb-12, Feb-26) 12649
 5 [Feb-26, Mar-12) 13009
 6 [Mar-12, Mar-26) 13100
 7 [Mar-26, Apr-09) 13145
 8 [Apr-09, Apr-23) 13239
 9 [Apr-23, May-07) 13080
10 [May-07, May-21) 13019
# ℹ 17 more rows

If left unspecified, most of the time the time_ functions will try to guess the time granularity through the greatest common divisor of time differences.

abs(gcd_diff(flights$time_hour))
[1] 3600
time_diff_gcd(flights$time_hour)
$numeric
[1] 3600

We see gcd_diff() returns 3600, which is the number of seconds in an hour.

This allows us to specify start and end times

flights |> 
  time_count(date, time_by = "2 weeks",
             from = dmy("01-December-2013"),
             .by = origin)
# A tibble: 12 × 3
   origin             date      n
   <chr>         <tm_intv>  <int>
 1 EWR    [Dec-01, Dec-15)   4596
 2 EWR    [Dec-15, Dec-29)   4398
 3 EWR    [Dec-29, Jan-12)    928
 4 EWR    NA               110913
 5 JFK    [Dec-01, Dec-15)   4049
 6 JFK    [Dec-15, Dec-29)   4187
 7 JFK    [Dec-29, Jan-12)    910
 8 JFK    NA               102133
 9 LGA    [Dec-01, Dec-15)   4271
10 LGA    [Dec-15, Dec-29)   4002
11 LGA    [Dec-29, Jan-12)    794
12 LGA    NA                95595

time_type controls whether we use lubridate style durations or periods.
By default, durations are used for anything more granular than a day, and periods otherwise.

options(timeplyr.interval_sub_formatter = identity)
flights |> 
  time_count(date, time_by = "3 months",
             time_type = "period")
# A tibble: 4 × 2
                      date     n
                 <tm_intv> <int>
1 [2013-01-01, 2013-04-01) 80789
2 [2013-04-01, 2013-07-01) 85369
3 [2013-07-01, 2013-10-01) 86326
4 [2013-10-01, 2014-01-01) 84292
flights |> 
  time_count(date, time_by = "3 months",
             time_type = "duration")
# A tibble: 4 × 2
                                        date     n
                                   <tm_intv> <int>
1 [2013-01-01, 2013-04-02 07:30:00)          82742
2 [2013-04-02 07:30:00, 2013-07-02 15:00:00) 85327
3 [2013-07-02 15:00:00, 2013-10-01 22:30:00) 85380
4 [2013-10-01 22:30:00, 2014-01-01 06:00:00) 83327

time_floor Takes the minimum time and floors it to the nearest specified unit (excluding the multiple). If for example you specified time_by = "3 months", time_floor = TRUE would round the earliest time to the nearest month.

flights |> 
    time_count(date, time_by = "3 months",
               from = dmy("22-August-2013"),
               time_floor = FALSE)
# A tibble: 3 × 2
                      date      n
                 <tm_intv>  <int>
1 [2013-08-22, 2013-11-22)  85195
2 [2013-11-22, 2014-02-22)  35871
3 NA                       215710
flights |> 
    time_count(date, time_by = "3 months",
               from = dmy("22-August-2013"),
               time_floor = TRUE)
# A tibble: 3 × 2
                      date      n
                 <tm_intv>  <int>
1 [2013-08-01, 2013-11-01)  65663
2 [2013-11-01, 2014-02-01)  55403
3 NA                       215710

This argument controls whether the start of the implicit time interval is returned, or whether an explicit time_interval is returned.

flights |> 
  time_count(date, time_by = "months", as_interval = FALSE)
# A tibble: 12 × 2
   date           n
   <date>     <int>
 1 2013-01-01 27004
 2 2013-02-01 24951
 3 2013-03-01 28834
 4 2013-04-01 28330
 5 2013-05-01 28796
 6 2013-06-01 28243
 7 2013-07-01 29425
 8 2013-08-01 29327
 9 2013-09-01 27574
10 2013-10-01 28889
11 2013-11-01 27268
12 2013-12-01 28135
flights |> 
  time_count(date, time_by = "months", as_interval = TRUE)
# A tibble: 12 × 2
                       date     n
                  <tm_intv> <int>
 1 [2013-01-01, 2013-02-01) 27004
 2 [2013-02-01, 2013-03-01) 24951
 3 [2013-03-01, 2013-04-01) 28834
 4 [2013-04-01, 2013-05-01) 28330
 5 [2013-05-01, 2013-06-01) 28796
 6 [2013-06-01, 2013-07-01) 28243
 7 [2013-07-01, 2013-08-01) 29425
 8 [2013-08-01, 2013-09-01) 29327
 9 [2013-09-01, 2013-10-01) 27574
10 [2013-10-01, 2013-11-01) 28889
11 [2013-11-01, 2013-12-01) 27268
12 [2013-12-01, 2014-01-01) 28135

Sometimes the former is more useful, especially for plotting and formatting, for example:

flights |> 
  time_count(date, time_by = "months", as_interval = FALSE) |> 
  mutate(month = format(date, "%B"))
# A tibble: 12 × 3
   date           n month    
   <date>     <int> <chr>    
 1 2013-01-01 27004 January  
 2 2013-02-01 24951 February 
 3 2013-03-01 28834 March    
 4 2013-04-01 28330 April    
 5 2013-05-01 28796 May      
 6 2013-06-01 28243 June     
 7 2013-07-01 29425 July     
 8 2013-08-01 29327 August   
 9 2013-09-01 27574 September
10 2013-10-01 28889 October  
11 2013-11-01 27268 November 
12 2013-12-01 28135 December 

Time series

We can easily convert time series to a (long) tibble

eu_stock <- EuStockMarkets |>
  ts_as_tibble()
eu_stock
# A tibble: 7,440 × 3
   group  time value
   <chr> <dbl> <dbl>
 1 DAX   1991. 1629.
 2 DAX   1992. 1614.
 3 DAX   1992. 1607.
 4 DAX   1992. 1621.
 5 DAX   1992. 1618.
 6 DAX   1992. 1611.
 7 DAX   1992. 1631.
 8 DAX   1992. 1640.
 9 DAX   1992. 1635.
10 DAX   1992. 1646.
# ℹ 7,430 more rows

ts_as_tibble() currently supports ts, xts, zoo and timeSeries objects.

Quick plotting

We can also plot univariate and multivariate time-series data using time_ggplot() which is just a ggplot wrapper that applies some sensible axes and formatting.

# #| output-location: column
eu_stock |> 
  time_ggplot(time, value, group = group)

Rolling functions

timeplyr has a set of optimised functions, utilising data.table’s excellent frollmean() and friends. It builds upon these by incorporating a fast method for grouped calculations, as well as incorporating time-based windows.

Combining timeplyr and data.table

Benchmark (unsorted data, 10,000 groups, 1m rows)

library(data.table)
df <- data.table(x = rnorm(10^6),
                 g = sample.int(10^4, 10^6, TRUE))
mark(df[, dt_mean := frollmean(x, n = 7, 
                               align = "right", na.rm = FALSE),
        by = g])
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 "df[, `:=`(dt_mean, frollmean(x, n … 425ms  469ms      2.13     179MB     21.3
mark(df[, tp_mean := roll_mean(x, window = 7, 
                               na.rm = FALSE, partial = FALSE, 
                               g = g)])
#> # A tibble: 1 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 df[, `:=`(tp_mean, roll_mean(x, win…  27ms 27.3ms      36.7      47MB     202.
all.equal(df$dt_mean, df$tp_mean)
#> [1] TRUE

Combining timeplyr and data.table

We can use window_sequence() with data.table’s adaptive argument to easily calculate rolling stats with partial windows.

x <- 1:10
w <- window_sequence(length(x), k = length(x))
data.table::frollsum(x, n = w, adaptive = TRUE)
 [1]  1  3  6 10 15 21 28 36 45 55

More rolling madness

An example of a rolling sum, by group, with respect to a time index

x <- 1:20
g <- seq_id(rep(5, 4))
t <- today() + days(0:19)
df <- tibble(x, g, t)
df |> 
  mutate(sum = 
           time_roll_sum(x, time = t, g = g, 
                         window = "3 days", partial = FALSE))
# A tibble: 20 × 4
       x     g t            sum
   <int> <int> <date>     <dbl>
 1     1     1 2024-01-26    NA
 2     2     1 2024-01-27    NA
 3     3     1 2024-01-28     6
 4     4     1 2024-01-29     9
 5     5     1 2024-01-30    12
 6     6     2 2024-01-31    NA
 7     7     2 2024-02-01    NA
 8     8     2 2024-02-02    21
 9     9     2 2024-02-03    24
10    10     2 2024-02-04    27
11    11     3 2024-02-05    NA
12    12     3 2024-02-06    NA
13    13     3 2024-02-07    36
14    14     3 2024-02-08    39
15    15     3 2024-02-09    42
16    16     4 2024-02-10    NA
17    17     4 2024-02-11    NA
18    18     4 2024-02-12    51
19    19     4 2024-02-13    54
20    20     4 2024-02-14    57

Events and Episodes

It is surprisingly common to analyse time-based episodes in health datasets.
Examples include identifying disease reinfection, or hospital readmission. The data may contain many records in quick succession, and so certain rules are applied to identify which ones are distinct episodes. One such criteria is setting a time threshold between events.

In this example we will show how to find reinfections in fictitious data of 1000 individuals receiving a test for an unknown (to us) disease, on average every 2 weeks. A third of these will be positive results. The aim is to classify these positives as new episodes when 42 days pass between the previous positive result.

Events and Episodes

Data setup

set.seed(912398123)
id <- 1:1000
initial_date <- today()
time_bw_tests <- rexp(10^3 * 10, 1/14)

initial_df <- tibble(id, test_date = initial_date)

df <- initial_df |> 
    bind_rows(
        tibble(id = rep(id, 10), time_bw_tests) 
    )

df <- df |> 
    mutate(time_bw_tests = replace_na(time_bw_tests, 0))

df <- df |> 
    group_by(id) %>%
    mutate(test_date = today() + cumsum(as.integer(time_bw_tests))) %>%
    mutate(test_date = replace_na(test_date, today()))

events <- df %>%
    slice_sample(prop = 1/3)

Events and Episodes

Calculation

# 1000 individuals, 10 tests each
# On average they take 1 test every 2 weeks

events <- events %>%
    time_episodes(test_date, time_by = "days", window = 42)
events
# A tibble:        3,000 x 6
# Groups:          id [1,000]
# Episodes:        N: 1,784, Median: 2, Mean: 1.78 ▅▁▁▇▁▁▂
# Time b/w events: Pooled mean: ~1.33 months
# Threshold:       ~1.38 months
      id test_date  t_elapsed ep_start   ep_id ep_id_new
 * <int> <date>         <dbl> <date>     <int>     <int>
 1     1 2024-02-11         0 2024-02-11     1         1
 2     1 2024-03-13        31 2024-02-11     1         0
 3     1 2024-05-04        52 2024-05-04     2         2
 4     2 2024-01-28         0 2024-01-28     1         1
 5     2 2024-03-26        58 2024-03-26     2         2
 6     2 2024-04-09        14 2024-03-26     2         0
 7     3 2024-03-01        18 2024-02-12     1         0
 8     3 2024-03-23        22 2024-02-12     1         0
 9     3 2024-02-12         0 2024-02-12     1         1
10     4 2024-07-24       180 2024-07-24     2         2
# ℹ 2,990 more rows
reinfections <- events %>%
    filter(ep_id_new > 1)
reinfections
# A tibble:        784 x 6
# Groups:          id [666]
# Episodes:        N: 784, Median: 1, Mean: 1.18 ▇▁▁▁▁▁▂
# Time b/w events: Pooled mean: ~2.5 months
# Threshold:       ~1.38 months
      id test_date  t_elapsed ep_start   ep_id ep_id_new
   <int> <date>         <dbl> <date>     <int>     <int>
 1     1 2024-05-04        52 2024-05-04     2         2
 2     2 2024-03-26        58 2024-03-26     2         2
 3     4 2024-07-24       180 2024-07-24     2         2
 4     5 2024-04-15        52 2024-04-15     2         2
 5     6 2024-04-16        51 2024-04-16     2         2
 6     8 2024-05-02        97 2024-05-02     2         2
 7     9 2024-03-25        42 2024-03-25     2         2
 8    12 2024-05-08        46 2024-05-08     2         2
 9    13 2024-06-01        53 2024-06-01     2         2
10    17 2024-05-20        90 2024-05-20     2         2
# ℹ 774 more rows

More helpers

time_seq() is like seq() but can accept a mix of dates and date-times

today <- today()
next_week <- as_datetime(today + dweeks(1))
time_seq(today, next_week, "days")
[1] "2024-01-26 UTC" "2024-01-27 UTC" "2024-01-28 UTC" "2024-01-29 UTC"
[5] "2024-01-30 UTC" "2024-01-31 UTC" "2024-02-01 UTC" "2024-02-02 UTC"
seq(today, next_week, "days")
Error in seq.Date(today, next_week, "days"): 'to' must be a "Date" object
# Vectorised version
time_seq_v(today, next_week, list(days = 1:3)) # from/to/by 
 [1] "2024-01-26 UTC" "2024-01-27 UTC" "2024-01-28 UTC" "2024-01-29 UTC"
 [5] "2024-01-30 UTC" "2024-01-31 UTC" "2024-02-01 UTC" "2024-02-02 UTC"
 [9] "2024-01-26 UTC" "2024-01-28 UTC" "2024-01-30 UTC" "2024-02-01 UTC"
[13] "2024-01-26 UTC" "2024-01-29 UTC" "2024-02-01 UTC"
time_seq_v2(1:3, from = today, time_by = "days") |> # length/from/by
  setNames(seq_id(1:3))
           1            2            2            3            3            3 
"2024-01-26" "2024-01-26" "2024-01-27" "2024-01-26" "2024-01-27" "2024-01-28" 

A convenience function to calculate time differences

today <- today()
next_year <- today + years(1)
time_diff(today, next_year, time_by = "days")
[1] 366
# Compared to lubridate
mark(time_diff(flights$time_hour, today(), "years"),
     interval(flights$time_hour, today()) / years(1), 
     min_iterations = 3)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                        <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 "time_diff(flights$time_hour, to… 15.43ms 16.16ms    50.7      20.9MB     7.80
#> 2 "interval(flights$time_hour, tod…   1.32s   1.32s     0.758   654.9MB     6.57

group_id() assigns each group a unique id and can be useful to use interactively

iris |> 
  mutate(id = group_id(Species)) |>
  distinct(Species, id)
     Species id
1     setosa  1
2 versicolor  2
3  virginica  3

An enhanced diff()

x <- 1:10
y <- time_seq(today, len = 20, time_by = "weeks")
diff(x)
[1] 1 1 1 1 1 1 1 1 1
time_elapsed(x)
 [1] NA  1  1  1  1  1  1  1  1  1
diff(y)
Time differences in days
 [1] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
time_elapsed(y, "weeks")
 [1] NA  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
time_elapsed(y, "days")
 [1] NA  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7
time_elapsed(y, "weeks", rolling = FALSE)
 [1]  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19

A function that prioritises pretty-looking breaks

time_breaks(flights$time_hour)
[1] "2013-01-01 05:00:00 EST" "2013-04-01 05:00:00 EDT"
[3] "2013-07-01 05:00:00 EDT" "2013-10-01 05:00:00 EDT"
time_breaks(flights$time_hour, n = 10)
[1] "2013-01-01 05:00:00 EST" "2013-03-01 05:00:00 EST"
[3] "2013-05-01 05:00:00 EDT" "2013-07-01 05:00:00 EDT"
[5] "2013-09-01 05:00:00 EDT" "2013-11-01 05:00:00 EDT"
time_cut(flights$date, n = 10, as_interval = TRUE) |> 
  interval_count()
# A tibble: 6 × 2
                  interval     n
                 <tm_intv> <int>
1 [2013-01-01, 2013-03-01) 51955
2 [2013-03-01, 2013-05-01) 57164
3 [2013-05-01, 2013-07-01) 57039
4 [2013-07-01, 2013-09-01) 58752
5 [2013-09-01, 2013-11-01) 56463
6 [2013-11-01, 2014-01-01) 55403

Identify breaks in a sequence using e.g. time_seq_id time_is_regular, time_num_gaps

hours <- sort(unique(flights$time_hour))
time_is_regular(hours)
[1] TRUE
time_is_regular(hours, allow_gaps = FALSE, allow_dups = FALSE)
[1] FALSE
id <- time_seq_id(hours, time_by = "hours", threshold = 5) # >5 hours
head(hours[id != lead(id)], n = 10)
 [1] "2013-01-01 23:00:00 EST" "2013-01-02 23:00:00 EST"
 [3] "2013-01-03 23:00:00 EST" "2013-01-04 23:00:00 EST"
 [5] "2013-01-05 23:00:00 EST" "2013-01-06 23:00:00 EST"
 [7] "2013-01-07 23:00:00 EST" "2013-01-08 23:00:00 EST"
 [9] "2013-01-09 23:00:00 EST" "2013-01-10 23:00:00 EST"
time_num_gaps(hours)
[1] 1819

Fast dplyr alternatives

timeplyr has a set of fast dplyr alternative functions, which will likely be moved to a package of their own

This is very similar to dplyr::group_by except for an attribute sorted that specifies whether the groups are sorted normally or sorted in order-of-first appearance. This is analogous to the difference between the by and keyby construct in data.table.

flights |> 
  fgroup_by(month) |> 
  summarise(number_of_flights = n())
# A tibble: 12 × 2
   month number_of_flights
   <int>             <int>
 1     1             27004
 2     2             24951
 3     3             28834
 4     4             28330
 5     5             28796
 6     6             28243
 7     7             29425
 8     8             29327
 9     9             27574
10    10             28889
11    11             27268
12    12             28135
flights |> 
  fgroup_by(month, order = FALSE) |> 
  summarise(number_of_flights = n())
# A tibble: 12 × 2
   month number_of_flights
   <int>             <int>
 1     1             27004
 2    10             28889
 3    11             27268
 4    12             28135
 5     2             24951
 6     3             28834
 7     4             28330
 8     5             28796
 9     6             28243
10     7             29425
11     8             29327
12     9             27574
# Same as..
flights |> 
  summarise(number_of_flights = n(), .by = month)
# A tibble: 12 × 2
   month number_of_flights
   <int>             <int>
 1     1             27004
 2    10             28889
 3    11             27268
 4    12             28135
 5     2             24951
 6     3             28834
 7     4             28330
 8     5             28796
 9     6             28243
10     7             29425
11     8             29327
12     9             27574
flights |> 
    fcount(across(where(is_time)))
# A tibble: 6,936 × 3
   time_hour           date           n
   <dttm>              <date>     <int>
 1 2013-01-01 05:00:00 2013-01-01     6
 2 2013-01-01 06:00:00 2013-01-01    52
 3 2013-01-01 07:00:00 2013-01-01    49
 4 2013-01-01 08:00:00 2013-01-01    58
 5 2013-01-01 09:00:00 2013-01-01    56
 6 2013-01-01 10:00:00 2013-01-01    39
 7 2013-01-01 11:00:00 2013-01-01    37
 8 2013-01-01 12:00:00 2013-01-01    56
 9 2013-01-01 13:00:00 2013-01-01    54
10 2013-01-01 14:00:00 2013-01-01    48
# ℹ 6,926 more rows
flights |> 
  fcount(.cols = "origin", order = FALSE)
# A tibble: 3 × 2
  origin      n
  <chr>   <int>
1 EWR    120835
2 LGA    104662
3 JFK    111279
# 1st row of each destination
flights |> 
  fslice(1, .by = dest)
# A tibble: 105 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    10     1     1955           2001        -6     2213           2248
 2  2013    10     1     1149           1159       -10     1245           1259
 3  2013     1     1     1315           1317        -2     1413           1423
 4  2013     7     6     1629           1615        14     1954           1953
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      905            905         0     1309           1229
 7  2013     1     1      946            959       -13     1146           1202
 8  2013     1     1     1318           1322        -4     1358           1416
 9  2013    10     1     1338           1105       153     1446           1245
10  2013     1     2     2044           2005        39     2229           2158
# ℹ 95 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, date <date>
flights |> 
  fslice_head(n = 1, .by = dest)
# A tibble: 105 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    10     1     1955           2001        -6     2213           2248
 2  2013    10     1     1149           1159       -10     1245           1259
 3  2013     1     1     1315           1317        -2     1413           1423
 4  2013     7     6     1629           1615        14     1954           1953
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      905            905         0     1309           1229
 7  2013     1     1      946            959       -13     1146           1202
 8  2013     1     1     1318           1322        -4     1358           1416
 9  2013    10     1     1338           1105       153     1446           1245
10  2013     1     2     2044           2005        39     2229           2158
# ℹ 95 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, date <date>
# Last row of each destination
flights |> 
  fslice_tail(n = 1, .by = dest)
# A tibble: 105 × 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     9    30     1959           2001        -2     2209           2248
 2  2013     9    30     1156           1159        -3     1253           1259
 3  2013     9    24     2151           2159        -8     2240           2303
 4  2013     8    24     1633           1625         8     1959           2003
 5  2013     9    30     2053           1815       158     2310           2054
 6  2013     9    30     2052           2054        -2     2306           2358
 7  2013     9    30     1641           1638         3     1833           1829
 8  2013     9    28      741            738         3      823            835
 9  2013     9    30     2203           2205        -2     2339           2331
10  2013     9    30     1939           1950       -11     2112           2141
# ℹ 95 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, date <date>
# Random 10% of flights by origin airport
flights |> 
  fgroup_by(origin, order = FALSE) |> 
  fslice_sample(prop = 0.1)
# A tibble: 33,676 × 20
# Groups:   origin [3]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     8    28     1818           1805        13     2013           2033
 2  2013     2     3     2221           2159        22     2345           2315
 3  2013    10     3     1159           1200        -1     1259           1315
 4  2013     5    15      620            629        -9      812            825
 5  2013     9    19     1942           1945        -3     2128           2156
 6  2013     4    26     2106           2000        66       13           2259
 7  2013     2     9     1856           1859        -3     2034           2104
 8  2013     3    16     2201           2155         6       43             48
 9  2013    11     1     1832           1742        50     2045           1940
10  2013     6     8     1014            959        15     1223           1207
# ℹ 33,666 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, date <date>

fdistinct has a native sort option

flights |> 
  distinct(month) |> 
  arrange(month)
# A tibble: 12 × 1
   month
   <int>
 1     1
 2     2
 3     3
 4     4
 5     5
 6     6
 7     7
 8     8
 9     9
10    10
11    11
12    12
flights |> 
  fdistinct(month, sort = TRUE)
# A tibble: 12 × 1
   month
   <int>
 1     1
 2     2
 3     3
 4     4
 5     5
 6     6
 7     7
 8     8
 9     9
10    10
11    11
12    12

This is sort of the inverse of fdistinct and allows us to find duplicate rows very quickly

flights |> 
  duplicate_rows() # No duplicate rows
# A tibble: 0 × 20
# ℹ 20 variables: year <int>, month <int>, day <int>, dep_time <int>,
#   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
#   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>, date <date>
flights |> 
  duplicate_rows(origin, dest)
# A tibble: 336,552 × 2
   origin dest 
   <chr>  <chr>
 1 LGA    ATL  
 2 EWR    MIA  
 3 EWR    ORD  
 4 LGA    IAH  
 5 LGA    MSP  
 6 JFK    SJU  
 7 LGA    ORD  
 8 LGA    DFW  
 9 JFK    BOS  
10 EWR    PBI  
# ℹ 336,542 more rows
flights |> 
  duplicate_rows(origin, dest, sort = TRUE)
# A tibble: 336,552 × 2
   origin dest 
   <chr>  <chr>
 1 EWR    ALB  
 2 EWR    ALB  
 3 EWR    ALB  
 4 EWR    ALB  
 5 EWR    ALB  
 6 EWR    ALB  
 7 EWR    ALB  
 8 EWR    ALB  
 9 EWR    ALB  
10 EWR    ALB  
# ℹ 336,542 more rows

Final thoughts

There is still a lot of room to develop tools centred around time analysis with timeplyr offering a fast, clean and user-friendly interface to dealing with large longitudinal data.

timeplyr, which was initially my personal playground to learn more about R and package development, will hopefully continue to develop and become a routine part of other analysts’ toolkits

Credits