Healthcare Analyst primarily focused on working with health and social care data
R programmer (duh)
Someone that loves open source, learning new things and solving problems
Rock and heavy metal drummer
A package designed to make it easier to manipulate messy, large longitudinal and time-based datasets with potentially many groups.
It was inspired by my work analysing COVID-19 data consisting of millions of individuals.
It’s a package of functions I find useful!
✅ Aggregate dates and date-times into any time interval, such as days, weeks, months and years
✅ Identify and complete missing gaps in time
✅ Apply rolling calculations for large numbers of groups, with respect to a time index
✅ Calculate key episodic information based on events separated by a specified threshold of time
❌Perform advanced time-series analysis
❌Advanced data visualisation
❌Modelling or machine-learning
Fast - Uses a mix of collapse, data.table, dplyr and custom C++ to be as fast and efficient as possible
Consistent - Uses tidy principles where possible with emphasis on code safety
Developing - It is continuously being built and improved upon
Using a well-known tidyverse dataset of flights departing from 3 NYC airports in 2013.
We can easily check for missing gaps in time with time_num_gaps()
# Adding a date variable (day of flight)
flights <- flights |>
mutate(date = as_date(time_hour))
# Check for hours with no flights
(n_gaps <- time_num_gaps(flights$time_hour))
[1] 1819
We can also use time_num_gaps()
with an additional argument g
to find the number of gaps for each group, in this case, each airport.
# A tibble: 6,936 × 2
time_hour n
<dttm> <int>
1 2013-01-01 05:00:00 6
2 2013-01-01 06:00:00 52
3 2013-01-01 07:00:00 49
4 2013-01-01 08:00:00 58
5 2013-01-01 09:00:00 56
6 2013-01-01 10:00:00 39
7 2013-01-01 11:00:00 37
8 2013-01-01 12:00:00 56
9 2013-01-01 13:00:00 54
10 2013-01-01 14:00:00 48
# ℹ 6,926 more rows
Here we complete the hourly gaps using time_complete()
Using time_count()
we can very quickly count the number of flights each month.
# A tibble: 12 × 2
date n
<date> <int>
1 2013-01-01 27004
2 2013-02-01 24951
3 2013-03-01 28834
4 2013-04-01 28330
5 2013-05-01 28796
6 2013-06-01 28243
7 2013-07-01 29425
8 2013-08-01 29327
9 2013-09-01 27574
10 2013-10-01 28889
11 2013-11-01 27268
12 2013-12-01 28135
We can make our implicit month interval into a more explicit one with the as_interval
argument.
# A tibble: 12 × 2
date n
<tm_intv> <int>
1 [2013-01-01, 2013-02-01) 27004
2 [2013-02-01, 2013-03-01) 24951
3 [2013-03-01, 2013-04-01) 28834
4 [2013-04-01, 2013-05-01) 28330
5 [2013-05-01, 2013-06-01) 28796
6 [2013-06-01, 2013-07-01) 28243
7 [2013-07-01, 2013-08-01) 29425
8 [2013-08-01, 2013-09-01) 29327
9 [2013-09-01, 2013-10-01) 27574
10 [2013-10-01, 2013-11-01) 28889
11 [2013-11-01, 2013-12-01) 27268
12 [2013-12-01, 2014-01-01) 28135
timeplyr has it’s own year-month class, inspired by the excellent zoo and tsibble packages
It is very easy to create a sequence of year_months
The time_by()
function is similar to dplyr’s group_by()
but with an emphasis on dates and date-times.
Let’s work with time_intervals throughout by setting the timeplyr.use_intervals
option to true.
Number of flights and mean departure delay per quarter
time_count
This is just the time variable we want to specify.
# A tibble: 365 × 2
date n
<tm_intv> <int>
1 [Jan-01, Jan-02) 842
2 [Jan-02, Jan-03) 943
3 [Jan-03, Jan-04) 914
4 [Jan-04, Jan-05) 915
5 [Jan-05, Jan-06) 720
6 [Jan-06, Jan-07) 832
7 [Jan-07, Jan-08) 933
8 [Jan-08, Jan-09) 899
9 [Jan-09, Jan-10) 902
10 [Jan-10, Jan-11) 932
# ℹ 355 more rows
These are additional variables to count which use dplyr style data-masking
# A tibble: 316,497 × 6
date carrier tailnum origin dest n
<tm_intv> <chr> <chr> <chr> <chr> <int>
1 [Jan-01, Jan-02) 9E N602LR JFK CVG 1
2 [Jan-01, Jan-02) 9E N604LR JFK MSP 1
3 [Jan-01, Jan-02) 9E N836AY JFK PHL 1
4 [Jan-01, Jan-02) 9E N8409N JFK SYR 1
5 [Jan-01, Jan-02) 9E N8444F JFK CLE 1
6 [Jan-01, Jan-02) 9E N8444F JFK IAD 1
7 [Jan-01, Jan-02) 9E N8515F JFK RDU 1
8 [Jan-01, Jan-02) 9E N8598B JFK BWI 1
9 [Jan-01, Jan-02) 9E N8611A JFK RDU 1
10 [Jan-01, Jan-02) 9E N8631E JFK ROC 1
# ℹ 316,487 more rows
This argument allows us to control which time unit we will aggregate our time variable to.
# A tibble: 27 × 2
date n
<tm_intv> <int>
1 [Jan-01, Jan-15) 12208
2 [Jan-15, Jan-29) 12078
3 [Jan-29, Feb-12) 12173
4 [Feb-12, Feb-26) 12649
5 [Feb-26, Mar-12) 13009
6 [Mar-12, Mar-26) 13100
7 [Mar-26, Apr-09) 13145
8 [Apr-09, Apr-23) 13239
9 [Apr-23, May-07) 13080
10 [May-07, May-21) 13019
# ℹ 17 more rows
If left unspecified, most of the time the time_
functions will try to guess the time granularity through the greatest common divisor of time differences.
We see gcd_diff()
returns 3600, which is the number of seconds in an hour.
This allows us to specify start and end times
# A tibble: 12 × 3
origin date n
<chr> <tm_intv> <int>
1 EWR [Dec-01, Dec-15) 4596
2 EWR [Dec-15, Dec-29) 4398
3 EWR [Dec-29, Jan-12) 928
4 EWR NA 110913
5 JFK [Dec-01, Dec-15) 4049
6 JFK [Dec-15, Dec-29) 4187
7 JFK [Dec-29, Jan-12) 910
8 JFK NA 102133
9 LGA [Dec-01, Dec-15) 4271
10 LGA [Dec-15, Dec-29) 4002
11 LGA [Dec-29, Jan-12) 794
12 LGA NA 95595
time_type
controls whether we use lubridate style durations or periods.
By default, durations are used for anything more granular than a day, and periods otherwise.
options(timeplyr.interval_sub_formatter = identity)
flights |>
time_count(date, time_by = "3 months",
time_type = "period")
# A tibble: 4 × 2
date n
<tm_intv> <int>
1 [2013-01-01, 2013-04-01) 80789
2 [2013-04-01, 2013-07-01) 85369
3 [2013-07-01, 2013-10-01) 86326
4 [2013-10-01, 2014-01-01) 84292
# A tibble: 4 × 2
date n
<tm_intv> <int>
1 [2013-01-01, 2013-04-02 07:30:00) 82742
2 [2013-04-02 07:30:00, 2013-07-02 15:00:00) 85327
3 [2013-07-02 15:00:00, 2013-10-01 22:30:00) 85380
4 [2013-10-01 22:30:00, 2014-01-01 06:00:00) 83327
time_floor
Takes the minimum time and floors it to the nearest specified unit (excluding the multiple). If for example you specified time_by = "3 months"
, time_floor = TRUE
would round the earliest time to the nearest month.
# A tibble: 3 × 2
date n
<tm_intv> <int>
1 [2013-08-22, 2013-11-22) 85195
2 [2013-11-22, 2014-02-22) 35871
3 NA 215710
# A tibble: 3 × 2
date n
<tm_intv> <int>
1 [2013-08-01, 2013-11-01) 65663
2 [2013-11-01, 2014-02-01) 55403
3 NA 215710
This argument controls whether the start of the implicit time interval is returned, or whether an explicit time_interval
is returned.
# A tibble: 12 × 2
date n
<date> <int>
1 2013-01-01 27004
2 2013-02-01 24951
3 2013-03-01 28834
4 2013-04-01 28330
5 2013-05-01 28796
6 2013-06-01 28243
7 2013-07-01 29425
8 2013-08-01 29327
9 2013-09-01 27574
10 2013-10-01 28889
11 2013-11-01 27268
12 2013-12-01 28135
# A tibble: 12 × 2
date n
<tm_intv> <int>
1 [2013-01-01, 2013-02-01) 27004
2 [2013-02-01, 2013-03-01) 24951
3 [2013-03-01, 2013-04-01) 28834
4 [2013-04-01, 2013-05-01) 28330
5 [2013-05-01, 2013-06-01) 28796
6 [2013-06-01, 2013-07-01) 28243
7 [2013-07-01, 2013-08-01) 29425
8 [2013-08-01, 2013-09-01) 29327
9 [2013-09-01, 2013-10-01) 27574
10 [2013-10-01, 2013-11-01) 28889
11 [2013-11-01, 2013-12-01) 27268
12 [2013-12-01, 2014-01-01) 28135
Sometimes the former is more useful, especially for plotting and formatting, for example:
flights |>
time_count(date, time_by = "months", as_interval = FALSE) |>
mutate(month = format(date, "%B"))
# A tibble: 12 × 3
date n month
<date> <int> <chr>
1 2013-01-01 27004 January
2 2013-02-01 24951 February
3 2013-03-01 28834 March
4 2013-04-01 28330 April
5 2013-05-01 28796 May
6 2013-06-01 28243 June
7 2013-07-01 29425 July
8 2013-08-01 29327 August
9 2013-09-01 27574 September
10 2013-10-01 28889 October
11 2013-11-01 27268 November
12 2013-12-01 28135 December
We can easily convert time series to a (long) tibble
# A tibble: 7,440 × 3
group time value
<chr> <dbl> <dbl>
1 DAX 1991. 1629.
2 DAX 1992. 1614.
3 DAX 1992. 1607.
4 DAX 1992. 1621.
5 DAX 1992. 1618.
6 DAX 1992. 1611.
7 DAX 1992. 1631.
8 DAX 1992. 1640.
9 DAX 1992. 1635.
10 DAX 1992. 1646.
# ℹ 7,430 more rows
ts_as_tibble()
currently supports ts
, xts
, zoo
and timeSeries
objects.
We can also plot univariate and multivariate time-series data using time_ggplot()
which is just a ggplot wrapper that applies some sensible axes and formatting.
timeplyr has a set of optimised functions, utilising data.table’s excellent frollmean()
and friends. It builds upon these by incorporating a fast method for grouped calculations, as well as incorporating time-based windows.
library(data.table)
df <- data.table(x = rnorm(10^6),
g = sample.int(10^4, 10^6, TRUE))
mark(df[, dt_mean := frollmean(x, n = 7,
align = "right", na.rm = FALSE),
by = g])
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 "df[, `:=`(dt_mean, frollmean(x, n … 425ms 469ms 2.13 179MB 21.3
mark(df[, tp_mean := roll_mean(x, window = 7,
na.rm = FALSE, partial = FALSE,
g = g)])
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl>
#> 1 df[, `:=`(tp_mean, roll_mean(x, win… 27ms 27.3ms 36.7 47MB 202.
all.equal(df$dt_mean, df$tp_mean)
#> [1] TRUE
We can use window_sequence()
with data.table’s adaptive
argument to easily calculate rolling stats with partial windows.
An example of a rolling sum, by group, with respect to a time index
x <- 1:20
g <- seq_id(rep(5, 4))
t <- today() + days(0:19)
df <- tibble(x, g, t)
df |>
mutate(sum =
time_roll_sum(x, time = t, g = g,
window = "3 days", partial = FALSE))
# A tibble: 20 × 4
x g t sum
<int> <int> <date> <dbl>
1 1 1 2024-01-26 NA
2 2 1 2024-01-27 NA
3 3 1 2024-01-28 6
4 4 1 2024-01-29 9
5 5 1 2024-01-30 12
6 6 2 2024-01-31 NA
7 7 2 2024-02-01 NA
8 8 2 2024-02-02 21
9 9 2 2024-02-03 24
10 10 2 2024-02-04 27
11 11 3 2024-02-05 NA
12 12 3 2024-02-06 NA
13 13 3 2024-02-07 36
14 14 3 2024-02-08 39
15 15 3 2024-02-09 42
16 16 4 2024-02-10 NA
17 17 4 2024-02-11 NA
18 18 4 2024-02-12 51
19 19 4 2024-02-13 54
20 20 4 2024-02-14 57
It is surprisingly common to analyse time-based episodes in health datasets.
Examples include identifying disease reinfection, or hospital readmission. The data may contain many records in quick succession, and so certain rules are applied to identify which ones are distinct episodes. One such criteria is setting a time threshold between events.
In this example we will show how to find reinfections in fictitious data of 1000 individuals receiving a test for an unknown (to us) disease, on average every 2 weeks. A third of these will be positive results. The aim is to classify these positives as new episodes when 42 days pass between the previous positive result.
set.seed(912398123)
id <- 1:1000
initial_date <- today()
time_bw_tests <- rexp(10^3 * 10, 1/14)
initial_df <- tibble(id, test_date = initial_date)
df <- initial_df |>
bind_rows(
tibble(id = rep(id, 10), time_bw_tests)
)
df <- df |>
mutate(time_bw_tests = replace_na(time_bw_tests, 0))
df <- df |>
group_by(id) %>%
mutate(test_date = today() + cumsum(as.integer(time_bw_tests))) %>%
mutate(test_date = replace_na(test_date, today()))
events <- df %>%
slice_sample(prop = 1/3)
# 1000 individuals, 10 tests each
# On average they take 1 test every 2 weeks
events <- events %>%
time_episodes(test_date, time_by = "days", window = 42)
events
# A tibble: 3,000 x 6
# Groups: id [1,000]
# Episodes: N: 1,784, Median: 2, Mean: 1.78 ▅▁▁▇▁▁▂
# Time b/w events: Pooled mean: ~1.33 months
# Threshold: ~1.38 months
id test_date t_elapsed ep_start ep_id ep_id_new
* <int> <date> <dbl> <date> <int> <int>
1 1 2024-02-11 0 2024-02-11 1 1
2 1 2024-03-13 31 2024-02-11 1 0
3 1 2024-05-04 52 2024-05-04 2 2
4 2 2024-01-28 0 2024-01-28 1 1
5 2 2024-03-26 58 2024-03-26 2 2
6 2 2024-04-09 14 2024-03-26 2 0
7 3 2024-03-01 18 2024-02-12 1 0
8 3 2024-03-23 22 2024-02-12 1 0
9 3 2024-02-12 0 2024-02-12 1 1
10 4 2024-07-24 180 2024-07-24 2 2
# ℹ 2,990 more rows
# A tibble: 784 x 6
# Groups: id [666]
# Episodes: N: 784, Median: 1, Mean: 1.18 ▇▁▁▁▁▁▂
# Time b/w events: Pooled mean: ~2.5 months
# Threshold: ~1.38 months
id test_date t_elapsed ep_start ep_id ep_id_new
<int> <date> <dbl> <date> <int> <int>
1 1 2024-05-04 52 2024-05-04 2 2
2 2 2024-03-26 58 2024-03-26 2 2
3 4 2024-07-24 180 2024-07-24 2 2
4 5 2024-04-15 52 2024-04-15 2 2
5 6 2024-04-16 51 2024-04-16 2 2
6 8 2024-05-02 97 2024-05-02 2 2
7 9 2024-03-25 42 2024-03-25 2 2
8 12 2024-05-08 46 2024-05-08 2 2
9 13 2024-06-01 53 2024-06-01 2 2
10 17 2024-05-20 90 2024-05-20 2 2
# ℹ 774 more rows
time_seq()
is like seq()
but can accept a mix of dates and date-times
[1] "2024-01-26 UTC" "2024-01-27 UTC" "2024-01-28 UTC" "2024-01-29 UTC"
[5] "2024-01-30 UTC" "2024-01-31 UTC" "2024-02-01 UTC" "2024-02-02 UTC"
Error in seq.Date(today, next_week, "days"): 'to' must be a "Date" object
[1] "2024-01-26 UTC" "2024-01-27 UTC" "2024-01-28 UTC" "2024-01-29 UTC"
[5] "2024-01-30 UTC" "2024-01-31 UTC" "2024-02-01 UTC" "2024-02-02 UTC"
[9] "2024-01-26 UTC" "2024-01-28 UTC" "2024-01-30 UTC" "2024-02-01 UTC"
[13] "2024-01-26 UTC" "2024-01-29 UTC" "2024-02-01 UTC"
1 2 2 3 3 3
"2024-01-26" "2024-01-26" "2024-01-27" "2024-01-26" "2024-01-27" "2024-01-28"
A convenience function to calculate time differences
[1] 366
# Compared to lubridate
mark(time_diff(flights$time_hour, today(), "years"),
interval(flights$time_hour, today()) / years(1),
min_iterations = 3)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl>
#> 1 "time_diff(flights$time_hour, to… 15.43ms 16.16ms 50.7 20.9MB 7.80
#> 2 "interval(flights$time_hour, tod… 1.32s 1.32s 0.758 654.9MB 6.57
group_id()
assigns each group a unique id and can be useful to use interactively
An enhanced diff()
[1] 1 1 1 1 1 1 1 1 1
[1] NA 1 1 1 1 1 1 1 1 1
Time differences in days
[1] 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
[1] NA 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[1] NA 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
[1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
A function that prioritises pretty-looking breaks
[1] "2013-01-01 05:00:00 EST" "2013-04-01 05:00:00 EDT"
[3] "2013-07-01 05:00:00 EDT" "2013-10-01 05:00:00 EDT"
[1] "2013-01-01 05:00:00 EST" "2013-03-01 05:00:00 EST"
[3] "2013-05-01 05:00:00 EDT" "2013-07-01 05:00:00 EDT"
[5] "2013-09-01 05:00:00 EDT" "2013-11-01 05:00:00 EDT"
# A tibble: 6 × 2
interval n
<tm_intv> <int>
1 [2013-01-01, 2013-03-01) 51955
2 [2013-03-01, 2013-05-01) 57164
3 [2013-05-01, 2013-07-01) 57039
4 [2013-07-01, 2013-09-01) 58752
5 [2013-09-01, 2013-11-01) 56463
6 [2013-11-01, 2014-01-01) 55403
Identify breaks in a sequence using e.g. time_seq_id
time_is_regular
, time_num_gaps
[1] TRUE
[1] FALSE
id <- time_seq_id(hours, time_by = "hours", threshold = 5) # >5 hours
head(hours[id != lead(id)], n = 10)
[1] "2013-01-01 23:00:00 EST" "2013-01-02 23:00:00 EST"
[3] "2013-01-03 23:00:00 EST" "2013-01-04 23:00:00 EST"
[5] "2013-01-05 23:00:00 EST" "2013-01-06 23:00:00 EST"
[7] "2013-01-07 23:00:00 EST" "2013-01-08 23:00:00 EST"
[9] "2013-01-09 23:00:00 EST" "2013-01-10 23:00:00 EST"
[1] 1819
timeplyr has a set of fast dplyr alternative functions, which will likely be moved to a package of their own
This is very similar to dplyr::group_by
except for an attribute sorted
that specifies whether the groups are sorted normally or sorted in order-of-first appearance. This is analogous to the difference between the by
and keyby
construct in data.table.
# A tibble: 12 × 2
month number_of_flights
<int> <int>
1 1 27004
2 2 24951
3 3 28834
4 4 28330
5 5 28796
6 6 28243
7 7 29425
8 8 29327
9 9 27574
10 10 28889
11 11 27268
12 12 28135
# A tibble: 12 × 2
month number_of_flights
<int> <int>
1 1 27004
2 10 28889
3 11 27268
4 12 28135
5 2 24951
6 3 28834
7 4 28330
8 5 28796
9 6 28243
10 7 29425
11 8 29327
12 9 27574
# A tibble: 12 × 2
month number_of_flights
<int> <int>
1 1 27004
2 10 28889
3 11 27268
4 12 28135
5 2 24951
6 3 28834
7 4 28330
8 5 28796
9 6 28243
10 7 29425
11 8 29327
12 9 27574
# A tibble: 6,936 × 3
time_hour date n
<dttm> <date> <int>
1 2013-01-01 05:00:00 2013-01-01 6
2 2013-01-01 06:00:00 2013-01-01 52
3 2013-01-01 07:00:00 2013-01-01 49
4 2013-01-01 08:00:00 2013-01-01 58
5 2013-01-01 09:00:00 2013-01-01 56
6 2013-01-01 10:00:00 2013-01-01 39
7 2013-01-01 11:00:00 2013-01-01 37
8 2013-01-01 12:00:00 2013-01-01 56
9 2013-01-01 13:00:00 2013-01-01 54
10 2013-01-01 14:00:00 2013-01-01 48
# ℹ 6,926 more rows
# A tibble: 3 × 2
origin n
<chr> <int>
1 EWR 120835
2 LGA 104662
3 JFK 111279
# A tibble: 105 × 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 10 1 1955 2001 -6 2213 2248
2 2013 10 1 1149 1159 -10 1245 1259
3 2013 1 1 1315 1317 -2 1413 1423
4 2013 7 6 1629 1615 14 1954 1953
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 905 905 0 1309 1229
7 2013 1 1 946 959 -13 1146 1202
8 2013 1 1 1318 1322 -4 1358 1416
9 2013 10 1 1338 1105 153 1446 1245
10 2013 1 2 2044 2005 39 2229 2158
# ℹ 95 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, date <date>
# A tibble: 105 × 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 10 1 1955 2001 -6 2213 2248
2 2013 10 1 1149 1159 -10 1245 1259
3 2013 1 1 1315 1317 -2 1413 1423
4 2013 7 6 1629 1615 14 1954 1953
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 905 905 0 1309 1229
7 2013 1 1 946 959 -13 1146 1202
8 2013 1 1 1318 1322 -4 1358 1416
9 2013 10 1 1338 1105 153 1446 1245
10 2013 1 2 2044 2005 39 2229 2158
# ℹ 95 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, date <date>
# A tibble: 105 × 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 9 30 1959 2001 -2 2209 2248
2 2013 9 30 1156 1159 -3 1253 1259
3 2013 9 24 2151 2159 -8 2240 2303
4 2013 8 24 1633 1625 8 1959 2003
5 2013 9 30 2053 1815 158 2310 2054
6 2013 9 30 2052 2054 -2 2306 2358
7 2013 9 30 1641 1638 3 1833 1829
8 2013 9 28 741 738 3 823 835
9 2013 9 30 2203 2205 -2 2339 2331
10 2013 9 30 1939 1950 -11 2112 2141
# ℹ 95 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, date <date>
# Random 10% of flights by origin airport
flights |>
fgroup_by(origin, order = FALSE) |>
fslice_sample(prop = 0.1)
# A tibble: 33,676 × 20
# Groups: origin [3]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 8 28 1818 1805 13 2013 2033
2 2013 2 3 2221 2159 22 2345 2315
3 2013 10 3 1159 1200 -1 1259 1315
4 2013 5 15 620 629 -9 812 825
5 2013 9 19 1942 1945 -3 2128 2156
6 2013 4 26 2106 2000 66 13 2259
7 2013 2 9 1856 1859 -3 2034 2104
8 2013 3 16 2201 2155 6 43 48
9 2013 11 1 1832 1742 50 2045 1940
10 2013 6 8 1014 959 15 1223 1207
# ℹ 33,666 more rows
# ℹ 12 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, date <date>
fdistinct
has a native sort
option
This is sort of the inverse of fdistinct
and allows us to find duplicate rows very quickly
# A tibble: 0 × 20
# ℹ 20 variables: year <int>, month <int>, day <int>, dep_time <int>,
# sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
# sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>, date <date>
# A tibble: 336,552 × 2
origin dest
<chr> <chr>
1 LGA ATL
2 EWR MIA
3 EWR ORD
4 LGA IAH
5 LGA MSP
6 JFK SJU
7 LGA ORD
8 LGA DFW
9 JFK BOS
10 EWR PBI
# ℹ 336,542 more rows
# A tibble: 336,552 × 2
origin dest
<chr> <chr>
1 EWR ALB
2 EWR ALB
3 EWR ALB
4 EWR ALB
5 EWR ALB
6 EWR ALB
7 EWR ALB
8 EWR ALB
9 EWR ALB
10 EWR ALB
# ℹ 336,542 more rows
There is still a lot of room to develop tools centred around time analysis with timeplyr offering a fast, clean and user-friendly interface to dealing with large longitudinal data.
timeplyr, which was initially my personal playground to learn more about R and package development, will hopefully continue to develop and become a routine part of other analysts’ toolkits
The R core team https://www.r-project.org/
The tidyverse team https://www.tidyverse.org/
The data.table Devs https://rdatatable.gitlab.io/data.table/
The collapse Devs https://sebkrantz.github.io/collapse/