Object classes and characteristics

bupaR knows 2 main object classes: eventlog and activitylog. Both are special types of data.frames. Furthermore there is the overarching object class log. The object class log is used by functions were a distinction between the two classes is not relevant. It is only used as a higher-level classification of the eventlog and activitylog objects - it cannot stand on its own. That is, objects which have just the class “log” cannot exist, they must have one of the subclasses as well.

The defining characteristics of the eventlog are stored in regular variables, of which the names can be obtained with the mapping function.

mapping(patients)
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type
mapping(patients_act)
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Timestamps:      start, complete

Note that eventlogs and activitylogs have some mapping-elements in common:

  • case identifier
  • activity identifier
  • resource identifier

While others are slightly difference

  • the activity instance identifier only exist for eventlogs. For activity log, each row is an activity instance
  • the lifecycle identifier for eventlog consist of a single column. For activity log, it consist of multiple columns. (At least start and complete are required, although they can contain NAs)
  • the activitylog does not have a timestamp column. The timestamps are stored within the lifecycle columns.

Note that there are 2 classes for the mapping, one for eventlog and one for activitylog. (Note also that the eventlog_mapping has a dedidcated print function, while the activitylog has not (yet), and prints just a regular list.)

Individual mapping-variables can be obtained with the dedicated id functions. They work on both the logs itself, as on the mappings.

activity_id(patients)
## [1] "handling"
activity_id(patients_act)
## [1] "handling"
mapping_event <- mapping(patients)
mapping_act <- mapping(patients_act)
activity_id(mapping_event)
## [1] "handling"
activity_id(mapping_act)
## [1] "handling"

During data manipulation, it can sometimes happen (or sometimes be necessary) that the log is at somepoint transferred to a regular data.frame for some operations. If the ultimate output of the function should be once again a log object (and not a visual or summary table), the mapping can be used to recuperate the original mapping. This happens with the re_map function.

patients_df <- as.data.frame(patients)
class(patients_df)
## [1] "data.frame"
patients_log <- re_map(patients_df, mapping_event)
class(patients_log)
## [1] "eventlog"   "log"        "tbl_df"     "tbl"        "data.frame"

The re_map function, recognizes the class of the mapping, and thus works for both activitylog and eventlog mappings. It will always return to the original type. (I.e. if the mapping originates from an activitylog object, it will result once again in an activitylog object.) It can never be used to convert activitylog to eventlog, or vice versa.

While the re_map function is exported by bupaR, it is primarly for internal use. Only for more advanced use of bupaR, it can be useful for the end-user.

Note that functions that are not exported can always be used using the ::: instead of the :: operator. For instance, we can use the not-exported activity_id_ function also outside of bupaR as follows.

bupaR:::activity_id_(patients)
## handling

While you should typically not need these function outside of bupaR, except for perhaps developing or testing some code interactively, we will use the ::: notation in this manual whenever we refer to internal functions.

The activity_id_ function is a variant on the activity_id function. Only instead of returning a character object, it returns a symbol. This symbol is useful when you want to use the eventlog mapping in programming.

For example, suppose you want to filter patients event log, only for patient == 1. But you don’t know that the case_id is patient, so you use the function to get the case_id.

The following wont work.

patients %>%
    filter(case_id(patients) == 1)
## EMPTY EVENT LOG
## # A tibble: 0 x 7
## # ... with 7 variables: handling <fct>, patient <chr>, employee <fct>,
## #   handling_id <chr>, registration_type <fct>, time <dttm>, .order <int>

just as the following wont work.

patients %>%
    filter("patient" == 1)
## EMPTY EVENT LOG
## # A tibble: 0 x 7
## # ... with 7 variables: handling <fct>, patient <chr>, employee <fct>,
## #   handling_id <chr>, registration_type <fct>, time <dttm>, .order <int>

In order to successfully do this, we could use the symbol:

patients %>%
    filter(!!bupaR:::case_id_(patients) == 1)
## # Log of 12 events consisting of:
## 1 trace 
## 1 case 
## 6 instances of 6 activities 
## 6 resources 
## Events occurred from 2017-01-02 11:41:53 until 2017-01-09 19:45:45 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 12 x 7
##    handling    patient employee handling_id registration_ty~ time               
##    <fct>       <chr>   <fct>    <chr>       <fct>            <dttm>             
##  1 Registrati~ 1       r1       1           start            2017-01-02 11:41:53
##  2 Triage and~ 1       r2       501         start            2017-01-02 12:40:20
##  3 Blood test  1       r3       1001        start            2017-01-05 08:59:04
##  4 MRI SCAN    1       r4       1238        start            2017-01-05 21:37:12
##  5 Discuss Re~ 1       r6       1735        start            2017-01-07 07:57:49
##  6 Check-out   1       r7       2230        start            2017-01-09 17:09:43
##  7 Registrati~ 1       r1       1           complete         2017-01-02 12:40:20
##  8 Triage and~ 1       r2       501         complete         2017-01-02 22:32:25
##  9 Blood test  1       r3       1001        complete         2017-01-05 14:34:27
## 10 MRI SCAN    1       r4       1238        complete         2017-01-06 01:54:23
## 11 Discuss Re~ 1       r6       1735        complete         2017-01-07 10:18:08
## 12 Check-out   1       r7       2230        complete         2017-01-09 19:45:45
## # ... with 1 more variable: .order <int>

More on symbols and !!: https://adv-r.hadley.nz/quasiquotation.html

Alternatively, the following notation works as well.

patients %>%
    filter(.data[[case_id(patients)]] == 1)
## # Log of 12 events consisting of:
## 1 trace 
## 1 case 
## 6 instances of 6 activities 
## 6 resources 
## Events occurred from 2017-01-02 11:41:53 until 2017-01-09 19:45:45 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 12 x 7
##    handling    patient employee handling_id registration_ty~ time               
##    <fct>       <chr>   <fct>    <chr>       <fct>            <dttm>             
##  1 Registrati~ 1       r1       1           start            2017-01-02 11:41:53
##  2 Triage and~ 1       r2       501         start            2017-01-02 12:40:20
##  3 Blood test  1       r3       1001        start            2017-01-05 08:59:04
##  4 MRI SCAN    1       r4       1238        start            2017-01-05 21:37:12
##  5 Discuss Re~ 1       r6       1735        start            2017-01-07 07:57:49
##  6 Check-out   1       r7       2230        start            2017-01-09 17:09:43
##  7 Registrati~ 1       r1       1           complete         2017-01-02 12:40:20
##  8 Triage and~ 1       r2       501         complete         2017-01-02 22:32:25
##  9 Blood test  1       r3       1001        complete         2017-01-05 14:34:27
## 10 MRI SCAN    1       r4       1238        complete         2017-01-06 01:54:23
## 11 Discuss Re~ 1       r6       1735        complete         2017-01-07 10:18:08
## 12 Check-out   1       r7       2230        complete         2017-01-09 19:45:45
## # ... with 1 more variable: .order <int>

The .data here is a special command, a pronoun that can be used in dplyr functions. More information here: https://adv-r.hadley.nz/quasiquotation.html

In bupaR, the preference goes to the latter notation. It has the advantage to be used in scripts both inside bupaR as well as outside (whereas the !! notation only works with the bupaR::: prefix). It is also slightly easier to understand than the workings of !!.

That said, the use of case_id_() and symbol(case_id()) is still widespread in bupaR, but the goal is to phase out this usage.

dplyr verbs

The following dplyr verbs have received methods for activity logs and event logs.

  • filter
  • group_by
  • arrange
  • mutate
  • select

They will all return a proper eventlog - i.e. there is no risk the mapping is lost.

Special attention has to be given to the following:

Select

Conventionally, select will not ensure that the log will maintain the variables it needs to be considered a log. The select methods for logs therefore will keep the listed variables and the variables that define the eventlog.

The following code returns an eventlog with the attribute oligurie, as well as the 6 variables needed to define the eventlog (plus the .order variable, see further).

sepsis %>%
    select(oligurie)
## # Log of 15214 events consisting of:
## 846 traces 
## 1050 cases 
## 15214 instances of 16 activities 
## 26 resources 
## Events occurred from 2013-11-07 08:18:29 until 2015-06-05 12:25:11 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 15,214 x 8
##    oligurie case_id activity       activity_instan~ timestamp           resource
##    <lgl>    <chr>   <fct>          <chr>            <dttm>              <fct>   
##  1 FALSE    A       ER Registrati~ 1                2014-10-22 11:15:41 A       
##  2 NA       A       Leucocytes     2                2014-10-22 11:27:00 B       
##  3 NA       A       CRP            3                2014-10-22 11:27:00 B       
##  4 NA       A       LacticAcid     4                2014-10-22 11:27:00 B       
##  5 NA       A       ER Triage      5                2014-10-22 11:33:37 C       
##  6 NA       A       ER Sepsis Tri~ 6                2014-10-22 11:34:00 A       
##  7 NA       A       IV Liquid      7                2014-10-22 14:03:47 A       
##  8 NA       A       IV Antibiotics 8                2014-10-22 14:03:47 A       
##  9 NA       A       Admission NC   9                2014-10-22 14:13:19 D       
## 10 NA       A       CRP            10               2014-10-24 09:00:00 B       
## # ... with 15,204 more rows, and 2 more variables: lifecycle <fct>,
## #   .order <int>

This behaviour can be turned of using the force_df = TRUE argument. In that case, the select will work just like a traditional select, and the result will be a normal data.frame, no long an eventlog.

sepsis %>%
    select(oligurie, force_df = TRUE)
## # A tibble: 15,214 x 1
##    oligurie
##    <lgl>   
##  1 FALSE   
##  2 NA      
##  3 NA      
##  4 NA      
##  5 NA      
##  6 NA      
##  7 NA      
##  8 NA      
##  9 NA      
## 10 NA      
## # ... with 15,204 more rows

Because of this, you can select just the eventlog mapping using select()

sepsis %>%
    select()
## # Log of 15214 events consisting of:
## 846 traces 
## 1050 cases 
## 15214 instances of 16 activities 
## 26 resources 
## Events occurred from 2013-11-07 08:18:29 until 2015-06-05 12:25:11 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 15,214 x 7
##    case_id activity      activity_instan~ timestamp           resource lifecycle
##    <chr>   <fct>         <chr>            <dttm>              <fct>    <fct>    
##  1 A       ER Registrat~ 1                2014-10-22 11:15:41 A        complete 
##  2 A       Leucocytes    2                2014-10-22 11:27:00 B        complete 
##  3 A       CRP           3                2014-10-22 11:27:00 B        complete 
##  4 A       LacticAcid    4                2014-10-22 11:27:00 B        complete 
##  5 A       ER Triage     5                2014-10-22 11:33:37 C        complete 
##  6 A       ER Sepsis Tr~ 6                2014-10-22 11:34:00 A        complete 
##  7 A       IV Liquid     7                2014-10-22 14:03:47 A        complete 
##  8 A       IV Antibioti~ 8                2014-10-22 14:03:47 A        complete 
##  9 A       Admission NC  9                2014-10-22 14:13:19 D        complete 
## 10 A       CRP           10               2014-10-24 09:00:00 B        complete 
## # ... with 15,204 more rows, and 1 more variable: .order <int>

If you want to select only specific eventlog-classifiers, you can use the function selects_ids (internally). Because you will typically not select all ids (otherwise you can use select()), this will by default turn your object to a data.frame.

sepsis %>%
    bupaR:::select_ids(activity_id, case_id)
## # A tibble: 15,214 x 2
##    activity         case_id
##    <fct>            <chr>  
##  1 ER Registration  A      
##  2 Leucocytes       A      
##  3 CRP              A      
##  4 LacticAcid       A      
##  5 ER Triage        A      
##  6 ER Sepsis Triage A      
##  7 IV Liquid        A      
##  8 IV Antibiotics   A      
##  9 Admission NC     A      
## 10 CRP              A      
## # ... with 15,204 more rows

Note how the different classifiers are defined: using the id() functions, but without the brackets. And not using characeters.

Group_by

While the group_by function is defined for logs, it should be noted that it requires special methods for each function before that function is “compatible” with grouped logs. Some utility functions for this do however exist (see further).

There are some short cuts for typical groupings when programming in bupaR.

  • group_by_case
  • group_by_activity
  • group_by_activity_instance
  • group_by_resource
  • group_by_resource_activity
patients %>%
    group_by_case()
## # Groups: [patient]
## Grouped # Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 x 7
##    handling    patient employee handling_id registration_ty~ time               
##    <fct>       <chr>   <fct>    <chr>       <fct>            <dttm>             
##  1 Registrati~ 1       r1       1           start            2017-01-02 11:41:53
##  2 Registrati~ 2       r1       2           start            2017-01-02 11:41:53
##  3 Registrati~ 3       r1       3           start            2017-01-04 01:34:05
##  4 Registrati~ 4       r1       4           start            2017-01-04 01:34:04
##  5 Registrati~ 5       r1       5           start            2017-01-04 16:07:47
##  6 Registrati~ 6       r1       6           start            2017-01-04 16:07:47
##  7 Registrati~ 7       r1       7           start            2017-01-05 04:56:11
##  8 Registrati~ 8       r1       8           start            2017-01-05 04:56:11
##  9 Registrati~ 9       r1       9           start            2017-01-06 05:58:54
## 10 Registrati~ 10      r1       10          start            2017-01-06 05:58:54
## # ... with 5,432 more rows, and 1 more variable: .order <int>

is equivalent to

patients %>%
    group_by(.data[[case_id(patients)]])
## # Groups: [patient]
## Grouped # Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 x 7
##    handling    patient employee handling_id registration_ty~ time               
##    <fct>       <chr>   <fct>    <chr>       <fct>            <dttm>             
##  1 Registrati~ 1       r1       1           start            2017-01-02 11:41:53
##  2 Registrati~ 2       r1       2           start            2017-01-02 11:41:53
##  3 Registrati~ 3       r1       3           start            2017-01-04 01:34:05
##  4 Registrati~ 4       r1       4           start            2017-01-04 01:34:04
##  5 Registrati~ 5       r1       5           start            2017-01-04 16:07:47
##  6 Registrati~ 6       r1       6           start            2017-01-04 16:07:47
##  7 Registrati~ 7       r1       7           start            2017-01-05 04:56:11
##  8 Registrati~ 8       r1       8           start            2017-01-05 04:56:11
##  9 Registrati~ 9       r1       9           start            2017-01-06 05:58:54
## 10 Registrati~ 10      r1       10          start            2017-01-06 05:58:54
## # ... with 5,432 more rows, and 1 more variable: .order <int>

While, except for the more common resource-activity, not all relevant combinations of groupings is provided as a shortcut, the internal group_by_ids function allows the use of any combination of _id functions. For example:

patients %>%
    bupaR:::group_by_ids(activity_id, case_id)
## # Groups: [handling, patient]
## Grouped # Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 x 7
##    handling    patient employee handling_id registration_ty~ time               
##    <fct>       <chr>   <fct>    <chr>       <fct>            <dttm>             
##  1 Registrati~ 1       r1       1           start            2017-01-02 11:41:53
##  2 Registrati~ 2       r1       2           start            2017-01-02 11:41:53
##  3 Registrati~ 3       r1       3           start            2017-01-04 01:34:05
##  4 Registrati~ 4       r1       4           start            2017-01-04 01:34:04
##  5 Registrati~ 5       r1       5           start            2017-01-04 16:07:47
##  6 Registrati~ 6       r1       6           start            2017-01-04 16:07:47
##  7 Registrati~ 7       r1       7           start            2017-01-05 04:56:11
##  8 Registrati~ 8       r1       8           start            2017-01-05 04:56:11
##  9 Registrati~ 9       r1       9           start            2017-01-06 05:58:54
## 10 Registrati~ 10      r1       10          start            2017-01-06 05:58:54
## # ... with 5,432 more rows, and 1 more variable: .order <int>

Note that the notation is analogous to select_ids: specific the id functions, without quotation marks or brackets.

Warning: grouping on classifier variables should currently not be done in combinations with other bupaR functions. The latter might create a conflict. For example, the following will not work. The activities() function will need the case identifier to work, but cannot access this because the data is grouped on that variable. This might be improved in the future by either ignoring the grouping in that case, or by making sure there is no conflict.

patients %>%
    group_by_case() %>%
    activities()
## # A tibble: 2,721 x 4
##    patient handling              absolute_frequency relative_frequency
##    <chr>   <fct>                              <int>              <dbl>
##  1 1       Blood test                             1              0.167
##  2 1       Check-out                              1              0.167
##  3 1       Discuss Results                        1              0.167
##  4 1       MRI SCAN                               1              0.167
##  5 1       Registration                           1              0.167
##  6 1       Triage and Assessment                  1              0.167
##  7 10      Check-out                              1              0.2  
##  8 10      Discuss Results                        1              0.2  
##  9 10      Registration                           1              0.2  
## 10 10      Triage and Assessment                  1              0.2  
## # ... with 2,711 more rows

Nonetheless, the group_by_x functions are useful in other situations, for instance for data preprocessing. The following code create a new variable to check if the patient had an MRI scan. (Because of the grouping on case, the any() function will look at each case individually to check if any activity was an MRI SCAN)

patients %>%
    group_by_case() %>%
    mutate(had_mri = any(handling == "MRI SCAN"))
## # Groups: [patient]
## Grouped # Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 x 8
##    handling    patient employee handling_id registration_ty~ time               
##    <fct>       <chr>   <fct>    <chr>       <fct>            <dttm>             
##  1 Registrati~ 1       r1       1           start            2017-01-02 11:41:53
##  2 Registrati~ 2       r1       2           start            2017-01-02 11:41:53
##  3 Registrati~ 3       r1       3           start            2017-01-04 01:34:05
##  4 Registrati~ 4       r1       4           start            2017-01-04 01:34:04
##  5 Registrati~ 5       r1       5           start            2017-01-04 16:07:47
##  6 Registrati~ 6       r1       6           start            2017-01-04 16:07:47
##  7 Registrati~ 7       r1       7           start            2017-01-05 04:56:11
##  8 Registrati~ 8       r1       8           start            2017-01-05 04:56:11
##  9 Registrati~ 9       r1       9           start            2017-01-06 05:58:54
## 10 Registrati~ 10      r1       10          start            2017-01-06 05:58:54
## # ... with 5,432 more rows, and 2 more variables: .order <int>, had_mri <lgl>