Next to the specifically supported preprocessing steps (aggregating, subsetting and enriching), event logs can also be manipulated in a more generic way. In order to do so, well-known dplyr-verbs have been adapted to event log objects and other extensions have been made.

group_by

Using the group_by function, event logs can be grouped according to a (set of) variables, such that all further computations happen for each of these different groups.

In the next example, the number of cases are computed for each value of “vehicleclass”.

traffic_fines %>%
    group_by(vehicleclass) %>%
    n_cases()
## # A tibble: 4 x 2
##   vehicleclass n_cases
##   <chr>          <int>
## 1 A               8569
## 2 C                 17
## 3 M                  5
## 4 <NA>            8591

Predefined grouping functions

For specific groupings, some auxiliary functions are available.

  • group_by_case - group by cases
  • group_by_activity - group by activity types
  • group_by_resource - group by resources
  • group_by_activity_resource - group by activity resource pair
  • group_by_activity_instance - group by activity instances.

For example, the number of cases in which a specific resource occurs, can be computed as follows.

sepsis %>%
    group_by_resource %>%
    n_cases
## # A tibble: 26 x 2
##    resource n_cases
##    <fct>      <int>
##  1 ?            294
##  2 A            985
##  3 B           1013
##  4 C           1050
##  5 D             46
##  6 E            782
##  7 F            200
##  8 G            147
##  9 H             50
## 10 I            118
## # ... with 16 more rows

Note that each of the descriptive metrics discussed here can be rewritten using these lower-level functions. The example above is equal to the resource_involvement metric at case level.

Remove grouping

When a grouping is no longer needed, it can be removed using the ungroup_eventlog function.

mutate

You can use mutate to add new variable to an event log, possibly by using existing variables. In the next example, the total amount of lacticacid is computed for each case.

sepsis %>%
    group_by_case() %>%
    mutate(total_lacticacid = sum(lacticacid, na.rm = T))
## # Groups: [case_id]
## Grouped Log of 15214 events consisting of:
## 846 traces 
## 1050 cases 
## 15214 instances of 16 activities 
## 26 resources 
## Events occurred from 2013-11-07 08:18:29 until 2015-06-05 12:25:11 
##  
## Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 15,214 x 35
##    case_id activity lifecycle resource timestamp             age   crp
##    <chr>   <fct>    <fct>     <fct>    <dttm>              <int> <dbl>
##  1 A       ER Regi~ complete  A        2014-10-22 11:15:41    85    NA
##  2 A       Leucocy~ complete  B        2014-10-22 11:27:00    NA    NA
##  3 A       CRP      complete  B        2014-10-22 11:27:00    NA   210
##  4 A       LacticA~ complete  B        2014-10-22 11:27:00    NA    NA
##  5 A       ER Tria~ complete  C        2014-10-22 11:33:37    NA    NA
##  6 A       ER Seps~ complete  A        2014-10-22 11:34:00    NA    NA
##  7 A       IV Liqu~ complete  A        2014-10-22 14:03:47    NA    NA
##  8 A       IV Anti~ complete  A        2014-10-22 14:03:47    NA    NA
##  9 A       Admissi~ complete  D        2014-10-22 14:13:19    NA    NA
## 10 A       CRP      complete  B        2014-10-24 09:00:00    NA  1090
## # ... with 15,204 more rows, and 28 more variables: diagnose <chr>,
## #   diagnosticartastrup <chr>, diagnosticblood <chr>, diagnosticecg <chr>,
## #   diagnosticic <chr>, diagnosticlacticacid <chr>,
## #   diagnosticliquor <chr>, diagnosticother <chr>, diagnosticsputum <chr>,
## #   diagnosticurinaryculture <chr>, diagnosticurinarysediment <chr>,
## #   diagnosticxthorax <chr>, disfuncorg <chr>, hypotensie <chr>,
## #   hypoxie <chr>, infectionsuspected <chr>, infusion <chr>,
## #   lacticacid <dbl>, leucocytes <chr>, oligurie <chr>,
## #   sirscritheartrate <chr>, sirscritleucos <chr>,
## #   sirscrittachypnea <chr>, sirscrittemperature <chr>,
## #   sirscriteria2ormore <chr>, activity_instance_id <chr>, .order <int>,
## #   total_lacticacid <dbl>

filter

Generic filtering of events can be done using the filter function, which takes an event log and any number of logical conditions. The example below filters events which have vehicleclas “C” and amount greater than 300. More process-specific filtering methods can be found here.

traffic_fines %>%
    filter(vehicleclass == "C", amount > 300)
## Log of 17 events consisting of:
## 1 trace 
## 17 cases 
## 17 instances of 1 activity 
## 9 resources 
## Events occurred from 2006-08-10 until 2008-02-09 
##  
## Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 17 x 18
##    case_id activity lifecycle resource timestamp           amount article
##    <chr>   <fct>    <fct>     <fct>    <dttm>               <dbl>   <int>
##  1 A10060  Create ~ complete  541      2007-03-08 00:00:00    360     157
##  2 A10497  Create ~ complete  558      2007-03-30 00:00:00    360     157
##  3 A10818  Create ~ complete  561      2007-04-08 00:00:00    360     157
##  4 A11707  Create ~ complete  550      2007-04-24 00:00:00    360     157
##  5 A1408   Create ~ complete  559      2006-08-20 00:00:00    350     157
##  6 A14883  Create ~ complete  561      2007-06-29 00:00:00    360     157
##  7 A17130  Create ~ complete  541      2007-07-15 00:00:00    360     157
##  8 A1815   Create ~ complete  563      2006-08-10 00:00:00    350     157
##  9 A19109  Create ~ complete  556      2007-07-17 00:00:00    360     157
## 10 A23000  Create ~ complete  550      2007-12-29 00:00:00    360     157
## 11 A24247  Create ~ complete  561      2007-12-03 00:00:00    360     157
## 12 A24366  Create ~ complete  541      2008-02-09 00:00:00    360     157
## 13 A24634  Create ~ complete  537      2007-11-21 00:00:00    360     157
## 14 A24942  Create ~ complete  561      2007-12-30 00:00:00    360     157
## 15 A25581  Create ~ complete  559      2007-11-23 00:00:00    360     157
## 16 A26099  Create ~ complete  559      2007-12-09 00:00:00    360     157
## 17 A26277  Create ~ complete  538      2008-01-07 00:00:00    360     157
## # ... with 11 more variables: dismissal <chr>, expense <dbl>,
## #   lastsent <chr>, matricola <chr>, notificationtype <chr>,
## #   paymentamount <dbl>, points <int>, totalpaymentamount <chr>,
## #   vehicleclass <chr>, activity_instance_id <chr>, .order <int>

select

Variables on a event log can be selected using select. By default, select will always make sure that the mapping-variables are retained. Otherwise, it would no longer function as an eventlog.

traffic_fines %>%
    select(vehicleclass)
## Log of 27001 events consisting of:
## 3 traces 
## 8591 cases 
## 27001 instances of 6 activities 
## 15 resources 
## Events occurred from 2006-06-17 until 2012-03-26 
##  
## Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 27,001 x 8
##    vehicleclass case_id activity activity_instan~ timestamp          
##    <chr>        <chr>   <fct>    <chr>            <dttm>             
##  1 A            A1      Create ~ 1                2006-07-24 00:00:00
##  2 <NA>         A1      Send Fi~ 2                2006-12-05 00:00:00
##  3 A            A100    Create ~ 3                2006-08-02 00:00:00
##  4 <NA>         A100    Send Fi~ 4                2006-12-12 00:00:00
##  5 <NA>         A100    Insert ~ 5                2007-01-15 00:00:00
##  6 <NA>         A100    Add pen~ 6                2007-03-16 00:00:00
##  7 <NA>         A100    Send fo~ 7                2009-03-30 00:00:00
##  8 A            A10004  Create ~ 19               2007-03-20 00:00:00
##  9 <NA>         A10004  Send Fi~ 20               2007-07-17 00:00:00
## 10 <NA>         A10004  Insert ~ 21               2007-07-24 00:00:00
## # ... with 26,991 more rows, and 3 more variables: resource <fct>,
## #   lifecycle <fct>, .order <int>

By setting the argument force_df = TRUE, the mapping-variables will not be retained, and the output will be a data.frame, and not an event log.

traffic_fines %>%
    select(case_id, vehicleclass, amount, force_df = TRUE)
## # A tibble: 27,001 x 3
##    case_id vehicleclass amount
##    <chr>   <chr>         <dbl>
##  1 A1      A               350
##  2 A1      <NA>            350
##  3 A100    A               350
##  4 A100    <NA>            350
##  5 A100    <NA>            350
##  6 A100    <NA>            715
##  7 A100    <NA>            715
##  8 A10004  A               360
##  9 A10004  <NA>            360
## 10 A10004  <NA>            360
## # ... with 26,991 more rows

arrange

Event data can be sorted using the arrange function. desc can be used to sort descendingly on an attribute.

#sort descending on time
patients %>%
    arrange(desc(time))
## Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 x 7
##    handling patient employee handling_id registration_ty~
##    <fct>    <chr>   <fct>    <chr>       <fct>           
##  1 Triage ~ 500     r2       1000        complete        
##  2 Discuss~ 495     r6       2229        complete        
##  3 X-Ray    498     r5       1734        complete        
##  4 Triage ~ 500     r2       1000        start           
##  5 Triage ~ 499     r2       999         complete        
##  6 Discuss~ 495     r6       2229        start           
##  7 Discuss~ 489     r6       2223        complete        
##  8 X-Ray    498     r5       1734        start           
##  9 X-Ray    497     r5       1733        complete        
## 10 Discuss~ 489     r6       2223        start           
## # ... with 5,432 more rows, and 2 more variables: time <dttm>,
## #   .order <int>

slice

An eventlog can be sliced, which mean returning a slice, i.e. a subset, from the eventlog, based on row number. There are three ways to slice event logs

  • Using slice: take a slice of cases
  • Using slice_activities: take a slice of activity instances
  • Using slice_events: take a slice of events

The next piece of code returns the first 10 cases. Note that first here is defined by the current order of the data set, not by time.

patients %>%
    slice(1:10)
## Log of 110 events consisting of:
## 2 traces 
## 10 cases 
## 55 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2017-01-11 11:39:30 
##  
## Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 110 x 7
##    handling patient employee handling_id registration_ty~
##    <fct>    <chr>   <fct>    <chr>       <fct>           
##  1 Registr~ 1       r1       1           start           
##  2 Registr~ 2       r1       2           start           
##  3 Registr~ 3       r1       3           start           
##  4 Registr~ 4       r1       4           start           
##  5 Registr~ 5       r1       5           start           
##  6 Registr~ 6       r1       6           start           
##  7 Registr~ 7       r1       7           start           
##  8 Registr~ 8       r1       8           start           
##  9 Registr~ 9       r1       9           start           
## 10 Registr~ 10      r1       10          start           
## # ... with 100 more rows, and 2 more variables: time <dttm>, .order <int>

slice_activities

The next piece of code returns the first 10 activity instances.

patients %>%
    slice_activities(1:10)
## Log of 20 events consisting of:
## 1 trace 
## 10 cases 
## 10 instances of 1 activity 
## 1 resource 
## Events occurred from 2017-01-02 11:41:53 until 2017-01-06 09:13:28 
##  
## Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 20 x 7
##    handling patient employee handling_id registration_ty~
##    <fct>    <chr>   <fct>    <chr>       <fct>           
##  1 Registr~ 1       r1       1           start           
##  2 Registr~ 2       r1       2           start           
##  3 Registr~ 3       r1       3           start           
##  4 Registr~ 4       r1       4           start           
##  5 Registr~ 5       r1       5           start           
##  6 Registr~ 6       r1       6           start           
##  7 Registr~ 7       r1       7           start           
##  8 Registr~ 8       r1       8           start           
##  9 Registr~ 9       r1       9           start           
## 10 Registr~ 10      r1       10          start           
## 11 Registr~ 1       r1       1           complete        
## 12 Registr~ 2       r1       2           complete        
## 13 Registr~ 3       r1       3           complete        
## 14 Registr~ 4       r1       4           complete        
## 15 Registr~ 5       r1       5           complete        
## 16 Registr~ 6       r1       6           complete        
## 17 Registr~ 7       r1       7           complete        
## 18 Registr~ 8       r1       8           complete        
## 19 Registr~ 9       r1       9           complete        
## 20 Registr~ 10      r1       10          complete        
## # ... with 2 more variables: time <dttm>, .order <int>

slice_events

The next piece of code returns the first 10 events.

patients %>% 
    slice_events(1:10)
## Log of 10 events consisting of:
## 1 trace 
## 10 cases 
## 10 instances of 1 activity 
## 1 resource 
## Events occurred from 2017-01-02 11:41:53 until 2017-01-06 05:58:54 
##  
## Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 10 x 7
##    handling patient employee handling_id registration_ty~
##    <fct>    <chr>   <fct>    <chr>       <fct>           
##  1 Registr~ 1       r1       1           start           
##  2 Registr~ 2       r1       2           start           
##  3 Registr~ 3       r1       3           start           
##  4 Registr~ 4       r1       4           start           
##  5 Registr~ 5       r1       5           start           
##  6 Registr~ 6       r1       6           start           
##  7 Registr~ 7       r1       7           start           
##  8 Registr~ 8       r1       8           start           
##  9 Registr~ 9       r1       9           start           
## 10 Registr~ 10      r1       10          start           
## # ... with 2 more variables: time <dttm>, .order <int>

first_n, last_n

The slice function select events, cases or activity instances based on their current position in the event data. As such, the result can be changed using the arrange function. More often, we want to select the first n activity instances, or the last ones. This is achieved with the first_n or last_n functions, which return the first, resp. last, n activity instances of a log based on time, not on position.

patients %>% 
    first_n(n = 5)
## Log of 10 events consisting of:
## 2 traces 
## 3 cases 
## 5 instances of 2 activities 
## 2 resources 
## Events occurred from 2017-01-02 11:41:53 until 2017-01-04 04:25:06 
##  
## Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 10 x 7
##    handling patient employee handling_id registration_ty~
##    <fct>    <chr>   <fct>    <chr>       <fct>           
##  1 Registr~ 1       r1       1           start           
##  2 Registr~ 2       r1       2           start           
##  3 Triage ~ 1       r2       501         start           
##  4 Registr~ 1       r1       1           complete        
##  5 Registr~ 2       r1       2           complete        
##  6 Triage ~ 2       r2       502         start           
##  7 Triage ~ 1       r2       501         complete        
##  8 Triage ~ 2       r2       502         complete        
##  9 Registr~ 4       r1       4           start           
## 10 Registr~ 4       r1       4           complete        
## # ... with 2 more variables: time <dttm>, .order <int>

This is not impacted by a different ordering of the data since it will take the time aspect into account.

patients %>%
    arrange(desc(time)) %>%
    first_n(n = 5)
## Log of 10 events consisting of:
## 2 traces 
## 3 cases 
## 5 instances of 2 activities 
## 2 resources 
## Events occurred from 2017-01-02 11:41:53 until 2017-01-04 04:25:06 
##  
## Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 10 x 7
##    handling patient employee handling_id registration_ty~
##    <fct>    <chr>   <fct>    <chr>       <fct>           
##  1 Registr~ 1       r1       1           start           
##  2 Registr~ 2       r1       2           start           
##  3 Triage ~ 1       r2       501         start           
##  4 Registr~ 1       r1       1           complete        
##  5 Registr~ 2       r1       2           complete        
##  6 Triage ~ 2       r2       502         start           
##  7 Triage ~ 1       r2       501         complete        
##  8 Triage ~ 2       r2       502         complete        
##  9 Registr~ 4       r1       4           start           
## 10 Registr~ 4       r1       4           complete        
## # ... with 2 more variables: time <dttm>, .order <int>

Incombination with group_by_case, it is very easy to select the heads or tails of each case. Below, we explore the 95% most common first 3 activities in the sepsis log.

sepsis %>%
    group_by_case() %>%
    first_n(3) %>%
    trace_explorer(coverage = 0.95)
## Warning: `cols` is now required.
## Please use `cols = c(data)`

sample_n

The sample_n function allows to take a sample of the event log containing n cases. The code below returns a sample of 10 patients.

patients %>%
    sample_n(size = 10)
## Log of 108 events consisting of:
## 2 traces 
## 10 cases 
## 54 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-05 04:56:11 until 2018-03-22 22:55:15 
##  
## Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 108 x 7
##    handling patient employee handling_id registration_ty~
##    <fct>    <chr>   <fct>    <chr>       <fct>           
##  1 Registr~ 8       r1       8           start           
##  2 Registr~ 16      r1       16          start           
##  3 Registr~ 46      r1       46          start           
##  4 Registr~ 64      r1       64          start           
##  5 Registr~ 101     r1       101         start           
##  6 Registr~ 105     r1       105         start           
##  7 Registr~ 245     r1       245         start           
##  8 Registr~ 266     r1       266         start           
##  9 Registr~ 408     r1       408         start           
## 10 Registr~ 460     r1       460         start           
## # ... with 98 more rows, and 2 more variables: time <dttm>, .order <int>

Note that this function can also be used with a sample size bigger than the number of cases in the event log, if you allow for the replacements of drawn cases.

A more extensive list of subsetting methods is provided by edeaR. Look here for more information.