This page will guide you in creating their own event log objects. Firstly, the data model for events is introduced. Secondly, it is shown how to create eventlog
objects. Finally, some common examples of data transformations are shown which might be useful for reshaping event data.
The notion of event log in bupaR refers to a set of events which are recorded in the context of a process. For instance, suppose the process under consideration takes place in the emergency department of a hospital. A general representation of the data model is shown below. Firstly, each event belongs to a case. A case, in general is an instance of the process. In the emergency department example, a case would be a visit by a patient.
Each event relates to the coarser concept of an activity. For instance, activities in our example might be: check-in, surgery, treatment, etc. When an activity is performed, this means that an activity instance is created. While the label surgery refers to an activity, one specific surgery for a specific patient at a specific point in time is an activity instance.
patient | activity |
---|---|
John Doe | check-in |
John Doe | surgery |
John Doe | treatment |
John DOe | surgery |
John Doe | check-out |
The table above shows a fictious example of a patient which went through 4 different activities. Note that there are 5 different activity instances, as there were two instances of the “surgery” activity.
An event is an atomic registration related to an activity instance. It thus contains one (and only one) timestamp. Additionally, the event should include a reference to a life cycle transition. More specificaly, multiple events can describe different life cycle transitions of a single activity instance. For example, one event might record when a surgery is scheduled, another when it is started, yet another when it is completed, etc. The Figure shows the standard transactional life cycle. While it supports a wide variety of transaction life cycle stages, the user is allowed to defined his/her own life cycle transitions.
Standard transactional life cycle.
In the table below, the earlier example is extended: for each activity instance, different statusses of the transactional life cycle can be seen, each of them with there own timestamp. At this point, each row refers to a single specific event. Note that not all activity instances have the same life cycle transations, and even different instances of the same activity might have different recorded transitions. E.g. in contrast with the first surgery, the second started without being scheduled, probably due to reasons of urgency.
patient | activity | timestamp | status |
---|---|---|---|
John Doe | check-in | 2017-05-10 08:33:26 | complete |
John Doe | surgery | 2017-05-10 08:38:21 | schedule |
John Doe | surgery | 2017-05-10 08:53:16 | start |
John Doe | surgery | 2017-05-10 09:25:19 | complete |
John Doe | treatment | 2017-05-10 10:01:25 | start |
John Doe | treatment | 2017-05-10 10:35:18 | complete |
John DOe | surgery | 2017-05-10 10:41:35 | start |
John DOe | surgery | 2017-05-10 11:05:56 | complete |
John Doe | check-out | 2017-05-11 14:52:36 | complete |
In order to be able to correlate events which belong to the same activity instance, an activity instance identifier is required. For example, it is possible that a patient has gone through different surgeries, each with there own scheduled, started and complete event. The activity instance identifier will then allow to distinguish which events belong together and which not. The activity instance identifier is always required and will be very important in case of concurrent activity instances. In the table below, the event-activity instance correlation is formally defined by the activity_instance column. It is important to note that this instance identifier should be unique, also among different cases and activities.
patient | activity | timestamp | status | activity_instance |
---|---|---|---|---|
John Doe | check-in | 2017-05-10 08:33:26 | complete | 1 |
John Doe | surgery | 2017-05-10 08:38:21 | schedule | 2 |
John Doe | surgery | 2017-05-10 08:53:16 | start | 2 |
John Doe | surgery | 2017-05-10 09:25:19 | complete | 2 |
John Doe | treatment | 2017-05-10 10:01:25 | start | 3 |
John Doe | treatment | 2017-05-10 10:35:18 | complete | 3 |
John Doe | surgery | 2017-05-10 10:41:35 | start | 4 |
John Doe | surgery | 2017-05-10 11:05:56 | complete | 4 |
John Doe | check-out | 2017-05-11 14:52:36 | complete | 5 |
Finally, each event can also contain the notion of a resource. For instance, the adminstrative clerk that checked-in a patient or scheduled its surgery, or the docter which performed a treatment.
patient | activity | timestamp | status | activity_instance | resource |
---|---|---|---|---|---|
John Doe | check-in | 2017-05-10 08:33:26 | complete | 1 | Samantha |
John Doe | surgery | 2017-05-10 08:38:21 | schedule | 2 | Danny |
John Doe | surgery | 2017-05-10 08:53:16 | start | 2 | Richard |
John Doe | surgery | 2017-05-10 09:25:19 | complete | 2 | Richard |
John Doe | treatment | 2017-05-10 10:01:25 | start | 3 | Danny |
John Doe | treatment | 2017-05-10 10:35:18 | complete | 3 | Danny |
John Doe | surgery | 2017-05-10 10:41:35 | start | 4 | William |
John Doe | surgery | 2017-05-10 11:05:56 | complete | 4 | William |
John Doe | check-out | 2017-05-11 14:52:36 | complete | 5 | Samantha |
To sum up, each row in the data should be an event with at least 6 different pieces of required information:
Additionally, any number of custom event attributes can be addedn e.g. cost.
Given that the data is in the format discussed above and stored as a data.frame, an event log object can be created with the eventlog
function from bupaR
as shown below.
library(bupaR)
example_log_1 %>% #a data.frame with the information in the table above
eventlog(
case_id = "patient",
activity_id = "activity",
activity_instance_id = "activity_instance",
lifecycle_id = "status",
timestamp = "timestamp",
resource_id = "resource"
)
## Event log consisting of:
## 9 events
## 1 traces
## 1 cases
## 4 activities
## 5 activity instances
##
## # A tibble: 9 x 7
## patient activity timestamp status activity_instan~ resource
## <chr> <fct> <dttm> <fct> <chr> <fct>
## 1 John Doe check-in 2017-05-10 08:33:26 comple~ 1 Samantha
## 2 John Doe surgery 2017-05-10 08:38:21 schedu~ 2 Danny
## 3 John Doe surgery 2017-05-10 08:53:16 start 2 Richard
## 4 John Doe surgery 2017-05-10 09:25:19 comple~ 2 Richard
## 5 John Doe treatment 2017-05-10 10:01:25 start 3 Danny
## 6 John Doe treatment 2017-05-10 10:35:18 comple~ 3 Danny
## 7 John Doe surgery 2017-05-10 10:41:35 start 4 William
## 8 John Doe surgery 2017-05-10 11:05:56 comple~ 4 William
## 9 John Doe check-out 2017-05-11 14:52:36 comple~ 5 Samantha
## # ... with 1 more variable: .order <int>
Often, data will not come in the format defined above, or will not include all the required values. Below are given a few examples and how to handle them.
It happens a lot that data is not recorded at the low level of transactions, but that only a single timestamp is recorded for each activity instance. In that case, an event is equivalent to a activity instance. For instance, consider the example above, but now we only have the following information.
example_log_2
## # A tibble: 5 x 4
## patient activity timestamp resource
## <chr> <chr> <dttm> <chr>
## 1 John Doe check-in 2017-05-10 08:33:26 Samantha
## 2 John Doe surgery 2017-05-10 09:25:19 Richard
## 3 John Doe treatment 2017-05-10 10:35:18 Danny
## 4 John Doe surgery 2017-05-10 11:05:56 William
## 5 John Doe check-out 2017-05-11 14:52:36 Samantha
When this is the case, it requires domain knowledge to know which transition of the life cycle is recorded. However, most of the time it will be the completion of a task which is recorded. A such, the lifecycle transition can be added manually, as well as the activity instance id, which is unique for each row.
example_log_2 %>%
mutate(status = "complete",
activity_instance = 1:nrow(.)) %>%
eventlog(
case_id = "patient",
activity_id = "activity",
activity_instance_id = "activity_instance",
lifecycle_id = "status",
timestamp = "timestamp",
resource_id = "resource"
)
## Event log consisting of:
## 5 events
## 1 traces
## 1 cases
## 4 activities
## 5 activity instances
##
## # A tibble: 5 x 7
## patient activity timestamp resource status activity_instan~
## <chr> <fct> <dttm> <fct> <fct> <chr>
## 1 John Doe check-in 2017-05-10 08:33:26 Samantha comple~ 1
## 2 John Doe surgery 2017-05-10 09:25:19 Richard comple~ 2
## 3 John Doe treatment 2017-05-10 10:35:18 Danny comple~ 3
## 4 John Doe surgery 2017-05-10 11:05:56 William comple~ 4
## 5 John Doe check-out 2017-05-11 14:52:36 Samantha comple~ 5
## # ... with 1 more variable: .order <int>
Since many of the functions in bupaR are targetted towards organizational and performance issues, they expect the presence of the resource attribute. However, in certain cases, this information will no be available, such as for the data in example_log_3
.
example_log_3
## # A tibble: 9 x 5
## patient activity timestamp status activity_instance
## <chr> <chr> <dttm> <chr> <dbl>
## 1 John Doe check-in 2017-05-10 08:33:26 complete 1.
## 2 John Doe surgery 2017-05-10 08:38:21 schedule 2.
## 3 John Doe surgery 2017-05-10 08:53:16 start 2.
## 4 John Doe surgery 2017-05-10 09:25:19 complete 2.
## 5 John Doe treatment 2017-05-10 10:01:25 start 3.
## 6 John Doe treatment 2017-05-10 10:35:18 complete 3.
## 7 John Doe surgery 2017-05-10 10:41:35 start 4.
## 8 John Doe surgery 2017-05-10 11:05:56 complete 4.
## 9 John Doe check-out 2017-05-11 14:52:36 complete 5.
In order to work around this problem, the easiest solution is to include an empty resource variable.
example_log_3 %>%
mutate(resource = NA) %>%
eventlog(
case_id = "patient",
activity_id = "activity",
activity_instance_id = "activity_instance",
lifecycle_id = "status",
timestamp = "timestamp",
resource_id = "resource"
)
## Event log consisting of:
## 9 events
## 1 traces
## 1 cases
## 4 activities
## 5 activity instances
##
## # A tibble: 9 x 7
## patient activity timestamp status activity_instan~ resource
## <chr> <fct> <dttm> <fct> <chr> <fct>
## 1 John Doe check-in 2017-05-10 08:33:26 comple~ 1 <NA>
## 2 John Doe surgery 2017-05-10 08:38:21 schedu~ 2 <NA>
## 3 John Doe surgery 2017-05-10 08:53:16 start 2 <NA>
## 4 John Doe surgery 2017-05-10 09:25:19 comple~ 2 <NA>
## 5 John Doe treatment 2017-05-10 10:01:25 start 3 <NA>
## 6 John Doe treatment 2017-05-10 10:35:18 comple~ 3 <NA>
## 7 John Doe surgery 2017-05-10 10:41:35 start 4 <NA>
## 8 John Doe surgery 2017-05-10 11:05:56 comple~ 4 <NA>
## 9 John Doe check-out 2017-05-11 14:52:36 comple~ 5 <NA>
## # ... with 1 more variable: .order <int>
Another possibity is that instead of a list of events, there is a list of activity instances available. This is the case in example_log_4
.
example_log_4
## # A tibble: 5 x 5
## patient activity schedule start
## <chr> <chr> <dttm> <dttm>
## 1 John Doe check-in NA NA
## 2 John Doe check-out NA NA
## 3 John Doe surgery 2017-05-10 08:38:21 2017-05-10 08:53:16
## 4 John Doe surgery NA 2017-05-10 10:41:35
## 5 John Doe treatment NA 2017-05-10 10:01:25
## # ... with 1 more variable: complete <dttm>
When this is the case, we proceed by first adding an unique id to define the activity instances, and subsequently by gathering the different timestamp columns using tidyr::gather
.
example_log_4 %>%
mutate(activity_instance = 1:nrow(.)) %>%
gather(status, timestamp, schedule, start, complete) %>%
mutate(resource = NA) %>%
filter(!is.na(timestamp)) %>%
eventlog(
case_id = "patient",
activity_id = "activity",
activity_instance_id = "activity_instance",
lifecycle_id = "status",
timestamp = "timestamp",
resource_id = "resource"
)
## Event log consisting of:
## 9 events
## 1 traces
## 1 cases
## 4 activities
## 5 activity instances
##
## # A tibble: 9 x 7
## patient activity activity_instan~ status timestamp resource
## <chr> <fct> <chr> <fct> <dttm> <fct>
## 1 John Doe surgery 3 schedu~ 2017-05-10 08:38:21 <NA>
## 2 John Doe surgery 3 start 2017-05-10 08:53:16 <NA>
## 3 John Doe surgery 4 start 2017-05-10 10:41:35 <NA>
## 4 John Doe treatment 5 start 2017-05-10 10:01:25 <NA>
## 5 John Doe check-in 1 comple~ 2017-05-10 08:33:26 <NA>
## 6 John Doe check-out 2 comple~ 2017-05-11 14:52:36 <NA>
## 7 John Doe surgery 3 comple~ 2017-05-10 09:25:19 <NA>
## 8 John Doe surgery 4 comple~ 2017-05-10 11:05:56 <NA>
## 9 John Doe treatment 5 comple~ 2017-05-10 10:35:18 <NA>
## # ... with 1 more variable: .order <int>