Introduction

Despite the extensive opportunities that process mining techniques provide, the garbage in - garbage out principle still applies. Data quality issues are widespread in real-life data and can generate misleading results when used for analysis purposes. DaQAPO - Data Quality Assessment for Process-Oriented data - provides a set of assessment functions to identify a wide array of quality issues.

The table below summarizes the different data quality assessment tests available in daqapo, after which each test will be briefly demonstrated.

Function name Description Output
detect_activity_frequency_violations Function that detects activity frequency anomalies per case Summary in console + Returns activities in cases which are executed too many times
detect_activity_order_violations Function detecting violations in activity order Summary in console + Returns detected orders which violate the specified order
detect_attribute_dependencies Function detecting violations of dependencies between attributes (i.e. condition(s) that should hold when (an)other condition(s) hold(s)) Summary in console + Returns rows with dependency violations
detect_case_id_sequence_gaps Function detecting gaps in the sequence of case identifiers Summary in console + Returns case IDs which should be expected to be present
detect_conditional_activity_presence Function detection violations of conditional activity presence (i.e. activity/activities that should be present when (a) particular condition(s) hold(s)) Summary in console + Returns cases violating conditional activity presence
detect_duration_outliers Function detecting duration outliers for a particular activity Summary in console + Returns rows with outliers
detect_inactive_periods Function detecting inactive periods, i.e. periods of time in which no activity executions/arrivals are recorded Summary in console + Returns periods of inactivity
detect_incomplete_cases Function detecting incomplete cases in terms of the activities that need to be recorded for a case Summary in console + Returns traces in which the mentioned activities are not present
detect_incorrect_activity_names Function returning the incorrect activity labels in the log Summary in console + Returns rows with incorrect activities
detect_missing_values Function detecting missing values at different levels of aggregation Summary in console + Returns rows with NAs
detect_multiregistration Function detecting the registration of a series of events in a short time period for the same case or by the same resource Summary in console + Returns rows with multiregistration on resource or case level
detect_overlaps Checks if a resource has performed two activities in parallel Data frame containing the activities, the number of overlaps and average overlap in minutes
detect_related_activities Function detecting missing related activities, i.e. activities that should be registered because another activity is registered for a case Summary in console + Returns cases violating related activities
detect_similar_labels Function detecting potential spelling mistakes Table showing similarities for each label
detect_time_anomalies Funtion detecting activity executions with negative or zero duration Summary in console + Returns rows with negative or zero durations
detect_unique_values Function listing all distinct combinations of the given log attributes Summary in console + Returns all unique combinations of values in given columns
detect_value_range_violations Function detecting violations of the range of acceptable values Summary in console + Returns rows with value range infringements

In the examples below, we use the dataset hospital_actlog_actlog, which is an artificial event log with data quality issues provided by daqapo.

Activity Frequency Violations

hospital_actlog %>%
  detect_activity_frequency_violations("Registration" = 1,
                                       "Clinical exam" = 1)
## *** OUTPUT ***
## For 3 cases in the activity log (13.6363636363636%) an anomaly is detected.
## The anomalies are spread over the following cases:
## # A tibble: 3 x 3
##   patient_visit_nr activity          n
##              <dbl> <chr>         <int>
## 1              518 Registration      3
## 2              512 Clinical exam     2
## 3              535 Registration      2

Activity Order Violations

hospital_actlog %>%
  detect_activity_order_violations(activity_order = c("Registration", "Triage", "Clinical exam",
                                                      "Treatment", "Treatment evaluation"))
## Warning in detect_activity_order_violations.activitylog(., activity_order =
## c("Registration", : Some activity instances within the same case overlap. Use
## detect_overlaps to investigate further.
## Warning in detect_activity_order_violations.activitylog(., activity_order
## = c("Registration", : Not all specified activities occur in each case. Use
## detect_incomplete_cases to investigate further.
## Selected timestamp parameter value: both
## *** OUTPUT ***
## It was checked whether the activity order Registration - Triage - Clinical exam - Treatment - Treatment evaluation is respected.
## This activity order is respected for 18 (81.82%) of the cases and not for4 (18.18%) of the cases.
## For cases for which the aformentioned activity order is not respected, the following order is detected (ordered by decreasing frequeny of occurrence):
## # A tibble: 4 x 3
##   activity_list                                                       n case_ids
##   <chr>                                                           <int> <chr>   
## 1 Registration - Registration - Registration                          1 518     
## 2 Registration - Registration - Triage - Clinical exam - Treatme~     1 535     
## 3 Registration - Triage - Clinical exam - Clinical exam               1 512     
## 4 Triage - Registration                                               1 521

Attribute Dependencies

hospital_actlog %>% 
  detect_attribute_dependencies(antecedent = activity == "Registration",
                                consequent = startsWith(originator,"Clerk"))
## *** OUTPUT ***
## The following statement was checked: if condition(s) ~activity == "Registration" hold(s), then ~startsWith(originator, "Clerk") should also hold.
## This statement holds for 12 (85.71%) of the rows in the activity log for which the first condition(s) hold and does not hold for 2 (14.29%) of these rows.
## For the following rows, the first condition(s) hold(s), but the second condition does not:
## # A tibble: 2 x 7
##   patient_visit_nr activity originator start               complete           
##              <dbl> <chr>    <chr>      <dttm>              <dttm>             
## 1              528 Registr~ Nurse 6    2017-11-21 18:10:17 2017-11-21 18:15:04
## 2              534 Registr~ <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>

Case ID Sequence Gaps

hospital_actlog %>%
  detect_case_id_sequence_gaps()
## *** OUTPUT ***
## It was checked whether there are gaps in the sequence of case IDs
## From the 27 expected cases in the activity log, ranging from 510 to 536, 5 (18.52%) are missing.
## These case numbers are:
##   case present
## 1  511   FALSE
## 2  513   FALSE
## 3  514   FALSE
## 4  515   FALSE
## 5  516   FALSE

Conditional Activity Presence

hospital_actlog %>%
  detect_conditional_activity_presence(condition = specialization == "TRAU",
                                       activities = "Clinical exam")
## *** OUTPUT ***
## The following statement was checked: if condition(s) ~specialization == "TRAU" hold(s), then activity/activities Clinical exam should be recorded
## The condition(s) hold(s) for 2 cases. From these cases:
## - the specified activity/activities is/are recorded for 2 case(s) (100%)
## - the specified activity/activities is/are not recorded for 0 case(s) (0%)

Duration Outliers

hospital_actlog %>%
  detect_duration_outliers(Treatment = duration_within(bound_sd = 1))
## *** OUTPUT ***
## Outliers are detected for following activities
## Treatment     Lower bound: 5.06   Upper bound: 22.2
## A total of 1 is detected (1.89% of the activity executions)
## For the following activity instances, outliers are detected:
## # A tibble: 1 x 13
##   patient_visit_nr activity originator start               complete           
##              <dbl> <chr>    <chr>      <dttm>              <dttm>             
## 1              523 Treatme~ Nurse 17   2017-11-21 18:26:04 2017-11-21 18:55:00
## # ... with 8 more variables: triagecode <dbl>, specialization <chr>,
## #   duration <dbl>, mean <dbl>, sd <dbl>, bound_sd <dbl>, lower_bound <dbl>,
## #   upper_bound <dbl>
hospital_actlog %>%
  detect_duration_outliers(Treatment = duration_within(lower_bound = 0, upper_bound = 15))
## *** OUTPUT ***
## Outliers are detected for following activities
## Treatment     Lower bound: 0      Upper bound: 15
## A total of 1 is detected (1.89% of the activity executions)
## For the following activity instances, outliers are detected:
## # A tibble: 1 x 13
##   patient_visit_nr activity originator start               complete           
##              <dbl> <chr>    <chr>      <dttm>              <dttm>             
## 1              523 Treatme~ Nurse 17   2017-11-21 18:26:04 2017-11-21 18:55:00
## # ... with 8 more variables: triagecode <dbl>, specialization <chr>,
## #   duration <dbl>, mean <dbl>, sd <dbl>, bound_sd <dbl>, lower_bound <dbl>,
## #   upper_bound <dbl>

Inactive Periods

hospital_actlog %>%
  detect_inactive_periods(threshold = 30)
## Selected timestamp parameter value: both
## Selected inactivity type:arrivals
## *** OUTPUT ***
## Specified threshold of 30 minutes is violated 9 times.
## Threshold is violated in the following periods:
## # A tibble: 9 x 3
##   period_start        period_end          time_gap
##   <dttm>              <dttm>                 <dbl>
## 1 2017-11-20 10:20:06 2017-11-21 11:35:16   1515. 
## 2 2017-11-21 11:22:16 2017-11-21 11:59:41     37.4
## 3 2017-11-21 12:05:52 2017-11-21 13:43:16     97.4
## 4 2017-11-21 14:06:09 2017-11-21 15:12:17     66.1
## 5 2017-11-21 15:18:19 2017-11-21 16:42:08     83.8
## 6 2017-11-21 17:06:10 2017-11-21 18:02:10     56  
## 7 2017-11-21 18:15:04 2017-11-22 10:04:57    950. 
## 8 2017-11-22 10:32:56 2017-11-22 16:30:00    357. 
## 9 2017-11-22 17:00:00 2017-11-22 18:00:00     60

Incomplete Cases

hospital_actlog %>%
  detect_incomplete_cases(activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation"))
## *** OUTPUT ***
## It was checked whether the activities Clinical exam, Registration, Treatment, Treatment evaluation, Triage are present for cases.
## These activities are present for 4 (39.62%) of the cases and are not present for 18 (60.38%) of the cases.
## Note: this function only checks the presence of activities for a particular case, not the completeness of these entries in the activity log or the order of activities.
## For cases for which the aforementioned activities are not all present, the following activities are recorded (ordered by decreasing frequeny of occurrence):
## # A tibble: 9 x 3
##   activity               n case_ids                                             
##   <chr>              <int> <chr>                                                
## 1 Triage                11 510 - 512 - 517 - 521 - 524 - 525 - 526 - 527 - 528 ~
## 2 Registration           9 512 - 518 - 518 - 518 - 521 - 522 - 527 - 528 - 534  
## 3 Clinical exam          5 512 - 510 - 527 - 528 - 512                          
## 4 Treatment evaluat~     2 529 - 532                                            
## 5 0                      1 533                                                  
## 6 registration           1 510                                                  
## 7 Trage                  1 520                                                  
## 8 Treatment              1 532                                                  
## 9 Triaga                 1 522

Incorrect Activity Names

hospital_actlog %>%
  detect_incorrect_activity_names(allowed_activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation"))
## *** OUTPUT ***
## 4 out of 9 (44.44% ) activity labels are identified to be incorrect.
## These activity labels are:
## registration - Trage - Triaga - 0
## Given this information, 4 of 53 (7.55%) rows in the activity log are incorrect. These are the following:
## # A tibble: 4 x 7
##   patient_visit_nr activity originator start               complete           
##              <dbl> <chr>    <chr>      <dttm>              <dttm>             
## 1              510 registr~ Clerk 9    2017-11-20 10:18:17 2017-11-20 10:20:06
## 2              520 Trage    Nurse 17   2017-11-21 13:43:16 2017-11-21 13:39:00
## 3              522 Triaga   Nurse 5    2017-11-21 15:15:25 2017-11-21 15:18:04
## 4              533 0        <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>

Missing Values

hospital_actlog %>%
  detect_missing_values()
## Selected level of aggregation:overview
## *** OUTPUT ***
## Absolute number of missing values per column:
## Relative number of missing values per column (expressed as percentage):
## Overview of activity log rows which are incomplete:
##                   
## patient_visit_nr 0
## activity         0
## originator       2
## start            1
## complete         0
## triagecode       1
## specialization   0
##                          
## patient_visit_nr 0.000000
## activity         0.000000
## originator       3.773585
## start            1.886792
## complete         0.000000
## triagecode       1.886792
## specialization   0.000000
## # A tibble: 4 x 7
##   patient_visit_nr activity originator start               complete           
##              <dbl> <chr>    <chr>      <dttm>              <dttm>             
## 1              510 Clinica~ Doctor 7   2017-11-20 11:35:01 2017-11-20 11:36:09
## 2              533 0        <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
## 3              534 Registr~ <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
## 4              512 Clinica~ Doctor 7   NA                  2017-11-20 11:33:57
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>
hospital_actlog %>% 
  detect_missing_values(level_of_aggregation = "activity")
## Selected level of aggregation:activity
## *** OUTPUT ***
## Absolute number of missing values per column (per activity):
## Relative number of missing values per column (per activity, expressed as percentage):
## Overview of activity log rows which are incomplete:
## # A tibble: 9 x 7
##   activity  patient_visit_nr originator start complete triagecode specialization
##   <chr>                <int>      <int> <int>    <int>      <int>          <int>
## 1 0                        0          1     0        0          0              0
## 2 Clinical~                0          0     1        0          1              0
## 3 registra~                0          0     0        0          0              0
## 4 Registra~                0          1     0        0          0              0
## 5 Trage                    0          0     0        0          0              0
## 6 Treatment                0          0     0        0          0              0
## 7 Treatmen~                0          0     0        0          0              0
## 8 Triaga                   0          0     0        0          0              0
## 9 Triage                   0          0     0        0          0              0
## # A tibble: 9 x 7
##   activity  patient_visit_nr originator start complete triagecode specialization
##   <chr>                <dbl>      <dbl> <dbl>    <dbl>      <dbl>          <dbl>
## 1 0                        0     1      0            0      0                  0
## 2 Clinical~                0     0      0.111        0      0.111              0
## 3 registra~                0     0      0            0      0                  0
## 4 Registra~                0     0.0714 0            0      0                  0
## 5 Trage                    0     0      0            0      0                  0
## 6 Treatment                0     0      0            0      0                  0
## 7 Treatmen~                0     0      0            0      0                  0
## 8 Triaga                   0     0      0            0      0                  0
## 9 Triage                   0     0      0            0      0                  0
## # A tibble: 4 x 7
##   patient_visit_nr activity originator start               complete           
##              <dbl> <chr>    <chr>      <dttm>              <dttm>             
## 1              510 Clinica~ Doctor 7   2017-11-20 11:35:01 2017-11-20 11:36:09
## 2              533 0        <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
## 3              534 Registr~ <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
## 4              512 Clinica~ Doctor 7   NA                  2017-11-20 11:33:57
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>
hospital_actlog %>% 
  detect_missing_values(
  level_of_aggregation = "column",
  column = "triagecode")
## Selected level of aggregation:column
## *** OUTPUT ***
## Absolute number of missing values in columntriagecode:1
## Relative number of missing values in columntriagecode(expressed as percentage):1.88679245283019
## 
## Overview of activity log rows in whichtriagecodeis missing:
## # A tibble: 1 x 7
##   patient_visit_nr activity originator start               complete           
##              <dbl> <chr>    <chr>      <dttm>              <dttm>             
## 1              510 Clinica~ Doctor 7   2017-11-20 11:35:01 2017-11-20 11:36:09
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>

Multiregistration

hospital_actlog %>%
  detect_multiregistration(threshold_in_seconds = 10)
## Selected level of aggregation: resource
## Selected timestamp parameter value: complete
## *** OUTPUT ***
## Multi-registration is detected for 4 of the 12 resources (33.33%). These resources are:
## Doctor 7 - Nurse 5 - Nurse 27 - NA
## For the following rows in the activity log, multi-registration is detected:
## 
## # A tibble: 9 x 7
##   patient_visit_nr activity originator start               complete           
##              <dbl> <chr>    <chr>      <dttm>              <dttm>             
## 1              512 Clinica~ Doctor 7   2017-11-20 11:27:12 2017-11-20 11:33:57
## 2              512 Clinica~ Doctor 7   NA                  2017-11-20 11:33:57
## 3              524 Triage   Nurse 5    2017-11-21 17:04:03 2017-11-21 17:06:05
## 4              525 Triage   Nurse 5    2017-11-21 17:04:13 2017-11-21 17:06:08
## 5              526 Triage   Nurse 5    2017-11-21 17:04:15 2017-11-21 17:06:10
## 6              536 Triage   Nurse 27   2017-11-22 15:15:39 2017-11-22 15:25:01
## 7              536 Treatme~ Nurse 27   2017-11-22 15:15:41 2017-11-22 15:25:03
## 8              533 0        <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
## 9              534 Registr~ <NA>       2017-11-22 18:35:00 2017-11-22 18:37:00
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>

Overlaps

hospital_actlog %>%
  detect_overlaps()
## # A tibble: 7 x 4
##   activity_a    activity_b        n avg_overlap_mins
##   <chr>         <chr>         <int>            <dbl>
## 1 Clinical exam Treatment         2            8.17 
## 2 Registration  Clinical exam     1            1.9  
## 3 Registration  Triaga            1            2.65 
## 4 Registration  Triage            1            1.93 
## 5 Triage        Clinical exam     2            5.63 
## 6 Triage        Registration      1            0.817
## 7 Triage        Treatment         1            9.33

Similar Labels

hospital_actlog %>%
  detect_similar_labels(column_labels = "activity", max_edit_distance = 3)
## # A tibble: 5 x 3
##   column_labels labels       similar_to     
##   <chr>         <chr>        <chr>          
## 1 activity      registration Registration   
## 2 activity      Registration registration   
## 3 activity      Triage       Trage - Triaga 
## 4 activity      Trage        Triage - Triaga
## 5 activity      Triaga       Triage - Trage

Time Anomalies

hospital_actlog %>%
  detect_time_anomalies()
## Selected anomaly type: both
## *** OUTPUT ***
## For 5 rows in the activity log (9.43%), an anomaly is detected.
## The anomalies are spread over the activities as follows:
## Anomalies are found in the following rows:
## # A tibble: 3 x 3
## # Groups:   activity [3]
##   activity      type                  n
##   <chr>         <chr>             <int>
## 1 Registration  negative duration     3
## 2 Clinical exam zero duration         1
## 3 Trage         negative duration     1
## # A tibble: 5 x 9
##   patient_visit_nr activity originator start               complete           
##              <dbl> <chr>    <chr>      <dttm>              <dttm>             
## 1              518 Registr~ Clerk 12   2017-11-21 11:45:16 2017-11-21 11:22:16
## 2              518 Registr~ Clerk 6    2017-11-21 11:45:16 2017-11-21 11:22:16
## 3              518 Registr~ Clerk 9    2017-11-21 11:45:16 2017-11-21 11:22:16
## 4              520 Trage    Nurse 17   2017-11-21 13:43:16 2017-11-21 13:39:00
## 5              528 Clinica~ Doctor 1   2017-11-21 19:00:00 2017-11-21 19:00:00
## # ... with 4 more variables: triagecode <dbl>, specialization <chr>,
## #   duration <dbl>, type <chr>

Unique Values

hospital_actlog %>%
  detect_unique_values(column_labels = "activity")
## *** OUTPUT ***
## Distinct entries are computed for the following columns: 
## activity
## # A tibble: 9 x 1
##   activity            
##   <chr>               
## 1 registration        
## 2 Registration        
## 3 Triage              
## 4 Clinical exam       
## 5 Trage               
## 6 Treatment           
## 7 Triaga              
## 8 Treatment evaluation
## 9 0
hospital_actlog %>%
  detect_unique_values(column_labels = c("activity", "originator"))
## *** OUTPUT ***
## Distinct entries are computed for the following columns: 
## activity - originator
## # A tibble: 22 x 2
##    activity      originator
##    <chr>         <chr>     
##  1 registration  Clerk 9   
##  2 Registration  Clerk 12  
##  3 Triage        Nurse 27  
##  4 Clinical exam Doctor 7  
##  5 Triage        Nurse 17  
##  6 Registration  Clerk 6   
##  7 Registration  Clerk 9   
##  8 Trage         Nurse 17  
##  9 Clinical exam Doctor 4  
## 10 Registration  Clerk 3   
## # ... with 12 more rows

Value Range Violations

hospital_actlog %>%
  detect_value_range_violations(triagecode = domain_numeric(from = 0, to = 5))
## *** OUTPUT ***
## The domain range for column triagecode is checked.
## Values allowed between 0 and 5
## The values fall within the specified domain range for 46 (86.79%) of the rows in the activity log and outside the domain range for 7 (13.21%) of these rows.
## 
## The following rows fall outside the specified domain range for indicated column:
## # A tibble: 7 x 8
##   column_checked patient_visit_nr activity originator start              
##   <chr>                     <dbl> <chr>    <chr>      <dttm>             
## 1 triagecode                  510 Clinica~ Doctor 7   2017-11-20 11:35:01
## 2 triagecode                  529 Treatme~ Doctor 1   2017-11-22 16:30:00
## 3 triagecode                  530 Triage   Nurse 17   2017-11-22 18:00:00
## 4 triagecode                  531 Triage   Nurse 17   2017-11-22 18:05:00
## 5 triagecode                  532 Treatme~ Nurse 17   2017-11-22 18:15:00
## 6 triagecode                  532 Treatme~ Doctor 7   2017-11-22 18:27:00
## 7 triagecode                  533 0        <NA>       2017-11-22 18:35:00
## # ... with 3 more variables: complete <dttm>, triagecode <dbl>,
## #   specialization <chr>