mycsg SASnR2 2300_Handling-duplicates

Lesson Description

Sometimes, we want to work with the concept of "Handling duplicates" in a clear, repeatable way.
This lesson walks through a simple example and shows the key steps.
We will see one approach on how we can do it in SAS and R.

SAS (Base SAS)

data AE;
 infile datalines dlm='|' dsd missover; 
input subjid : $10. term : $20. stdtc : $10.; 
label ;
 format ;
 datalines4; 
1001|Headache|2010-01-05
1001|Headache|2010-01-08
1001|Nausea|2010-01-08
1002|Headache|2010-02-15
;;;;
 run; 

data AE1;
 infile datalines dlm='|' dsd missover; 
input subjid : $10. term : $20. stdtc : $10.; 
label ;
 format ;
 datalines4; 
1001|Headache|2010-01-05
1001|Headache|2010-01-08
1001|Nausea|2010-01-08
1002|Headache|2010-02-15
1002|Headache|2010-02-15
;;;;
 run; 

*------------------------------------------------------------------------------;
*nodupkey;
*------------------------------------------------------------------------------;

proc sort data=ae;
     by subjid term stdtc;
 run; 

proc sort data=ae out=nodupkey nodupkey; 
    by subjid term;
 run; 


*------------------------------------------------------------------------------;
*noduprec;
*------------------------------------------------------------------------------;

proc sort data=ae1 out=noduprec noduprec;
     by _all_;
 run;

These SAS code snippets showcase techniques to eliminate duplicate observations in a dataset using different methods: nodupkey and noduprec.
nodupkey:
The nodupkey option is used in the proc sort statement to remove duplicate observations based on specific variables. The dataset "ae" is sorted by the variables "subjid," "term," and "stdtc" using the by statement. Then, another proc sort statement is used with the nodupkey option to create a new dataset called "nodupkey," eliminating duplicate observations based on the "subjid" and "term" variables.
noduprec:
The noduprec option is used in the proc sort statement to remove duplicate observations by considering all variables in the dataset. The dataset "ae1" is sorted by _all_, which includes all variables in the dataset. The noduprec option is used to create a new dataset called "x," eliminating duplicate observations based on all variables.
These SAS code snippets demonstrate how to handle duplicate observations in a dataset by sorting the data and removing duplicates based on specific variables or considering all variables. The resulting datasets "nodupkey" and "x" will contain unique observations based on the defined criteria.

R (tidyverse)

library(tidyverse)

ae <- tribble(
  ~subjid, ~term,       ~stdtc,
  "1001",  "Headache",  "2010-01-05",
  "1001",  "Headache",  "2010-01-08",
  "1001",  "Nausea",    "2010-01-08",
  "1002",  "Headache",  "2010-02-15"
)

ae1 <- tribble(
  ~subjid, ~term,       ~stdtc,
  "1001",  "Headache",  "2010-01-05",
  "1001",  "Headache",  "2010-01-08",
  "1001",  "Nausea",    "2010-01-08",
  "1002",  "Headache",  "2010-02-15",
  "1002",  "Headache",  "2010-02-15"
)

#---------------------------------------
# Duplicate based on key variables (nodupkey)
#---------------------------------------

nodupkey <- ae %>%
  arrange(subjid, term, stdtc) %>%
  group_by(subjid, term) %>%
  slice(1)

#---------------------------------------
# Duplicate record (noduprec / nodup)
#---------------------------------------

noduprec <- ae1 %>%
  distinct()

These R Tidyverse code snippets demonstrate techniques to eliminate duplicate observations in a dataset using different approaches.
nodupkey:
In the first code snippet, the dataset "ae" is processed using the %>% pipe operator. The data is arranged by the variables "subjid," "term," and "stdtc" using the arrange() function. Then, the data is grouped by "subjid" and "term" using the group_by() function. Finally, the slice(1) function is applied to keep only the first observation within each group, effectively removing duplicates based on the key variables. The resulting dataset contains unique observations based on the defined key variables.
noduprec/nodup:
In the second code snippet, the dataset "ae1" is processed. The distinct() function is applied to the dataset, which removes duplicate records based on all variables. The resulting dataset contains unique observations without considering any specific key variables.
These R Tidyverse code snippets demonstrate how to handle duplicate observations in a dataset by arranging, grouping, and selecting unique observations based on specific key variables or considering all variables. The resulting datasets will contain the unique observations based on the defined criteria.

R (base)

ae <- data.frame(
  subjid = c(1001, 1001, 1001, 1002),
  term = c("Headache", "Headache", "Nausea", "Headache"),
  stdtc = c("2010-01-05", "2010-01-08", "2010-01-08", "2010-02-15")
  , stringsAsFactors = FALSE
)

ae1 <- data.frame(
  subjid = c(1001, 1001, 1001, 1002, 1002),
  term = c("Headache", "Headache", "Nausea", "Headache", "Headache"),
  stdtc = c("2010-01-05", "2010-01-08", "2010-01-08", "2010-02-15", "2010-02-15")
  , stringsAsFactors = FALSE
)

# duplicate based on key variables (nodupkey)

nodupkey <- ae[order(ae$subjid, ae$term, ae$stdtc), ]

nodupkey <- nodupkey[!duplicated(nodupkey[c("subjid", "term")]), ]

# duplicate record (noduprec/nodup)

noduprec <- ae1[!duplicated(ae1), ]

Duplicate handling based on key variables (nodupkey concept)

This pattern is used when uniqueness is defined by a subset of variables rather than the entire row. In clinical data, this is very common when we say “one record per subject per term”, or “one record per subject per visit”.

The data is first sorted using order() so that, within each key combination, records are arranged in a meaningful sequence. Here, sorting by stdtc ensures that earlier dates appear first.

The duplicated() function is then applied only to the key variables (subjid, term). This tells R to look for repeated key combinations, not full-row duplicates.

By default, duplicated() keeps the first occurrence and flags later ones. Because we sorted beforehand, “first” now has a clear definition.

Duplicate record removal (nodup / noduprec concept)

This pattern is used when we want to remove rows that are exact copies across all variables, not just selected keys.

Here, duplicated(ae1) checks whether an entire row has appeared before, comparing every column.

No sorting is required because order does not matter when removing exact duplicate rows.