*Copyright @ www.mycsg.in;
What is the word meaning of unique
Being the only one of its kind or different from all others.
What do we call a unique record in a dataset
A record is unique when at least one value makes it different from every other observation in the dataset
When specific variables are treated as key variables, we judge uniqueness only on the combination of those key variables
So a record can be unique under one key definition and non-unique under another key definition
Create a sample dataset
This dataset contains exact duplicates for Alfred and multiple records for Carol with different ages
That mix lets us study uniqueness using different BY variable combinations
data class; infile cards truncover; input name $ sex $ age height weight; cards; Alfred M 14 69 112.5 Alfred M 14 69 112.5 Carol F 14 62.8 102.5 Carol F 12 55 84 Carol F 11 51 79 Jane F 12 59.8 84.5 ; run;
Copy Code
View Log
SAS Log
data class; infile cards truncover; input name $ sex $ age height weight; cards; NOTE: The data set WORK.CLASS has 6 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.06 seconds cpu time 0.00 seconds ; run;
If all variables are considered together, only the two Alfred rows are non-unique exact duplicates
If `name` alone is considered as the key, then Alfred and Carol are non-unique because their names repeat
If `name` and `age` are considered together, then only Alfred remains non-unique in this example
View Data
Dataset View
Where do we need to identify non-unique records
We may need to review duplicate subject records, duplicate event records, or repeated lab records within a defined key
Separating unique and non-unique observations helps with data cleaning and issue investigation
`proc sort` provides a convenient way to split these records into separate datasets
Separate unique and non-unique records based on key variables
The `nouniquekeys` option sends non-unique observations to the dataset specified by `out=`
The `uniqueout=` option stores observations whose BY value combination occurs only once
The BY statement defines the key used to judge uniqueness
Create a dataset named class_nouniq_names to hold records with non-unique names
Using `by name;` means uniqueness is checked only on the value of `name`
Jane appears only once and therefore goes to `class_uniq_names`
Alfred and Carol appear more than once and therefore go to `class_nouniq_names`
proc sort data=class out=class_nouniq_names uniqueout=class_uniq_names nouniquekeys; by name; run;
Copy Code
View Log
SAS Log
proc sort data=class out=class_nouniq_names uniqueout=class_uniq_names nouniquekeys; by name; run; NOTE: There were 6 observations read from the data set WORK.CLASS. NOTE: 1 observations with unique key values were deleted. NOTE: The data set WORK.CLASS_NOUNIQ_NAMES has 5 observations and 5 variables. NOTE: The data set WORK.CLASS_UNIQ_NAMES has 1 observations and 5 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
View Data
Dataset View
Create a dataset named class_nouniq_name_age to hold records with non-unique name and age combinations
Using `by name age;` changes the uniqueness rule
Carol now has unique rows because each age value differs within the name Carol
Only the Alfred rows remain non-unique because both `name` and `age` match exactly
proc sort data=class out=class_nouniq_name_age uniqueout=class_uniq_name_age nouniquekeys; by name age; run;
Copy Code
View Log
SAS Log
proc sort data=class out=class_nouniq_name_age uniqueout=class_uniq_name_age nouniquekeys; by name age; run; NOTE: There were 6 observations read from the data set WORK.CLASS. NOTE: 4 observations with unique key values were deleted. NOTE: The data set WORK.CLASS_NOUNIQ_NAME_AGE has 2 observations and 5 variables. NOTE: The data set WORK.CLASS_UNIQ_NAME_AGE has 4 observations and 5 variables. NOTE: PROCEDURE SORT used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
View Data
Dataset View
Key points to remember
Uniqueness depends on the BY variables used as the key
`nouniquekeys` helps separate non-unique observations from unique observations
`uniqueout=` stores records that occur only once for the specified BY key
This technique is very useful in duplicate record review and data cleaning workflows