mycsg SASnR2 2220_First-dot-and-last-dot-concept

Lesson Description

Sometimes, we want to work with the concept of "First dot and last dot concept" in a clear, repeatable way.
This lesson walks through a simple example and shows the key steps.
We will see one approach on how we can do it in SAS and R.

SAS (Base SAS)

data CLASS;
 infile datalines dlm='|' dsd missover; 
input Name : $8. Sex : $1. Age : best32. Height : best32. Weight : best32.; 
label ;
 format ;
 datalines4; 
Alfred|M|14|69|112.5
Alice|F|13|56.5|84
Barbara|F|13|65.3|98
Carol|F|14|62.8|102.5
Henry|M|14|63.5|102.5
James|M|12|57.3|83
Jane|F|12|59.8|84.5
Janet|F|15|62.5|112.5
Jeffrey|M|13|62.5|84
John|M|12|59|99.5
Joyce|F|11|51.3|50.5
Judy|F|14|64.3|90
Louise|F|12|56.3|77
Mary|F|15|66.5|112
Philip|M|16|72|150
Robert|M|12|64.8|128
Ronald|M|15|67|133
Thomas|M|11|57.5|85
William|M|15|66.5|112
;;;;
 run; 

proc sort data=class;
    by age;
 run; 

data only_one_in_group;
     set class;
     by age;
     if first.age and last.age;
 run;

This SAS code snippet demonstrates how to create a subset of data by selecting only one observation from each group based on a specific variable, in this case, the "age" variable.
First, the PROC SORT step is used to sort the "class" dataset in ascending order by the "age" variable.
Then, in the DATA step:
The SET statement is used to read the sorted "class" dataset.
The BY statement specifies the variable "age" for processing the data in a sorted manner.
The IF FIRST.AGE AND LAST.AGE condition is used to select only the first and last observations within each unique value of "age". This condition becomes true only for observations that are the first and last within their respective groups.
The resulting dataset, named "only_one_in_group," will contain only one observation from each group based on the "age" variable.
This SAS code snippet allows you to extract a subset of data where each unique value of "age" is represented by only one observation. It is useful for situations where you need to identify and work with only one observation per group.

R (tidyverse)

library(tidyverse) 
class<-tribble(
~Name,~Sex,~Age,~Height,~Weight, 
"Alfred","M",14,69,112.5, 
"Alice","F",13,56.5,84, 
"Barbara","F",13,65.3,98, 
"Carol","F",14,62.8,102.5, 
"Henry","M",14,63.5,102.5, 
"James","M",12,57.3,83,
 "Jane","F",12,59.8,84.5, 
"Janet","F",15,62.5,112.5, 
"Jeffrey","M",13,62.5,84, 
"John","M",12,59,99.5, 
"Joyce","F",11,51.3,50.5, 
"Judy","F",14,64.3,90,
 "Louise","F",12,56.3,77, 
"Mary","F",15,66.5,112, 
"Philip","M",16,72,150, 
"Robert","M",12,64.8,128, 
"Ronald","M",15,67,133, 
"Thomas","M",11,57.5,85, 
"William","M",15,66.5,112, 
) 
only_one_in_group<-class %>% 
group_by(age) %>% 
mutate(nrows=n()) %>% 
filter(nrows==1)

This R Tidyverse code snippet demonstrates how to create a subset of data by selecting only one observation from each group based on a specific variable, in this case, the "age" variable.
Using the pipe operator %>%, the following operations are performed:
The group_by function is used to group the "class" data frame by the "age" variable.
The mutate function is applied to create a new variable named "nrows" that represents the number of observations in each group.
The filter function is used to keep only the observations where "nrows" is equal to 1, indicating that it is the only observation in its respective group.
After executing this code snippet, the resulting data frame "only_one_in_group" will contain only one observation from each group based on the "age" variable.

R (base)

class <- data.frame(
  name = c("Alfred", "Alice", "Barbara", "Carol", "Henry", "James", "Jane", "Janet", "Jeffrey", "John", "Joyce", "Judy", "Louise", "Mary", "Philip", "Robert", "Ronald", "Thomas", "William"),
  sex = c("M", "F", "F", "F", "M", "M", "F", "F", "M", "M", "F", "F", "F", "F", "M", "M", "M", "M", "M"),
  age = c(14, 13, 13, 14, 14, 12, 12, 15, 13, 12, 11, 14, 12, 15, 16, 12, 15, 11, 15),
  height = c(69, 56.5, 65.3, 62.8, 63.5, 57.3, 59.8, 62.5, 62.5, 59, 51.3, 64.3, 56.3, 66.5, 72, 64.8, 67, 57.5, 66.5),
  weight = c(112.5, 84, 98, 102.5, 102.5, 83, 84.5, 112.5, 84, 99.5, 50.5, 90, 77, 112, 150, 128, 133, 85, 112)
  , stringsAsFactors = FALSE
)


only_one_in_group_tmp <- class

 
only_one_in_group_tmp$nrows <- ave(
  only_one_in_group_tmp$age,
  only_one_in_group_tmp$age,
  FUN = length
)

only_one_in_group_tmp <- only_one_in_group_tmp[only_one_in_group_tmp$nrows == 1, ]

only_one_in_group <- only_one_in_group_tmp

We start by creating a temporary copy of the dataset so the original data remains unchanged.

ave(only_one_in_group_tmp$age, only_one_in_group_tmp$age, FUN = length)
The ave() function is used to perform a group-wise calculation and return the result back to each row.
Here, the first age represents the values being processed, and the second age defines the grouping variable.

Within each age group, length counts how many rows share that age value.
This count is then repeated for every row belonging to the same age group.

The resulting vector is assigned to a new column named nrows, which now holds the group size for each record.

only_one_in_group_tmp[nrows == 1, ]
Row subsetting is used to keep only those records where the age value appears exactly once in the dataset.

Finally, the filtered dataset is assigned to a new object, which contains only records that are unique within their age group.

Overall logic
Compute group size → attach it to each row → retain only groups with a single observation.

First dot and last dot concept

Lesson Description

SAS (Base SAS)

R (tidyverse)

R (base)