mycsg SASnR2 2200_First-dot-concept

Lesson Description

Sometimes, we want to work with the concept of "First dot concept" in a clear, repeatable way.
This lesson walks through a simple example and shows the key steps.
We will see one approach on how we can do it in SAS and R.

SAS (Base SAS)

data CLASS;
 infile datalines dlm='|' dsd missover; 
input Name : $8. Sex : $1. Age : best32. Height : best32. Weight : best32.; 
label ;
 format ;
 datalines4; 
Alfred|M|14|69|112.5
Alice|F|13|56.5|84
Barbara|F|13|65.3|98
Carol|F|14|62.8|102.5
Henry|M|14|63.5|102.5
James|M|12|57.3|83
Jane|F|12|59.8|84.5
Janet|F|15|62.5|112.5
Jeffrey|M|13|62.5|84
John|M|12|59|99.5
Joyce|F|11|51.3|50.5
Judy|F|14|64.3|90
Louise|F|12|56.3|77
Mary|F|15|66.5|112
Philip|M|16|72|150
Robert|M|12|64.8|128
Ronald|M|15|67|133
Thomas|M|11|57.5|85
William|M|15|66.5|112
;;;;
 run; 

*------------------------------------------------------------------------------;
*counter;
*------------------------------------------------------------------------------;

proc sort data=class;
     by sex height weight name;
 run; 

data counter;
     set class;
     by sex height weight name;
     if first.sex then counter=1; 
    else counter+1; 
run; 

*------------------------------------------------------------------------------;
*subset of lowest height;
*------------------------------------------------------------------------------;

proc sort data=class;
     by sex height;
 run; 

data lowestheight;
     set class;
     by sex height;
     if first.sex;
     keep sex height name;
 run; 

 

These SAS code snippets showcase techniques for creating subsets of data based on specific criteria. The code includes a counter and a subset of the lowest height.
In the first code snippet:
The proc sort procedure is used to sort the "class" dataset by the variables "sex," "height," "weight," and "name."
The sorted dataset is then used in the subsequent data step.
The data step defines a new dataset named "counter" by using the set statement to read in the "class" dataset.
The by statement specifies the variables to be used for grouping the observations.
The if first.sex condition checks if it is the first observation for each unique value of "sex."
If it is the first observation, the "counter" variable is set to 1. Otherwise, it increments by 1.
After executing the first code snippet, the "counter" dataset will contain a counter variable that increments for each observation within each unique combination of "sex," "height," "weight," and "name."
In the second code snippet:
The proc sort procedure is used to sort the "class" dataset by the variables "sex" and "height."
The sorted dataset is then used in the subsequent data step.
The data step defines a new dataset named "lowestheight" by using the set statement to read in the sorted "class" dataset.
The by statement specifies the variables to be used for grouping the observations.
The if first.sex condition checks if it is the first observation for each unique value of "sex."
If it is the first observation, the observation is retained (keep statement) in the "lowestheight" dataset, including the variables "sex," "height," and "name."
After executing the second code snippet, the "lowestheight" dataset will contain the subset of observations with the lowest height for each unique value of "sex."

R (tidyverse)

library(tidyverse) 
class<-tribble( 
~Name,~Sex,~Age,~Height,~Weight, 
"Alfred","M",14,69,112.5, 
"Alice","F",13,56.5,84, 
"Barbara","F",13,65.3,98, 
"Carol","F",14,62.8,102.5, 
"Henry","M",14,63.5,102.5, 
"James","M",12,57.3,83, 
"Jane","F",12,59.8,84.5, 
"Janet","F",15,62.5,112.5, 
"Jeffrey","M",13,62.5,84, 
"John","M",12,59,99.5, 
"Joyce","F",11,51.3,50.5, 
"Judy","F",14,64.3,90, 
"Louise","F",12,56.3,77, 
"Mary","F",15,66.5,112, 
"Philip","M",16,72,150, 
"Robert","M",12,64.8,128, 
"Ronald","M",15,67,133, 
"Thomas","M",11,57.5,85, 
"William","M",15,66.5,112, 
) 

counter<-class %>% 
arrange(Sex,Height,Weight,Name) 
%>% group_by(Sex) 
%>% mutate(counter=row_number()) 

lowestheight<-class %>% 
arrange(Sex,Height) %>% 
group_by(Sex) %>% 
slice(1) %>% 
select(Name,Sex,Height)

These R Tidyverse code snippets demonstrate techniques for creating subsets of data based on specific criteria, including the use of a counter and obtaining the subset of observations with the lowest height.
In the first code snippet:
The arrange function is used to sort the "class" dataframe in ascending order by the variables "Sex," "Height," "Weight," and "Name."
The group_by function is used to group the observations by the variable "Sex."
The mutate function is used to create a new variable named "counter" using the row_number function, which assigns a sequential number to each observation within each group.
After executing the first code snippet, the "counter" variable in the "class" dataframe will contain the sequential numbers representing the order of observations within each unique value of "Sex."
In the second code snippet:
The arrange function is used to sort the "class" dataframe in ascending order by the variables "Sex" and "Height."
The group_by function is used to group the observations by the variable "Sex."
The slice function is used to select the first observation within each group, which corresponds to the observation with the lowest height.
The select function is used to choose specific variables ("Name," "Sex," and "Height") to include in the resulting dataframe.
After executing the second code snippet, the "lowestheight" dataframe will contain the subset of observations with the lowest height for each unique value of "Sex."

R (base)

class <- data.frame(
  Name = c("Alfred", "Alice", "Barbara", "Carol", "Henry", "James", "Jane", "Janet", "Jeffrey", "John", "Joyce", "Judy", "Louise", "Mary", "Philip", "Robert", "Ronald", "Thomas", "William"),
  Sex = c("M", "F", "F", "F", "M", "M", "F", "F", "M", "M", "F", "F", "F", "F", "M", "M", "M", "M", "M"),
  Age = c(14, 13, 13, 14, 14, 12, 12, 15, 13, 12, 11, 14, 12, 15, 16, 12, 15, 11, 15),
  Height = c(69, 56.5, 65.3, 62.8, 63.5, 57.3, 59.8, 62.5, 62.5, 59, 51.3, 64.3, 56.3, 66.5, 72, 64.8, 67, 57.5, 66.5),
  Weight = c(112.5, 84, 98, 102.5, 102.5, 83, 84.5, 112.5, 84, 99.5, 50.5, 90, 77, 112, 150, 128, 133, 85, 112)
  , stringsAsFactors = FALSE
)

#==============================================================================;
#Counter;
#==============================================================================;

 
counter_tmp <- class
counter_tmp <- counter_tmp[order(counter_tmp$Sex, counter_tmp$Height, counter_tmp$Weight, counter_tmp$Name), ]
counter_tmp$counter <- ave(counter_tmp$Sex, counter_tmp$Sex, FUN = seq_along)
counter <- counter_tmp

 
#==============================================================================;
#Subset lowest height student within each sex;
#==============================================================================;

 
lowestheight_tmp <- class
lowestheight_tmp <- lowestheight_tmp[order(lowestheight_tmp$Sex, lowestheight_tmp$Height), ]
lowestheight_tmp <- lowestheight_tmp[!duplicated(lowestheight_tmp$Sex), ]
lowestheight_tmp <- lowestheight_tmp[, c("Name", "Sex", "Height")]
lowestheight <- lowestheight_tmp

Counter variable creation

counter_tmp <- class
We begin by creating a working copy of the dataset so that all transformations are applied on a temporary object.

order(counter_tmp$Sex, counter_tmp$Height, counter_tmp$Weight, counter_tmp$Name)
The order() function is used to sort the data by multiple variables.
Here, records are ordered first by Sex, then by Height, then by Weight, and finally by Name.
This establishes a deterministic and reproducible ordering within each Sex group.

counter_tmp <- counter_tmp[order(...), ]
Row subsetting with [ , ] applies the sorting indices, physically reordering the rows of the data frame.

ave(counter_tmp$Sex, counter_tmp$Sex, FUN = seq_along)
The ave() function applies a calculation separately within each group defined by Sex.
For each Sex group, seq_along generates a running sequence starting from 1.

counter_tmp$counter <- ave(...)
The resulting sequence is assigned to a new column named counter, creating a within-group row number for each Sex.

counter <- counter_tmp
The final dataset is assigned to a new object, clearly indicating that this version includes the derived counter variable.

Lowest Height record selection

lowestheight_tmp <- class
We begin by creating a temporary copy of the dataset. This is a common practice so that we can freely manipulate the data without modifying the original object.

order(lowestheight_tmp$Sex, lowestheight_tmp$Height)
The order() function generates row indices that sort the data first by Sex and then, within each Sex, by Height in ascending order.

lowestheight_tmp <- lowestheight_tmp[order(...), ]
Using those indices inside square brackets reorders the rows of the data frame according to the specified sort order.

!duplicated(lowestheight_tmp$Sex)
The duplicated() function flags repeated values of Sex. Applying the logical negation (!) keeps only the first occurrence of each Sex.

lowestheight_tmp <- lowestheight_tmp[!duplicated(...), ]
Because the data is already sorted by Height within Sex, keeping the first record per Sex effectively selects the row with the lowest Height for that group.

lowestheight_tmp <- lowestheight_tmp[, c("Name", "Sex", "Height")]
Column subsetting using [ , ] is used to retain only the variables needed for the final output.