mycsg SASnR2 2100_Descriptive-statistics-for-numeric-variables

Lesson Description

Sometimes, we want to work with the concept of "Descriptive statistics for numeric variables" in a clear, repeatable way.
This lesson walks through a simple example and shows the key steps.
We will see one approach on how we can do it in SAS and R.

SAS (Base SAS)

data CLASS;
 infile datalines dlm='|' dsd missover; 
input Name : $8. Sex : $1. Age : best32. Height : best32. Weight : best32.; 
label ;
 format ;
 datalines4; 
Alfred|M|14|69|112.5
Alice|F|13|56.5|84
Barbara|F|13|65.3|98
Henry|M|14|63.5|102.5
James|M|12|57.3|83
;;;;
 run; 

proc summary data=class;
 var height     ;
 output out=stats01 n=n mean=mean std=sd;
 run; 

proc summary data=class nway; 
class sex    ;
 var height     ;
 output out=stats02 n=n mean=mean std=sd;
 run; 

 

These SAS code snippets demonstrate how to perform summary statistics on variables in a dataset named "class" using the proc summary procedure. The results of the summary statistics are stored in separate output datasets named "stats01" and "stats02" for different variable configurations.
In the first code snippet:
The proc summary procedure is used to compute summary statistics on the variable "height" in the "class" dataset.
The var statement specifies the variable "height" to be analyzed.
The output statement is used to store the summary statistics, including count (n), mean, and standard deviation, in an output dataset named "stats01".
After executing the first code snippet, the output dataset "stats01" will contain the summary statistics for the variable "height" in the "class" dataset.
In the second code snippet:
The proc summary procedure is used to compute summary statistics on the variable "height" in the "class" dataset, grouped by the variable "sex".
The class statement specifies the variable "sex" as the grouping variable.
The var statement specifies the variable "height" to be analyzed.
The output statement is used to store the summary statistics, including count (n), mean, and standard deviation, in an output dataset named "stats02".
After executing the second code snippet, the output dataset "stats02" will contain the summary statistics for the variable "height" in the "class" dataset, grouped by the variable "sex".

R (tidyverse)

class<-tribble( 
~Name,~Sex,~Age,~Height,~Weight, 
"Alfred","M",14,69,112.5, 
"Alice","F",13,56.5,84, 
"Barbara","F",13,65.3,98, 
"Henry","M",14,63.5,102.5, 
"James","M",12,57.3,83, ) 

stats01<-summarize(class,
n=n( ), 
mean=mean(Height),
sd=sd(Height)) 

stats02<-class %>% 
group_by(Sex) %>% 
summarize(
n=n( ), 
mean=mean(Height),
sd=sd(Height)
)

These R Tidyverse code snippets demonstrate how to compute summary statistics on variables in a data frame named "class" using different functions. The results of the summary statistics are stored in separate data frames named "stats01" and "stats02" for different variable configurations.
In the first code snippet:
The summarize function is used to compute summary statistics on the variable "Height" in the "class" data frame.
The first argument specifies the input data frame, which is "class" in this case.
The subsequent arguments specify the summary statistics to be calculated, including count (n), mean, and standard deviation (sd) of the variable "Height".
After executing the first code snippet, the "stats01" data frame will contain the summary statistics for the variable "Height" in the "class" data frame.
In the second code snippet:
The %>% operator is used to pipe the "class" data frame into a sequence of operations.
The group_by function is used to group the data by the variable "Sex".
The summarize function is used to compute summary statistics within each group.
The n function is used to calculate the count of observations, and the mean and sd functions are used to calculate the mean and standard deviation of the variable "Height" within each group.
After executing the second code snippet, the "stats02" data frame will contain the summary statistics for the variable "Height" in the "class" data frame, grouped by the variable "Sex".

R (base)

class <- data.frame(
  Name = c("Alfred", "Alice", "Barbara", "Henry", "James"),
  Sex = c("M", "F", "F", "M", "M"),
  Age = c(14, 13, 13, 14, 12),
  Height = c(69, 56.5, 65.3, 63.5, 57.3),
  Weight = c(112.5, 84, 98, 102.5, 83)
  , stringsAsFactors = FALSE
)

stats01 <- data.frame(
  n = length(class$Height),
  mean = mean(class$Height),
  sd = sd(class$Height)
)

stats02 <- aggregate(Height ~ Sex, data = class, FUN = function(x) {
  c(n = length(x), mean = mean(x), sd = sd(x))
})

stats02 <- do.call(data.frame, stats02)
names(stats02) <- c("Sex", "n", "mean", "sd")

aggregate() groups the data first and applies the summary function once per group, just like PROC MEANS processes one CLASS value at a time.
Because the function returns multiple statistics together (n, mean, sd), base R stores them as a single matrix inside one column instead of creating separate columns immediately.
RStudio’s data viewer makes this matrix look like multiple columns (Height[, "n"], Height[, "mean"], Height[, "sd"]), but structurally it is still just one column.
We use do.call(data.frame, …) to physically unpack that matrix into real, independent columns that R can work with normally.
Only after this unpacking step do we actually have separate columns for n, mean, and sd, which makes renaming both valid and safe.
This extra step exists because base R prioritizes flexibility in returned objects, whereas tidyverse functions automatically flatten results for us.
Overall, this flow mirrors PROC MEANS with CLASS variables followed by reshaping the output into a clean, report-ready dataset.

Descriptive statistics for numeric variables

Lesson Description

SAS (Base SAS)

R (tidyverse)

R (base)