Announcement Icon Online training class for Clinical R programming batch starts on Monday, 02Feb2026. Click here for details.

Summarize Age by Treatment Group with Descriptive Statistics


Lesson Description
-
  • Sometimes, we want to work with the concept of "Summarize Age by Treatment Group with Descriptive Statistics" in a clear, repeatable way.
  • This lesson walks through a simple example and shows the key steps.
  • We will see one approach on how we can do it in SAS and R.
data adsl;
  input studyid $ usubjid $ trt01pn age fasfl $ trt01p :$20.;
  datalines;
MYCSG MYCSG-1001 1 23 Y "Dose level 1"
MYCSG MYCSG-1002 3 68 Y "Dose level 3"
MYCSG MYCSG-1003 3 .  Y "Dose level 3"
MYCSG MYCSG-1004 3 35 Y "Dose level 3"
MYCSG MYCSG-1006 3 54 Y "Dose level 3"
MYCSG MYCSG-1007 1 63 N "Dose level 1"
;
run;

proc format;
    value treatment
    
1=1
    
2=2
    
3=3
    
4=4
    ;
run;
 
data adsl02;
    set adsl;
    where fasfl="Y";
    treatment=trt01pn;
    
output;
    treatment=4;
    output;
run;

proc summary data=adsl02 nway completetypes;
    class treatment/preloadfmt;
    var  age;
    output out=stats01 nnmissmeanstdminq1medianq3max= /autoname;
    format treatment treatment.;
run;

 

 

library(tidyverse)
adsl <-
tribble(
~studyid, ~usubjid, ~trt01pn, ~age, ~fasfl, ~trt01p,
"MYCSG", "MYCSG-1001", 1, 23, "Y", "Dose level 1",
"MYCSG", "MYCSG-1002", 3, 68, "Y", "Dose level 3",
"MYCSG", "MYCSG-1003", 3, NA, "Y", "Dose level 3",
"MYCSG", "MYCSG-1004", 3, 35, "Y", "Dose level 3",
"MYCSG", "MYCSG-1006", 3, 54, "Y", "Dose level 3",
"MYCSG", "MYCSG-1007", 1, 63, "N", "Dose level 1"
)

stats01 <- adsl %>%
group_by(trt01pn, trt01p) %>%
summarize(
nrecs = n(),
nmiss = sum(is.na(age)),
n = nrecs - nmiss,
mean = mean(age, na.rm = TRUE),
stddev = sd(age, na.rm = TRUE),
min = min(age, na.rm = TRUE),
q1 = quantile(age, 0.25, type = 2, na.rm = TRUE),
median = median(age, na.rm = TRUE),
q3 = quantile(age, 0.75, type = 2, na.rm = TRUE),
max = max(age, na.rm = TRUE) 
)
  • The dataset `adsl` includes subject IDs, treatment info, and age.
  • `group_by(trt01pn, trt01p)` calculates summaries within each treatment group.
  • `summarize()` computes N, missing values, mean, standard deviation, and 5-number summary.
  • Quartiles are computed with `quantile(..., type = 2)` to match SAS behavior.
adsl <- data.frame(
  studyid = c("MYCSG", "MYCSG", "MYCSG", "MYCSG", "MYCSG", "MYCSG"),
  usubjid = c("MYCSG-1001", "MYCSG-1002", "MYCSG-1003", "MYCSG-1004", "MYCSG-1006", "MYCSG-1007"),
  trt01pn = c(1, 3, 3, 3, 3, 1),
  age = c(23, 68, NA, 35, 54, 63),
  fasfl = c("Y", "Y", "Y", "Y", "Y", "N"),
  trt01p = c("Dose level 1", "Dose level 3", "Dose level 3", "Dose level 3", "Dose level 3", "Dose level 1")
  , stringsAsFactors = FALSE
)

#==============================================================================
# Base R summary (aggregate) with intermediate names + final tidyverse-style names
#==============================================================================

stats01 <- aggregate(age ~ trt01pn + trt01p, data = adsl, FUN = function(x) {

  tmp_nrecs  <- length(x)
  tmp_nmiss  <- sum(is.na(x))
  tmp_n      <- tmp_nrecs - tmp_nmiss

  c(
    tmp_nrecs  = tmp_nrecs,
    tmp_nmiss  = tmp_nmiss,
    tmp_n      = tmp_n,
    tmp_mean   = mean(x, na.rm = TRUE),
    tmp_stddev = sd(x, na.rm = TRUE),
    tmp_min    = min(x, na.rm = TRUE),
    tmp_q1     = unname(quantile(x, 0.25, type = 2, na.rm = TRUE)),
    tmp_median = median(x, na.rm = TRUE),
    tmp_q3     = unname(quantile(x, 0.75, type = 2, na.rm = TRUE)),
    tmp_max    = max(x, na.rm = TRUE)
  )
})

 
# Flatten the matrix column to real columns
stats01 <- do.call(data.frame, stats01)

 
# Remove analysis-variable prefix added by aggregate() (age.)
names(stats01) <- sub("^age\\.", "", names(stats01))

 
# Map intermediate names to final tidyverse-style names
final_map <- c(
  tmp_nrecs  = "nrecs",
  tmp_nmiss  = "nmiss",
  tmp_n      = "n",
  tmp_mean   = "mean",
  tmp_stddev = "stddev",
  tmp_min    = "min",
  tmp_q1     = "q1",
  tmp_median = "median",
  tmp_q3     = "q3",
  tmp_max    = "max"
)

 
names(stats01) <- ifelse(
  names(stats01) %in% names(final_map),
  final_map[names(stats01)],
  names(stats01)  # keep trt01pn, trt01p as-is
)
  • Base R aggregate() is used to compute statistics separately for each treatment group, similar to PROC MEANS with CLASS variables.
  • The function receives a vector of age values for one group at a time, which allows us to compute multiple statistics in one place.
  • Temporary names like tmp_nrecs, tmp_mean, and tmp_q1 are used intentionally to indicate these are intermediate labels, not final report column names.
  • All statistics are returned together as a single named numeric vector, which base R stores as a matrix inside one result column.
  • do.call(data.frame, …) is used to unpack that matrix into real, independent columns that R can work with normally.
  • aggregate() prefixes all summary columns with the analysis variable name (age.), so we explicitly remove that prefix to clean the structure.
  • A mapping table (final_map) defines how intermediate names should be converted into final tidyverse-style names.
  • For each column name, we check whether it appears in the mapping table and replace it only if a match is found.
  • Grouping variables (trt01pn, trt01p) are left unchanged, while summary columns are renamed cleanly.
  • The final output matches tidyverse summarise() exactly in both structure and column naming, while remaining 100% base R.