mycsg SASnR2 2110_summary_stats_by

SAS (Base SAS)

+

data adsl;
  input studyid $ usubjid $ trt01pn age fasfl $ trt01p :$20.;
  datalines;
MYCSG MYCSG-1001 1 23 Y "Dose level 1"
MYCSG MYCSG-1002 3 68 Y "Dose level 3"
MYCSG MYCSG-1003 3 .  Y "Dose level 3"
MYCSG MYCSG-1004 3 35 Y "Dose level 3"
MYCSG MYCSG-1006 3 54 Y "Dose level 3"
MYCSG MYCSG-1007 1 63 N "Dose level 1"
;
run;

proc format;
    value treatment
    1=1
    2=2
    3=3
    4=4
    ;
run;
 
data adsl02;
    set adsl;
    where fasfl="Y";
    treatment=trt01pn;
    output;
    treatment=4;
    output;
run;

proc summary data=adsl02 nway completetypes;
    class treatment/preloadfmt;
    var  age;
    output out=stats01 n= nmiss= mean= std= min= q1= median= q3= max= /autoname;
    format treatment treatment.;
run;

 

R (tidyverse)

+

library(tidyverse)
adsl <-
tribble(
~studyid, ~usubjid, ~trt01pn, ~age, ~fasfl, ~trt01p,
"MYCSG", "MYCSG-1001", 1, 23, "Y", "Dose level 1",
"MYCSG", "MYCSG-1002", 3, 68, "Y", "Dose level 3",
"MYCSG", "MYCSG-1003", 3, NA, "Y", "Dose level 3",
"MYCSG", "MYCSG-1004", 3, 35, "Y", "Dose level 3",
"MYCSG", "MYCSG-1006", 3, 54, "Y", "Dose level 3",
"MYCSG", "MYCSG-1007", 1, 63, "N", "Dose level 1"
)

stats01 <- adsl %>%
group_by(trt01pn, trt01p) %>%
summarize(
nrecs = n(),
nmiss = sum(is.na(age)),
n = nrecs - nmiss,
mean = mean(age, na.rm = TRUE),
stddev = sd(age, na.rm = TRUE),
min = min(age, na.rm = TRUE),
q1 = quantile(age, 0.25, type = 2, na.rm = TRUE),
median = median(age, na.rm = TRUE),
q3 = quantile(age, 0.75, type = 2, na.rm = TRUE),
max = max(age, na.rm = TRUE) 
)

The dataset `adsl` includes subject IDs, treatment info, and age.
`group_by(trt01pn, trt01p)` calculates summaries within each treatment group.
`summarize()` computes N, missing values, mean, standard deviation, and 5-number summary.
Quartiles are computed with `quantile(..., type = 2)` to match SAS behavior.

R (base)

+

adsl <- data.frame(
  studyid = c("MYCSG", "MYCSG", "MYCSG", "MYCSG", "MYCSG", "MYCSG"),
  usubjid = c("MYCSG-1001", "MYCSG-1002", "MYCSG-1003", "MYCSG-1004", "MYCSG-1006", "MYCSG-1007"),
  trt01pn = c(1, 3, 3, 3, 3, 1),
  age = c(23, 68, NA, 35, 54, 63),
  fasfl = c("Y", "Y", "Y", "Y", "Y", "N"),
  trt01p = c("Dose level 1", "Dose level 3", "Dose level 3", "Dose level 3", "Dose level 3", "Dose level 1")
  , stringsAsFactors = FALSE
)

#==============================================================================
# Base R summary (aggregate) with intermediate names + final tidyverse-style names
#==============================================================================

stats01 <- aggregate(age ~ trt01pn + trt01p, data = adsl, FUN = function(x) {

  tmp_nrecs  <- length(x)
  tmp_nmiss  <- sum(is.na(x))
  tmp_n      <- tmp_nrecs - tmp_nmiss

  c(
    tmp_nrecs  = tmp_nrecs,
    tmp_nmiss  = tmp_nmiss,
    tmp_n      = tmp_n,
    tmp_mean   = mean(x, na.rm = TRUE),
    tmp_stddev = sd(x, na.rm = TRUE),
    tmp_min    = min(x, na.rm = TRUE),
    tmp_q1     = unname(quantile(x, 0.25, type = 2, na.rm = TRUE)),
    tmp_median = median(x, na.rm = TRUE),
    tmp_q3     = unname(quantile(x, 0.75, type = 2, na.rm = TRUE)),
    tmp_max    = max(x, na.rm = TRUE)
  )
})

 
# Flatten the matrix column to real columns
stats01 <- do.call(data.frame, stats01)

 
# Remove analysis-variable prefix added by aggregate() (age.)
names(stats01) <- sub("^age\\.", "", names(stats01))

 
# Map intermediate names to final tidyverse-style names
final_map <- c(
  tmp_nrecs  = "nrecs",
  tmp_nmiss  = "nmiss",
  tmp_n      = "n",
  tmp_mean   = "mean",
  tmp_stddev = "stddev",
  tmp_min    = "min",
  tmp_q1     = "q1",
  tmp_median = "median",
  tmp_q3     = "q3",
  tmp_max    = "max"
)

 
names(stats01) <- ifelse(
  names(stats01) %in% names(final_map),
  final_map[names(stats01)],
  names(stats01)  # keep trt01pn, trt01p as-is
)

Base R aggregate() is used to compute statistics separately for each treatment group, similar to PROC MEANS with CLASS variables.
The function receives a vector of age values for one group at a time, which allows us to compute multiple statistics in one place.
Temporary names like tmp_nrecs, tmp_mean, and tmp_q1 are used intentionally to indicate these are intermediate labels, not final report column names.
All statistics are returned together as a single named numeric vector, which base R stores as a matrix inside one result column.
do.call(data.frame, …) is used to unpack that matrix into real, independent columns that R can work with normally.
aggregate() prefixes all summary columns with the analysis variable name (age.), so we explicitly remove that prefix to clean the structure.
A mapping table (final_map) defines how intermediate names should be converted into final tidyverse-style names.
For each column name, we check whether it appears in the mapping table and replace it only if a match is found.
Grouping variables (trt01pn, trt01p) are left unchanged, while summary columns are renamed cleanly.
The final output matches tidyverse summarise() exactly in both structure and column naming, while remaining 100% base R.

Summarize Age by Treatment Group with Descriptive Statistics

Lesson Description

SAS (Base SAS)

R (tidyverse)

R (base)