`gap()` computes the Gap statistic of Tibshirani et al. (2001) for a dataset `X` over candidate cluster numbers `K = 1, …, Kmax`.
The procedure evaluates: * `W_k` — within-cluster dispersion for the observed dataset, * `W^*_k` — reference dispersions from `B` Monte Carlo samples generated according to the reference distribution (`ref.gen`), * `gap_k = mean(log W^*_k) − log W_k`, * the standard error `sk`, and * the Gap selection criterion: select the smallest `k` such that `gap_k ≥ gap_k+1 − sk_k+1`.
The function returns both the selected number of clusters `hatK` and the corresponding cluster labels.
The function can be called internally by PART—where parameters are passed via `fixed.par`—or used independently, in which case arguments in `...` override default values directly.
gap(X, Kmax = 10, B = 100, ref.gen = "PC", cl.lab = NULL, ...)A numeric data matrix where rows are observations and columns are features.
Maximum number of clusters to evaluate. If `Kmax` exceeds the number of observations, it is reduced appropriately.
Number of Monte Carlo reference datasets to generate for estimating the reference dispersion `W^*_k`.
Method used to generate reference datasets. Typically `"PC"` (principal components) but may include other options supported by `getReferenceW()`.
Optional list of cluster label vectors for `k = 1, …, Kmax`. If `NULL`, labels are computed by `findPartition()`.
Additional tuning parameters for clustering and distance computation. These may be supplied as a `fixed.par` list (when called from PART) or as individual named arguments (for stand-alone usage). Relevant fields include `cl.method`, `dist.method`, `linkage`, `cor.method`, and others.
list of gaps found