`gap()` computes the Gap statistic of Tibshirani et al. (2001) for a dataset `X` over candidate cluster numbers `K = 1, …, Kmax`.

The procedure evaluates: * `W_k` — within-cluster dispersion for the observed dataset, * `W^*_k` — reference dispersions from `B` Monte Carlo samples generated according to the reference distribution (`ref.gen`), * `gap_k = mean(log W^*_k) − log W_k`, * the standard error `sk`, and * the Gap selection criterion:  select the smallest `k` such that   `gap_k ≥ gap_k+1 − sk_k+1`.

The function returns both the selected number of clusters `hatK` and the corresponding cluster labels.

The function can be called internally by PART—where parameters are passed via `fixed.par`—or used independently, in which case arguments in `...` override default values directly.

gap(X, Kmax = 10, B = 100, ref.gen = "PC", cl.lab = NULL, ...)

Arguments

X

A numeric data matrix where rows are observations and columns are features.

Kmax

Maximum number of clusters to evaluate. If `Kmax` exceeds the number of observations, it is reduced appropriately.

B

Number of Monte Carlo reference datasets to generate for estimating the reference dispersion `W^*_k`.

ref.gen

Method used to generate reference datasets. Typically `"PC"` (principal components) but may include other options supported by `getReferenceW()`.

cl.lab

Optional list of cluster label vectors for `k = 1, …, Kmax`. If `NULL`, labels are computed by `findPartition()`.

...

Additional tuning parameters for clustering and distance computation. These may be supplied as a `fixed.par` list (when called from PART) or as individual named arguments (for stand-alone usage). Relevant fields include `cl.method`, `dist.method`, `linkage`, `cor.method`, and others.

Value

list of gaps found