Manual

Three apps, layered on dot: tree, nb, acquire.

Header sigils

Suffix on column names tells data.awk what each column is for.

PatternTypeRole
^[A-Z]numericAge, Mpg
^[a-z]symbolicclass, color
name+y-goalmaximize, ykind=num
name-y-goalminimize, ykind=num
name!klassclassification target, ykind=sym

data type

Header parse + row table. Pure ingestion. No y/target logic, no splits. Apps add what they need.

function data_init(it) {
  arr(.it.cols); arr(.it.rows); arr(.it.all)
  arr(.it.y);    arr(.it.hdr);  arr(.it.nump)
  .it.nc = 0; .it.nrows = 0; .it.klass = 0; .it.ykind = ""
  return it }

Public: data_head(d, line) parses CSV header. data_read(d, training) reads one row from $0 and feeds each cell into its column (if training).

data_clone / data_feed

Used by the active learner to avoid info leak. data_clone(d) returns a fresh data with same headers but empty col stats. data_feed(dest, row_ids, src) copies selected rows from src into dest, updating dest's col stats and row table.

DT = data_clone(D)
data_feed(DT, train_ids, D)   # only train rows feed into DT.cols
# now disty/distx using DT match what ezr's d_train sees

disty / distx / aha

Minkowski distance functions. THE.p sets the exponent (default 2 = Euclidean).

FunctionReturns
disty(d, row)distance to heaven across y-cols. Each y-col contributes (num_norm(c, x) - heaven)^p.
distx(d, r1, r2)distance between two rows on x-cols. Sums aha(c, u, v)^p.
aha(c, u, v)distance between two values on one column. Sym: 0/1. Num: normalised diff with missing-value swing.
mids(d, cols, row)centroid: write each col's mid into row[i].

tree

Decision/regression tree. Binary cuts for both numeric and symbolic columns (ezr-style). Numeric: split at median. Symbolic: try every distinct value as a binary cut, pick the best.

FunctionPurpose
tree_train(d, rows, dep, label)recursive build; stores .n.label, .n.dep, .n.mu, .n.nrows, .n.ymids[name] at each node
tree_test(d, n, row)walk to leaf, return .leaf.mu
tree_leaf(d, n, row)walk to leaf, return the leaf node id
tree_cuts(d, c, rows, out)candidate cuts: median for numeric, distinct values for symbolic
try_split / ycol / spreadhelpers: build l/r partitions, score, evaluate spread

Stop conditions:

  • length(rows) < 2 * THE.leaf (need enough rows to consider splitting)
  • dep >= THE.maxd
  • no split with both sides ≥ THE.leaf (balance filter)

Selection: minimise n_l * spread(l) + n_r * spread(r).

treeshow

Pretty-printer. tree_train caches label + dep + ymids on each node, so treeshow is just a recursive walk.

function treeshow_walk(d, n,    pad, i, key, name) {
  pad = ""
  for (i = 0; i < .n.dep; i++) pad = pad "|   "
  printf "%-40s %6.3f %5d", pad .n.label, .n.mu, .n.nrows
  for (i in .d.y) printf " %8.2f", .n.ymids[.d.hdr[i]]
  printf "\n"
  if (.n.kind == "branch")
    for (key in .n.kids) treeshow_walk(d, .n.kids[key]) }

bayes

Naive Bayes. Per-class column objects accumulate during training. At test time: log-likelihood from num_like / sym_like + class prior.

Smoothing: m-estimate (default 1) for both numeric and symbolic likelihoods, k-estimate (default 1) for class priors.

Functions: nb_train, nb_test, nb_pred, nb_safe_like.

wins

Percentile-based scorer (0..100, higher = better). Initialised once on a row set; computes lo, med, sd from sorted distys.

W = new("wins")
wins_init(d, .d.all, W)         # capture lo/med/sd from full data
score = wins_score(d, row, W)   # 100 = within 0.35*sd of lo
                                 # 0   = at median
                                 # negative = worse than median

The 0.35*sd cushion treats "within ~third of robust-sd of best" as effectively-best.

active pipeline

Glues everything together. See Examples / acquire for the full step-by-step. Key choices:

  • Wins computed on FULL data (matches ezr's wins(d0)) — captured before any train/test split.
  • All other operations (tree, acquire, distx, disty during selection) use dt — a clone of d with only train rows fed. No info leak.
  • Acquire score: distx(row, best_centroid) - distx(row, rest_centroid). Pick row with min score (closest to best, farthest from rest).
  • Cap on best: sqrt(|labelled|). Worst by disty evicted to rest each iteration.

Config

Key/value CSV fed as the first input file. Each app's CLI captures via FNR == NR { THE[$1] = $2 + 0; next }.

AppKeysDefaults (Makefile)
treewait, leaf, maxd, p200, 4, 8, 2
nbwait, m, k200, 1, 1
acquireseed, p, few, start, budget, check, leaf, maxd1, 2, 128, 4, 50, 5, 3, 8

Files

FileLinesPurpose
data.awk~50CSV ingest, sigils, data_clone, data_feed
dist.awk~25disty, distx, aha, mids
tree.awk~110tree functions (binary cuts, balance filter)
tree-cli.awk~15CLI driver for dotlearn --demo tree
treeshow.awk~15tree pretty-printer
bayes.awk~35Naive Bayes functions
bayes-cli.awk~12CLI driver for dotlearn --demo nb
wins.awk~25percentile scorer
active.awk~140active learning pipeline
metrics.awk~30per-class confusion matrix
shuf.awk~15seeded row shuffle