Manual

Three apps, layered on dot: tree, nb, acquire.

Header sigils

Suffix on column names tells data.awk what each column is for.

Pattern	Type	Role
`^[A-Z]`	numeric	`Age`, `Mpg`
`^[a-z]`	symbolic	`class`, `color`
`name+`	y-goal	maximize, `ykind=num`
`name-`	y-goal	minimize, `ykind=num`
`name!`	klass	classification target, `ykind=sym`

data type

Header parse + row table. Pure ingestion. No y/target logic, no splits. Apps add what they need.

function data_init(it) {
  arr(.it.cols); arr(.it.rows); arr(.it.all)
  arr(.it.y);    arr(.it.hdr);  arr(.it.nump)
  .it.nc = 0; .it.nrows = 0; .it.klass = 0; .it.ykind = ""
  return it }

Public: data_head(d, line) parses CSV header. data_read(d, training) reads one row from $0 and feeds each cell into its column (if training).

data_clone / data_feed

Used by the active learner to avoid info leak. data_clone(d) returns a fresh data with same headers but empty col stats. data_feed(dest, row_ids, src) copies selected rows from src into dest, updating dest's col stats and row table.

DT = data_clone(D)
data_feed(DT, train_ids, D)   # only train rows feed into DT.cols
# now disty/distx using DT match what ezr's d_train sees

disty / distx / aha

Minkowski distance functions. THE.p sets the exponent (default 2 = Euclidean).

Function	Returns
`disty(d, row)`	distance to heaven across y-cols. Each y-col contributes `(num_norm(c, x) - heaven)^p`.
`distx(d, r1, r2)`	distance between two rows on x-cols. Sums `aha(c, u, v)^p`.
`aha(c, u, v)`	distance between two values on one column. Sym: `0/1`. Num: normalised diff with missing-value swing.
`mids(d, cols, row)`	centroid: write each col's mid into `row[i]`.

tree

Decision/regression tree. Binary cuts for both numeric and symbolic columns (ezr-style). Numeric: split at median. Symbolic: try every distinct value as a binary cut, pick the best.

Function	Purpose
`tree_train(d, rows, dep, label)`	recursive build; stores `.n.label`, `.n.dep`, `.n.mu`, `.n.nrows`, `.n.ymids[name]` at each node
`tree_test(d, n, row)`	walk to leaf, return `.leaf.mu`
`tree_leaf(d, n, row)`	walk to leaf, return the leaf node id
`tree_cuts(d, c, rows, out)`	candidate cuts: median for numeric, distinct values for symbolic
`try_split` / `ycol` / `spread`	helpers: build l/r partitions, score, evaluate spread

Stop conditions:

length(rows) < 2 * THE.leaf (need enough rows to consider splitting)
dep >= THE.maxd
no split with both sides ≥ THE.leaf (balance filter)

Selection: minimise n_l * spread(l) + n_r * spread(r).

treeshow

Pretty-printer. tree_train caches label + dep + ymids on each node, so treeshow is just a recursive walk.

function treeshow_walk(d, n,    pad, i, key, name) {
  pad = ""
  for (i = 0; i < .n.dep; i++) pad = pad "|   "
  printf "%-40s %6.3f %5d", pad .n.label, .n.mu, .n.nrows
  for (i in .d.y) printf " %8.2f", .n.ymids[.d.hdr[i]]
  printf "\n"
  if (.n.kind == "branch")
    for (key in .n.kids) treeshow_walk(d, .n.kids[key]) }

bayes

Naive Bayes. Per-class column objects accumulate during training. At test time: log-likelihood from num_like / sym_like + class prior.

Smoothing: m-estimate (default 1) for both numeric and symbolic likelihoods, k-estimate (default 1) for class priors.

Functions: nb_train, nb_test, nb_pred, nb_safe_like.

wins

Percentile-based scorer (0..100, higher = better). Initialised once on a row set; computes lo, med, sd from sorted distys.

W = new("wins")
wins_init(d, .d.all, W)         # capture lo/med/sd from full data
score = wins_score(d, row, W)   # 100 = within 0.35*sd of lo
                                 # 0   = at median
                                 # negative = worse than median

The 0.35*sd cushion treats "within ~third of robust-sd of best" as effectively-best.

active pipeline

Glues everything together. See Examples / acquire for the full step-by-step. Key choices:

Wins computed on FULL data (matches ezr's wins(d0)) — captured before any train/test split.
All other operations (tree, acquire, distx, disty during selection) use dt — a clone of d with only train rows fed. No info leak.
Acquire score: distx(row, best_centroid) - distx(row, rest_centroid). Pick row with min score (closest to best, farthest from rest).
Cap on best: sqrt(|labelled|). Worst by disty evicted to rest each iteration.

Config

Key/value CSV fed as the first input file. Each app's CLI captures via FNR == NR { THE[$1] = $2 + 0; next }.

App	Keys	Defaults (Makefile)
tree	`wait`, `leaf`, `maxd`, `p`	200, 4, 8, 2
nb	`wait`, `m`, `k`	200, 1, 1
acquire	`seed`, `p`, `few`, `start`, `budget`, `check`, `leaf`, `maxd`	1, 2, 128, 4, 50, 5, 3, 8

Files

File	Lines	Purpose
`data.awk`	~50	CSV ingest, sigils, `data_clone`, `data_feed`
`dist.awk`	~25	`disty`, `distx`, `aha`, `mids`
`tree.awk`	~110	tree functions (binary cuts, balance filter)
`tree-cli.awk`	~15	CLI driver for `dotlearn --demo tree`
`treeshow.awk`	~15	tree pretty-printer
`bayes.awk`	~35	Naive Bayes functions
`bayes-cli.awk`	~12	CLI driver for `dotlearn --demo nb`
`wins.awk`	~25	percentile scorer
`active.awk`	~140	active learning pipeline
`metrics.awk`	~30	per-class confusion matrix
`shuf.awk`	~15	seeded row shuffle