Manual
Three apps, layered on dot: tree, nb, acquire.
Header sigils
Suffix on column names tells data.awk what each column is for.
| Pattern | Type | Role |
|---|---|---|
^[A-Z] | numeric | Age, Mpg |
^[a-z] | symbolic | class, color |
name+ | y-goal | maximize, ykind=num |
name- | y-goal | minimize, ykind=num |
name! | klass | classification target, ykind=sym |
data type
Header parse + row table. Pure ingestion. No y/target logic, no splits. Apps add what they need.
function data_init(it) {
arr(.it.cols); arr(.it.rows); arr(.it.all)
arr(.it.y); arr(.it.hdr); arr(.it.nump)
.it.nc = 0; .it.nrows = 0; .it.klass = 0; .it.ykind = ""
return it }
Public: data_head(d, line) parses CSV header. data_read(d, training) reads one row from $0 and feeds each cell into its column (if training).
data_clone / data_feed
Used by the active learner to avoid info leak. data_clone(d) returns a fresh data with same headers but empty col stats. data_feed(dest, row_ids, src) copies selected rows from src into dest, updating dest's col stats and row table.
DT = data_clone(D)
data_feed(DT, train_ids, D) # only train rows feed into DT.cols
# now disty/distx using DT match what ezr's d_train sees
disty / distx / aha
Minkowski distance functions. THE.p sets the exponent (default 2 = Euclidean).
| Function | Returns |
|---|---|
disty(d, row) | distance to heaven across y-cols. Each y-col contributes (num_norm(c, x) - heaven)^p. |
distx(d, r1, r2) | distance between two rows on x-cols. Sums aha(c, u, v)^p. |
aha(c, u, v) | distance between two values on one column. Sym: 0/1. Num: normalised diff with missing-value swing. |
mids(d, cols, row) | centroid: write each col's mid into row[i]. |
tree
Decision/regression tree. Binary cuts for both numeric and symbolic columns (ezr-style). Numeric: split at median. Symbolic: try every distinct value as a binary cut, pick the best.
| Function | Purpose |
|---|---|
tree_train(d, rows, dep, label) | recursive build; stores .n.label, .n.dep, .n.mu, .n.nrows, .n.ymids[name] at each node |
tree_test(d, n, row) | walk to leaf, return .leaf.mu |
tree_leaf(d, n, row) | walk to leaf, return the leaf node id |
tree_cuts(d, c, rows, out) | candidate cuts: median for numeric, distinct values for symbolic |
try_split / ycol / spread | helpers: build l/r partitions, score, evaluate spread |
Stop conditions:
length(rows) < 2 * THE.leaf(need enough rows to consider splitting)dep >= THE.maxd- no split with both sides ≥
THE.leaf(balance filter)
Selection: minimise n_l * spread(l) + n_r * spread(r).
treeshow
Pretty-printer. tree_train caches label + dep + ymids on each node, so treeshow is just a recursive walk.
function treeshow_walk(d, n, pad, i, key, name) {
pad = ""
for (i = 0; i < .n.dep; i++) pad = pad "| "
printf "%-40s %6.3f %5d", pad .n.label, .n.mu, .n.nrows
for (i in .d.y) printf " %8.2f", .n.ymids[.d.hdr[i]]
printf "\n"
if (.n.kind == "branch")
for (key in .n.kids) treeshow_walk(d, .n.kids[key]) }
bayes
Naive Bayes. Per-class column objects accumulate during training. At test time: log-likelihood from num_like / sym_like + class prior.
Smoothing: m-estimate (default 1) for both numeric and symbolic likelihoods, k-estimate (default 1) for class priors.
Functions: nb_train, nb_test, nb_pred, nb_safe_like.
wins
Percentile-based scorer (0..100, higher = better). Initialised once on a row set; computes lo, med, sd from sorted distys.
W = new("wins")
wins_init(d, .d.all, W) # capture lo/med/sd from full data
score = wins_score(d, row, W) # 100 = within 0.35*sd of lo
# 0 = at median
# negative = worse than median
The 0.35*sd cushion treats "within ~third of robust-sd of best" as effectively-best.
active pipeline
Glues everything together. See Examples / acquire for the full step-by-step. Key choices:
- Wins computed on FULL data (matches ezr's
wins(d0)) — captured before any train/test split. - All other operations (tree, acquire, distx, disty during selection) use
dt— a clone ofdwith only train rows fed. No info leak. - Acquire score:
distx(row, best_centroid) - distx(row, rest_centroid). Pick row with min score (closest to best, farthest from rest). - Cap on best:
sqrt(|labelled|). Worst by disty evicted to rest each iteration.
Config
Key/value CSV fed as the first input file. Each app's CLI captures via FNR == NR { THE[$1] = $2 + 0; next }.
| App | Keys | Defaults (Makefile) |
|---|---|---|
| tree | wait, leaf, maxd, p | 200, 4, 8, 2 |
| nb | wait, m, k | 200, 1, 1 |
| acquire | seed, p, few, start, budget, check, leaf, maxd | 1, 2, 128, 4, 50, 5, 3, 8 |
Files
| File | Lines | Purpose |
|---|---|---|
data.awk | ~50 | CSV ingest, sigils, data_clone, data_feed |
dist.awk | ~25 | disty, distx, aha, mids |
tree.awk | ~110 | tree functions (binary cuts, balance filter) |
tree-cli.awk | ~15 | CLI driver for dotlearn --demo tree |
treeshow.awk | ~15 | tree pretty-printer |
bayes.awk | ~35 | Naive Bayes functions |
bayes-cli.awk | ~12 | CLI driver for dotlearn --demo nb |
wins.awk | ~25 | percentile scorer |
active.awk | ~140 | active learning pipeline |
metrics.awk | ~30 | per-class confusion matrix |
shuf.awk | ~15 | seeded row shuffle |