Tour

All three apps end-to-end: tree, nb, acquire.

Prerequisite: dotlearn loads files from the sibling dot directory. Take the dot tour first if you haven't seen the object sugar.

Clone & layout

$ git clone https://github.com/timm/awk
$ cd awk/dotlearn
$ ls
active.awk      bayes.awk       bayes-cli.awk   data.awk
dist.awk        Makefile        metrics.awk     shuf.awk
tree.awk        tree-cli.awk    treeshow.awk    wins.awk

Get sample data

$ ./dotlearn --get-data
$ head -2 data/regression/auto93.csv
Clndrs,Volume,HpX,Model,origin,Lbs-,Acc+,Mpg+
8,304,193,70,1,4732,18.5,10

Header sigils: UPPERCASE = numeric, suffix ! = class, +/- = max/min goal. Each demo also has its own demos/NAME/sample.csv, used by default when no DATA arg is given.

App 1: tree — build it and look at it

Five-line program: read CSV, train a tree on every row, pretty-print it. Loads the demo's config.csv (sets leaf, maxd, p) first, then the data.

$ cat > tour-tree.awk <<'EOF'
BEGIN { FS = " *, *" }
FNR == NR { THE[$1] = $2 + 0; next }
FNR == 1  { D = new("data"); data_head(D, $0); next }
          { data_read(D, 1) }
END       { T = tree_train(D, .D.all, 0, ""); treeshow(D, T) }
EOF
$ ./dotlearn tour-tree.awk demos/tree/config.csv demos/acquire/sample.csv | head -10
rule                                        d2h     n     Lbs-     Acc+     Mpg+
ROOT                                      0.529   398  2970.42    15.57    23.84
|   Clndrs >  4                           0.720   190  3693.56    14.51    17.84
|   |   HpX >  130                        0.852    91  4151.89    12.56    14.62
|   |   |   HpX >  150                    0.893    45  4322.40    11.89    12.67
|   |   |   |   Volume >  360             0.912    22  4441.41    10.85    13.18
|   |   |   |   Volume <= 360             0.874    23  4208.57    12.87    12.17
|   |   |   HpX <= 150                    0.812    46  4495.65    13.21    16.67
|   |   HpX <= 130                        0.594    99  3272.51    16.31    20.85
|   Clndrs <= 4                           0.343   208  3308.30    16.54    29.34

Each row: rule path, d2h (mean distance-to-heaven), n (rows in node), then per-y-goal column means. Reading top-down: Clndrs <= 4 is the better branch (lower d2h, higher Mpg+, lower Lbs-). The bundled treeshow walks the tree and prints; the work is all in tree_train.

Heads up — column names look ambiguous in deep leaves. In auto93, Volume is engine displacement in cubic inches (real range 70–455); Model is model year (70–82, i.e. 1970–1982). They overlap numerically in the 70–100 range, so a deep split like Volume <= 91 can read like a year filter at first glance. It isn't — it's picking small-engine cars (which is why the leaf below has Mpg+ ≈ 35).

tree — get prediction stats from it

Now the same model, used differently: train on first wait rows, predict the rest, summarise per-class:

$ ./dotlearn --demo tree | gawk -f tools/metrics.awk
    n     pd     pf   prec    acc  class
   48  0.625  0.145  0.789  0.748  >50_1
   55  0.855  0.375  0.723  0.748  <50

--demo tree emits raw pred,actual lines; tools/metrics.awk reduces them to per-class precision/recall/accuracy. So 3a inspects the model, 3b scores it.

App 2: nb (naive Bayes)

Streaming Naive Bayes. m-estimate / k-estimate smoothing:

$ ./dotlearn --demo nb | gawk -f tools/metrics.awk
    n     pd     pf   prec    acc  class
   48  0.729  0.091  0.875  0.825  >50_1
   55  0.909  0.271  0.794  0.825  <50

Same shape as tree: read CSV, train rows fed into per-class column objects, test rows scored by log-likelihood. acc=0.825 on heart.

App 3: acquire (active learning)

This is the interesting one. Shuffle. Split 50/50 train/test. Warm-start with 4 random labels. Acquire 50 more by centroid score (closer to best, farther from rest). Train a tree on the 54 labelled rows. Test, take top 5 by tree prediction, score the best by actual disty.

$ ./dotlearn --demo acquire
wins  lo=0.075  med=0.535  sd=0.266

=== TREE on 54 labelled rows ===
rule                                        d2h     n     Lbs-     Acc+     Mpg+
ROOT                                      0.348    54  2285.63    16.49    29.63
|   Clndrs >  4                           0.725     6  3348.50    13.47    18.33
|   |   Clndrs >  6                       0.842     3  3818.00    13.07    13.33
|   |   Clndrs <= 6                       0.608     3  2879.00    13.87    23.33
|   Clndrs <= 4                           0.300    48  2152.77    16.87    31.04
|   |   HpX >  70                         0.378    23  2314.83    15.84    28.70
|   |   |   origin != 1                   0.413    15  2295.00    15.59    27.33
|   |   |   |   Volume >  108             0.465     7  2455.29    15.97    24.29
|   |   |   |   |   Model >  72           0.479     3  2406.67    15.43    26.67
|   |   |   |   |   Model <= 72           0.454     4  2491.75    16.38    22.50
|   |   |   |   Volume <= 108             0.367     8  2154.75    15.26    30.00
|   |   |   |   |   origin != 2           0.245     3  2309.33    16.93    30.00
|   |   |   |   |   origin == 2           0.440     5  2062.00    14.26    30.00
|   |   |   origin == 1                   0.312     8  2352.00    16.31    31.25
|   |   |   |   Model >  76               0.349     4  2455.00    16.07    32.50
|   |   |   |   Model <= 76               0.275     4  2249.00    16.55    30.00
|   |   HpX <= 70                         0.229    25  2003.68    17.81    33.20
|   |   |   Volume >  91                  0.283    12  2120.67    17.71    30.83
|   |   |   |   Volume >  98              0.382     3  2220.00    15.00    30.00
|   |   |   |   Volume <= 98              0.250     9  2087.56    18.61    31.11
|   |   |   |   |   HpX >  66             0.237     4  2142.25    16.85    32.50
|   |   |   |   |   HpX <= 66             0.260     5  2043.80    20.02    30.00
|   |   |   Volume <= 91                  0.180    13  1895.69    17.91    35.38
|   |   |   |   Volume >  85              0.207     4  1871.00    16.92    35.00
|   |   |   |   Volume <= 85              0.168     9  1906.67    18.34    35.56
|   |   |   |   |   Volume >  81          0.113     4  2047.50    19.85    37.50
|   |   |   |   |   Volume <= 81          0.212     5  1794.00    17.14    34.00

=== RESULT ===
labelled  : 54   (start=4, budget=50)
test rows : 199
top 5 guess actual-disty=0.158  win=100/100

What that means: with 54 labels (~14% of 398 rows) plus 5 oracle calls at the end, we landed in the top-tier "win=100" band. Across 30 seeds the picks span 0.075–0.160 disty, all inside the 0.35*sd cushion of the global best (0.075).

Cross-validation

To average over N seeded shuffles, loop in shell using tools/shuf.awk:

$ for i in $(seq 1 20); do
    ./dotlearn --demo tree <(gawk -v seed=$i -f tools/shuf.awk demos/tree/sample.csv)
  done | gawk -f tools/metrics.awk
    n     pd     pf   prec    acc  class
  945  0.743  0.237  0.727  0.754  >50_1
 1115  0.763  0.257  0.778  0.754  <50

Where to go next

Examples — full walkthrough of each app with code.
Manual — data sigils, config keys, file index, internals.
dot manual — the underlying object machinery.