All three apps end-to-end: tree, nb, acquire.
Prerequisite: dotlearn loads files from the sibling dot directory. Take the dot tour first if you haven't seen the object sugar.
$ git clone https://github.com/timm/awk $ cd awk/dotlearn $ ls active.awk bayes.awk bayes-cli.awk data.awk dist.awk Makefile metrics.awk shuf.awk tree.awk tree-cli.awk treeshow.awk wins.awk
$ ./dotlearn --get-data $ head -2 data/regression/auto93.csv Clndrs,Volume,HpX,Model,origin,Lbs-,Acc+,Mpg+ 8,304,193,70,1,4732,18.5,10
Header sigils: UPPERCASE = numeric, suffix ! = class, +/- = max/min goal. Each demo also has its own demos/NAME/sample.csv, used by default when no DATA arg is given.
Five-line program: read CSV, train a tree on every row, pretty-print it. Loads the demo's config.csv (sets leaf, maxd, p) first, then the data.
$ cat > tour-tree.awk <<'EOF' BEGIN { FS = " *, *" } FNR == NR { THE[$1] = $2 + 0; next } FNR == 1 { D = new("data"); data_head(D, $0); next } { data_read(D, 1) } END { T = tree_train(D, .D.all, 0, ""); treeshow(D, T) } EOF $ ./dotlearn tour-tree.awk demos/tree/config.csv demos/acquire/sample.csv | head -10 rule d2h n Lbs- Acc+ Mpg+ ROOT 0.529 398 2970.42 15.57 23.84 | Clndrs > 4 0.720 190 3693.56 14.51 17.84 | | HpX > 130 0.852 91 4151.89 12.56 14.62 | | | HpX > 150 0.893 45 4322.40 11.89 12.67 | | | | Volume > 360 0.912 22 4441.41 10.85 13.18 | | | | Volume <= 360 0.874 23 4208.57 12.87 12.17 | | | HpX <= 150 0.812 46 4495.65 13.21 16.67 | | HpX <= 130 0.594 99 3272.51 16.31 20.85 | Clndrs <= 4 0.343 208 3308.30 16.54 29.34
Each row: rule path, d2h (mean distance-to-heaven), n (rows in node), then per-y-goal column means. Reading top-down: Clndrs <= 4 is the better branch (lower d2h, higher Mpg+, lower Lbs-). The bundled treeshow walks the tree and prints; the work is all in tree_train.
Heads up — column names look ambiguous in deep leaves.
In auto93, Volume is engine displacement in cubic inches (real range 70–455);
Model is model year (70–82, i.e. 1970–1982). They overlap numerically in the 70–100 range,
so a deep split like Volume <= 91 can read like a year filter at first glance.
It isn't — it's picking small-engine cars (which is why the leaf below has Mpg+ ≈ 35).
Now the same model, used differently: train on first wait rows, predict the rest, summarise per-class:
$ ./dotlearn --demo tree | gawk -f tools/metrics.awk n pd pf prec acc class 48 0.625 0.145 0.789 0.748 >50_1 55 0.855 0.375 0.723 0.748 <50
--demo tree emits raw pred,actual lines; tools/metrics.awk reduces them to per-class precision/recall/accuracy. So 3a inspects the model, 3b scores it.
Streaming Naive Bayes. m-estimate / k-estimate smoothing:
$ ./dotlearn --demo nb | gawk -f tools/metrics.awk n pd pf prec acc class 48 0.729 0.091 0.875 0.825 >50_1 55 0.909 0.271 0.794 0.825 <50
Same shape as tree: read CSV, train rows fed into per-class column objects, test rows scored by log-likelihood. acc=0.825 on heart.
This is the interesting one. Shuffle. Split 50/50 train/test. Warm-start with 4 random labels. Acquire 50 more by centroid score (closer to best, farther from rest). Train a tree on the 54 labelled rows. Test, take top 5 by tree prediction, score the best by actual disty.
$ ./dotlearn --demo acquire wins lo=0.075 med=0.535 sd=0.266 === TREE on 54 labelled rows === rule d2h n Lbs- Acc+ Mpg+ ROOT 0.348 54 2285.63 16.49 29.63 | Clndrs > 4 0.725 6 3348.50 13.47 18.33 | | Clndrs > 6 0.842 3 3818.00 13.07 13.33 | | Clndrs <= 6 0.608 3 2879.00 13.87 23.33 | Clndrs <= 4 0.300 48 2152.77 16.87 31.04 | | HpX > 70 0.378 23 2314.83 15.84 28.70 | | | origin != 1 0.413 15 2295.00 15.59 27.33 | | | | Volume > 108 0.465 7 2455.29 15.97 24.29 | | | | | Model > 72 0.479 3 2406.67 15.43 26.67 | | | | | Model <= 72 0.454 4 2491.75 16.38 22.50 | | | | Volume <= 108 0.367 8 2154.75 15.26 30.00 | | | | | origin != 2 0.245 3 2309.33 16.93 30.00 | | | | | origin == 2 0.440 5 2062.00 14.26 30.00 | | | origin == 1 0.312 8 2352.00 16.31 31.25 | | | | Model > 76 0.349 4 2455.00 16.07 32.50 | | | | Model <= 76 0.275 4 2249.00 16.55 30.00 | | HpX <= 70 0.229 25 2003.68 17.81 33.20 | | | Volume > 91 0.283 12 2120.67 17.71 30.83 | | | | Volume > 98 0.382 3 2220.00 15.00 30.00 | | | | Volume <= 98 0.250 9 2087.56 18.61 31.11 | | | | | HpX > 66 0.237 4 2142.25 16.85 32.50 | | | | | HpX <= 66 0.260 5 2043.80 20.02 30.00 | | | Volume <= 91 0.180 13 1895.69 17.91 35.38 | | | | Volume > 85 0.207 4 1871.00 16.92 35.00 | | | | Volume <= 85 0.168 9 1906.67 18.34 35.56 | | | | | Volume > 81 0.113 4 2047.50 19.85 37.50 | | | | | Volume <= 81 0.212 5 1794.00 17.14 34.00 === RESULT === labelled : 54 (start=4, budget=50) test rows : 199 top 5 guess actual-disty=0.158 win=100/100
What that means: with 54 labels (~14% of 398 rows) plus 5 oracle calls at the end, we landed in the top-tier "win=100" band. Across 30 seeds the picks span 0.075–0.160 disty, all inside the 0.35*sd cushion of the global best (0.075).
To average over N seeded shuffles, loop in shell using tools/shuf.awk:
$ for i in $(seq 1 20); do ./dotlearn --demo tree <(gawk -v seed=$i -f tools/shuf.awk demos/tree/sample.csv) done | gawk -f tools/metrics.awk n pd pf prec acc class 945 0.743 0.237 0.727 0.754 >50_1 1115 0.763 0.257 0.778 0.754 <50
Examples — full walkthrough of each app with code.
Manual — data sigils, config keys, file index, internals.
dot manual — the underlying object machinery.