Examples

One section per app. Each shows real input, real output.

tree

Decision tree on heart-disease data. Binary cuts, ezr-style.

Input

AGE,sex,cp,TRESTBPS,CHOL,fbs,restecg,THALACH,exang,OLDPEAK,slope,CA,thal,num!
63,male,typ_angina,145,233,t,left_vent_hyper,150,no,2.3,down,0,fixed_defect,<50
67,male,asympt,160,286,f,left_vent_hyper,108,yes,1.5,flat,3,normal,>50_1

Run

dotlearn --demo tree                                # uses bundled sample (heart.c.csv)
dotlearn --demo tree data/classify/heart.c.csv      # explicit data file
dotlearn --demo tree | gawk -f tools/metrics.awk    # add metrics summary

Output (raw, "pred,actual" lines)

<50,<50
<50,<50
>50_1,>50_1
<50,>50_1
... (one row per held-out test row)

Pipe to tools/metrics.awk for per-class summary:

    n     pd     pf   prec    acc  class
   48  0.625  0.145  0.789  0.748  >50_1
   55  0.855  0.375  0.723  0.748  <50

n = test count, pd = recall, pf = false-alarm, prec = precision, acc = accuracy. Same metrics for both classification and regression (with continuous targets, class column shows raw distance values).

nb

Naive Bayes. Same shape as tree.

Run

dotlearn --demo nb                              # uses bundled sample (heart.c.csv)
dotlearn --demo nb | gawk -f tools/metrics.awk  # add metrics summary

Output (with metrics)

    n     pd     pf   prec    acc  class
   48  0.729  0.091  0.875  0.825  >50_1
   55  0.909  0.271  0.794  0.825  <50

acc=0.825 on heart.c (vs tree=0.748). Naive Bayes wins here because heart has many low-cardinality symbolic columns.

acquire

Active learning on auto93 (multi-objective: Lbs-, Acc+, Mpg+).

Pipeline

  1. Read all 398 rows (column stats from full data).
  2. Wins percentiles: compute lo, med, sd of disty over full data. Used later to score the final pick.
  3. Shuffle + split half/half: train pool (199), test pool (199). Cap train pool at 128.
  4. Clone: build a fresh data dt with same column structure but only train rows fed in. All subsequent stats come from dt, not full data — matches ezr's setup, no info leak.
  5. Warm-start: label first 4 train rows. Sort by disty. Top sqrt(4)=2 go to best, rest go to rest.
  6. Acquire loop (50 iterations): score each unlabelled row with distx(best_centroid) - distx(rest_centroid). Pick lowest score. Add to best. Cap best at sqrt(|labelled|); evict worst to rest.
  7. Tree: train on the 54 labelled rows.
  8. Predict: walk each test row to its tree leaf. Take top 5 by leaf mu.
  9. Score: of those 5, pick the one with min actual disty. Score against the wins percentiles from step 2.

Run

dotlearn --demo acquire                              # uses bundled sample (auto93.csv)
dotlearn --demo acquire data/regression/auto93.csv   # explicit data file

Output

wins  lo=0.075  med=0.535  sd=0.266

=== TREE on 54 labelled rows ===
rule                                        d2h     n     Lbs-     Acc+     Mpg+
ROOT                                      0.348    54  2285.63    16.49    29.63
|   Clndrs >  4                           0.725     6  3348.50    13.47    18.33
|   |   Clndrs >  6                       0.842     3  3818.00    13.07    13.33
|   |   Clndrs <= 6                       0.608     3  2879.00    13.87    23.33
|   Clndrs <= 4                           0.300    48  2152.77    16.87    31.04
|   |   HpX >  70                         0.378    23  2314.83    15.84    28.70
|   |   |   origin != 1                   0.413    15  2295.00    15.59    27.33
|   |   |   |   Volume >  108             0.465     7  2455.29    15.97    24.29
|   |   |   |   Volume <= 108             0.367     8  2154.75    15.26    30.00
|   |   |   origin == 1                   0.312     8  2352.00    16.31    31.25
|   |   HpX <= 70                         0.229    25  2003.68    17.81    33.20
|   |   |   Volume >  91                  0.283    12  2120.67    17.71    30.83
|   |   |   Volume <= 91                  0.180    13  1895.69    17.91    35.38

=== RESULT ===
labelled  : 54   (start=4, budget=50)
test rows : 199
top 5 guess actual-disty=0.158  win=100/100

Reading the tree

Each row of the tree shows: rule, d2h (mean distance-to-heaven for rows in the node), n (row count), then per-y-goal column means. The deepest leaves with low d2h represent the "best" corner of design space — light cars (Lbs ~1900), high acceleration (Acc 17-18), high MPG (35).

Cross-seed variance

30 seeds on auto93 (default config: start=4, budget=50, check=5):

all 30 seeds: win=100/100
pick disty: range [0.075, 0.160], all inside lo + 0.35*sd cushion (0.168)

Roughly: 14% labels + 5 oracle calls at end → consistently top-tier picks.

Config keys

KeyDefaultMeaning
seed1random seed (Fisher-Yates shuffle)
p2Minkowski exponent (1=Manhattan, 2=Euclid)
few128cap on train pool size
start4warm-start labels
budget50acquire loop iterations
check5top-N predictions to evaluate at end
leaf3min rows per tree leaf
maxd8max tree depth