Tutorial

Zero to per-column running stats on a real CSV. Real prompts, real output.

Install gawk

$ brew install gawk        # macOS
$ sudo apt install gawk    # debian/ubuntu

Install the dotcols binary

$ curl -sL https://raw.githubusercontent.com/timm/awk/master/dotcols/dotcols -o dotcols
$ chmod +x dotcols
$ ./dotcols --demo stats | head -3
column                      n          mid       spread
AGE                       303       54.366        9.082
sex                       303         male        0.624

One file. Bundles dot's runtime + Num/Sym/Data + the bundled stats demo with its sample CSV.

Pull example data

$ ./dotcols --get-data
fetching 10 classify ->  data/classify/
  iris
  wine
  ...
fetching 10 regression -> data/regression/
  ...
done. 30 files in ./data/

Try Num + Sym in a one-liner

$ cat > tour1.awk <<'EOF'
BEGIN { N = new("num"); S = new("sym")
        printf "%-12s %5s %12s %12s\n", "column", "n", "mid", "spread" }
      { add(N, $1, 1); add(S, $2, 1) }
END   { printf "%-12s %5d %12.3f %12.3f\n", "AGE",   .N.n, mid(N), var(N)
        printf "%-12s %5d %12s %12.3f\n",   "color", .S.n, mid(S), var(S) }
EOF
$ printf "10 red\n20 blue\n30 blue\n40 red\n50 blue\n" | ./dotcols tour1.awk
column           n          mid       spread
AGE              5       30.000       15.811
color            5         blue        0.673

Same add, mid, var calls dispatch to num_* or sym_* via the .it.is tag set by new(). The output shape matches the bundled stats demo — same column / n / mid / spread layout.

Run the bundled stats demo on real data

$ ./dotcols --demo stats data/classify/iris.csv
column                      n          mid       spread
SEPALLENGTH               150        5.843        0.828
SEPALWIDTH                150        3.054        0.434
PETALLENGTH               150        3.759        1.764
PETALWIDTH                150        1.199        0.763
class                     150   Iris-setosa        1.099

UPPER columns get Num (mean + stdev). lowercase get Sym (mode + entropy). One pass, O(1) per row. Same shape as step 4 — just doing it for every column at once.

Use Data directly: ingest + inspect

$ cat > tour2.awk <<'EOF'
BEGIN { D = new("data"); FS = " *, *" }
NR==1 { data_head(D, $0); next }
      { data_read(D, 1) }
END   { printf "rows=%d  cols=%d  ykind=%s\n\n", .D.nrows, .D.nc, .D.ykind
        printf "%-3s %-14s %-3s %-3s\n", "i", "name", "kind", "y?"
        for (i=1; i<=.D.nc; i++)
          printf "%-3d %-14s %-3s %-3s\n",
                 i, .D.hdr[i],
                 (.D.nump[i] ? "num" : "sym"),
                 ((i in .D.y) ? "y" : "-") }
EOF
$ ./dotcols tour2.awk data/regression/housing.csv
rows=506  cols=14  ykind=num

i   name           kind  y?
1   CRIM           num   -
2   ZN             num   -
3   INDUS          num   -
4   CHAS           num   -
5   NOX            num   -
6   RM             num   -
7   AGE            num   -
8   DIS            num   -
9   RAD            num   -
10  TAX            num   -
11  PTRATIO        num   -
12  B              num   -
13  LSTAT          num   -
14  MEDV+          num   y

data_head parses the header (sigils too — + means y-goal to maximize); each subsequent data_read feeds one row's cells into the right column object. After ingest, .D.cols[i] is a Num or Sym you can call mid()/var() on directly.

Where to go next

Example — stats.awk walkthrough with code.
Manual — Num, Sym, Data API; dispatch; conventions.
Tests — annotated test suite, doubles as 17 mini usage snippets.
dotlearn — ML on top: trees, naive Bayes, active learning.