Per-column running stats on any CSV. Mean/stdev for numeric, mode/entropy for symbolic. Same add() call, polymorphic dispatch.
FUNCTAB, SYMTAB.(name in FUNCTAB).rogues() to detect leaks.fn = "num_add"; @fn(it,x) calls num_add(it,x).cmd's output appears as a temporary file path.local keyword. Locals declared as extra args after the real ones, separated by extra spaces.Built bottom-up, smallest first. The toolchain (3-line shell + 6-line preprocessor) and the runtime (3 functions, 9 LOC) are the entire foundation. After that, two layers of program code: a generic helper library, then a tiny type library, then the main program. gawk concatenates everything via separate -f flags (see Run); no @include needed.
dot + prep.awk (~9 LOC total)The shell wrapper is three lines. It locates itself, then runs prep.awk on the source file you hand it; the rewritten code goes to stdout, ready for gawk -f <(dot foo.awk).
#!/usr/bin/env bash
DOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
gawk -f "$DOT_DIR/prep.awk" "$1"
The preprocessor itself is one rule with two regex sweeps per line. First sweep: any .field directly after a value-character (letter, digit, _, ], )) becomes ["field"] — a struct-field access. Second sweep: any leftover .name (bare, with no value-char before it) becomes HEAP[name] — an object reference.
# prep.awk -- two-pass regex rewrite. ~6 LOC.
# .f after value-char -> ["f"] (field access)
# .x bare -> HEAP[x] (object reference)
{ s = $0
while (match(s, /([A-Za-z0-9_\]\)])\.([A-Za-z_][A-Za-z_0-9]*)/, m))
s = substr(s,1,RSTART-1) m[1] "[\"" m[2] "\"]" substr(s,RSTART+RLENGTH)
while (match(s, /\.([A-Za-z_][A-Za-z_0-9]*)/, m))
s = substr(s,1,RSTART-1) "HEAP[" m[1] "]" substr(s,RSTART+RLENGTH)
print s }
That is the whole compile pipeline. match()+substr() loop instead of gensub(...,"g") dodges a gawk 5.4.0 bug where the second match's captures come back empty.
What that does, in concrete terms — each input line gets rewritten line-for-line, no insertions, no deletions:
# source line -> after prep.awk
.it.n++ -> HEAP[it]["n"]++
.it.mu += d / .it.n -> HEAP[it]["mu"] += d / HEAP[it]["n"]
add(.col[i], $i, 1) -> add(HEAP[col][i], $i, 1)
.d.rows[r][i] -> HEAP[d]["rows"][r][i]
NAME[i] = $i -> NAME[i] = $i (no dots, unchanged)
Two cases per line. .x after a value-char (t, ], ), etc.) is a field access → ["x"]. .x bare is an object reference → HEAP[x]. Plain awk code with no dots passes through untouched, so most existing awk programs survive the preprocessor unchanged.
If you need a literal dot — for instance, joining a basename and an extension — split it across strings so the dot has a quote on each side. Quotes are not value-chars, so the regex skips it:
###############################################################
# DO THIS #
###############################################################
# source line -> after prep.awk
"fred" "." "csv" -> "fred" "." "csv" (safe: dot stands alone)
name "." ext -> name "." ext (safe)
###############################################################
# DON'T DO THIS #
###############################################################
"fred.csv" -> "fred[\"csv\"]" (BROKEN: dot is inside a string,
d is a value-char, .csv matches)
Rule of thumb: a dot literal is safe only when nothing on its left is a letter, digit, _, ], or ). The cleanest fix is the explicit-dot pattern above — "." as its own string, concatenated.
dot.awk (3 functions, 9 LOC)Everything sits on one global, HEAP. Each object is an integer id; HEAP[id] is its struct. Three functions — a constructor, an array-init helper, a slot zapper. That is the entire runtime.
# dot.awk -- the entire object runtime. Three functions.
# new(t): allocate a fresh object id, tag it, run optional t_init().
function new(t, it, fn) {
it = ++NID; .it.is = t # .it.is -> HEAP[it]["is"]
fn = t "_init"
return (fn in FUNCTAB) ? @fn(it) : it } # call type_init if it exists
# arr(x): force x to be an array. Safe to "for k in x" when later empty.
function arr(x) { x[""] = 0; delete x[""] }
# zap(i): clear one HEAP slot. Use when caller is done with object i.
function zap(i) { delete HEAP[i] }
Fewest lines to make dot work: 3-line shell + 6-line prep.awk + 4-line new() = 13 LOC. Add arr and zap for the typical demos and the runtime is still under 10 lines.
dotlib.awkThree things, two functions. A leak check, and one polymorphic printer o() that dispatches to a workhorse _oo for arrays. Scalars and arrays go through the same call.
# dotlib.awk -- generic helpers (no HEAP knowledge).
# rogues(): warn on lowercase globals at end of run. Discipline check.
function rogues( i) {
for (i in SYMTAB) if (i ~ /^[a-z]/) print "leak:", i > "/dev/stderr" }
# o(x): print one thing.
# array, 1 in x -> "[..]" list, numeric-sorted, no key prefix
# array, no 1 -> "{..}" dict, string-sorted, "k: " prefix
# number-shaped -> %d if whole, else %G
# else -> %s
# Custom brackets: call _oo directly, e.g. _oo(a,"(",")","@ind_num_asc",0).
function o(x) {
if (isarray(x)) {
if (1 in x) _oo(x, "[", "]", "@ind_num_asc", 0)
else _oo(x, "{", "}", "@ind_str_asc", 1)
} else if (x ~ /^-?[0-9]+(\.[0-9]+)?([eE][-+]?[0-9]+)?$/) {
if (x == int(x)) printf "%d", x
else printf "%G", x }
else printf "%s", x }
# _oo: collect+sort keys, walk them, print one entry each. Recurses via o().
function _oo(a, lhs, rhs, how, withkey, n, i, k, sep, sorted) {
printf "%s", lhs
n = asorti(a, sorted, how)
sep = ""
for (i = 1; i <= n; i++) {
k = sorted[i]
printf "%s", sep
if (withkey) printf "%s: ", k
o(a[k])
sep = ", " }
printf "%s", rhs }
One entry point. o() dispatches on shape: array vs scalar, list-shaped vs dict-shaped, integer-valued float vs other number, else string. Recursion is automatic — _oo calls o() on each element, so nested mixes (list-of-dicts, dict-of-lists, anything) just work.
o(5.0) # -> 5
o(1e-7) # -> 1E-07
o("hi") # -> hi
o(a) # a[1..n] # -> [10, 20, 3.14]
o(b) # b["alpha"]=1, ... # -> {alpha: 1, beta: 2.5}
o(c) # nested mix # -> {name: tim, xs: [1, 2]}
_oo(a, "(", ")", "@ind_num_asc", 0) # -> (10, 20, 3.14) tuple style
The "@ind_num_asc" string names a built-in asorti() comparator (numeric ascending on the keys). The @ prefix is required — without it gawk treats the string as a user-defined comparison function name and aborts.
numsym-mini.awk (~20 LOC)Two types, Num and Sym, sharing three polymorphic ops: add, mid, var. Dispatch is one line each — build the function name from .it.is, then indirect-call it via @fn. Num uses Welford's online mean+variance; Sym keeps a counts table.
# numsym-mini.awk -- Num: mean+stdev (Welford). Sym: mode+entropy.
# --- polymorphic dispatch: pick fn by type tag, indirect-call it ----
function add(k, x, fn) { fn = .k.is "_add"; return @fn(k, x) }
function mid(it, fn) { fn = .it.is "_mid"; return @fn(it) }
function var(it, fn) { fn = .it.is "_var"; return @fn(it) }
# --- NUM: running mean + stdev via Welford --------------------------
function num_add(it, x, d) {
.it.n++
d = x - .it.mu # delta from old mean
.it.mu += d / .it.n # update mean
.it.m2 += d * (x - .it.mu) } # delta * (x - new mean)
function num_mid(it) { return .it.mu }
function num_var(it) { return .it.n < 2 ? 0 : sqrt(.it.m2 / (.it.n - 1)) }
# --- SYM: counts table -> mode + entropy ----------------------------
function sym_init(it) { arr(.it.has); return it } # called by new("sym")
function sym_add(it, x) { .it.n++; .it.has[x]++ }
function sym_mid(it, k, b, bv) { # most-common key
bv = -1
for (k in .it.has) if (.it.has[k] > bv) { bv = .it.has[k]; b = k }
return b }
function sym_var(it, k, p, e) { # Shannon entropy
for (k in .it.has) { p = .it.has[k] / .it.n; e -= p * log(p) }
return e }
Three sugar tricks on display: .it.x → HEAP[it]["x"] (struct sugar), @fn(it) indirect call (poor-man's vtable), and new()/arr() from layer 1 for safe construction. The full numsym.awk adds Bayes likelihoods, weighting, and normalisation; the slice above is enough for this example.
stats.awkReads any CSV. UPPERCASE column names → Num, lowercase → Sym. One pass, prints summary at end. All it does is call into the two layers above.
BEGIN { FS = " *, *" } # CSV: comma + optional spaces
NR == 1 { header(); next } # first row = column names
{ ingest() } # every other row = data
# header(): one Num or Sym per column, decided by first letter's case.
function header( i) {
for (i = 1; i <= NF; i++) {
NAME[i] = $i
COL[i] = new($i ~ /^[A-Z]/ ? "num" : "sym") } } # new() from dot.awk
# ingest(): polymorphic add() picks num_add or sym_add per column.
function ingest( i) {
for (i = 1; i <= NF; i++) add(COL[i], $i, 1) }
END { report(); rogues() } # rogues() from dotlib.awk
function report( i, c) {
printf "%-22s %6s %12s %12s\n", "column", "n", "mid", "spread"
for (i = 1; i <= length(NAME); i++) {
c = COL[i]
if (.c.is == "num") printf "%-22s %6d %12.3f %12.3f\n", NAME[i], .c.n, mid(c), var(c)
else printf "%-22s %6d %12s %12.3f\n", NAME[i], .c.n, mid(c), var(c) } }
Heart-disease CSV. Mix of numeric (UPPER) and symbolic (lower) columns:
AGE,sex,cp,TRESTBPS,CHOL,fbs,restecg,THALACH,exang,OLDPEAK,slope,CA,thal,num!
63,male,typ_angina,145,233,t,left_vent_hyper,150,no,2.3,down,0,fixed_defect,<50
67,male,asympt,160,286,f,left_vent_hyper,108,yes,1.5,flat,3,normal,>50_1
The dotcols binary bundles dot's runtime + helpers + Num/Sym/Data, so you only pass the program file:
dotcols stats.awk data/classify/heart.c.csv
Or use the bundled demo (no source file needed — reads demos/stats/sample.csv by default, or whatever DATA you pass):
dotcols --demo stats data/classify/heart.c.csv
Or pipe data via stdin (use - to override the default sample):
cat data/classify/heart.c.csv | dotcols --demo stats -
To inspect what the preprocessor produced for any file:
dotcols -c stats.awk # print rewritten source to stdout
column n mid spread
AGE 303 54.366 9.082
sex 303 male 0.624
cp 303 asympt 1.206
TRESTBPS 303 131.624 17.538
CHOL 303 246.264 51.831
fbs 303 f 0.420
restecg 303 normal 0.754
THALACH 303 149.647 22.905
exang 303 no 0.632
OLDPEAK 303 1.040 1.161
slope 303 up 0.897
CA 298 0.674 0.938
thal 301 normal 0.864
num! 303 <50 0.689
mid = mean (num) or mode (sym). spread = stdev (num) or entropy (sym).
header() reads first row, calls new("num") or new("sym") per column based on first-letter case. Each call returns a fresh ID and seeds HEAP[id].ingest() calls add(COL[i], $i, 1). Same call dispatches to num_add or sym_add via .it.is.num_add updates Welford's running mean+variance. sym_add increments a counts table.report() calls mid() and var() — both polymorphic. num_mid returns mean; sym_mid returns mode. num_var returns stdev; sym_var returns entropy.rogues() at end reports any lowercase global leaked. Should print nothing.dotcols --demos # list bundled demos
dotcols --show # dump bundled runtime libs (post-prep)
dotcols --get-data # fetch 30 curated CSVs into ./data/
To use these types for actual machine learning — decision trees, naive Bayes — see dotlearn.