Example: stats.awk

Per-column running stats on any CSV. Mean/stdev for numeric, mode/entropy for symbolic. Same add() call, polymorphic dispatch.

Quick glossary

gawk
GNU awk. Adds indirect calls, multi-dim arrays, FUNCTAB, SYMTAB.
FUNCTAB
gawk built-in: array of all defined function names. Tested with (name in FUNCTAB).
SYMTAB
gawk built-in: array of all global variable names. Used by rogues() to detect leaks.
@fn(...)
indirect function call. fn = "num_add"; @fn(it,x) calls num_add(it,x).
<(cmd)
process substitution (bash). cmd's output appears as a temporary file path.
arg slot
awk has no local keyword. Locals declared as extra args after the real ones, separated by extra spaces.

Built bottom-up, smallest first. The toolchain (3-line shell + 6-line preprocessor) and the runtime (3 functions, 9 LOC) are the entire foundation. After that, two layers of program code: a generic helper library, then a tiny type library, then the main program. gawk concatenates everything via separate -f flags (see Run); no @include needed.

0. The toolchain — dot + prep.awk (~9 LOC total)

The shell wrapper is three lines. It locates itself, then runs prep.awk on the source file you hand it; the rewritten code goes to stdout, ready for gawk -f <(dot foo.awk).

#!/usr/bin/env bash
DOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
gawk -f "$DOT_DIR/prep.awk" "$1"

The preprocessor itself is one rule with two regex sweeps per line. First sweep: any .field directly after a value-character (letter, digit, _, ], )) becomes ["field"] — a struct-field access. Second sweep: any leftover .name (bare, with no value-char before it) becomes HEAP[name] — an object reference.

# prep.awk -- two-pass regex rewrite. ~6 LOC.
# .f after value-char  ->  ["f"]      (field access)
# .x bare              ->  HEAP[x]    (object reference)
{ s = $0
  while (match(s, /([A-Za-z0-9_\]\)])\.([A-Za-z_][A-Za-z_0-9]*)/, m))
    s = substr(s,1,RSTART-1) m[1] "[\"" m[2] "\"]" substr(s,RSTART+RLENGTH)
  while (match(s, /\.([A-Za-z_][A-Za-z_0-9]*)/, m))
    s = substr(s,1,RSTART-1) "HEAP[" m[1] "]" substr(s,RSTART+RLENGTH)
  print s }

That is the whole compile pipeline. match()+substr() loop instead of gensub(...,"g") dodges a gawk 5.4.0 bug where the second match's captures come back empty.

What that does, in concrete terms — each input line gets rewritten line-for-line, no insertions, no deletions:

# source line                  ->  after prep.awk
.it.n++                        ->  HEAP[it]["n"]++
.it.mu += d / .it.n            ->  HEAP[it]["mu"] += d / HEAP[it]["n"]
add(.col[i], $i, 1)            ->  add(HEAP[col][i], $i, 1)
.d.rows[r][i]                  ->  HEAP[d]["rows"][r][i]
NAME[i] = $i                   ->  NAME[i] = $i        (no dots, unchanged)

Two cases per line. .x after a value-char (t, ], ), etc.) is a field access → ["x"]. .x bare is an object reference → HEAP[x]. Plain awk code with no dots passes through untouched, so most existing awk programs survive the preprocessor unchanged.

If you need a literal dot — for instance, joining a basename and an extension — split it across strings so the dot has a quote on each side. Quotes are not value-chars, so the regex skips it:

###############################################################
#   DO  THIS                                                  #
###############################################################
# source line                  ->  after prep.awk
"fred" "." "csv"               ->  "fred" "." "csv"        (safe: dot stands alone)
name "." ext                   ->  name "." ext            (safe)

###############################################################
#   DON'T  DO  THIS                                           #
###############################################################
"fred.csv"                     ->  "fred[\"csv\"]"         (BROKEN: dot is inside a string,
                                                             d is a value-char, .csv matches)

Rule of thumb: a dot literal is safe only when nothing on its left is a letter, digit, _, ], or ). The cleanest fix is the explicit-dot pattern above — "." as its own string, concatenated.

1. The runtime — dot.awk (3 functions, 9 LOC)

Everything sits on one global, HEAP. Each object is an integer id; HEAP[id] is its struct. Three functions — a constructor, an array-init helper, a slot zapper. That is the entire runtime.

# dot.awk -- the entire object runtime. Three functions.

# new(t): allocate a fresh object id, tag it, run optional t_init().
function new(t,    it, fn) {
  it = ++NID;  .it.is = t                  # .it.is  ->  HEAP[it]["is"]
  fn = t "_init"
  return (fn in FUNCTAB) ? @fn(it) : it }  # call type_init if it exists

# arr(x): force x to be an array. Safe to "for k in x" when later empty.
function arr(x) { x[""] = 0; delete x[""] }

# zap(i): clear one HEAP slot. Use when caller is done with object i.
function zap(i) { delete HEAP[i] }

Fewest lines to make dot work: 3-line shell + 6-line prep.awk + 4-line new() = 13 LOC. Add arr and zap for the typical demos and the runtime is still under 10 lines.

2. Generic helpers — dotlib.awk

Three things, two functions. A leak check, and one polymorphic printer o() that dispatches to a workhorse _oo for arrays. Scalars and arrays go through the same call.

# dotlib.awk -- generic helpers (no HEAP knowledge).

# rogues(): warn on lowercase globals at end of run. Discipline check.
function rogues(    i) {
  for (i in SYMTAB) if (i ~ /^[a-z]/) print "leak:", i > "/dev/stderr" }

# o(x): print one thing.
#   array, 1 in x   -> "[..]" list,  numeric-sorted, no key prefix
#   array, no 1     -> "{..}" dict,  string-sorted,  "k: " prefix
#   number-shaped   -> %d if whole, else %G
#   else            -> %s
# Custom brackets: call _oo directly, e.g. _oo(a,"(",")","@ind_num_asc",0).
function o(x) {
  if (isarray(x)) {
    if (1 in x) _oo(x, "[", "]", "@ind_num_asc", 0)
    else        _oo(x, "{", "}", "@ind_str_asc", 1)
  } else if (x ~ /^-?[0-9]+(\.[0-9]+)?([eE][-+]?[0-9]+)?$/) {
    if (x == int(x)) printf "%d", x
    else             printf "%G", x }
  else printf "%s", x }

# _oo: collect+sort keys, walk them, print one entry each. Recurses via o().
function _oo(a, lhs, rhs, how, withkey,    n, i, k, sep, sorted) {
  printf "%s", lhs
  n = asorti(a, sorted, how)
  sep = ""
  for (i = 1; i <= n; i++) {
    k = sorted[i]
    printf "%s", sep
    if (withkey) printf "%s: ", k
    o(a[k])
    sep = ", " }
  printf "%s", rhs }

One entry point. o() dispatches on shape: array vs scalar, list-shaped vs dict-shaped, integer-valued float vs other number, else string. Recursion is automatic — _oo calls o() on each element, so nested mixes (list-of-dicts, dict-of-lists, anything) just work.

o(5.0)                              # -> 5
o(1e-7)                             # -> 1E-07
o("hi")                             # -> hi
o(a)        # a[1..n]               # -> [10, 20, 3.14]
o(b)        # b["alpha"]=1, ...     # -> {alpha: 1, beta: 2.5}
o(c)        # nested mix            # -> {name: tim, xs: [1, 2]}
_oo(a, "(", ")", "@ind_num_asc", 0) # -> (10, 20, 3.14)   tuple style

The "@ind_num_asc" string names a built-in asorti() comparator (numeric ascending on the keys). The @ prefix is required — without it gawk treats the string as a user-defined comparison function name and aborts.

3. The type library — numsym-mini.awk (~20 LOC)

Two types, Num and Sym, sharing three polymorphic ops: add, mid, var. Dispatch is one line each — build the function name from .it.is, then indirect-call it via @fn. Num uses Welford's online mean+variance; Sym keeps a counts table.

# numsym-mini.awk -- Num: mean+stdev (Welford). Sym: mode+entropy.

# --- polymorphic dispatch: pick fn by type tag, indirect-call it ----
function add(k, x,    fn) { fn = .k.is  "_add"; return @fn(k, x) }
function mid(it,      fn) { fn = .it.is "_mid"; return @fn(it) }
function var(it,      fn) { fn = .it.is "_var"; return @fn(it) }

# --- NUM: running mean + stdev via Welford --------------------------
function num_add(it, x,    d) {
  .it.n++
   d      = x - .it.mu                           # delta from old mean
  .it.mu += d / .it.n                            # update mean
  .it.m2 += d * (x - .it.mu) }                   # delta * (x - new mean)
function num_mid(it) { return .it.mu }
function num_var(it) { return .it.n < 2 ? 0 : sqrt(.it.m2 / (.it.n - 1)) }

# --- SYM: counts table -> mode + entropy ----------------------------
function sym_init(it)   { arr(.it.has); return it }   # called by new("sym")
function sym_add(it, x) { .it.n++; .it.has[x]++ }
function sym_mid(it,   k, b, bv) {                    # most-common key
  bv = -1
  for (k in .it.has) if (.it.has[k] > bv) { bv = .it.has[k]; b = k }
  return b }
function sym_var(it,   k, p, e) {                     # Shannon entropy
  for (k in .it.has) { p = .it.has[k] / .it.n; e -= p * log(p) }
  return e }

Three sugar tricks on display: .it.xHEAP[it]["x"] (struct sugar), @fn(it) indirect call (poor-man's vtable), and new()/arr() from layer 1 for safe construction. The full numsym.awk adds Bayes likelihoods, weighting, and normalisation; the slice above is enough for this example.

4. The main program — stats.awk

Reads any CSV. UPPERCASE column names → Num, lowercase → Sym. One pass, prints summary at end. All it does is call into the two layers above.

BEGIN { FS = " *, *" }                       # CSV: comma + optional spaces

NR == 1 { header(); next }                   # first row = column names
      { ingest() }                            # every other row = data

# header(): one Num or Sym per column, decided by first letter's case.
function header(    i) {
  for (i = 1; i <= NF; i++) {
    NAME[i] = $i
    COL[i]  = new($i ~ /^[A-Z]/ ? "num" : "sym") } }   # new() from dot.awk

# ingest(): polymorphic add() picks num_add or sym_add per column.
function ingest(    i) {
  for (i = 1; i <= NF; i++) add(COL[i], $i, 1) }

END { report(); rogues() }                   # rogues() from dotlib.awk

function report(    i, c) {
  printf "%-22s %6s %12s %12s\n", "column", "n", "mid", "spread"
  for (i = 1; i <= length(NAME); i++) {
    c = COL[i]
    if (.c.is == "num") printf "%-22s %6d %12.3f %12.3f\n", NAME[i], .c.n, mid(c), var(c)
    else                printf "%-22s %6d %12s %12.3f\n", NAME[i], .c.n, mid(c), var(c) } }

Input

Heart-disease CSV. Mix of numeric (UPPER) and symbolic (lower) columns:

AGE,sex,cp,TRESTBPS,CHOL,fbs,restecg,THALACH,exang,OLDPEAK,slope,CA,thal,num!
63,male,typ_angina,145,233,t,left_vent_hyper,150,no,2.3,down,0,fixed_defect,<50
67,male,asympt,160,286,f,left_vent_hyper,108,yes,1.5,flat,3,normal,>50_1

Run

The dotcols binary bundles dot's runtime + helpers + Num/Sym/Data, so you only pass the program file:

dotcols stats.awk data/classify/heart.c.csv

Or use the bundled demo (no source file needed — reads demos/stats/sample.csv by default, or whatever DATA you pass):

dotcols --demo stats data/classify/heart.c.csv

Or pipe data via stdin (use - to override the default sample):

cat data/classify/heart.c.csv | dotcols --demo stats -

To inspect what the preprocessor produced for any file:

dotcols -c stats.awk    # print rewritten source to stdout

Output

column                      n          mid       spread
AGE                       303       54.366        9.082
sex                       303         male        0.624
cp                        303       asympt        1.206
TRESTBPS                  303      131.624       17.538
CHOL                      303      246.264       51.831
fbs                       303            f        0.420
restecg                   303       normal        0.754
THALACH                   303      149.647       22.905
exang                     303           no        0.632
OLDPEAK                   303        1.040        1.161
slope                     303           up        0.897
CA                        298        0.674        0.938
thal                      301       normal        0.864
num!                      303          <50        0.689

mid = mean (num) or mode (sym). spread = stdev (num) or entropy (sym).

What just happened

  1. header() reads first row, calls new("num") or new("sym") per column based on first-letter case. Each call returns a fresh ID and seeds HEAP[id].
  2. Every subsequent row: ingest() calls add(COL[i], $i, 1). Same call dispatches to num_add or sym_add via .it.is.
  3. num_add updates Welford's running mean+variance. sym_add increments a counts table.
  4. report() calls mid() and var() — both polymorphic. num_mid returns mean; sym_mid returns mode. num_var returns stdev; sym_var returns entropy.
  5. rogues() at end reports any lowercase global leaked. Should print nothing.

Variants

dotcols --demos              # list bundled demos
dotcols --show               # dump bundled runtime libs (post-prep)
dotcols --get-data           # fetch 30 curated CSVs into ./data/

Next

To use these types for actual machine learning — decision trees, naive Bayes — see dotlearn.