Manual

The CLI, preprocessor, runtime, types, helpers, and conventions.

Example CSV (used in code samples below)
AGE, sex, cp,         CHOL, TARGET-, KLASS!
63,  male, typ_angina, 233,  3.4,     low
67,  male, asympt,     286,  2.8,     high
41,  fem,  asympt,     250,  2.1,     low
55,  fem,  typ_angina, 215,  2.5,     high
Header sigils tell data.awk what each column is for: UPPER first letter = numeric (AGE, CHOL); lowercase = symbolic (sex, cp); suffix + / - = y-goal (max/min); suffix ! = classification target. See dotlearn manual for the regression/classification roles — dotcols itself doesn't use the y-flag, just stores it.

dot — the CLI

One self-contained bash script (~280 lines). Bundles the preprocessor, runtime, helpers, and Num/Sym types as embedded heredocs. No directory, no Makefile.

dotcols FILE.awk [DATA...]    # run rewritten FILE.awk on DATA (or stdin)
dotcols a.awk b.awk DATA      # multi-file run; .awk args go through prep
cat DATA | dotcols FILE.awk   # stdin works too
dotcols -c FILE.awk           # print rewritten source only
dotcols --demo NAME [DATA]    # run demos/NAME/*.awk on DATA (or sample.*)
dotcols --demos               # list available demos under ./demos/
dotcols --show                # dump bundled lib/*.awk (post-prep)
dotcols --get-data            # fetch 30 curated CSVs into ./data/
dotcols --help                # full help

Arg dispatch: any *.awk argument is preprocessed and added as a -f source; everything else becomes a data file. The bundled runtime is always loaded first. User .awk files are written to ${TMPDIR}/dot.XXXX/<basename> before gawk -f picks them up — that way error messages preserve the original filename and line number.

Built from the modular sources by build.sh. Edit a source file, run ./build.sh, get a rebuilt dot.

Preprocessor

Two rewrites applied to source before gawk sees it. A "value-char" is anything in [A-Za-z0-9_])]. So it.x rewrites; .x at line start does not.

Source
.it.cols
.d.rows[r][i]
.d
new(.d.ykind)
After prep
it["cols"]
d["rows"][r][i]
HEAP[d]
new(HEAP[d]["ykind"])

Run the preprocessor manually with dotcols -c file.awk (one file) or dotcols --show (bundled libs).

new

Allocate an object, dispatch its initializer. new(t) increments NID, sets .it.is = t, and calls t_init if defined. Returns the new ID.

function new(t,    it, fn) {
  it = ++NID;  .it.is = t
  fn = t "_init"
  return (fn in FUNCTAB) ? @fn(it) : it }

Storage lives in HEAP[id]["field"].

arr

Force a value to be an array. gawk creates arrays-of-arrays lazily. arr(x) is the standard idiom — assign to a dummy key, delete it. Now safe to for k in x when later empty.

function arr(x) { x[""] = 0; delete x[""] }

zap

Drop one HEAP slot. Use when the caller is done with an object and wants the memory back. No GC, no refcount — zap is the only memory hook.

function zap(i) { delete HEAP[i] }

Pattern: build a temporary aggregate, extract its scalar, drop it.

function spread(d, rows,    y, v) {
  y = ycol(d, rows); v = var(y); zap(y); return v }

Polymorphic dispatch

Each object carries its type string in .it.is (set by new()). A polymorphic call concatenates that string with the operation name and uses gawk's indirect-call (@fn) syntax.

function add(k, x, train, w,  fn) { fn = .k.is "_add"
                                    return @fn(k, x, train, w) }
function like(k, x, p, m,     fn) { fn = .k.is "_like"
                                    return @fn(k, x, p, m) }
function var(it,    fn)  { fn = .it.is "_var"; return @fn(it) }
function mid(it,    fn)  { fn = .it.is "_mid"; return @fn(it) }

Dispatchers live in numsym.awk: add, like, mid, var. To add a new one, follow the pattern — one line builds the name, one line calls it.

num

Numeric column. Welford running mean + variance with optional weight. w=1 adds, w=-1 subtracts (used by the active learner to evict rows).

function num_add(it, x, train, w,    d, d2) {
  w = (w == "" ? 1 : w + 0)
  if (x == "?") return x
  x += 0
  if (!train) return x
  if (w < 0 && .it.n <= 2) { .it.n=0; .it.mu=0; .it.m2=0; return x }
  .it.n  += w
  d  = x - .it.mu;  .it.mu += w * d / .it.n
  d2 = x - .it.mu;  .it.m2 += w * d * d2
  return x }

function num_var(it) {
  return .it.n < 2 ? 0 : sqrt(.it.m2 / (.it.n - 1)) }

function num_mid(it) { return .it.mu }

sym

Symbolic column. Counts table, mode, entropy. Optional weight for sub.

function sym_init(it)   { arr(.it.has); return it }

function sym_add(it, x, train, w) {
  w = (w == "" ? 1 : w + 0)
  if (x == "?" || !train) return x
  .it.n      += w
  .it.has[x] += w
  if (.it.has[x] <= 0) delete .it.has[x]
  return x }

function sym_var(it,    e, k, p) {
  for (k in .it.has) { p = .it.has[k]/.it.n; e -= p*log(p) }
  return e }

function sym_mid(it,    k, b, bv) {
  bv = -1
  for (k in .it.has) if (.it.has[k] > bv) { bv = .it.has[k]; b = k }
  return b }

o, _oo — recursive pretty-printers

One entry point. o(x) dispatches on shape: array vs scalar, list vs dict, integer-valued float vs other number, else string. Recursion is automatic — the workhorse _oo calls o() on each element, so nested mixes (list-of-dicts, dict-of-lists, anything) just work.

function o(x) {
  if (isarray(x)) {
    if (1 in x) _oo(x, "[", "]", "@ind_num_asc", 0)
    else        _oo(x, "{", "}", "@ind_str_asc", 1)
  } else if (x ~ /^-?[0-9]+(\.[0-9]+)?([eE][-+]?[0-9]+)?$/) {
    if (x == int(x)) printf "%d", x
    else             printf "%G", x }
  else printf "%s", x }

function _oo(a, lhs, rhs, how, withkey,    n, i, k, sep, sorted) {
  printf "%s", lhs
  n = asorti(a, sorted, how)
  sep = ""
  for (i = 1; i <= n; i++) {
    k = sorted[i]
    printf "%s", sep
    if (withkey) printf "%s: ", k
    o(a[k])
    sep = ", " }
  printf "%s", rhs }

Defaults: list keys numeric-sorted, dict keys string-sorted, integer-valued floats print as %d (so 5.00005), other numbers as %G (so 1e-71E-07). For custom brackets, call _oo directly:

_oo(a, "(", ")", "@ind_num_asc", 0)   # tuple style
_oo(a, "<", ">", "@ind_str_asc", 1)   # angle-bracket dict

The "@ind_num_asc" string names a built-in asorti() comparator. The @ prefix is required — without it gawk treats the string as a user-defined comparison function name and aborts with a fatal error.

Conventions

  • UPPER = global. HEAP, NID, FUNCTAB.
  • lowercase = local. Declare in the gawk pseudo-arg slot (extra spaces in arg list).
  • Every object sets .it.is to its type string.
  • Run rogues() at END to flag any lowercase name leaked into SYMTAB.
  • BEGIN/END blocks have no local scope — move logic to functions if you need locals there.

rogues

End-of-run leak detector. Lives in dotlib.awk. Walks SYMTAB (gawk's table of all global names) and prints any starting with a lowercase letter — the convention says lowercase is local, so any leaked lowercase global is a bug.

function rogues(    i) {
  for (i in SYMTAB)
    if (i ~ /^[a-z]/) print "leak:", i > "/dev/stderr" }

Call once at END. Silent if clean; one line per leak otherwise.

Files

What the user installs is exactly one file: dot. Everything else lives in the repo for editing — build.sh embeds each .awk source as a heredoc inside the bundled binary. At runtime, dot writes those heredocs to a tempdir (${TMPDIR}/dot.XXXX/) on demand, then gawk -f's them. So once you have dot, no other file is required.

Shipped (the install)

FileLinesPurpose
dot~280self-contained bash binary; bundles every .awk below as heredocs

In the repo (for editing)

FileLinesPurposeHow it ships
build.sh~180assembles dot from the sourcesnot bundled (build-time only)
prep.awk10preprocessor (one rule, two regex sweeps)heredoc inside dot
dot.awk14runtime: new, arr, zapheredoc inside dot
dotlib.awk34helpers: rogues, o, _ooheredoc inside dot
numsym.awk66num/sym types + dispatchers add, like, mid, varheredoc inside dot
hello.awk6smallest example: running meanheredoc inside dot (used by --hello)
stats.awk25per-column stats main programheredoc inside dot (used by --stats)

So: the install is one file (dot). The repo holds the seven editable sources plus build.sh. Tempfiles are runtime artifacts and get cleaned up on exit.

Limitations

Read these before deploying to anything you care about.

  • Preprocessor is regex, not a parser. A literal ".x" inside a string or regex constant gets rewritten too. So does printf "%.3f" if a value-char precedes the dot. Workaround: split the dot across strings so it has a quote on each side — printf "%" "." "3f", or build basenames as name "." ext. (Quotes are not value-chars, so the regex skips it.) For object code, rename the field or write raw HEAP[x] directly.
  • Comments are not stripped. A .x inside a # ... comment also rewrites. Harmless but ugly in dotcols --show.
  • .it.is is a reserved field name. Used internally for type dispatch. Don't shadow it.
  • zap(i) is the only memory hook. No GC, no refcount — you decide when an object is done with and call zap(i) to clear its HEAP slot.
  • Monotonic IDs, not recycled. NID only ever increments; zap deletes a slot's contents but never reuses its id. Not a big deal — an id is just an int, so accumulation costs effectively nothing. The slot contents stay zapped, no leak.
  • No type checking. add(num_obj, "abc", 1) coerces silently per awk rules.
  • Indirect call cost is real but small. Every polymorphic add() does a string concat + FUNCTAB lookup per row. gawk is interpreted, so the per-call overhead is already in microseconds — the indirect step adds a fraction of that. Hot inner loops can still call num_add directly if you want the last bit.
  • gawk 5.4.0 bug: gensub's second match returns empty captures. The preprocessor uses a match/substr loop instead.
  • Bash required to run dot. The binary uses heredocs, arrays, and process substitution <(...). macOS /bin/sh is bash 3.x — run via bash dot ... or under zsh's invocation, not /bin/sh dot.