Manual
The CLI, preprocessor, runtime, helpers, and conventions. (For Num/Sym/Data column types, see dotcols.)
dot — the CLI
One self-contained bash script (~280 lines). Bundles the preprocessor, runtime, helpers, and Num/Sym types as embedded heredocs. No directory, no Makefile.
dot FILE.awk [DATA...] # run rewritten FILE.awk on DATA (or stdin)
dot a.awk b.awk DATA # multi-file run; .awk args go through prep
cat DATA | dot FILE.awk # stdin works too
dot -c FILE.awk # print rewritten source only
dot --demo NAME [DATA] # run demos/NAME/*.awk on DATA (or sample.*; pass `-` for stdin)
dot --demos # list available demos under ./demos/
dot --show # dump bundled lib/*.awk (post-prep)
dot --help # full help
Arg dispatch: any *.awk argument is preprocessed and added as a -f source; everything else becomes a data file. The bundled runtime is always loaded first. User .awk files are written to ${TMPDIR}/dot.XXXX/<basename> before gawk -f picks them up — that way error messages preserve the original filename and line number.
Built from the modular sources by build.sh. Edit a source file, run ./build.sh, get a rebuilt dot.
Preprocessor
Two rewrites applied to source before gawk sees it. A "value-char" is anything in [A-Za-z0-9_])]. So it.x rewrites; .x at line start does not.
Run the preprocessor manually with dot -c file.awk (one file) or dot --show (bundled libs).
new
Allocate an object, dispatch its initializer. new(t) increments NID, sets .it.is = t, and calls t_init if defined. Returns the new ID.
function new(t, it, fn) {
it = ++NID; .it.is = t
fn = t "_init"
return (fn in FUNCTAB) ? @fn(it) : it }
Storage lives in HEAP[id]["field"].
arr
Force a value to be an array. gawk creates arrays-of-arrays lazily. arr(x) is the standard idiom — assign to a dummy key, delete it. Now safe to for k in x when later empty.
function arr(x) { x[""] = 0; delete x[""] }
zap
Drop one HEAP slot. Use when the caller is done with an object and wants the memory back. No GC, no refcount — zap is the only memory hook.
function zap(i) { delete HEAP[i] }
Pattern: build a temporary aggregate, extract its scalar, drop it.
function spread(d, rows, y, v) {
y = ycol(d, rows); v = var(y); zap(y); return v }
Polymorphic dispatch
Each object carries its type string in .it.is (set by new()). A polymorphic call concatenates that string with the operation name and uses gawk's indirect-call (@fn) syntax.
function add(k, x, train, w, fn) { fn = .k.is "_add"
return @fn(k, x, train, w) }
function like(k, x, p, m, fn) { fn = .k.is "_like"
return @fn(k, x, p, m) }
function var(it, fn) { fn = .it.is "_var"; return @fn(it) }
function mid(it, fn) { fn = .it.is "_mid"; return @fn(it) }
The dispatch pattern (one line builds the name, one line calls it) is the only thing dot itself defines. Concrete types like num and sym with their num_add/sym_add/num_mid/sym_mid functions live in the dotcols layer.
o, _oo — recursive pretty-printers
One entry point. o(x) dispatches on shape: array vs scalar, list vs dict, integer-valued float vs other number, else string. Recursion is automatic — the workhorse _oo calls o() on each element, so nested mixes (list-of-dicts, dict-of-lists, anything) just work.
function o(x) {
if (isarray(x)) {
if (1 in x) _oo(x, "[", "]", "@ind_num_asc", 0)
else _oo(x, "{", "}", "@ind_str_asc", 1)
} else if (x ~ /^-?[0-9]+(\.[0-9]+)?([eE][-+]?[0-9]+)?$/) {
if (x == int(x)) printf "%d", x
else printf "%G", x }
else printf "%s", x }
function _oo(a, lhs, rhs, how, withkey, n, i, k, sep, sorted) {
printf "%s", lhs
n = asorti(a, sorted, how)
sep = ""
for (i = 1; i <= n; i++) {
k = sorted[i]
printf "%s", sep
if (withkey) printf "%s: ", k
o(a[k])
sep = ", " }
printf "%s", rhs }
Defaults: list keys numeric-sorted, dict keys string-sorted, integer-valued floats print as %d (so 5.0000 → 5), other numbers as %G (so 1e-7 → 1E-07). For custom brackets, call _oo directly:
_oo(a, "(", ")", "@ind_num_asc", 0) # tuple style
_oo(a, "<", ">", "@ind_str_asc", 1) # angle-bracket dict
The "@ind_num_asc" string names a built-in asorti() comparator. The @ prefix is required — without it gawk treats the string as a user-defined comparison function name and aborts with a fatal error.
Conventions
- UPPER = global.
HEAP,NID,FUNCTAB. - lowercase = local. Declare in the gawk pseudo-arg slot (extra spaces in arg list).
- Every object sets
.it.isto its type string. - Run
rogues()atENDto flag any lowercase name leaked intoSYMTAB. - BEGIN/END blocks have no local scope — move logic to functions if you need locals there.
rogues
End-of-run leak detector. Lives in dotlib.awk. Walks SYMTAB (gawk's table of all global names) and prints any starting with a lowercase letter — the convention says lowercase is local, so any leaked lowercase global is a bug.
function rogues( i) {
for (i in SYMTAB)
if (i ~ /^[a-z]/) print "leak:", i > "/dev/stderr" }
Call once at END. Silent if clean; one line per leak otherwise.
Files
What the user installs is exactly one file: dot. Everything else lives in the repo for editing — build.sh embeds each .awk source as a heredoc inside the bundled binary. At runtime, dot writes those heredocs to a tempdir (${TMPDIR}/dot.XXXX/) on demand, then gawk -f's them. So once you have dot, no other file is required.
Shipped (the install)
| File | Lines | Purpose |
|---|---|---|
dot | ~280 | self-contained bash binary; bundles every .awk below as heredocs |
In the repo (for editing)
| File | Lines | Purpose | How it ships |
|---|---|---|---|
build.sh | ~180 | assembles dot from lib/ | not bundled (build-time only) |
lib/prep.awk | 10 | preprocessor (one rule, two regex sweeps) | heredoc inside dot |
lib/dot.awk | 14 | runtime: new, arr, zap | heredoc inside dot |
lib/dotlib.awk | 34 | helpers: rogues, o, _oo | heredoc inside dot |
demos/hello/hello.awk | 6 | smallest example: running mean | shipped in repo, runs via dot --demo hello |
demos/hello/sample.txt | 5 | auto-loaded sample for hello demo | shipped in repo |
So: the install is one file (dot). The repo holds lib/ sources, build.sh, and any number of demos/NAME/ directories that dot --demo NAME can run. Higher layers (dotcols, dotlearn) ship their own binaries that bundle dot's lib/ plus their own.
Limitations
Read these before deploying to anything you care about.
- Preprocessor is regex, not a parser. A literal
".x"inside a string or regex constant gets rewritten too. So doesprintf "%.3f"if a value-char precedes the dot. Workaround: split the dot across strings so it has a quote on each side —printf "%" "." "3f", or build basenames asname "." ext. (Quotes are not value-chars, so the regex skips it.) For object code, rename the field or write rawHEAP[x]directly. - Comments are not stripped. A
.xinside a# ...comment also rewrites. Harmless but ugly indot --showoutput. .it.isis a reserved field name. Used internally for type dispatch. Don't shadow it.zap(i)is the only memory hook. No GC, no refcount — you decide when an object is done with and callzap(i)to clear itsHEAPslot.- Monotonic IDs, not recycled.
NIDonly ever increments;zapdeletes a slot's contents but never reuses its id. Not a big deal — an id is just an int, so accumulation costs effectively nothing. The slot contents stay zapped, no leak. - No type checking.
add(num_obj, "abc", 1)coerces silently per awk rules. - Indirect call cost is real but small. Every polymorphic
add()does a string concat +FUNCTABlookup per row. gawk is interpreted, so the per-call overhead is already in microseconds — the indirect step adds a fraction of that. Hot inner loops can still callnum_adddirectly if you want the last bit. - gawk 5.4.0 bug:
gensub's second match returns empty captures. The preprocessor uses amatch/substrloop instead. - Bash required to run
dot. The binary uses heredocs, arrays, and process substitution<(...). macOS/bin/shis bash 3.x — run viabash dot ...or under zsh's invocation, not/bin/sh dot.