Performance on TurboVision

Benchmarking with a Stopwatch

Sun, 22 Feb 2026 00:00:00 +0000

When people imagine benchmarking, they picture automated harnesses, high-resolution timers, and dashboards with percentile charts. Useful tools, absolutely. But many core lessons of performance engineering can be learned with much humbler methods, including one old trick from retro workflows: benchmarking with a stopwatch and disciplined procedure.

On vintage systems, instrumentation was often limited, intrusive, or unavailable. So users built practical measurement habits with what they had:

fixed test scenarios
fixed machine state
repeated runs
manual timing
written logs

It sounds primitive until you realize it enforces the exact thing modern teams often skip: experimental discipline.

The first rule was baseline control. Before measuring anything, define the environment:

cold boot or warm boot?
which TSRs loaded?
cache settings?
storage medium and fragmentation status?
background noise sources?

Without this, numbers are stories, not data.

Retro benchmark notes were often simple tables in paper notebooks:

date/time
test ID
config profile
run duration
anomalies observed

Crude format, high value. The notebook gave context that raw timing never carries alone.

A useful retro-style method still works today:

Define one narrow task.
Freeze variables you can control.
Predict expected change before tuning.
Run at least five times.
Record median, min, max, and odd behavior.
Change one variable only.
Repeat.

This method is slow compared to one-click benchmarks. It is also far less vulnerable to self-deception.

On old DOS systems, examples were concrete:

compile a known source tree
load/save a fixed data file
render a known scene
execute a scripted file operation loop

The key was repeatability, not synthetic hero numbers.

Stopwatch timing also trained observational awareness. While timing a run, people noticed things automated tools might not flag immediately:

intermittent disk spin-up delays
occasional UI stalls
audible seeks indicating poor locality
thermal behavior after repeated runs

These qualitative observations often explained quantitative outliers.

Outliers are where learning happens. Many teams throw them away too quickly. In retro workflows, outliers were investigated because they were expensive and visible. Was the disk retrying? Did memory managers conflict? Did a TSR wake unexpectedly? Outlier analysis taught root-cause thinking.

Modern equivalent: if your p99 spikes, do not call it “noise” by default.

Another underrated benefit of manual benchmarking is forced hypothesis writing. If timing is laborious, you naturally ask, “What exactly am I trying to prove?” That question removes random optimization churn.

A strong benchmark note has:

hypothesis
method
expected outcome
observed outcome
interpretation

If interpretation comes without explicit expectation, confirmation bias sneaks in.

Retro systems also made tradeoffs obvious. You might optimize disk cache and gain load speed but lose conventional memory needed by a tool. You might tune for compile throughput and reduce game compatibility in the same boot profile. Measuring one axis while ignoring others produced bad local wins.

That tradeoff awareness is still essential:

lower latency at cost of CPU headroom
higher throughput at cost of tail behavior
better cache hit rate at cost of stale data risk

All optimization is policy.

The stopwatch method encouraged another good habit: “benchmark the user task, not the subsystem vanity metric.” Faster block IO means little if perceived workflow time is unchanged. In retro terms: if startup is faster but menu interaction is still laggy, users still feel it is slow.

Many optimization projects fail because they optimize what is easy to measure, not what users experience.

The historical constraints are gone, but the pattern remains useful for quick field analysis:

no profiler on locked-down machine
no tracing in production-like lab
no permission for invasive instrumentation

In those cases, controlled manual timing plus careful notes can still produce actionable decisions.

There is a social benefit too. Manual benchmark logs are readable by non-specialists. Product, support, and ops can review the same sheet and understand what changed. Shared understanding improves prioritization.

This does not replace modern telemetry. It complements it. Think of stopwatch benchmarking as a low-tech integrity check:

Does automated telemetry align with observed behavior?
Do optimization claims survive controlled reruns?
Do gains persist after reboot and load variance?

If yes, confidence increases.

If no, investigate before celebrating.

A practical retro-inspired template for teams:

keep one canonical benchmark scenario per critical user flow
run it before and after risky performance changes
require expected-vs-actual notes
archive results alongside release notes

This creates performance memory. Without memory, teams repeat old mistakes with new tooling.

Performance culture improves when measurement is treated as craft, not ceremony. Retro workflows learned that under hardware limits. We can keep the lesson without the limits.

The stopwatch is symbolic, not sacred. Use any timer you like. What matters is disciplined comparison, clear expectations, and honest interpretation. Those traits produce reliable performance improvements on 486-era systems and cloud-native stacks alike.

In the end, benchmarking quality is less about timer precision than about thinking precision. A clean method beats a noisy toolchain every time.

Latency Budgeting on Old Machines

Sun, 22 Feb 2026 00:00:00 +0000

One gift of old machines is that they make latency visible. You do not need an observability platform to notice when an operation takes too long; your hands tell you immediately. Keyboard echo lags. Menu redraw stutters. Disk access interrupts flow. On constrained hardware, latency is not hidden behind animation. It is a first-class design variable.

Most retro users developed latency budgets without naming them that way. They did not begin with dashboards. They began with tolerance thresholds: if opening a directory takes longer than a second, it feels broken; if screen updates exceed a certain rhythm, confidence drops; if save operations block too long, people fear data loss. This was experiential ergonomics, built from repeated friction.

A practical budget often split work into classes. Input responsiveness had the strictest target. Visual feedback came second. Heavy background operations came third, but only if they could communicate progress honestly. Even simple tools benefited from this hierarchy. A file manager that reacts instantly to keys but defers expensive sorting feels usable. One that blocks on every key feels hostile.

Because CPUs and memory were limited, achieving these budgets required architectural choices, not just micro-optimizations. You cached directory metadata. You precomputed static UI regions. You used incremental redraw instead of repainting everything. You chose algorithms with predictable worst-case behavior over theoretically elegant options with pathological spikes. The goal was not maximum benchmark score; it was consistent interaction quality.

Disk I/O dominated many workloads, so scheduling mattered. Batching writes reduced seek churn. Sequential reads were preferred whenever possible. Temporary file design became a latency decision: poor temp strategy could double user-visible wait time. Even naming conventions influenced performance because directory traversal cost was real and structure affected lookup behavior on older filesystems.

Developers also learned a subtle lesson: users tolerate total time better than jitter. A stable two-second operation can feel acceptable if progress is clear and consistent. An operation that usually takes half a second but occasionally spikes to five feels unreliable and stressful. Old systems made jitter painful, so engineers learned to trade mean performance for tighter variance when user trust depended on predictability.

Measurement techniques were primitive but effective. Stopwatch timings, loop counters, and controlled repeat runs produced enough signal to guide decisions. You did not need nanosecond precision to find meaningful wins; you needed discipline. Define a scenario, run it repeatedly, change one variable, and compare. This method is still superior to intuition-driven tuning in modern environments.

Another recurring tactic was level-of-detail adaptation. Tools degraded gracefully under load: fewer visual effects, smaller previews, delayed nonessential processing, simplified sorting criteria. These were not considered failures. They were responsible design responses to finite resources. Today we call this adaptive quality or progressive enhancement, but the principle is identical.

Importantly, latency budgeting changed communication between developers and users. Release notes often highlighted perceived speed improvements for specific workflows: startup, save, search, print, compile. This focus signaled respect for user time. It also forced teams to anchor claims in concrete tasks instead of vague “performance improved” statements.

Retro constraints also exposed the cost of abstraction layers. Every wrapper, conversion, and helper had measurable impact. Good abstractions survived because they paid for themselves in correctness and maintenance. Bad abstractions were stripped quickly when latency budgets broke. This pressure produced leaner designs and a healthier skepticism toward accidental complexity.

If we port these lessons to current systems, the takeaway is simple: define latency budgets at the interaction level, not just service metrics. Ask what a user can perceive and what breaks trust. Build architecture to protect those thresholds. Measure variance, not only averages. Prefer predictable degradation over catastrophic stalls. These are old practices, but they map perfectly to modern UX reliability.

The nostalgia framing misses the point. Old machines did not make developers virtuous by magic. They made trade-offs impossible to ignore. Latency was local, immediate, and accountable. When tools are transparent enough that cause and effect stay visible, teams build sharper instincts. That is the real value worth carrying forward.

One practical exercise is to choose a single workflow you use daily and write a hard budget for each step: open, search, edit, save, verify. Then instrument and defend those thresholds over time. On old machines this discipline was survival. On modern machines it is still an advantage, because user trust is ultimately built from perceived responsiveness, not theoretical peak throughput.

Budget log example

Workflow: open project -> search symbol -> edit -> save
Budget:
  open <= 800ms
  search <= 400ms
  save <= 300ms
Observed run #14:
  open 760ms | search 910ms | save 280ms
Action:
  inspect search index freshness and directory fan-out

Latency budgeting only works when budgets are written and checked, not assumed.