<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Monitoring on TurboVision</title>
    <link>https://turbovision.in6-addr.net/tags/monitoring/</link>
    <description>Recent content in Monitoring on TurboVision</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Tue, 21 Apr 2026 14:06:12 +0000</lastBuildDate>
    <atom:link href="https://turbovision.in6-addr.net/tags/monitoring/index.xml" rel="self" type="application/rss&#43;xml" />
    
    
    
    <item>
      <title>From Mailboxes to Everything Internet, Part 4: Perimeter, Proxies, and the Operations Upgrade</title>
      <link>https://turbovision.in6-addr.net/linux/migrations/from-mailboxes-to-everything-internet-part-4-perimeter-proxies-and-the-operations-upgrade/</link>
      <pubDate>Fri, 21 May 2010 00:00:00 +0000</pubDate>
      <lastBuildDate>Fri, 21 May 2010 00:00:00 +0000</lastBuildDate>
      <guid>https://turbovision.in6-addr.net/linux/migrations/from-mailboxes-to-everything-internet-part-4-perimeter-proxies-and-the-operations-upgrade/</guid>
      <description>&lt;p&gt;The final phase of the migration story starts when internet access stops being &amp;ldquo;useful&amp;rdquo; and becomes &amp;ldquo;required for normal business.&amp;rdquo;&lt;/p&gt;
&lt;p&gt;That is the moment architecture changes character. You are no longer adding online capabilities to an offline-first world. You are operating an internet-dependent environment where outages hurt immediately, security posture matters daily, and latency becomes political.&lt;/p&gt;
&lt;p&gt;If Part 1 taught us gateways, Part 2 taught policy discipline, and Part 3 taught identity realism, Part 4 teaches operational maturity: perimeter control, proxy strategy, and observability that is good enough to act on.&lt;/p&gt;
&lt;h2 id=&#34;the-perimeter-timeline-everyone-lived&#34;&gt;The perimeter timeline everyone lived&lt;/h2&gt;
&lt;p&gt;In the late 90s and early 2000s, many of us moved through the same progression:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;permissive edge with ad-hoc rules&lt;/li&gt;
&lt;li&gt;basic packet filtering&lt;/li&gt;
&lt;li&gt;NAT as default containment and address strategy&lt;/li&gt;
&lt;li&gt;explicit service publishing with stricter inbound policy&lt;/li&gt;
&lt;li&gt;recurring audits and documented rule ownership&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Tool names changed over time. The operating truth stayed constant:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;If nobody can explain why a firewall rule exists, that rule is debt.&lt;/strong&gt;&lt;/p&gt;
&lt;h2 id=&#34;rule-sets-as-executable-policy&#34;&gt;Rule sets as executable policy&lt;/h2&gt;
&lt;p&gt;The biggest jump in reliability came when we stopped treating firewall config as wizard output and started treating it like policy code with comments, ownership, and change history.&lt;/p&gt;
&lt;p&gt;A conceptual baseline:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;default INPUT  = DROP
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;default FORWARD = DROP
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;default OUTPUT = ACCEPT
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;allow established,related
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;allow loopback
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;allow admin-ssh from mgmt-net
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;allow smtp to mail-gateway
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;allow web to reverse-proxy
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;log+drop everything else&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;This is not about minimalism for style points. It is about creating a rulebase an operator can reason about quickly during incidents.&lt;/p&gt;
&lt;h2 id=&#34;nat-convenience-and-trap-in-one-box&#34;&gt;NAT: convenience and trap in one box&lt;/h2&gt;
&lt;p&gt;NAT solved practical problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;private address reuse&lt;/li&gt;
&lt;li&gt;easy outbound internet for many hosts&lt;/li&gt;
&lt;li&gt;accidental reduction of direct inbound exposure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It also created recurring confusion:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;works outbound, fails inbound&amp;rdquo;&lt;/li&gt;
&lt;li&gt;protocol edge cases under state tracking&lt;/li&gt;
&lt;li&gt;poor assumptions that NAT equals security policy&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We learned to separate concerns explicitly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;NAT handles address translation&lt;/li&gt;
&lt;li&gt;firewall handles policy&lt;/li&gt;
&lt;li&gt;service publishing handles intentional exposure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Combining them mentally is how outages hide.&lt;/p&gt;
&lt;h2 id=&#34;proxy-and-cache-operations-bandwidth-as-architecture&#34;&gt;Proxy and cache operations: bandwidth as architecture&lt;/h2&gt;
&lt;p&gt;Web access volume and software update traffic make proxy/cache design a real budget topic, especially on constrained links.&lt;/p&gt;
&lt;p&gt;A disciplined proxy setup gave us:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;reduced repeated downloads&lt;/li&gt;
&lt;li&gt;controllable egress behavior&lt;/li&gt;
&lt;li&gt;clearer audit path for outbound traffic&lt;/li&gt;
&lt;li&gt;policy enforcement point for categories and exceptions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It also gave us politics:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;who gets exceptions&lt;/li&gt;
&lt;li&gt;what to log and for how long&lt;/li&gt;
&lt;li&gt;how to communicate policy without creating a revolt&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The winning pattern was transparent policy with named ownership and periodic review, not silent filtering.&lt;/p&gt;
&lt;h2 id=&#34;monitoring-matured-from-nice-graph-to-first-responder&#34;&gt;Monitoring matured from &amp;ldquo;nice graph&amp;rdquo; to &amp;ldquo;first responder&amp;rdquo;&lt;/h2&gt;
&lt;p&gt;Early graphing projects were often visual hobbies. Around 2008-2010, monitoring became core operations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;service availability checks&lt;/li&gt;
&lt;li&gt;latency and packet-loss visibility&lt;/li&gt;
&lt;li&gt;queue and disk saturation alerts&lt;/li&gt;
&lt;li&gt;trend analysis for capacity planning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A minimal useful stack in that era looked like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;polling/graphing for interfaces and host metrics&lt;/li&gt;
&lt;li&gt;active checks for critical services&lt;/li&gt;
&lt;li&gt;alert routing by severity and schedule&lt;/li&gt;
&lt;li&gt;daily review of top recurring warnings&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Most teams fail not from missing tools, but from alert noise without ownership.&lt;/p&gt;
&lt;h2 id=&#34;alert-hygiene-less-noise-more-truth&#34;&gt;Alert hygiene: less noise, more truth&lt;/h2&gt;
&lt;p&gt;We adopted three rules that changed everything:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;every alert must map to a concrete action&lt;/li&gt;
&lt;li&gt;every noisy alert must be tuned or removed&lt;/li&gt;
&lt;li&gt;every major incident must produce one monitoring improvement&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Without these rules, monitoring becomes background anxiety.
With them, monitoring becomes a decision system.&lt;/p&gt;
&lt;h2 id=&#34;web-went-from-optional-to-default-workload&#34;&gt;Web went from optional to default workload&lt;/h2&gt;
&lt;p&gt;In the &amp;ldquo;everything internet&amp;rdquo; phase, internal services increasingly depended on external web APIs, update endpoints, and browser-based tooling. Outbound failures became as disruptive as inbound failures.&lt;/p&gt;
&lt;p&gt;That pushed us to monitor the whole path:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;local DNS health&lt;/li&gt;
&lt;li&gt;upstream DNS responsiveness&lt;/li&gt;
&lt;li&gt;default route and failover behavior&lt;/li&gt;
&lt;li&gt;proxy health&lt;/li&gt;
&lt;li&gt;selected external endpoint reachability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When users say &amp;ldquo;internet is slow,&amp;rdquo; they mean any one of twelve potential bottlenecks.&lt;/p&gt;
&lt;h2 id=&#34;incident-story-the-half-outage-that-taught-path-thinking&#34;&gt;Incident story: the half-outage that taught path thinking&lt;/h2&gt;
&lt;p&gt;One of our most educational incidents looked like this:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;internal DNS resolved fine&lt;/li&gt;
&lt;li&gt;external name resolution intermittently failed&lt;/li&gt;
&lt;li&gt;some websites loaded, others timed out&lt;/li&gt;
&lt;li&gt;mail queues started deferring to specific domains&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Initial blame went to firewall changes. Real cause was upstream DNS flapping plus a local resolver timeout setting that turned transient upstream latency into user-visible failure bursts.&lt;/p&gt;
&lt;p&gt;Fixes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;tune resolver timeout/retry behavior&lt;/li&gt;
&lt;li&gt;add secondary upstream resolvers with health checks&lt;/li&gt;
&lt;li&gt;monitor DNS query latency as first-class metric&lt;/li&gt;
&lt;li&gt;add runbook step: test path by stage, not by &amp;ldquo;internet yes/no&amp;rdquo;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The lesson: binary status checks are comforting and often wrong.&lt;/p&gt;
&lt;h2 id=&#34;operational-runbooks-became-mandatory&#34;&gt;Operational runbooks became mandatory&lt;/h2&gt;
&lt;p&gt;As dependency increased, we formalized runbooks for common internet-era failures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;high packet loss on WAN edge&lt;/li&gt;
&lt;li&gt;DNS partial outage&lt;/li&gt;
&lt;li&gt;proxy saturation&lt;/li&gt;
&lt;li&gt;firewall deploy regression&lt;/li&gt;
&lt;li&gt;certificate expiry risk (yes, this became real quickly)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A useful runbook page had:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;symptom signatures&lt;/li&gt;
&lt;li&gt;first 5 commands/checks&lt;/li&gt;
&lt;li&gt;containment action&lt;/li&gt;
&lt;li&gt;escalation threshold&lt;/li&gt;
&lt;li&gt;known false signals&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Good runbooks are written by people who have been paged, not by people who enjoy templates.&lt;/p&gt;
&lt;h2 id=&#34;capacity-planning-by-trend-not-by-optimism&#34;&gt;Capacity planning by trend, not by optimism&lt;/h2&gt;
&lt;p&gt;The 2005-2010 period punished optimistic capacity assumptions. We moved to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;weekly trend snapshots&lt;/li&gt;
&lt;li&gt;monthly peak reports&lt;/li&gt;
&lt;li&gt;explicit growth assumptions tied to user counts/services&lt;/li&gt;
&lt;li&gt;trigger thresholds for upgrade planning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Bandwidth, disk, queue depth, and backup windows all needed trend visibility.&lt;/p&gt;
&lt;p&gt;The cheapest way to buy reliability is to stop being surprised.&lt;/p&gt;
&lt;h2 id=&#34;security-posture-in-the-broadband-normal&#34;&gt;Security posture in the broadband normal&lt;/h2&gt;
&lt;p&gt;Always-on connectivity changed attack surface and incident frequency. Sensible baseline hardening became routine:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;minimize exposed services&lt;/li&gt;
&lt;li&gt;patch regularly with rollback plan&lt;/li&gt;
&lt;li&gt;enforce admin access boundaries&lt;/li&gt;
&lt;li&gt;log denied traffic with retention policy&lt;/li&gt;
&lt;li&gt;periodically validate external exposure with independent scans&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;No single control solved this. Layered boring controls did.&lt;/p&gt;
&lt;h2 id=&#34;documentation-as-operational-memory&#34;&gt;Documentation as operational memory&lt;/h2&gt;
&lt;p&gt;The largest hidden risk in these years was tacit knowledge. One expert could still keep a network alive, but one expert could not scale resilience.&lt;/p&gt;
&lt;p&gt;We wrote concise docs for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;edge topology&lt;/li&gt;
&lt;li&gt;rule ownership&lt;/li&gt;
&lt;li&gt;proxy exceptions&lt;/li&gt;
&lt;li&gt;monitoring map&lt;/li&gt;
&lt;li&gt;escalation contacts&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then we tested docs by having another operator run routine tasks from them. If they failed, doc quality was failing, not operator quality.&lt;/p&gt;
&lt;h2 id=&#34;the-mindset-shift-that-completed-migration&#34;&gt;The mindset shift that completed migration&lt;/h2&gt;
&lt;p&gt;By 2010, the real completion signal was not &amp;ldquo;all services on Linux.&amp;rdquo;&lt;br&gt;
The completion signal was:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;we can explain the system&lt;/li&gt;
&lt;li&gt;we can detect drift early&lt;/li&gt;
&lt;li&gt;we can recover predictably&lt;/li&gt;
&lt;li&gt;we can hand operations across people&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That is the shift from clever setup to resilient operations.&lt;/p&gt;
&lt;h2 id=&#34;final-lessons-from-the-full-series&#34;&gt;Final lessons from the full series&lt;/h2&gt;
&lt;p&gt;Across all four parts, the durable lessons are:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;bridge systems first, replace systems second&lt;/li&gt;
&lt;li&gt;treat policy as explicit artifacts&lt;/li&gt;
&lt;li&gt;migrate identities and habits with as much care as services&lt;/li&gt;
&lt;li&gt;design monitoring and runbooks for tired humans&lt;/li&gt;
&lt;li&gt;prefer incremental certainty over dramatic cutovers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of this sounds fashionable. All of it works.&lt;/p&gt;
&lt;h2 id=&#34;what-comes-next&#34;&gt;What comes next&lt;/h2&gt;
&lt;p&gt;Outside this series, two adjacent topics deserve their own deep dives:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;storage reliability on budget hardware (where most silent disasters begin)&lt;/li&gt;
&lt;li&gt;early virtualization in small Linux shops (where consolidation and experimentation finally met)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both changed how we thought about failure domains and recovery.&lt;/p&gt;
&lt;h2 id=&#34;one-quarterly-drill-that-paid-off-every-time&#34;&gt;One quarterly drill that paid off every time&lt;/h2&gt;
&lt;p&gt;By the end of this migration era, we added a quarterly &amp;ldquo;internet dependency drill.&amp;rdquo; It was intentionally small and practical: simulate one realistic edge failure and walk the runbook with the current on-call rotation.&lt;/p&gt;
&lt;p&gt;Typical drill themes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;upstream DNS degraded but not fully down&lt;/li&gt;
&lt;li&gt;accidental firewall regression after policy deploy&lt;/li&gt;
&lt;li&gt;proxy saturation during patch rollout day&lt;/li&gt;
&lt;li&gt;WAN packet loss spike during business hours&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The rule was simple: no blame, no theater, and one concrete improvement item must come out of each drill.&lt;/p&gt;
&lt;p&gt;This practice changed behavior in a measurable way. Operators started recognizing symptoms earlier, escalation happened with better context, and runbooks stayed alive instead of rotting into documentation archives.&lt;/p&gt;
&lt;p&gt;Most importantly, drills exposed stale assumptions before real incidents did. In internet-dependent systems, stale assumptions are often the first domino.&lt;/p&gt;
&lt;p&gt;One side effect we did not expect: these drills improved cross-team language. Network admins, service admins, and helpdesk staff started describing incidents with the same terms and sequence. That alone reduced triage delay, because every handoff no longer restarted the investigation from zero.&lt;/p&gt;
&lt;p&gt;Shared language is not a soft benefit; in outages, it is response-time infrastructure.
It prevents expensive confusion.&lt;/p&gt;
&lt;p&gt;Related reading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://turbovision.in6-addr.net/retro/linux/migrations/from-mailboxes-to-everything-internet-part-1-the-gateway-years/&#34;&gt;From Mailboxes to Everything Internet, Part 1: The Gateway Years&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://turbovision.in6-addr.net/retro/linux/migrations/from-mailboxes-to-everything-internet-part-2-mail-migration-under-real-traffic/&#34;&gt;From Mailboxes to Everything Internet, Part 2: Mail Migration Under Real Traffic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://turbovision.in6-addr.net/retro/linux/migrations/from-mailboxes-to-everything-internet-part-3-identity-file-services-and-mixed-networks/&#34;&gt;From Mailboxes to Everything Internet, Part 3: Identity, File Services, and Mixed Networks&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://turbovision.in6-addr.net/retro/latency-budgeting-on-old-machines/&#34;&gt;Latency Budgeting on Old Machines&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
  </channel>
</rss>
