<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Postfix on TurboVision</title>
    <link>https://turbovision.in6-addr.net/tags/postfix/</link>
    <description>Recent content in Postfix on TurboVision</description>
    <generator>Hugo</generator>
    <language>en</language>
    <lastBuildDate>Tue, 21 Apr 2026 14:06:12 +0000</lastBuildDate>
    <atom:link href="https://turbovision.in6-addr.net/tags/postfix/index.xml" rel="self" type="application/rss&#43;xml" />
    
    
    
    <item>
      <title>From Mailboxes to Everything Internet, Part 2: Mail Migration Under Real Traffic</title>
      <link>https://turbovision.in6-addr.net/linux/migrations/from-mailboxes-to-everything-internet-part-2-mail-migration-under-real-traffic/</link>
      <pubDate>Tue, 27 Feb 2007 00:00:00 +0000</pubDate>
      <lastBuildDate>Tue, 27 Feb 2007 00:00:00 +0000</lastBuildDate>
      <guid>https://turbovision.in6-addr.net/linux/migrations/from-mailboxes-to-everything-internet-part-2-mail-migration-under-real-traffic/</guid>
      <description>&lt;p&gt;If Part 1 was about building a bridge, Part 2 is about learning to drive trucks across it in bad weather.&lt;/p&gt;
&lt;p&gt;Once mail leaves &amp;ldquo;small local utility&amp;rdquo; territory and becomes a central service, the conversation changes. You stop asking &amp;ldquo;can it send and receive?&amp;rdquo; and start asking:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;can it survive hostile traffic?&lt;/li&gt;
&lt;li&gt;can it be operated by more than one person?&lt;/li&gt;
&lt;li&gt;can policy changes be rolled out without accidental outages?&lt;/li&gt;
&lt;li&gt;can users trust it on weekdays when everyone is overloaded?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In our case, that transition happened between 2001 and 2007. By then, Linux mail infrastructure was no longer experimental in geek circles. It was production, with all the consequences.&lt;/p&gt;
&lt;h2 id=&#34;why-we-moved-away-from-wizard-level-config-only&#34;&gt;Why we moved away from &amp;ldquo;wizard-level config only&amp;rdquo;&lt;/h2&gt;
&lt;p&gt;Many older setups depended on one person who understood every macro, alias map, and legacy hack in a mail config. That worked until that person got sick, changed jobs, or simply slept through a pager alert.&lt;/p&gt;
&lt;p&gt;Our first explicit migration goal in this phase was organizational, not technical:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A competent operator should be able to reason about mail behavior from plain files and runbooks.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;That goal pushed us toward simpler policy expression and clearer service boundaries. Whether your final stack was sendmail, postfix, qmail, or exim mattered less than whether your team could operate it calmly.&lt;/p&gt;
&lt;h2 id=&#34;the-stack-boundary-model-that-reduced-incidents&#34;&gt;The stack boundary model that reduced incidents&lt;/h2&gt;
&lt;p&gt;We separated the pipeline into explicit layers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;SMTP ingress/egress policy&lt;/li&gt;
&lt;li&gt;queue and routing&lt;/li&gt;
&lt;li&gt;content filtering (spam/virus)&lt;/li&gt;
&lt;li&gt;mailbox delivery and retrieval (POP/IMAP)&lt;/li&gt;
&lt;li&gt;user/admin observability&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The key idea: one layer should fail in ways visible to the next, not silently mutate behavior.&lt;/p&gt;
&lt;p&gt;When all logic is crammed into one giant config, failure states become ambiguous. Ambiguity is expensive in incidents.&lt;/p&gt;
&lt;h2 id=&#34;real-world-migration-pattern-parallel-path-then-cutover&#34;&gt;Real-world migration pattern: parallel path, then cutover&lt;/h2&gt;
&lt;p&gt;Our cutovers got safer once we standardized this pattern:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;deploy new MTA host in parallel&lt;/li&gt;
&lt;li&gt;mirror relevant policy maps and aliases&lt;/li&gt;
&lt;li&gt;run shadow traffic tests (submission + delivery + bounce paths)&lt;/li&gt;
&lt;li&gt;cut one low-risk domain first&lt;/li&gt;
&lt;li&gt;watch queue/error behavior for a week&lt;/li&gt;
&lt;li&gt;migrate high-volume domains next&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This sounds slow. It is fast compared to cleaning up one bad all-at-once switch.&lt;/p&gt;
&lt;h2 id=&#34;the-anti-spam-era-changed-architecture&#34;&gt;The anti-spam era changed architecture&lt;/h2&gt;
&lt;p&gt;By 2005-2007, spam pressure made &amp;ldquo;mail server&amp;rdquo; and &amp;ldquo;mail security&amp;rdquo; inseparable. A useful configuration had to combine:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;connection-level checks (HELO sanity, rate controls)&lt;/li&gt;
&lt;li&gt;policy checks (relay restrictions, recipient validation)&lt;/li&gt;
&lt;li&gt;reputation checks (RBLs)&lt;/li&gt;
&lt;li&gt;content scoring (SpamAssassin-like layer)&lt;/li&gt;
&lt;li&gt;malware scanning&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A typical policy layout in that era looked conceptually like:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;div class=&#34;chroma&#34;&gt;
&lt;table class=&#34;lntable&#34;&gt;&lt;tr&gt;&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code&gt;&lt;span class=&#34;lnt&#34;&gt; 1
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 2
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 3
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 4
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 5
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 6
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 7
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 8
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt; 9
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;10
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;11
&lt;/span&gt;&lt;span class=&#34;lnt&#34;&gt;12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;
&lt;td class=&#34;lntd&#34;&gt;
&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-text&#34; data-lang=&#34;text&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;ingress:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  reject_non_fqdn_sender
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  reject_non_fqdn_recipient
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  reject_unknown_sender_domain
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  reject_unauth_destination
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  check_rbl zen.example-rbl.net
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  pass_to_content_filter
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;content_filter:
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  spam_score_threshold = 6.0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  quarantine_threshold = 12.0
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  antivirus = enabled&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
&lt;/div&gt;
&lt;/div&gt;&lt;p&gt;The exact knobs differed by implementation. The architecture of staged decision points did not.&lt;/p&gt;
&lt;h2 id=&#34;false-positives-the-quiet-business-outage&#34;&gt;False positives: the quiet business outage&lt;/h2&gt;
&lt;p&gt;Most teams fear spam floods. We learned to fear false positives just as much. Aggressive filtering can silently break legitimate workflows, especially for smaller orgs where one supplier&amp;rsquo;s odd mail setup is still mission-critical.&lt;/p&gt;
&lt;p&gt;We moved to a tiered posture:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;reject only on high-confidence transport policy violations&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;tag/quarantine for uncertain content cases&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;teach users to report false positives with full headers&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This reduced support friction and preserved trust.&lt;/p&gt;
&lt;p&gt;A service users trust imperfectly is a service they route around with private inboxes, and then governance fails quietly.&lt;/p&gt;
&lt;h2 id=&#34;queue-operations-numbers-that-actually-mattered&#34;&gt;Queue operations: numbers that actually mattered&lt;/h2&gt;
&lt;p&gt;People love total queue size graphs. Useful, but incomplete. We tracked a more operational set:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;queue age percentile (P50/P95)&lt;/li&gt;
&lt;li&gt;deferred reasons by top code/domain&lt;/li&gt;
&lt;li&gt;bounce class distribution&lt;/li&gt;
&lt;li&gt;local disk growth vs queue growth&lt;/li&gt;
&lt;li&gt;retry success after first deferral&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Why queue age percentile? Because a small queue with very old entries is often more dangerous than a large queue of fresh retries.&lt;/p&gt;
&lt;h2 id=&#34;submission-and-auth-became-first-class&#34;&gt;Submission and auth became first-class&lt;/h2&gt;
&lt;p&gt;As users moved from fixed office networks to mixed environments, authenticated submission stopped being optional. We separated trusted relay from authenticated submission explicitly and documented it in end-user instructions.&lt;/p&gt;
&lt;p&gt;A minimal policy split looked like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;relay without auth only from managed LAN ranges&lt;/li&gt;
&lt;li&gt;require auth for all remote submission&lt;/li&gt;
&lt;li&gt;enforce TLS where practical&lt;/li&gt;
&lt;li&gt;disable legacy insecure paths gradually with communication windows&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;People remember technical changes. They forget user communication. In migrations, communication is part of uptime.&lt;/p&gt;
&lt;h2 id=&#34;logging-from-forensic-artifact-to-daily-dashboard&#34;&gt;Logging: from forensic artifact to daily dashboard&lt;/h2&gt;
&lt;p&gt;Early on, logs were mostly used after incidents. By mid-migration, we treated them as daily control instruments. We built tiny scripts that summarized:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;top rejected senders&lt;/li&gt;
&lt;li&gt;top deferred recipient domains&lt;/li&gt;
&lt;li&gt;top local auth failures&lt;/li&gt;
&lt;li&gt;per-hour inbound/outbound volume&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Even crude summaries built operator intuition fast. If Tuesday looks unlike every previous Tuesday, investigate before users notice.&lt;/p&gt;
&lt;h2 id=&#34;dns-and-reputation-maintenance-discipline&#34;&gt;DNS and reputation maintenance discipline&lt;/h2&gt;
&lt;p&gt;Mail reliability in 2007 is tightly coupled to DNS hygiene and sending reputation. We added recurring checks for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;forward/reverse consistency&lt;/li&gt;
&lt;li&gt;MX consistency after planned changes&lt;/li&gt;
&lt;li&gt;SPF correctness&lt;/li&gt;
&lt;li&gt;stale secondary records&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;A single stale record can cause &amp;ldquo;works for most people&amp;rdquo; failures that consume days.&lt;/p&gt;
&lt;h2 id=&#34;incident-story-the-day-policy-order-bit-us&#34;&gt;Incident story: the day policy order bit us&lt;/h2&gt;
&lt;p&gt;One outage class recurred until we fixed our process: policy ordering mistakes.&lt;/p&gt;
&lt;p&gt;A config reload with one rule moved above another can flip behavior from permissive to catastrophic. We had one deploy where recipient validation executed before a required local map was loaded in a new process context. External effect: temporary 5xx rejects for valid local recipients.&lt;/p&gt;
&lt;p&gt;The post-incident fix was procedural:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;stage config in syntax check mode&lt;/li&gt;
&lt;li&gt;run policy simulation against known-good/known-bad test cases&lt;/li&gt;
&lt;li&gt;reload in maintenance window&lt;/li&gt;
&lt;li&gt;verify with live probes&lt;/li&gt;
&lt;li&gt;keep rollback snippet ready&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The technical fix was small. The process fix prevented repeats.&lt;/p&gt;
&lt;h2 id=&#34;the-human-layer-runbooks-and-ownership&#34;&gt;The human layer: runbooks and ownership&lt;/h2&gt;
&lt;p&gt;Mail operations improved when we wrote short, explicit runbooks and attached clear ownership:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&amp;ldquo;high queue depth but low queue age&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;low queue depth but high queue age&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;sudden outbound spike&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;auth failure burst&amp;rdquo;&lt;/li&gt;
&lt;li&gt;&amp;ldquo;upstream DNS inconsistency&amp;rdquo;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each runbook had:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;first checks&lt;/li&gt;
&lt;li&gt;known bad patterns&lt;/li&gt;
&lt;li&gt;escalation condition&lt;/li&gt;
&lt;li&gt;rollback or containment action&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The format matters less than consistency. Under stress, consistency wins.&lt;/p&gt;
&lt;h2 id=&#34;migration-economics-why-smaller-steps-are-cheaper&#34;&gt;Migration economics: why smaller steps are cheaper&lt;/h2&gt;
&lt;p&gt;A common argument was &amp;ldquo;let&amp;rsquo;s wait and migrate everything when we also redo identity and web hosting.&amp;rdquo; We tried that once and regretted it. Bundling too many moving parts creates coupled risk and unclear root causes.&lt;/p&gt;
&lt;p&gt;Mail migration became tractable when we treated it as its own program with clear acceptance gates:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;transport reliability&lt;/li&gt;
&lt;li&gt;policy correctness&lt;/li&gt;
&lt;li&gt;abuse resilience&lt;/li&gt;
&lt;li&gt;operator clarity&lt;/li&gt;
&lt;li&gt;user communication quality&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Only after those stabilized did we stack adjacent migrations.&lt;/p&gt;
&lt;h2 id=&#34;what-changes-in-2007-operations&#34;&gt;What changes in 2007 operations&lt;/h2&gt;
&lt;p&gt;Compared with 2001, a 2007 Linux mail setup in our environment looked less romantic and much more professional:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;explicit relay boundaries&lt;/li&gt;
&lt;li&gt;documented policy layers&lt;/li&gt;
&lt;li&gt;operational dashboards from logs&lt;/li&gt;
&lt;li&gt;recurring DNS/reputation checks&lt;/li&gt;
&lt;li&gt;reproducible deployment and rollback&lt;/li&gt;
&lt;li&gt;practical abuse handling without user-hostile defaults&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We did not eliminate incidents. We made incidents legible.&lt;/p&gt;
&lt;p&gt;That is the difference between hobby administration and service operations.&lt;/p&gt;
&lt;h2 id=&#34;practical-checklist-if-you-are-migrating-this-year&#34;&gt;Practical checklist: if you are migrating this year&lt;/h2&gt;
&lt;p&gt;If you are planning a migration this year, this is the condensed list I would tape above the rack:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;define policy boundaries before touching software packages&lt;/li&gt;
&lt;li&gt;build and test in parallel, then cut over domain-by-domain&lt;/li&gt;
&lt;li&gt;implement anti-spam as layered decisions, not one giant hammer&lt;/li&gt;
&lt;li&gt;measure queue age, not just queue size&lt;/li&gt;
&lt;li&gt;separate LAN relay from authenticated submission&lt;/li&gt;
&lt;li&gt;automate log summaries your operators will actually read&lt;/li&gt;
&lt;li&gt;simulate policy before reload&lt;/li&gt;
&lt;li&gt;treat user comms as part of the rollout, not afterthought&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;If you do only four of these, do 1, 3, 4, and 7.&lt;/p&gt;
&lt;h2 id=&#34;weekly-review-ritual-that-kept-us-honest&#34;&gt;Weekly review ritual that kept us honest&lt;/h2&gt;
&lt;p&gt;One habit improved this migration more than any single package choice: a short weekly mail operations review with evidence, not opinions.&lt;/p&gt;
&lt;p&gt;The agenda stayed fixed:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;queue age trend over last seven days&lt;/li&gt;
&lt;li&gt;top five defer reasons and whether each is improving&lt;/li&gt;
&lt;li&gt;false-positive reports with root-cause category&lt;/li&gt;
&lt;li&gt;auth failure clusters by source network&lt;/li&gt;
&lt;li&gt;one policy/rule cleanup item&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We kept the meeting to thirty minutes and required one concrete action at the end. If there was no action, we were probably admiring graphs instead of improving service.&lt;/p&gt;
&lt;p&gt;This ritual sounds simple because it is simple. The impact came from repetition. It turned scattered incidents into a feedback loop and gradually removed &amp;ldquo;mystery behavior&amp;rdquo; from the system.&lt;/p&gt;
&lt;p&gt;Related reading:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&#34;https://turbovision.in6-addr.net/retro/linux/migrations/from-mailboxes-to-everything-internet-part-1-the-gateway-years/&#34;&gt;From Mailboxes to Everything Internet, Part 1: The Gateway Years&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&#34;https://turbovision.in6-addr.net/hacking/tools/terminal-kits-for-incident-triage/&#34;&gt;Terminal Kits for Incident Triage&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
    </item>
    
  </channel>
</rss>
