The Utility Library I Wish I'd Built on Day 30
The first log line I wrote at this company was a fmt.Println.
It was a single Go service, a few hundred lines of code, and I just needed to verify the thing was running. No log levels, no structured fields, no output destinations. Just a string hitting stdout that I could tail from my terminal.
That was three years ago. Today we run 20+ microservices behind a shared logging utility that costs us a fraction of what it used to and gives us more visibility than we ever had before. Getting here took three rewrites, two cost crises, and one architecture decision I got right—too late to avoid the pain.
This is the story of that journey. Not the code. The decisions.
Phase 1: The One-Service World
Startups begin simple. One service, one developer (or three), one problem to solve. Logging is an afterthought because everything else is on fire. You reach for the standard library, or whatever package is popular that week, and you ship.
log.Printf("user %s logged in", user.ID)
It works. You see your output. You move on.
Then someone adds a second service. Then a third. Each one picks its own logging approach—different packages, different formats, different log levels if they bother with levels at all. Nobody writes it down because nobody thinks there's anything to write down.
This isn't negligence. In a two-person startup, standardizing logging is the wrong thing to optimize for. You should optimize for shipping. But you should also recognize that this phase has an expiration date, and it's sooner than you think.
The first sign: someone asks "hey, what log format does the auth service use?" and you realize you don't know.
Phase 2: The Sprawl
By the time there are eight services, the logging situation is genuinely broken.
Half the services write plain text. Three write JSON but with slightly different field names—one uses "level", another uses "severity", the third uses "log_level". Two services don't log structured data at all and just dump stack traces when things go wrong. One team configured their log level to DEBUG six months ago and never set it back.
The observable consequences are surprisingly specific:
- SRE can't build a dashboard. Every query has to be rewritten per service because field names don't match.
- Incident response takes longer. When a request fails, you can trace it through some services but not others, because some log trace IDs, some log session IDs, and some log neither.
- The storage bill keeps climbing. Nobody knows which service is the noisy one. You're paying by the gigabyte and half of it is debug logs from a background worker nobody touches anymore.
At this point you have two options. Option one: write a policy document, hold a meeting, ask every team to standardize voluntarily, and hope. Option two: build a shared library that makes the right thing the easy thing—or better, the only thing.
We tried option one. It didn't work. Nobody has time to retroactively standardize logging when there are features to ship.
So I built the shared library.
Phase 3: The Shared Library
The library's surface area was small. A handful of functions—Info, Error, Debug, Warn—backed by a fast, structured logger (we chose uber-go/zap). The contract was one line of setup:
import "github.com/company/logging"
func main() {
logging.Init(logging.ProductionConfig())
defer logging.Sync()
}
One import. No decisions for the caller. You get structured JSON logs, consistent field names, and trace ID propagation through request headers. The library enforced standards because the library was the standard. There was no alternative function to call.
The immediate impact was obvious. Dashboards that previously required per-service query rewriting now worked across all services. Incident response time dropped because every service logged trace IDs the same way. The debug-level spam from the background worker disappeared because the library defaulted to INFO in production and teams had to explicitly opt into DEBUG.
But the biggest impact was one I didn't anticipate: when every service uses the same logging library, you stop thinking about logging. It becomes invisible infrastructure—which is exactly what infrastructure should be.
For about four months, this worked. Then the cost crisis hit.
Phase 4: The Bill
We stored our logs in a managed observability platform. Ingestion was priced by the gigabyte. Retention was priced by the month.
One afternoon I was reviewing infrastructure costs and noticed something ugly: logging was the second-largest item after compute. We were paying more to store log data than we were paying for the databases that generated the data.
The culprit was request-response bodies.
The first version of the shared library captured request and response bodies on every HTTP request. Body capture is the quickest way to make a service debuggable—you can see exactly what went in and exactly what came out. For one service with moderate traffic, it's fine. For twenty services with production throughput, it's a cost disaster.
A 200 OK response from a /search endpoint returning 100KB of JSON results costs the same to log as a 500 error with a two-line stack trace. Multiply that by millions of requests per day, and you're paying a premium to store data you will never look at.
I made the obvious fix: remove body capture entirely. No more request bodies. No more response bodies. Just metadata: status code, latency, path, method, trace ID.
The cost plummeted. The dashboard flattened.
Phase 5: The Blind Spot
The first production incident after the change was a customer reporting "the search results are wrong, but it returns 200."
Normally I'd pull up the logs, find the request, and see the response body. Now there was nothing. The log told me the request happened, that it took 147ms, that it returned 200. But it couldn't tell me what it returned.
I knew the bug existed. I could reproduce it. But I couldn't prove it happened to the customer because my logs had no evidence.
This is the central tension of logging at scale: cheap logs are useless during incidents. Verbose logs are useful during incidents but expensive during everything else. You cannot optimize for both simultaneously with a binary on/off switch.
The answer isn't "log everything" or "log nothing." The answer is a policy layer.
Phase 6: The Policy Layer
A policy layer is a configuration system that separates what to log from what to do with the logs. Two questions, decoupled:
- Should I log this request at all?
- If yes, should I include the body?
The breakthrough is making the answer conditional on status code.
Hard rule: responses with status >= 400 are always logged, always with body. This is non-negotiable. Errors are rare and high-value. You never skip them.
Everything else—200s, 300s, health checks—is configurable. In steady state, the policy is set to errors-only. Your logging bill stays flat. When you're debugging, you flip a switch and suddenly every request gets logged with full body for the next five minutes. You find the bug. You flip it back.
policy := &LogPolicy{
SuccessMode: "errors-only", // 2xx/3xx: not logged
HealthMode: "skip-2xx", // /health returns 200: skip entirely
}
library.SetPolicy(policy)
Three modes, each serving a different operational state:
| Mode | 2xx/3xx behavior | When to use |
|---|---|---|
| `errors-only` | Not logged | Default steady state. Cheapest. |
| `all-with-body` | Logged, with body | Debugging a specific issue. Expensive but temporary. |
| `all-no-body` | Logged, no body | Monitoring traffic patterns without PII/body concerns. |
The design principle: logging is a latent debugging tool. Keep it cheap until the moment you need it to be loud.
This alone was a good improvement. One deploy and we had a configurable policy. But it still required a deploy. And at 20+ services, a deploy to change a log level takes time nobody wants to spend at 2am during an incident.
Phase 7: The Runtime Toggle
The final piece: remote configuration.
Instead of baking the policy into the binary, store it in a remote parameter store that the library polls every 30 seconds. When you need verbose logging, you change the parameter value. Within 30 seconds, every service that imports the library picks up the change and starts logging bodies. No deploy. No restart. No YAML change. One parameter update.
# Steady state:
mode=errors-only,health=skip-2xx
# Five minutes of debugging:
mode=all-with-body,health=skip-2xx
# Back to normal:
mode=errors-only,health=skip-2xx
The implementation details matter:
- The hot path must be lock-free. Every request reads the current policy. If that read involves a mutex or a network call, you've added latency to every request in the system. Use atomic pointer swaps—a single atomic load per request, no contention.
- The polling goroutine must be silent on failure. If the parameter store is unreachable, keep the last-known-good policy. Don't crash. Don't spam. Log a warning, rate-limited, and try again.
- The default must be safe. If there's no parameter at all—first deploy, new region, whatever—default to errors-only. Never default to verbose.
This pattern applies beyond logging. Any configuration that you might want to change during an incident without a deploy—feature flags, rate limits, circuit breaker thresholds—can follow the same model: atomic pointer, polling goroutine, safe default.
What I'd Do Differently
If I were starting a new company tomorrow and building the first service, here's what I'd do:
Day 1: Use the language's standard library. Don't build a shared logging library when you have one service and zero users. It's premature optimization.
The moment you hit service #3: Build the shared library. One import, structured JSON, trace IDs, consistent field names. Freeze the API surface—no more new functions, no more new patterns. The library is the standard.
From that point forward: The library ships with a policy layer from day one. Errors-only by default. Bodies on demand. Health endpoints filtered. Remote config wired up even if you're not using it yet—the toggle costs nothing when idle and saves you a migration later.
The policy layer is the thing I wish I'd built first, not third. Every migration we did—removing bodies, adding bodies back, adding the toggle—would have been a parameter change instead of a code change. The library's API never needed to change. Only the behavior needed to change, and behavior should live in configuration.
The Upfront Playbook
If you take nothing else from this, take these three rules:
-
The logging library should enforce standards, not offer choices. API surface matters. If there are two ways to log a trace ID, teams will pick two different ways. Give them one way.
-
Configuration should live outside the binary. Environment variables for startup config (log level, output format). Remote parameters for runtime config (what to log, how much to log). A deploy is not a configuration mechanism.
-
The default should be the cheapest mode that still catches failures. Errors-only with bodies. Health endpoints silent. Verbose logging is an opt-in, not an opt-out.
Final Thought
The mistake I made wasn't technical. The code was fine. The mistake was treating logging as a feature to be built rather than a policy to be designed.
Features have implementations. Policies have defaults, overrides, and lifecycle. When you build the implementation first and bolt on configurability later, you pay for it in migrations. When you build the policy layer first—even as a thin veneer over a simple implementation—you pay for it once.
Logging is the least interesting part of your system until it's the only thing standing between you and a production incident. At that moment, you don't care how elegant the code is. You care whether you can see what happened. The library that lets you toggle that visibility without a deploy is the one you'll wish you'd built on day 30.