Notes on Performance
Most performance advice is folklore dressed up as engineering.
"Protobuf is always faster than JSON." "ORMs are slow." "You need Redis for speed." These statements are directionally true and practically useless. They don't tell you how much faster. They don't tell you under what conditions. They don't tell you when the optimization costs more than the problem.
I spent years tuning JVM applications—GC pauses, heap sizing, JIT warmup, class loading overhead—and developed a certain relationship with performance. It's adversarial. The platform is powerful but opaque. You measure, tweak, measure again, and hope the next GC cycle doesn't ruin your p99.
Go felt different from the first week. Compiles to a static binary. No VM. No warmup. No class loader. A goroutine is a few kilobytes, not a megabyte thread. The runtime does less, which means it hides less. If something is slow, you can usually trace it to your code or your dependencies.
But "less opaque" isn't the same as "fast by default." The first time I wrote a nontrivial Go service and put it under load, I learned that quickly.
This series is about what happened next: the assumptions I destroyed, the things I measured, and what I now believe about writing Go that doesn't just work—but works under real conditions.
Why I Started Measuring
I was building a trading bot for Polymarket prediction markets. The idea was straightforward: when Chainlink's oracle says BTC is up 0.15% from the start price with 90 seconds left, the YES shares should trade near $0.99. But markets don't price in oracle data instantly. Sometimes there's a lag of several seconds—seconds where the trade is still profitable.
The bot needed to be fast because the window was small, but not microseconds fast. What it actually needed was predictable: feed five data sources into a single decision loop every 500ms, evaluate 13 signals through 6 rejection gates, and if everything aligns, place a fill-or-kill order. The hard part wasn't low-level optimization. It was making sure the system degraded gracefully when a feed went stale, tracked its own positions through resolution, and never took a trade it couldn't explain afterward.
I wrote it as a single Go binary. No Kubernetes. No message queue. Just goroutines, a mutex-guarded state store, and a 500ms tick. And it worked—not because Go is magic, but because Go's concurrency model and deterministic execution path map cleanly to that kind of problem.
But along the way, I kept hitting questions I couldn't answer from documentation:
- How much does Protobuf actually save over JSON for a typical trading payload?
- What happens to latency when I switch from NATS pub/sub to JetStream with persistence?
- Is GORM's query builder costing me more than I think?
- Which JSON encoder actually holds up under sustained load?
So I built test harnesses. Not quick benchmarks I'd throw away—structured, reproducible comparisons I could return to. What started as curiosity became a recurring practice. When I have a performance question now, I measure it. Thoroughly. In isolation. With the raw data published alongside the summary.
What This Series Covers
Here's what I've measured so far, and what's coming:
Part 1: Messaging Systems — NATS vs Kafka vs Redpanda
50,000 messages across 4 systems and 2 serialization formats. NATS Core at 73 µs median publish latency. JetStream's persistence overhead: not the 2–3× I assumed, but ~12%. Redpanda's advantage over Apache Kafka at tail latencies—20% faster at median, 2× faster at p99. And the pattern I didn't expect: Protobuf's advantage over JSON grows with broker overhead.
Part 2: Database Access in Go — sqlc+pgx vs GORM
11 queries on an e-commerce schema with 4.1 million rows. The difference between json_agg in a single query and three separate Preload calls. Why memory allocations per operation matter more than the raw query time. And the architecture differences that make pgxpool a meaningfully different engine from database/sql.
Part 3: JSON Encoding in Go — stdlib vs easyjson vs sonic vs goccy
Microbenchmarks on a 3.5KB nested struct. easyjson hitting 962 ns per marshal—4.45× faster than stdlib. Why sonic wins on amd64 but falls back to reflection on arm64 (its JIT uses AVX2 instructions that don't exist on Apple Silicon). And when the best choice is the one that requires no code generation at all.
And after that
Connection pooling under contention. Goroutine scheduling patterns. Memory allocator behavior when your heap grows in unexpected ways. Whatever breaks next.
How I Run These
Every benchmark follows the same rules, and the repositories are public:
Systems are tested one at a time. When I compare NATS to Kafka, only one Docker container is running. No resource contention between the things being compared.
Data is deterministic. The payload generator uses a fixed random seed. The e-commerce seeder uses gofakeit with seed=42. Run the harness twice and you get the same numbers.
Metrics have clear definitions. Producer latency is time.Since(call). End-to-end latency is the consumer's clock minus the producer's embedded timestamp. If a measurement isn't defined precisely, I don't report it.
Raw results are published. Every benchmark writes JSON with the full sample array. You can compute your own percentiles if you disagree with mine.
I warm up before measuring. The database benchmark runs 5% of the total count before starting the clock. The JSON benchmark uses Go's testing.Benchmark() for auto-tuned iterations per cell.
What I'm Not Saying
I'm not declaring that one tool is "better" than another. NATS is great. Kafka is great. Redpanda is great. sqlc and GORM solve different problems. The question is: under what conditions, with what tradeoffs, and for what specific workload?
I'm also not claiming these numbers will match your system exactly. They won't. Your network topology, your data shape, your access patterns—they're different from mine. The value isn't the exact number. It's the shape of the tradeoff. To know that GORM's Preload generates N+1 queries, not as a rumor but as something you can see in the data. To know that Redpanda's p99 tail latency advantage over Kafka widens the further out you go. To know why something is faster, not just that it is.
The Thing I Keep Coming Back To
After all these benchmarks, all the hours staring at flame graphs and p99s and allocation profiles, the single most useful skill hasn't been knowing which tool is fastest. It's been developing an instinct for where speed matters and where it doesn't.
The Polymarket bot doesn't need to be fast. It needs to be correct. The 500ms tick is an eternity in CPU time—plenty of room for six gates, signal computation, and order construction. The risk isn't being too slow; it's taking a bad trade. The system is fast enough because I defined "fast enough" before I started optimizing.
But the logging utility I wrote for 20+ microservices? That one needed to be fast on the hot path—every request reads the logging policy, so that read must be a single atomic pointer load with no mutex and no allocation. Get that wrong and you've added measurable latency to every request in the system. It's a tiny cost per call, multiplied by millions.
The difference between the two isn't in the code. It's in understanding what each system costs and what it can afford. That understanding is what I'm trying to document here.
The posts go up over the next few weeks. Start with Part 1.