Skip to content

We Got 10× Faster - But Only After We Stopped Guessing

Performance matters as much as execution correctness. As the engine builds a dependency graph and pulls data through it at runtime, handling parallelization, fallback chains, and array iteration automatically, it is essencial that the code is optimised to run efficiently.

We knew it could be faster. We had hunches. We were wrong.

This is the story of how a profiling-driven optimization effort (with an LLM helping with analysis and iteration) went from –11% (yes, slower) to ~10× — and the lessons we learned along the way.

Our first move was classic premature optimization. We looked at the code, spotted a linear scan over wire arrays, and knew it was the bottleneck. We built a WeakMap-cached DocumentIndex that pre-indexed wires by key, replacing O(n) scans with O(1) map lookups.

Result: 4–11% slower across every benchmark.

Why? The wire arrays had ~5–20 elements. The linear scan with zero-allocation field comparison was faster than constructing string keys for map lookups. And WeakMap.get() costs ~50–100ns per call (hash + GC barrier) — for 1,000 shadow trees in an array iteration, that’s 50–100µs of pure overhead that didn’t exist before.

We tried again: wire index by trunk key. A Map<string, Wire[]> to avoid repeated filtering.

Result: 10–23% slower. Same root cause — string allocation in a tight loop loses to a short linear scan.

Two failed optimizations before we had a single win. That’s when something clicked.

We stopped theorizing and built proper infrastructure:

  • 14 micro-benchmarks covering parsing, passthrough, tool chains, flat arrays (10/100/1,000 elements), nested arrays, and per-element tool calls — using tinybench (see the benchmarks)
  • CI tracking via Bencher so regressions are caught automatically
  • V8 tick profiling, CPU profiling, deoptimization analysis, and flamegraph tooling — all documented in our profiling guide

The profiling guide is intentionally “copy/paste friendly”. It’s structured so the next session (human or LLM) can pick up where we left off, run the same commands, and compare against the same baseline.

We also added a dedicated “Tips for LLM Agents” section with explicit guardrails (baseline first, read the perf history, watch for deopts), plus a running performance log that includes the failures.

We generated V8 tick profiles and fed the raw output into the LLM. It pointed us at where the microtask queue was flooding — materializeShadows at 3.4% of total ticks, pullSingle at 3.9%. Those functions became our targets (both live in the runtime core: ExecutionTree.ts).

Once we had measurements, every optimization was a hypothesis-test cycle: profile → propose a change → predict impact → verify with the benchmark suite. Here’s what actually held up in numbers:

Batch element materialization (#9): Instead of processing array elements one at a time with separate Promise.all calls, we pre-grouped wires by output path and resolved all element fields in a single flat loop. +44–130% on array benchmarks.

Sync fast-path (#10): The engine used async/await everywhere, but most values were already resolved synchronously. Before this fix, traversing a 1,000-item array with 3 fields per item forced V8 to suspend the execution context and queue a microtask over 6,000 times — the engine was spending most of its CPU cycles just managing the microtask queue instead of accessing the data. We introduced a MaybePromise<T> type and rewrote the hot path to avoid constructing Promises when values are ready. +42–114% on arrays, +8–17% on everything else.

Pre-computed keys (#11): For a 1,000-element array with 3 fields, we were constructing 3,000 template-literal keys per execution. Caching them on the AST nodes using Symbol keys eliminated that entirely. Why Symbols? If you add a new string property to an object at runtime, V8 is forced to create a new “Hidden Class” (Shape) and deoptimizes the object. By using Symbol keys, we attached our cache to the AST nodes without triggering a deopt penalty across the rest of the parsing phase. +60–129% on arrays.

De-async schedule (#12): Same pattern — schedule() was wrapped in async () => { ... } even for synchronous operations. Removing the async wrapper eliminated 2 microtask hops per tool call. +11–18% on tool-calling benchmarks.

Each optimization built on the previous one. The sync fast-path (#10) only unlocked its full potential because batch materialization (#9) had already restructured the inner loop. The key caching (#11) mattered because the sync path was now executing fast enough that string allocation became the new bottleneck.

Our stress benchmark — flat array with 1,000 elements:

Stageops/secvs. start
Before any optimization258
After batch materialization (#9)594+130%
After sync fast-path (#10)1,336+418%
After key caching (#11)3,064+1,088%
Current (after #12)2,980+1,055%

(The slight dip in #12 on the array benchmark is standard variance; #12 specifically optimized the tool-calling benchmarks, where it delivered +11–18%, not arrays.)

Over 10× throughput improvement on the array hot path. Non-array benchmarks also moved up (often in the ~10–20% range), but arrays were the clear winner because they hit the shadow-tree/microtask hot path.

1. Measure first, optimize second. Our first two optimizations made things slower because we optimized for theoretical complexity instead of actual bottlenecks. A WeakMap lookup isn’t free. A Map with string keys isn’t free. For small N, the constant factors dominate.

2. The async tax is real. JavaScript’s async/await creates microtask queue entries even when values are already available. On a hot path executing hundreds of thousands of times per second, each unnecessary await costs measurable throughput. The MaybePromise<T> pattern — check with isPromise(), return synchronously when possible — was our single biggest win.

3. Compound improvements beat silver bullets. No single optimization delivered 10×. But six targeted changes, each building on the previous, compounded to 10×+. The key was that each change revealed the next bottleneck. You can’t see that string allocation matters until you’ve eliminated the async overhead masking it.

4. LLMs are good at “reading the tools.” The LLM didn’t magically make anything fast — it helped run the profilers, summarize tick distributions, and keep hypotheses honest. When predictions were wrong (two reverted optimizations), the failure analysis became some of our best documentation.

5. Document failures, not just wins. Our performance log has 15 entries — 2 failed, 10 succeeded, 3 planned. The failed entries are longer and more detailed than the successful ones. They explain why something that should work in theory doesn’t work in practice. Next time someone proposes a WeakMap cache on the hot path, they’ll find the evidence right there.

6. The V8 anti-patterns we found. Through our profiling, we identified patterns that silently kill JavaScript performance at scale:

  • The Object Spread Tax: Using { ...parentObj, newProp: value } inside a hot loop creates a full object copy and triggers heavy GC pressure. Mutating safely or using Object.create() is significantly faster.
  • The Dynamic Map Key Trap: Constructing template strings (`${type}:${field}`) just to use them as Map keys is slower than doing an O(N) linear array scan for small arrays (N ≤ 20).
  • Nested Promise.all: Calling Promise.all inside another Promise.all creates massive microtask overhead. Flattening promises into a single 1D array before awaiting them shaved hundreds of microseconds off our execution.

These aren’t exotic edge cases. They’re patterns that show up in almost every Node.js codebase — you just don’t notice them until you profile.

For anyone looking to do this in their own codebase:

  1. Build benchmarks first. Cover your hot paths with micro-benchmarks. We use tinybench with CI tracking — the tooling is simple.
  2. Profile before theorizing. V8’s built-in tick profiler (--prof) is free and gives you the actual breakdown. Flamegraphs make it visual.
  3. Make your profiling docs AI-readable. Structure them with exact commands, expected outputs, and baseline numbers. When you start a new AI session, it can pick up immediately.
  4. Track everything. Every attempt, every failure, every before/after table. The compound story is more valuable than any single result.
  5. Commit one optimization at a time. Each with its own benchmark run. When something breaks, you know exactly which change caused it.

Performance work doesn’t have to be a dark art. With the right infrastructure, it becomes a repeatable, teachable process.