perf: reduce GC pressure in rendering pipeline (#687)
tazjin
created
The `renderIterator` function previously caused an extremely large
amount of small strings and other objects to be created and abandoned
during rendering, which caused performance issues after some time as the
GC had to occasionally collect all of these objects.
This was exacerbated by using streaming in models, which leads to
extremely frequent updates.
This commit refactors renderIterator to avoid constructing temporary
strings. Instead, the function now performs two passes:
1. A first pass in which the "fragments" to render are aggregated, but
the `rendered` string is not yet copied and appended/prepended to.
2. A second pass, which uses a `strings.Builder` to efficiently
construct the final output string.
This has *significantly* improved crush's performance for me. Whereas
before `perf` would show it spending up to 70% (!) of its time in
GC-related Go runtime functions, it now spends a trivial amount there.
pprof's heap profiling previously showed renderIterator as a massive
hotspot, whereas it now doesn't even show up in alloc `top` anymore.
The updated function is slightly harder to read. I did spend some time
trying different options for making it more readable, and also asking
various LLMs about it (using crush!), but ultimately didn't find
anything better than the two-pass solution.