Frame Allocators
This section explains how coroutine frames are allocated and how to customize allocation for performance.
Prerequisites
-
Completed Concurrent Composition
-
Understanding of coroutine frame allocation from C++20 Coroutines Tutorial
The Timing Constraint
Coroutine frame allocation has a unique constraint: memory must be allocated before the coroutine body begins executing. The standard C++ mechanism—promise type’s operator new—is called before the promise is constructed.
This creates a challenge: how can a coroutine use a custom allocator when the allocator might be passed as a parameter, which is stored in the frame?
Thread-Local Propagation
Capy solves this with thread-local propagation:
-
Before evaluating the task argument,
run_asyncsets a thread-local allocator -
The task’s
operator newreads this thread-local allocator -
The task stores the allocator in its promise for child propagation
This is why run_async uses two-call syntax:
run_async(executor)(my_task());
// ↑ ↑
// 1. Sets 2. Task allocated
// TLS using TLS allocator
The Window
The "window" is the interval between setting the thread-local allocator and the coroutine’s first suspension point. During this window:
-
The task is allocated using the TLS allocator
-
The task captures the TLS allocator in its promise
-
Child tasks inherit the allocator
After the window closes (at the first suspension), the TLS allocator may be restored to a previous value. The task retains its captured allocator regardless.
TLS Preservation
Between a coroutine’s await_resume (which sets TLS to the correct allocator) and the next child coroutine invocation (whose operator new reads TLS), arbitrary user code runs. If that code resumes a coroutine from a different chain on the same thread — by calling .resume() directly, pumping a completion queue, or running nested dispatch — the other coroutine’s await_resume overwrites TLS with its own allocator. The original coroutine’s next child would then allocate from the wrong resource.
To prevent this, any code that calls .resume() on a coroutine handle must use safe_resume from <boost/capy/ex/frame_allocator.hpp>:
// In your event loop or dispatch path:
capy::safe_resume(h); // saves and restores TLS around h.resume()
safe_resume saves the current thread-local allocator, calls h.resume(), then restores the saved value. This makes TLS behave like a stack: nested resumes cannot spoil the outer value. All of Capy’s built-in executors (thread_pool, strands, blocking_context) use safe_resume internally. Custom executor event loops must do the same — see Custom Executor for an example.
The FrameAllocator Concept
Custom allocators must satisfy the FrameAllocator concept, which is compatible with C++ allocator requirements:
template<typename A>
concept FrameAllocator = requires {
typename A::value_type;
} && requires(A& a, std::size_t n) {
{ a.allocate(n) } -> std::same_as<typename A::value_type*>;
{ a.deallocate(std::declval<typename A::value_type*>(), n) };
};
In practice, any standard allocator works.
Using Custom Allocators
With run_async
Pass an allocator to run_async:
std::pmr::monotonic_buffer_resource resource;
std::pmr::polymorphic_allocator<std::byte> alloc(&resource);
run_async(executor, alloc)(my_task());
Or pass a memory_resource* directly:
std::pmr::monotonic_buffer_resource resource;
run_async(executor, &resource)(my_task());
Recycling Allocator
Capy provides recycling_memory_resource, a memory resource optimized for coroutine frames:
-
Maintains freelists by size class
-
Reuses recently freed blocks (cache-friendly)
-
Falls back to upstream allocator for new sizes
This allocator is used by default for thread_pool and other execution contexts.
HALO Optimization
Heap Allocation eLision Optimization (HALO) allows the compiler to allocate coroutine frames on the stack instead of the heap when:
-
The coroutine’s lifetime is provably contained in the caller’s
-
The frame size is known at compile time
-
Optimization is enabled
template<typename T = void>
struct [[nodiscard]] BOOST_CAPY_CORO_AWAIT_ELIDABLE
task
{
// ...
};
Best Practices
Use Default Allocators
For most applications, the default recycling allocator provides good performance without configuration.
Consider Memory Resources for Batched Work
When launching many short-lived tasks together, a monotonic buffer resource can be efficient:
void process_batch(std::vector<item> const& items)
{
std::array<std::byte, 64 * 1024> buffer;
std::pmr::monotonic_buffer_resource resource(
buffer.data(), buffer.size());
for (auto const& item : items)
{
run_async(executor, &resource)(process(item));
}
// All frames deallocated when resource goes out of scope
}
Scope Variables to Reduce Frame Size
Compilers use declaration scope (braces) to decide which variables cross suspend points and must live in the coroutine frame. Variables declared in an outer scope remain in the frame even after their last use, as long as a co_await follows within the same scope.
Wrapping buffer usage in explicit braces can dramatically reduce frame size:
// BAD: buf lives in frame across all subsequent co_awaits
task<> process(stream& s)
{
char buf[4096];
auto [ec, n] = co_await s.read_some(buf);
co_await do_work(buf, n);
co_await s.write_some(reply); // buf wastes 4K in frame
}
// GOOD: braces end buf's lifetime before next suspend
task<> process(stream& s)
{
std::size_t n;
{
char buf[4096];
auto [ec, n_] = co_await s.read_some(buf);
n = n_;
co_await do_work(buf, n);
}
co_await s.write_some(reply); // 4K saved
}
This technique also enables the compiler to overlap variables in the frame. When two variables have completely non-overlapping lifetimes (in separate scoped blocks), the compiler can reuse the same frame memory for both — even on Clang:
// BAD: both arrays in frame simultaneously (8K)
task<> pipeline(stream& in, stream& out)
{
char read_buf[4096];
auto [ec1, n] = co_await in.read_some(read_buf);
char write_buf[4096];
prepare(write_buf, read_buf, n);
co_await out.write_some(write_buf);
}
// GOOD: non-overlapping scopes allow frame reuse (4K)
task<> pipeline(stream& in, stream& out)
{
std::size_t n;
{
char read_buf[4096];
auto [ec, n_] = co_await in.read_some(read_buf);
n = n_;
}
{
char write_buf[4096];
prepare(write_buf, n);
co_await out.write_some(write_buf);
}
}
In the second version, read_buf and write_buf never coexist, so the compiler can place them at the same frame offset — halving the frame’s buffer footprint. This optimization applies to any variables with non-overlapping lifetimes, not just arrays.
GCC vs Clang Frame Sizes
| This section draws on C++20 Coroutines from compiler and library authors' perspective by Chuanqi Xu. |
GCC and Clang use fundamentally different strategies for coroutine frame layout:
-
Clang performs frame layout after middle-end optimizations. Dead variables, unused temporaries, and constant-folded intermediates are eliminated before the frame is sized.
-
GCC performs frame layout in the frontend, before optimizations. Every local variable whose scope spans a suspend point ends up in the frame, even if optimizations would later prove it dead.
The practical consequence is that GCC coroutine frames are often 5-10x larger than Clang’s for the same source code. In one benchmark, the same coroutine produced a 24-byte frame on Clang and a 16,032-byte frame on GCC.
For production coroutine workloads, Clang currently produces substantially better code. If you must use GCC, pay extra attention to variable scoping (above) and consider supplying a custom memory_resource with larger block sizes, since frames above 2048 bytes bypass the default recycling allocator’s pooling.
Reference
| Header | Description |
|---|---|
|
Frame allocator concept and utilities |
|
Mixin base for promise types that use the TLS frame allocator |
|
Default recycling allocator implementation |
You have now learned how coroutine frame allocation works and how to customize it. Continue to Lambda Coroutine Captures to learn about a critical pitfall with lambda coroutines.