Frame Allocators

This section explains how coroutine frames are allocated and how to customize allocation for performance.

Prerequisites

Completed Concurrent Composition
Understanding of coroutine frame allocation from C++20 Coroutines Tutorial

The Timing Constraint

Coroutine frame allocation has a unique constraint: memory must be allocated before the coroutine body begins executing. The standard C++ mechanism—promise type’s operator new—is called before the promise is constructed.

This creates a challenge: how can a coroutine use a custom allocator when the allocator might be passed as a parameter, which is stored in the frame?

Thread-Local Propagation

Capy solves this with thread-local propagation:

Before evaluating the task argument, run_async sets a thread-local allocator
The task’s operator new reads this thread-local allocator
The task stores the allocator in its promise for child propagation

This is why run_async uses two-call syntax:

run_async(executor)(my_task());
//        ↑         ↑
//        1. Sets    2. Task allocated
//        TLS        using TLS allocator

The Window

The "window" is the interval between setting the thread-local allocator and the coroutine’s first suspension point. During this window:

The task is allocated using the TLS allocator
The task captures the TLS allocator in its promise
Child tasks inherit the allocator

After the window closes (at the first suspension), the TLS allocator may be restored to a previous value. The task retains its captured allocator regardless.

TLS Preservation

Between a coroutine’s await_resume (which sets TLS to the correct allocator) and the next child coroutine invocation (whose operator new reads TLS), arbitrary user code runs. If that code resumes a coroutine from a different chain on the same thread — by calling .resume() directly, pumping a completion queue, or running nested dispatch — the other coroutine’s await_resume overwrites TLS with its own allocator. The original coroutine’s next child would then allocate from the wrong resource.

To prevent this, any code that calls .resume() on a coroutine handle must use safe_resume from <boost/capy/ex/frame_allocator.hpp>:

// In your event loop or dispatch path:
capy::safe_resume(h);   // saves and restores TLS around h.resume()

safe_resume saves the current thread-local allocator, calls h.resume(), then restores the saved value. This makes TLS behave like a stack: nested resumes cannot spoil the outer value. All of Capy’s built-in executors (thread_pool, strands, blocking_context) use safe_resume internally. Custom executor event loops must do the same — see Custom Executor for an example.

The FrameAllocator Concept

Custom allocators must satisfy the FrameAllocator concept, which is compatible with C++ allocator requirements:

template<typename A>
concept FrameAllocator = requires {
    typename A::value_type;
} && requires(A& a, std::size_t n) {
    { a.allocate(n) } -> std::same_as<typename A::value_type*>;
    { a.deallocate(std::declval<typename A::value_type*>(), n) };
};

In practice, any standard allocator works.

Using Custom Allocators

With run_async

Pass an allocator to run_async:

std::pmr::monotonic_buffer_resource resource;
std::pmr::polymorphic_allocator<std::byte> alloc(&resource);

run_async(executor, alloc)(my_task());

Or pass a memory_resource* directly:

std::pmr::monotonic_buffer_resource resource;
run_async(executor, &resource)(my_task());

Default Allocator

When no allocator is specified, run_async uses the execution context’s default frame allocator, typically a recycling allocator optimized for coroutine frame sizes.

Recycling Allocator

Capy provides recycling_memory_resource, a memory resource optimized for coroutine frames:

Maintains freelists by size class
Reuses recently freed blocks (cache-friendly)
Falls back to upstream allocator for new sizes

This allocator is used by default for thread_pool and other execution contexts.

HALO Optimization

Heap Allocation eLision Optimization (HALO) allows the compiler to allocate coroutine frames on the stack instead of the heap when:

The coroutine’s lifetime is provably contained in the caller’s
The frame size is known at compile time
Optimization is enabled

Capy’s task<T> uses the attribute (when available) to enable HALO:

template<typename T = void>
struct [[nodiscard]] BOOST_CAPY_CORO_AWAIT_ELIDABLE
    task
{
    // ...
};

When HALO Applies

HALO is most effective for immediately-awaited tasks:

// HALO can apply: task is awaited immediately
int result = co_await compute();

// HALO cannot apply: task escapes to storage
auto t = compute();
tasks.push_back(std::move(t));

Measuring HALO Effectiveness

Profile your application to see if HALO is taking effect. Look for:

Reduced heap allocations
Improved cache locality
Lower allocation latency

Best Practices

Use Default Allocators

For most applications, the default recycling allocator provides good performance without configuration.

Consider Memory Resources for Batched Work

When launching many short-lived tasks together, a monotonic buffer resource can be efficient:

void process_batch(std::vector<item> const& items)
{
    std::array<std::byte, 64 * 1024> buffer;
    std::pmr::monotonic_buffer_resource resource(
        buffer.data(), buffer.size());

    for (auto const& item : items)
    {
        run_async(executor, &resource)(process(item));
    }
    // All frames deallocated when resource goes out of scope
}

Scope Variables to Reduce Frame Size

Compilers use declaration scope (braces) to decide which variables cross suspend points and must live in the coroutine frame. Variables declared in an outer scope remain in the frame even after their last use, as long as a co_await follows within the same scope.

Wrapping buffer usage in explicit braces can dramatically reduce frame size:

// BAD: buf lives in frame across all subsequent co_awaits
task<> process(stream& s)
{
    char buf[4096];
    auto [ec, n] = co_await s.read_some(buf);
    co_await do_work(buf, n);
    co_await s.write_some(reply);   // buf wastes 4K in frame
}

// GOOD: braces end buf's lifetime before next suspend
task<> process(stream& s)
{
    std::size_t n;
    {
        char buf[4096];
        auto [ec, n_] = co_await s.read_some(buf);
        n = n_;
        co_await do_work(buf, n);
    }
    co_await s.write_some(reply);  // 4K saved
}

This technique also enables the compiler to overlap variables in the frame. When two variables have completely non-overlapping lifetimes (in separate scoped blocks), the compiler can reuse the same frame memory for both — even on Clang:

// BAD: both arrays in frame simultaneously (8K)
task<> pipeline(stream& in, stream& out)
{
    char read_buf[4096];
    auto [ec1, n] = co_await in.read_some(read_buf);

    char write_buf[4096];
    prepare(write_buf, read_buf, n);
    co_await out.write_some(write_buf);
}

// GOOD: non-overlapping scopes allow frame reuse (4K)
task<> pipeline(stream& in, stream& out)
{
    std::size_t n;
    {
        char read_buf[4096];
        auto [ec, n_] = co_await in.read_some(read_buf);
        n = n_;
    }
    {
        char write_buf[4096];
        prepare(write_buf, n);
        co_await out.write_some(write_buf);
    }
}

In the second version, read_buf and write_buf never coexist, so the compiler can place them at the same frame offset — halving the frame’s buffer footprint. This optimization applies to any variables with non-overlapping lifetimes, not just arrays.

GCC vs Clang Frame Sizes

This section draws on C++20 Coroutines from compiler and library authors' perspective by Chuanqi Xu.

GCC and Clang use fundamentally different strategies for coroutine frame layout:

Clang performs frame layout after middle-end optimizations. Dead variables, unused temporaries, and constant-folded intermediates are eliminated before the frame is sized.
GCC performs frame layout in the frontend, before optimizations. Every local variable whose scope spans a suspend point ends up in the frame, even if optimizations would later prove it dead.

The practical consequence is that GCC coroutine frames are often 5-10x larger than Clang’s for the same source code. In one benchmark, the same coroutine produced a 24-byte frame on Clang and a 16,032-byte frame on GCC.

For production coroutine workloads, Clang currently produces substantially better code. If you must use GCC, pay extra attention to variable scoping (above) and consider supplying a custom memory_resource with larger block sizes, since frames above 2048 bytes bypass the default recycling allocator’s pooling.

Profile Before Optimizing

Coroutine frame allocation is rarely the bottleneck. Profile your application before investing in custom allocators.

Reference

Header Description

Header	Description
`<boost/capy/ex/frame_allocator.hpp>`	Frame allocator concept and utilities
`<boost/capy/ex/frame_alloc_promise.hpp>`	Mixin base for promise types that use the TLS frame allocator
`<boost/capy/ex/recycling_memory_resource.hpp>`	Default recycling allocator implementation

<boost/capy/ex/frame_allocator.hpp>

Frame allocator concept and utilities

<boost/capy/ex/frame_alloc_promise.hpp>

Mixin base for promise types that use the TLS frame allocator

<boost/capy/ex/recycling_memory_resource.hpp>

Default recycling allocator implementation

You have now learned how coroutine frame allocation works and how to customize it. Continue to Lambda Coroutine Captures to learn about a critical pitfall with lambda coroutines.

Edit this Page