In C, when I pass a pointer to a function, the compiler always seems to assume that the data pointed to by that pointer might be continuously modified in another thread, even though in actual API usage patterns this is usually not the case.

Problem Description Consider the following typical API usage pattern:

int create_xxx(int *p_xxx);
int xxx_do_something(int xxx);

int entry() {
    int xxx;

    create_xxx(&xxx);
    
    xxx_do_something(xxx);
    xxx_do_something(xxx);
    xxx_do_something(xxx);
    
    return 0;

}

When compiled with gcc -S -Ofast, the generated assembly code shows that the compiler reloads the value of xxx from the stack each time xxx_do_something is called. (Clang and MSVC are essentially equivalent; Godbolt.)

entry:
    subq    $24, %rsp
    leaq    12(%rsp), %rdi
    call    xxx_create
    movl    12(%rsp), %edi  # reload
    call    xxx_do_something
    movl    12(%rsp), %edi  # reload
    call    xxx_do_something
    movl    12(%rsp), %edi  # reload
    call    xxx_do_something
    xorl    %eax, %eax
    addq    $24, %rsp
    ret

Desired Behavior In actual API design, the create_xxx function typically initializes the data but does not continuously modify it in a background thread. I want the compiler to recognize this and keep the value of xxx in a register instead of repeatedly loading it from memory:

entry:
    push    %rbx           # save a call-preserved register
    subq    $16, %rsp      # space for xxx while keeping RSP aligned by 16
    leaq    12(%rsp), %rdi
    call    xxx_create
    movl    12(%rsp), %edi  # load into EDI as a function arg
    movl    %edi, %ebx      # and save a copy in a EBX
    call    xxx_do_something
    movl    %ebx, %edi        # copy a register instead of reloading
    call    xxx_do_something
    movl    %ebx, %edi        # ditto
    call    xxx_do_something
    xorl    %eax, %eax      # return 0
    addq    $16, %rsp
    pop     %rbx            # restore our caller's RBX
    ret

Problems with Manual Copying Although manually saving the value works:

int _xxx;

xxx_create(&_xxx);

int xxx = _xxx;

// do other work

xxx_do_something(xxx);
xxx_do_something(xxx);
xxx_do_something(xxx);

It has several disadvantages:

  • Compilation ordering constraints: In assembly, int xxx = _xxx must occur before // do other work. The compiler cannot reorder this through out-of-order execution because it assumes that other threads might change the value of _xxx during // do other work, even though in the code int xxx = _xxx is placed before // do other work.

  • Unnecessary stack usage: The compiler might allocate stack space to save xxx (if // do other work is lengthy), whereas if the compiler knew that _xxx wouldn't be modified again, it would only need to dereference once before xxx_do_something(_xxx).

  • Performance overhead for complex types: If xxx is an array or structure, reassignment also consumes performance.

Question Is there a standard way to tell the C compiler that:

  • The data pointed to by the pointer passed to the function won't be continuously modified by that function in a background thread?

  • Or, the function might modify the data, but after the function returns, the data won't be continuously changed by other threads?

I'm using GCC and Clang, and would prefer cross-compiler solutions or compiler-specific extensions.

10 Replies 10

Your suggested change assumes that nothing in the call chain of xxx_do_something modifies the EBX register.

@dbush: The post says they want the compiler to keep the value in a register, not in the register used to pass it to the routine. The argument register would have to be reloaded for each call, but it can be from another register instead of from memory. The post explicitly notes %ebx would be reloaded.

Unnecessary stack usage: The compiler might allocate stack space to save xxx (if // do other work is lengthy), whereas if the compiler knew that _xxx wouldn't be modified again, it would only need to dereference once before xxx_do_something(_xxx).

You can avoid this by ending the lifetime of the object passed to create_xxx:

int xxx;

{
    int xxxtemp;
    create_xxx(&xxxtemp);
    xxx = xxxtemp;
}

// Do other work.

xxx_do_something(xxx);
xxx_do_something(xxx);
xxx_do_something(xxx);

That also addresses your concern about allowing the compiler to reorder xxx = xxxtemp; with respect to // Do other work., since the compiler may conclude that // Do other work. does not modify xxxtemp because xxxtemp does not exist while // Do other work. is executing (its lifetime ended).

The optimizer should also deal with your performance concern; since nothing later in the program can use xxxtemp; the compiler can implement xxx = xxxtemp; simply by using the memory of xxxtemp for xxx.

Is there a standard way to tell the C compiler that:

  • The data pointed to by the pointer passed to the function won't be continuously modified by that function in a background thread?

Details of what happens during execution of create_xxx() are not relevant to the question, because your concern is about what the compiler is willing to assume about the value of that variable after create_xxx() returns.

  • Or, the function might modify the data, but after the function returns, the data won't be continuously changed by other threads?

It's not so much an issue of threading. Once you publish a pointer to the local variable by passing it to create_xxx(), the compiler cannot, without knowledge of the implementation of create_xxx(), know that it will not store that pointer somewhere, such that the variable can be accessed later by some other function. Even by another function called in the same thread.

That same-thread case is more likely to be what the compiler is guarding against, because if the variable were modified by a different thread then that would lead to a data race and ensuing undefined behavior. A modern optimizing compiler is likely simply to generate code whose correctness depends on data races not happening.

Standard C does not provide any way to both allow create_xxx() to modify the value of a local variable and assert that the value of that variable will not subsequently be modified by code outside the function.

HOWEVER, supposing that create_xxx() indeed does not store a copy of its pointer argument, you might be able to get behavior you like better from your particular compiler in one or more of these ways:

  • put create_xxx() in the same translation unit from which it is called and declare it static, and / or

  • compile with link-time optimization enabled (GCC: -flto), and / or

  • compile all the program sources in the same compilation command and as if they were a single unit (GCC: -whole-program).

The idea with all of these alternatives is to allow the compiler to see for itself that there is no way for the variable in question to be modified by code outside of entry() and create_xxx(). It remains up to the compiler, however, whether it will in fact notice or care.

Other than that, your idea to copy the value to a different local variable, whose address has not been published, is probably your best bet. Indeed, it's probably a better bet than any of the other alternatives described above.

Regarding the potential disadvantages you describe:

  • Compilation ordering constraints: In assembly, int xxx = _xxx must occur before // do other work. The compiler cannot reorder this through out-of-order execution because it assumes that other threads might change the value of _xxx during // do other work, even though in the code int xxx = _xxx is placed before // do other work.

As already discussed, interference from other threads probably is not the issue the compiler is guarding against. I wouldn't necessarily expect the assignment to be reordered across a function call, but I wouldn't be surprised to see reordering with respect to other kinds of // other work.

But also, concern about whether such an optimization is possible / performed is probably unwarranted. Your code is unlikely to be so hot and so performance sensitive that whether a reordering is performed here could make a noticeable performance difference.

  • Unnecessary stack usage: The compiler might allocate stack space to save xxx (if // do other work is lengthy), whereas if the compiler knew that _xxx wouldn't be modified again, it would only need to dereference once before xxx_do_something(_xxx).

With the possible exception of cases where create_xxx() is inlined, there needs to be space allocated for _xxx regardless, because that's a prerequisite for it to have an address.

Eric described a way to avoid such an allocation being retained longer than you would like by restricting the scope of the intermediate variable so that its lifetime ends promptly.

But also, unless you're dealing with very large objects or very limited stack, this issue is probably not important in practice.

  • Performance overhead for complex types: If xxx is an array or structure, reassignment also consumes performance.

Yes. In this case, you would likely be better off just letting the compiler assume that the object may be modified. But you aren't going to have a large object held in a register anyway, so reloading (parts of) it from memory at need will be happening regardless.

ssd

If the program flow is exactly as it appears here (i.e. the xxx_do_something function will be called 3 times in a row without any other lines in between), could a simple solution be to define a new function with an argument of int* __restrict__ xxx and move these three lines there?

Here, restrict informs (and gives compiler a guarantee) that xxx variable won't change, even through different threads.

void do_another_thing(int* __restrict__ xxx) {
    xxx_do_something(*xxx);
    xxx_do_something(*xxx);
    xxx_do_something(*xxx);
}

@Eric Postpischil

In fact, the compiler will not know.

int func1(int *);

int func2(int);

int entry() {

    int arr[10];

    {
        int _arr[10];

        func1(_arr);

        __builtin_memcpy(arr, _arr, sizeof(arr));
    }

    for (int i = 0; i < sizeof(arr) / sizeof(int); i++) {
        func2(arr[i]);
    }

    return 0;
}
        .file   "example.c"
# GNU C23 (Compiler-Explorer-Build-gcc--binutils-2.44) version 15.2.0 (x86_64-linux-gnu)
#       compiled by GNU C version 11.4.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.24-GMP

# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed: -mtune=generic -march=x86-64 -g -g0 -Ofast -fno-asynchronous-unwind-tables
        .text
        .p2align 4
        .globl  entry
        .type   entry, @function
entry:
        pushq   %rbp  #
        pushq   %rbx  #
        subq    $104, %rsp      #,
# /app/example.c:12:         func1(_arr);
        leaq    48(%rsp), %rdi  #, tmp103
        movq    %rsp, %rbx      #, ivtmp.11
        leaq    40(%rsp), %rbp  #, _19
        call    func1   #
# /app/example.c:14:         __builtin_memcpy(arr, _arr, sizeof(arr));
        movdqa  48(%rsp), %xmm0     # MEM <unsigned char[40]> [(char * {ref-all})&_arr], MEM <unsigned char[40]> [(char * {ref-all})&_arr]
        movq    80(%rsp), %rax  # MEM <unsigned char[40]> [(char * {ref-all})&_arr], MEM <unsigned char[40]> [(char * {ref-all})&_arr]
        movaps  %xmm0, (%rsp)       # MEM <unsigned char[40]> [(char * {ref-all})&_arr], MEM <unsigned char[40]> [(char * {ref-all})&arr]
        movdqa  64(%rsp), %xmm0     # MEM <unsigned char[40]> [(char * {ref-all})&_arr], MEM <unsigned char[40]> [(char * {ref-all})&_arr]
        movq    %rax, 32(%rsp)  # MEM <unsigned char[40]> [(char * {ref-all})&_arr], MEM <unsigned char[40]> [(char * {ref-all})&arr]
        movaps  %xmm0, 16(%rsp)     # MEM <unsigned char[40]> [(char * {ref-all})&_arr], MEM <unsigned char[40]> [(char * {ref-all})&arr]
        .p2align 4
        .p2align 3
.L2:
# /app/example.c:18:         func2(arr[i]);
        movl    (%rbx), %edi    # MEM[(int *)_17], MEM[(int *)_17]
# /app/example.c:17:     for (int i = 0; i < sizeof(arr) / sizeof(int); i++) {
        addq    $4, %rbx        #, ivtmp.11
# /app/example.c:18:         func2(arr[i]);
        call    func2   #
# /app/example.c:17:     for (int i = 0; i < sizeof(arr) / sizeof(int); i++) {
        cmpq    %rbp, %rbx      # _19, ivtmp.11
        jne     .L2       #,
# /app/example.c:22: }
        addq    $104, %rsp      #,
        xorl    %eax, %eax      #
        popq    %rbx    #
        popq    %rbp    #
        ret     
        .size   entry, .-entry
        .ident  "GCC: (Compiler-Explorer-Build-gcc--binutils-2.44) 15.2.0"
        .section        .note.GNU-stack,"",@progbits

The compiler will still allocate space to save it.

The best optimization method is to dereference each time func2(arr[i]); is called.

The compiler does indeed have the ability to do it.

int func1(int *);

int func2(int);

__attribute__((malloc)) int *func3();

int entry() {

    int *arr = func3();

    for (int i = 0; i < 10; i++) {
        func2(arr[i]);
    }

    return 0;
}
        .file   "example.c"
# GNU C23 (Compiler-Explorer-Build-gcc--binutils-2.44) version 15.2.0 (x86_64-linux-gnu)
#       compiled by GNU C version 11.4.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.24-GMP

# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed: -mtune=generic -march=x86-64 -g -g0 -Ofast -fno-asynchronous-unwind-tables
        .text
        .p2align 4
        .globl  entry
        .type   entry, @function
entry:
        pushq   %rbp  #
        pushq   %rbx  #
        subq    $8, %rsp        #,
# /app/example.c:9:     int *arr = func3();
        call    func3   #
        movq    %rax, %rbx      # ivtmp.10, ivtmp.10
        leaq    40(%rax), %rbp  #, _20
        .p2align 4
        .p2align 3
.L2:
# /app/example.c:12:         func2(arr[i]);
        movl    (%rbx), %edi    # MEM[(int *)_18], MEM[(int *)_18]
# /app/example.c:11:     for (int i = 0; i < 10; i++) {
        addq    $4, %rbx        #, ivtmp.10
# /app/example.c:12:         func2(arr[i]);
        call    func2   #
# /app/example.c:11:     for (int i = 0; i < 10; i++) {
        cmpq    %rbp, %rbx      # _20, ivtmp.10
        jne     .L2       #,
# /app/example.c:16: }
        addq    $8, %rsp        #,
        xorl    %eax, %eax      #
        popq    %rbx    #
        popq    %rbp    #
        ret     
        .size   entry, .-entry
        .ident  "GCC: (Compiler-Explorer-Build-gcc--binutils-2.44) 15.2.0"
        .section        .note.GNU-stack,"",@progbits

@ssd

In fact, all C compilers tend to somewhat ignore restrict. When do_another_thing and entry are compiled in the same source file, the compiler will perform its own analysis and then assume that xxx might be continuously modified.

int xxx_create(int *p_xxx);
int xxx_do_something(int xxx);

void do_another_thing(int* restrict xxx) {
    xxx_do_something(*xxx);
    xxx_do_something(*xxx);
    xxx_do_something(*xxx);
}

int entry() {
    int xxx;

    xxx_create(&xxx);
    
    do_another_thing(&xxx);
    
    return 0;
}
        .file   "example.c"
# GNU C23 (Compiler-Explorer-Build-gcc--binutils-2.44) version 15.2.0 (x86_64-linux-gnu)
#       compiled by GNU C version 11.4.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.24-GMP

# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed: -mtune=generic -march=x86-64 -g -g0 -Ofast -fno-asynchronous-unwind-tables
        .text
        .p2align 4
        .globl  do_another_thing
        .type   do_another_thing, @function
do_another_thing:
        pushq   %rbx  #
# /app/example.c:4: void do_another_thing(int* restrict xxx) {
        movq    %rdi, %rbx      # xxx, xxx
# /app/example.c:5:     xxx_do_something(*xxx);
        movl    (%rdi), %edi    # *xxx_5(D), *xxx_5(D)
        call    xxx_do_something        #
# /app/example.c:6:     xxx_do_something(*xxx);
        movl    (%rbx), %edi    # *xxx_5(D), *xxx_5(D)
        call    xxx_do_something        #
# /app/example.c:7:     xxx_do_something(*xxx);
        movl    (%rbx), %edi    # *xxx_5(D), *xxx_5(D)
# /app/example.c:8: }
        popq    %rbx    #
# /app/example.c:7:     xxx_do_something(*xxx);
        jmp     xxx_do_something  #
        .size   do_another_thing, .-do_another_thing
        .p2align 4
        .globl  entry
        .type   entry, @function
entry:
        subq    $24, %rsp       #,
# /app/example.c:13:     xxx_create(&xxx);
        leaq    12(%rsp), %rdi  #, tmp102
        call    xxx_create      #
# /app/example.c:5:     xxx_do_something(*xxx);
        movl    12(%rsp), %edi  # xxx,
        call    xxx_do_something        #
# /app/example.c:6:     xxx_do_something(*xxx);
        movl    12(%rsp), %edi  # xxx,
        call    xxx_do_something        #
# /app/example.c:7:     xxx_do_something(*xxx);
        movl    12(%rsp), %edi  # xxx,
        call    xxx_do_something        #
# /app/example.c:18: }
        xorl    %eax, %eax      #
        addq    $24, %rsp       #,
        ret     
        .size   entry, .-entry
        .ident  "GCC: (Compiler-Explorer-Build-gcc--binutils-2.44) 15.2.0"
        .section        .note.GNU-stack,"",@progbits

@John Bollinger

It is also very meaningful in the case of structures or arrays. In Windows, there is a type of programming called COM, and the well-known graphics API DirectX3D also uses COM. COM objects are generally called in this way.

typedef struct {
    void (*func1)();
    void (*func2)();
    void (*func3)();
    void (*func4)();
    void (*func5)();
} i_ibject_vtable;

typedef struct {
    i_ibject_vtable *vtable;
} i_object;

int object_create(i_object **);

int entry() {
    i_object *p_object;

    object_create(&p_object);
    
    p_object->vtable->func1();
    p_object->vtable->func2();
    p_object->vtable->func3();
    p_object->vtable->func4();
    p_object->vtable->func5();

    return 0;
}
        .file   "example.c"
# GNU C23 (Compiler-Explorer-Build-gcc--binutils-2.44) version 15.2.0 (x86_64-linux-gnu)
#       compiled by GNU C version 11.4.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.24-GMP

# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed: -mtune=generic -march=x86-64 -g -g0 -Ofast -fno-asynchronous-unwind-tables
        .text
        .p2align 4
        .globl  entry
        .type   entry, @function
entry:
        subq    $24, %rsp       #,
# /app/example.c:18:     object_create(&p_object);
        leaq    8(%rsp), %rdi   #, tmp114
        call    object_create   #
# /app/example.c:20:     p_object->vtable->func1();
        movq    8(%rsp), %rax   # p_object, p_object
# /app/example.c:20:     p_object->vtable->func1();
        movq    (%rax), %rax    # p_object.0_1->vtable, p_object.0_1->vtable
# /app/example.c:20:     p_object->vtable->func1();
        call    *(%rax) # _2->func1
# /app/example.c:21:     p_object->vtable->func2();
        movq    8(%rsp), %rax   # p_object, p_object
# /app/example.c:21:     p_object->vtable->func2();
        movq    (%rax), %rax    # p_object.1_4->vtable, p_object.1_4->vtable
# /app/example.c:21:     p_object->vtable->func2();
        call    *8(%rax)        # _5->func2
# /app/example.c:22:     p_object->vtable->func3();
        movq    8(%rsp), %rax   # p_object, p_object
# /app/example.c:22:     p_object->vtable->func3();
        movq    (%rax), %rax    # p_object.2_7->vtable, p_object.2_7->vtable
# /app/example.c:22:     p_object->vtable->func3();
        call    *16(%rax)       # _8->func3
# /app/example.c:23:     p_object->vtable->func4();
        movq    8(%rsp), %rax   # p_object, p_object
# /app/example.c:23:     p_object->vtable->func4();
        movq    (%rax), %rax    # p_object.3_10->vtable, p_object.3_10->vtable
# /app/example.c:23:     p_object->vtable->func4();
        call    *24(%rax)       # _11->func4
# /app/example.c:24:     p_object->vtable->func5();
        movq    8(%rsp), %rax   # p_object, p_object
# /app/example.c:24:     p_object->vtable->func5();
        movq    (%rax), %rax    # p_object.4_13->vtable, p_object.4_13->vtable
# /app/example.c:24:     p_object->vtable->func5();
        call    *32(%rax)       # _14->func5
# /app/example.c:27: }
        xorl    %eax, %eax      #
        addq    $24, %rsp       #,
        ret     
        .size   entry, .-entry
        .ident  "GCC: (Compiler-Explorer-Build-gcc--binutils-2.44) 15.2.0"
        .section        .note.GNU-stack,"",@progbits

Manually saving these function pointers results in negative optimization

typedef struct {
    void (*func1)();
    void (*func2)();
    void (*func3)();
    void (*func4)();
    void (*func5)();
} i_ibject_vtable;

typedef struct {
    i_ibject_vtable *vtable;
} i_object;

int object_create(i_object **);

int entry() {
    i_object *p_object;

    object_create(&p_object);

    i_ibject_vtable vtable;
    __builtin_memcpy(&vtable, p_object->vtable, sizeof(vtable));

    vtable.func1();
    vtable.func2();
    vtable.func3();
    vtable.func4();
    vtable.func5();

    return 0;
}
        .file   "example.c"
# GNU C23 (Compiler-Explorer-Build-gcc--binutils-2.44) version 15.2.0 (x86_64-linux-gnu)
#       compiled by GNU C version 11.4.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.24-GMP

# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed: -mtune=generic -march=x86-64 -g -g0 -Ofast -fno-asynchronous-unwind-tables
        .text
        .p2align 4
        .globl  entry
        .type   entry, @function
entry:
        subq    $72, %rsp       #,
# /app/example.c:18:     object_create(&p_object);
        leaq    8(%rsp), %rdi   #, tmp106
        call    object_create   #
# /app/example.c:21:     __builtin_memcpy(&vtable, p_object->vtable, sizeof(vtable));
        movq    8(%rsp), %rax   # p_object, p_object
# /app/example.c:21:     __builtin_memcpy(&vtable, p_object->vtable, sizeof(vtable));
        movq    (%rax), %rax    # p_object.0_1->vtable, p_object.0_1->vtable
        movdqu  (%rax), %xmm0       # MEM <char[1:40]> [(void *)_2], MEM <char[1:40]> [(void *)_2]
        movq    %xmm0, %rdx     # MEM <char[1:40]> [(void *)_2], tmp119
        movaps  %xmm0, 16(%rsp)     # MEM <char[1:40]> [(void *)_2], MEM <char[1:40]> [(void *)&vtable]
        movdqu  16(%rax), %xmm0     # MEM <char[1:40]> [(void *)_2], MEM <char[1:40]> [(void *)_2]
        movq    32(%rax), %rax  # MEM <char[1:40]> [(void *)_2], MEM <char[1:40]> [(void *)_2]
        movaps  %xmm0, 32(%rsp)     # MEM <char[1:40]> [(void *)_2], MEM <char[1:40]> [(void *)&vtable]
        movq    %rax, 48(%rsp)  # MEM <char[1:40]> [(void *)_2], MEM <char[1:40]> [(void *)&vtable]
# /app/example.c:23:     vtable.func1();
        call    *%rdx   # tmp119
# /app/example.c:24:     vtable.func2();
        call    *24(%rsp)       # vtable.func2
# /app/example.c:25:     vtable.func3();
        call    *32(%rsp)       # vtable.func3
# /app/example.c:26:     vtable.func4();
        call    *40(%rsp)       # vtable.func4
# /app/example.c:27:     vtable.func5();
        call    *48(%rsp)       # vtable.func5
# /app/example.c:30: }
        xorl    %eax, %eax      #
        addq    $72, %rsp       #,
        ret     
        .size   entry, .-entry
        .ident  "GCC: (Compiler-Explorer-Build-gcc--binutils-2.44) 15.2.0"
        .section        .note.GNU-stack,"",@progbits

I also can't do this for every object, because it's a tedious task. Once the compiler knows that i_object_vtable doesn't change often, it can optimize.

typedef struct {
    void (*func1)();
    void (*func2)();
    void (*func3)();
    void (*func4)();
    void (*func5)();
} i_ibject_vtable;

typedef struct {
    i_ibject_vtable *vtable;
} i_object;

__attribute__((malloc)) i_object *object_create();

int entry() {
    i_object *p_object;

    p_object = object_create();

    p_object->vtable->func1();
    p_object->vtable->func2();
    p_object->vtable->func3();
    p_object->vtable->func4();
    p_object->vtable->func5();

    // Saved the pointer into the register
    p_object->vtable->func1();
    p_object->vtable->func1();
    p_object->vtable->func1();

    return 0;
}
        .file   "example.c"
# GNU C23 (Compiler-Explorer-Build-gcc--binutils-2.44) version 15.2.0 (x86_64-linux-gnu)
#       compiled by GNU C version 11.4.0, GMP version 6.2.1, MPFR version 4.1.0, MPC version 1.2.1, isl version isl-0.24-GMP

# GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
# options passed: -mtune=generic -march=x86-64 -g -g0 -Ofast -fno-asynchronous-unwind-tables
        .text
        .p2align 4
        .globl  entry
        .type   entry, @function
entry:
        pushq   %rbx  #
# /app/example.c:18:     p_object = object_create();
        call    object_create   #
# /app/example.c:20:     p_object->vtable->func1();
        movq    (%rax), %rbx    # p_object_12->vtable, _1
# /app/example.c:20:     p_object->vtable->func1();
        call    *(%rbx) # _1->func1
# /app/example.c:21:     p_object->vtable->func2();
        call    *8(%rbx)        # _1->func2
# /app/example.c:22:     p_object->vtable->func3();
        call    *16(%rbx)       # _1->func3
# /app/example.c:23:     p_object->vtable->func4();
        call    *24(%rbx)       # _1->func4
# /app/example.c:24:     p_object->vtable->func5();
        call    *32(%rbx)       # _1->func5
# /app/example.c:27:     p_object->vtable->func1();
        call    *(%rbx) # _1->func1
# /app/example.c:28:     p_object->vtable->func1();
        call    *(%rbx) # _1->func1
# /app/example.c:29:     p_object->vtable->func1();
        call    *(%rbx) # _1->func1
# /app/example.c:32: }
        xorl    %eax, %eax      #
        popq    %rbx    #
        ret     
        .size   entry, .-entry
        .ident  "GCC: (Compiler-Explorer-Build-gcc--binutils-2.44) 15.2.0"
        .section        .note.GNU-stack,"",@progbits

Unfortunately, most APIs return error codes instead of pointers, making it impossible to use __attribute__.

In fact, all C compilers tend to somewhat ignore restrict.

restrict qualification is local to block/struct/function/file but is not transmitted to another function (assignation). With a call to an external function that the compiler does not know anything about restrict does nothing.

TL:DR: int xxx = create_byval(); instead of output args if you can

(Update, or @Eric Postpischil's trick of limiting the lifetime of the variable whose address escapes also solves the problem, although then you still need stack space at that point and a reload after the first time, instead of only ever registers for types small enough to fit.)


With callees that do significant work, they will typically save/restore RBX themselves, in which case mov %ebx, %edi is just copying a recent load result (from pop %rbx at the end of the callee), minimal benefit vs. doing our own reload, assuming out-of-order exec can see far enough to do it early-ish either way.

But that lack of significant benefit is because your example is very simple.

If you were doing ++xxx; or something in entry() between function calls, the compiler could just use lea 1(%rbx), %edi to set up the arg for the second call, and lea 2(%rbx), %edi for the final. vs. with the variable in memory because a pointer to it escaped the function, would have to RMW the copy in memory as well as reload it.

As we can see on Godbolt, adding ++xxx; between the calls does create a bigger difference between your version of entry vs. one which does int xxx = create_xxx_byval(); without letting the address of xxx escape the function. Escape analysis is what lets compilers figure out if they can optimize variables into registers.

I'm not aware of any C feature to promise that a function doesn't save a pointer somewhere reachable by other functions. If so, that would let a by-reference output arg work the way you hoped.

Use return values instead of output args whenever possible. If you have more than one return value, using a struct is pretty inconvenient in C, but still avoids this problem. (And is still efficient for return values up to two pointer-widths in size in many calling conventions, especially when it's two equal-sized halves. x86-64 SysV packs struct retvals into RDX:RAX using the same layout rules as for memory, so {int a; int64_t b;} will put a in EAX and b in RDX, but if b was an 8-byte char array (or anything with alignof(T) < 8) it will get packed right after a, with its low half in the top of RAX and high half in the bottom of RDX.)

Or use return values for "important" variables and have a less important variable (like a success/fail status) as a by-reference output arg, although that sucks for readability since then you can't write if (!foo(&xxx)) ... handle error, you'd need xxx = foo(&err_status); if (err_status) .... And it means a store/reload for the error status, which increases latency before the branch prediction can be checked for that branch, increasing mispredict cost. OTOH it can mean lower latency for the xxx value itself if it's small and returns in a register.

I'm optimistic that all of this still applies with larger types that are returned in asm by hidden pointer even when C returns by value. Callees shouldn't be allowed to hang onto a pointer to the return value in that case because it's not visible in the C semantics, only the asm implementation.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.