man.bsd.lv manual page server

Manual Page Search Parameters

ATOMIC_LOADSTORE(9) Kernel Developer's Manual ATOMIC_LOADSTORE(9)

atomic_load_relaxed, atomic_load_acquire, atomic_load_consume, atomic_store_relaxed, atomic_store_releaseatomic and ordered memory operations

#include <sys/atomic.h>

T
atomic_load_relaxed(const volatile T *p);

T
atomic_load_acquire(const volatile T *p);

T
atomic_load_consume(const volatile T *p);

void
atomic_store_relaxed(volatile T *p, T v);

void
atomic_store_release(volatile T *p, T v);

These type-generic macros implement memory operations that are atomic and that have memory ordering constraints. Aside from atomicity and ordering, the load operations are equivalent to *p and the store operations are equivalent to *p = v. The pointer p must be aligned, even on architectures like x86 which generally lack strict alignment requirements; see SIZE AND ALIGNMENT for details.

means that the memory operations cannot be or :

Atomic operations on any single object occur in a total order shared by all interrupts, threads, and CPUs, which is consistent with the program order in every interrupt, thread, and CPU. A single program without interruption or other threads or CPUs will always observe its own loads and stores in program order, but another program in an interrupt handler, in another thread, or on another CPU may issue loads that return values as if the first program's stores occurred out of program order, and vice versa. Two different threads might each observe a third thread's memory operations in different orders.

The memory ordering constraints make limited guarantees of ordering relative to memory operations on objects as witnessed by interrupts, other threads, or other CPUs, and have the following meanings:

relaxed
No ordering relative to memory operations on any other objects is guaranteed. Relaxed ordering is the default for ordinary non-atomic memory operations like *p and *p = v.

Atomic operations with relaxed ordering are cheap: they are not read/modify/write atomic operations, and they do not involve any kind of inter-CPU ordering barriers.

acquire
This memory operation happens before all subsequent memory operations in program order. However, prior memory operations in program order may be reordered to happen after this one. For example, assuming no aliasing between the pointers, the implementation is allowed to treat
	int x = *p;
	if (atomic_load_acquire(q)) {
		int y = *r;
		*s = x + y;
		return 1;
	}

as if it were

	if (atomic_load_acquire(q)) {
		int x = *p;
		int y = *r;
		*s = x + y;
		return 1;
	}

but not as if it were

	int x = *p;
	int y = *r;
	*s = x + y;
	if (atomic_load_acquire(q)) {
		return 1;
	}
consume
This memory operation happens before all memory operations on objects at addresses that are computed from the value returned by this one. Otherwise, no ordering relative to memory operations on other objects is implied.

For example, the implementation is allowed to treat

	struct foo *foo0, *foo1;

	struct foo *f0 = atomic_load_consume(&foo0);
	struct foo *f1 = atomic_load_consume(&foo1);
	int x = f0->x;
	int y = f1->y;

as if it were

	struct foo *foo0, *foo1;

	struct foo *f1 = atomic_load_consume(&foo1);
	struct foo *f0 = atomic_load_consume(&foo0);
	int y = f1->y;
	int x = f0->x;

but loading f0->x is guaranteed to happen after loading foo0 even if the CPU had a cached value for the address that f0->x happened to be at, and likewise for f1->y and foo1.

() functions like atomic_load_acquire() as long as the memory operations that must happen after it are limited to addresses that depend on the value returned by it, but it is almost always as cheap as atomic_load_relaxed(). See ACQUIRE OR CONSUME? below for more details.

release
All prior memory operations in program order happen before this one. However, subsequent memory operations in program order may be reordered to happen before this one too. For example, assuming no aliasing between the pointers, the implementation is allowed to treat
	int x = *p;
	*q = x;
	atomic_store_release(r, 0);
	int y = *s;
	return x + y;

as if it were

	int y = *s;
	int x = *p;
	*q = x;
	atomic_store_release(r, 0);
	return x + y;

but not as if it were

	atomic_store_release(r, 0);
	int x = *p;
	int y = *s;
	*q = x;
	return x + y;

In general, each () be paired with either atomic_load_acquire() or atomic_load_consume() in order to have an effect — it is only when a release operation synchronizes with an acquire or consume operation that any ordering guaranteed between memory operations the release operation and memory operations the acquire/consume operation.

For example, to set up an entry in a table and then mark the entry ready, you should:

  1. Perform memory operations to initialize the data.
    	tab[i].x = ...;
    	tab[i].y = ...;
  2. Issue () to mark it ready.
    	atomic_store_release(&tab[i].ready, 1);
  3. Possibly in another thread, issue atomic_load_acquire() to ascertain whether it is ready.
    	if (atomic_load_acquire(&tab[i].ready) == 0)
    		return EWOULDBLOCK;
  4. Perform memory operations to use the data.
    	do_stuff(tab[i].x, tab[i].y);

Similarly, if you want to create an object, initialize it, and then publish it to be used by another thread, then you should:

  1. Perform memory operations to initialize the object.
    	struct mumble *m = kmem_alloc(sizeof(*m), KM_SLEEP);
    	m->x = x;
    	m->y = y;
    	m->z = m->x + m->y;
  2. Issue () to publish it.
    	atomic_store_release(&the_mumble, m);
  3. Possibly in another thread, issue atomic_load_consume() to get it.
    	struct mumble *m = atomic_load_consume(&the_mumble);
  4. Perform memory operations to use the object's members.
    	m->y &= m->x;
    	do_things(m->x, m->y, m->z);

In both examples, assuming that the value written by () in step 2 is read by atomic_load_acquire() or atomic_load_consume() in step 3, this guarantees that all of the memory operations in step 1 complete before any of the memory operations in step 4 — even if they happen on different CPUs.

Without both the release operation in step 2 the acquire or consume operation in step 3, no ordering is guaranteed between the memory operations in steps 1 and 4. In fact, without both release and acquire/consume, even the assignment m->z = m->x + m->y in step 1 might read values of m->x and m->y that were written in step 4.

You must use () when subsequent memory operations in program order that must happen after the load are on objects at addresses that might not depend arithmetically on the resulting value. This applies particularly when the choice of whether to do the subsequent memory operation depends on a :

	struct gadget {
		int ready, x;
	} the_gadget;

	/* Producer */
	the_gadget.x = 42;
	atomic_store_release(&the_gadget.ready, 1);

	/* Consumer */
	if (atomic_load_acquire(&the_gadget.ready) == 0)
		return EWOULDBLOCK;
	int x = the_gadget.x;

Here the the_gadget.x depends on a control-flow decision depending on value loaded from the_gadget.ready, and loading the_gadget.x must happen after loading the_gadget.ready. Using () guarantees that the compiler and CPU do not conspire to load the_gadget.x before we have ascertained that it is ready.

You may use () if all subsequent memory operations in program order that must happen after the load are performed on objects at addresses computed arithmetically from the resulting value, such as loading a pointer to a structure object and then dereferencing it:

	struct gizmo {
		int x, y, z;
	};
	struct gizmo null_gizmo;
	struct gizmo *the_gizmo = &null_gizmo;

	/* Producer */
	struct gizmo *g = kmem_alloc(sizeof(*g), KM_SLEEP);
	g->x = 12;
	g->y = 34;
	g->z = 56;
	atomic_store_release(&the_gizmo, g);

	/* Consumer */
	struct gizmo *g = atomic_load_consume(&the_gizmo);
	int y = g->y;

Here the of g->y depends on the value of the pointer loaded from the_gizmo. Using () guarantees that we do not witness a stale cache for that address.

In some cases it may be unclear. For example:

	int x[2];
	bool b;

	/* Producer */
	x[0] = 42;
	atomic_store_release(&b, 0);

	/* Consumer 1 */
	int y = atomic_load_???(&b) ? x[0] : x[1];

	/* Consumer 2 */
	int y = x[atomic_load_???(&b) ? 0 : 1];

	/* Consumer 3 */
	int y = x[atomic_load_???(&b) ^ 1];

Although the three consumers seem to be equivalent, by the letter of C11 consumers 1 and 2 require () because the value determines the address of a subsequent load only via control-flow decisions in the ?: operator, whereas consumer 3 can use atomic_load_consume(). However, if you're not sure, you should err on the side of atomic_load_acquire() until C11 implementations have ironed out the kinks in the semantics.

On all CPUs other than DEC Alpha, () is cheap — it is identical to atomic_load_relaxed(). In contrast, atomic_load_acquire() usually implies an expensive memory barrier.

The pointer p must be aligned — that is, if the object it points to is 2^n bytes long, then the low-order n bits of p must be zero.

All NetBSD ports support atomic loads and stores on units of data up to 32 bits. Some ports additionally support atomic loads and stores on larger quantities, like 64-bit quantities, if __HAVE_ATOMIC64_LOADSTORE is defined. The macros are not allowed on larger quantities of data than the port supports atomically; attempts to use them for such quantities should result in a compile-time assertion failure.

For example, as long as you use () to write a 32-bit quantity, you can safely use atomic_load_relaxed() to optimistically read it outside a lock, but for a 64-bit quantity it must be conditional on __HAVE_ATOMIC64_LOADSTORE — otherwise it will lead to compile-time errors on platforms without 64-bit atomic loads and stores:

	struct foo {
		kmutex_t	f_lock;
		uint32_t	f_refcnt;
		uint64_t	f_ticket;
	};

	if (atomic_load_relaxed(&foo->f_refcnt) == 0)
		return 123;
#ifdef __HAVE_ATOMIC64_LOADSTORE
	if (atomic_load_relaxed(&foo->f_ticket) == ticket)
		return 123;
#endif
	mutex_enter(&foo->f_lock);
	if (foo->f_refcnt == 0 || foo->f_ticket == ticket)
		ret = 123;
	...
#ifdef __HAVE_ATOMIC64_LOADSTORE
	atomic_store_relaxed(&foo->f_ticket, foo->f_ticket + 1);
#else
	foo->f_ticket++;
#endif
	...
	mutex_exit(&foo->f_lock);

These macros are meant to follow C11 semantics, in terms of () and () with the appropriate memory order specifiers, and are meant to make future adoption of the C11 atomic API easier. Eventually it may be mandatory to use the C11 _Atomic type qualifier or equivalent for the operands.

The Linux kernel provides two macros (x) and (x, v) which are similar to atomic_load_consume(&x) and atomic_store_relaxed(&x, v) , respectively. However, while Linux's READ_ONCE() and WRITE_ONCE() prevent fusing, they may in some cases be torn — and therefore fail to guarantee atomicity — because:

The atomic read/modify/write operations in atomic_ops(3) have relaxed ordering by default, but can be combined with the memory barriers in membar_ops(3) for the same effect as an acquire operation and a release operation for the purposes of pairing with atomic_store_release() and atomic_load_acquire() or atomic_load_consume(): If () is an atomic read/modify/write operation in atomic_ops(3), then

	membar_exit();
	atomic_r/m/w(obj, ...);

functions like a release operation on obj, and

	atomic_r/m/w(obj, ...);
	membar_enter();

functions like a acquire operation on obj.

: The combination of () and membar_enter(3) make an acquire operation; only read/modify/write atomics may be combined with membar_enter(3) this way.

On architectures where __HAVE_ATOMIC_AS_MEMBAR is defined, all the atomic_ops(3) imply release and acquire operations, so the membar_enter(3) and membar_exit(3) are redundant.

Maintaining lossy counters. These may lose some counts, because the read/modify/write cycle as a whole is not atomic. But this guarantees that the count will increase by at most one each time. In contrast, without atomic operations, in principle a write to a 32-bit counter might be torn into multiple smaller stores, which could appear to happen out of order from another CPU's perspective, leading to nonsensical counter readouts. (For frequent events, consider using per-CPU counters instead in practice.)

	unsigned count;

	void
	record_event(void)
	{
		atomic_store_relaxed(&count,
		    1 + atomic_load_relaxed(&count));
	}

	unsigned
	read_event_count(void)
	{

		return atomic_load_relaxed(&count);
	}

Initialization barrier.

	int ready;
	struct data d;

	void
	setup_and_notify(void)
	{

		setup_data(&d.things);
		atomic_store_release(&ready, 1);
	}

	void
	try_if_ready(void)
	{

		if (atomic_load_acquire(&ready))
			do_stuff(d.things);
	}

Publishing a pointer to the current snapshot of data. (Caller must arrange that only one call to take_snapshot happens at any given time; generally this should be done in coordination with pserialize(9) or similar to enable resource reclamation.)

	struct data *current_d;

	void
	take_snapshot(void)
        {
		struct data *d = kmem_alloc(sizeof(*d));

		d->things = ...;

		atomic_store_release(&current_d, d);
	}

	struct data *
	get_snapshot(void)
	{

		return atomic_load_consume(&current_d);
	}

sys/sys/atomic.h

atomic_ops(3), membar_ops(3), pserialize(9)

These atomic operations first appeared in NetBSD 9.0.

C11 formally specifies that all subexpressions, except the left operands of the &&, ||, ?:, and , operators and the kill_dependency() macro, carry dependencies for which memory_order_consume guarantees ordering, but most or all implementations to date simply treat memory_order_consume as memory_order_acquire and do not take advantage of data dependencies to elide costly memory barriers or load-acquire CPU instructions.

Instead, we implement atomic_load_consume() as atomic_load_relaxed() followed by membar_datadep_consumer(3), which is equivalent to membar_consumer(3) on DEC Alpha and __insn_barrier(3) elsewhere.

Some idiot decided to call it , depriving us of the opportunity to say that atomic operations prevent fusion and .

November 25, 2019 NetBSD-9.2