File-locking on linux is NOT atomic and FIFO, contrary to what the docs may tell you.
To demonstrate, consider this toy server and client.
Server
Client
This is written in Python, but is a mirror of how the program could be written in C, and uses the same syscalls.
In this case, the client initializes a memfd
and passes it to server process via UDP and unix domain sockets (see: sending-fds-through-uds). The server then takes a lock on the memfd
, does a write on it, then releases the lock and immediately tries to claim it. Since the client claimed the lock before the server re-claimed the lock, it should claim the lock immediately after. The client should then read the data inside the memfd
, and then “flush” the output by setting the cursor, which is shared between the two processes (since both the descriptors share the same open file description), to either “0” for the read and “$bytes_written” for the writer. The code explains the logic better than I can.
In practice, this leads to messy race conditions. The order in which they run is completely non-deterministic. I measure this with whoops_count
. With 100 iterations, I repeatedly got whoops_count
s of around 400 for both processes, making it totally useless for real-world workloads. That’s a 4x “miss” per syscall!
This also doesn’t work with flock
-based locks, which are completely separate from fcntl
locks and do not interact with each-other (theoretically). flock
locks are even more useless, since it isn’t possible to upgrade locks atomically (going from a shared lock to an exclusive lock without dropping the lock in the middle). lockf
uses fcntl
under the hood, providing a much nicer interface.
The linux source code claims that while the atomic FIFO behavior isn’t required by POSIX, they support it anyways because it makes sense, . A git blame
reveals this comment was written over 19 years ago, and that’s inaccurate in that the commit where that comment comes from is the linux kernel’s first commit on git. So that comment could feasibly be 20+ years old.
My guess is that it really was FIFO, once upon a time, and maybe the change to the completely fair scheduler or something underlying caused the behavior to change. Relying on file locking, a syscall, for synchronization is kinda stupid anyways when it should be being done in the userspace for performance reasons.
So, in short, you gotta use something like eventfd
as mechanism for process-based synchronization (or futexes). I learned this the hard way.