Unix domain sockets is the third type of socket (after TCP and UDP) which enables IPC. It also has a pretty unique power in that it can duplicate file descriptors across processes.
A primer on files in Linux
A āfile descriptorā is an abstraction over an object that you can manipulate the data inside of it. Itās kinda like a malloc
ād block of memory, except instead of being tied to the virtual address space of your program, itās tied to the external context of your operating system. Note that Iām using ākindaā very liberally here, for there are an uncountably infinite amount of āifsā and ābutsā attached to this.
Indeed, you can treat your file descriptor like a block of memory through mmap
ing with little care if you so feel.
Fun fact: At least on linux, malloc
uses a mmap
ed file internally (well it depends, see this nice article for a writeup for how convoluted the whole thing is). So itās really all files all the way down.
Of course, itās a bit more complicated than that. At least on Linux, a file descriptor is an interface to an open file description, which is a kernel abstraction. One descriptor can point to one description, which can point to one file only. Yet one file could theoretically have multiple descriptions to it. If two files in two separate processes open
the same file simultaneously, they will have two independent file descriptions, and a descriptor that points to that description. Hell, you could even have multiple descriptions in a program pointing to the same file (with its each own independent descriptor).
A description shares the same internal file offset (if one descriptor changes the descriptionās offset with lseek
, a descriptor pointing to that same description will immediately see it) as well as any flags. You could open
a file in O_APPEND
more, and then that same file could then be opened in a different context as O_NONBLOCK
.
POSIX as a whole guarantees that read
and write
calls are atomic, meaning that you can never observe the intermediary state between before the operation and after the operation. This does not necessarily mean that each call will complete, as the various read
/write
syscalls return a value indicating how many bytes they were actually able to perform.
It gets even more complicated when you consider the page cache, as when you perform an operation on a file, it first gets propagated to the page cache, and then to the underlying file. If your file was open
ed in O_DIRECT | O_SYNC
, then, uhā¦
ĀÆ\_(ć)_/ĀÆ
Point is that reasoning about files is inherently an unsafe process. You canāt (normally) make any real guarantees that if youāre manipulating one, that it will continue to exist. Another process could come right in and destroy it. Its lifetime is external to yours, and interfacing with an object whose lifetime you donāt own and exists in a constant global, mutable context is a scary, scary prospect.
What about MacOS or Windows?
I wish I knew the internals for how this is done on MacOS. I donāt have a clue, nor do I know how to find out. Please leave a comment if you do know :)
Donāt tell me how it works on Windows. I donāt care nor do I want to know. Iām an Unix guy at heart, once I get DLSS Frame Generation on Linux Iāll never touch a Windows machine again willingly.
But why?
We may not be able to control the lifetime of the underlying file, but we can control the lifetime of a descriptor that points to it. This is a stronger guarantee than whatever we were dealing with before. If you create a file owned by your program (via the memfd_create
) call, you now own the entire lifetime of the āfileā. This also kinda applies to both the POSIX and SysV shared memory segments, though the lifetime of them is a little more murky in that you have to manually destroy them when done. In practice though, if careful, you can achieve similar-ish results.
Unix Domain Sockets is the killer feature here. Through the use of the sendmsg
and recvmsg
syscalls (which the read
and write
syscalls are just a nicer abstraction over), we can duplicate the file descriptor through the SCM_RIGHTS
control message.
Hereās our producer/sender:
Itās disgusting. I know. It uses a heavily macro-based API to manipulate the various parts of the message we send.
Moving on to the consumer/receiver:
Note that the integer associated with the file description across each process can be different (and honestly, unless youāre extraordinary lucky, it should), but it will point to the same description. You can verify this through manipulating the file offset with lseek
s and observing it on another process, or by using the kcmp
syscall. The kernel maintains a mapping of file descriptor ā file description.
You now need to find a method of synchronizing the two descriptors. Note that fcntl
locks will NOT work as you expect for synchronization (see fcntl-atomic-locks), so you need a different mechanism. Futexes like semaphores and mutexes work, but you need to have a different mechanism to send them to a separate process, or mmap the underlying data inside the descriptor to point it to the futex and then have the consumer take mutual ownership of it.
I prefer using eventfd
though, since you can send it with the descriptor via UDS sockets. It does involve the use of read
/write
syscalls since it isnāt in the user-space, but life is all about tradeoffs. Also can do epoll
nonsense if youāre into that like me.
View the manpages for more reference about unix domain sockets and control messages in general.