87flowers ~ $

understanding unix socket cmsgs: why your examples are wrong

13 Oct 2024 programming linux c zig

why look at cmsgs?
representation
1. summary
wave-particle duality
giving away your rights
1. truncation
2. leaking through padding
conclusion
references

This article is mainly intended to document my own learning so I can reference the details in the future when I forget them. I am not a POSIX expert.

If you want the tl;dr spoiler to the title: Some example code that I first found on the web when I looked at how to handle socket control messages is incorrect because it doesn't take into account the possibility of denial of service via file descriptor table pollution. Not blindly using example code you don't understand is generally a good rule of thumb. You can skip to the section on leaking file descriptors to just read about this if you wish.

why look at cmsgs?

I have very recently learnt Zig. It is a fun little language that fits in your head. It feels like a better C. I was looking at reimplementing the Wayland protocol in Zig. This triggered me looking into the how the Wayland protocol is implemented, and more relevantly, how the Wayland protocol uses local Unix sockets for communication.

Some of the messsages that the Wayland protocol specify requires passing file descriptors between client and server. Wayland achieves this via sending and receiving control messages over the socket, especifically the SCM_RIGHTS ancillary message.

You cannot serialize file descriptors as integers then send them over a socket because this is nonsensical: file descriptors are process-unique identifiers; they are handles that reference your process's file descriptor table. In order to transfer an entry from your file descriptor table into another process's, you require the intervention of the kernel. Thus, the kernel has to interpret cmsgs and do the appropriate transfer and translation of the fds.

As I had never dealt with cmsgs before, there were a few questions that immediately came to mind and caused me to stall here until I was satisfied:

Sockets represent a stream of data. Understanding this was a very interesting diversion into the kernel. Unfortunately the Wayland protocol relies on none of this knowledge and explicitly states so in its spec. How are control messages associated with the data stream? How do you know that a specific control message is associated with those particular bytes of data?
What happens when multiple control messages are sent in a short period of time? How should you handle this? Do you need to handle this?
As the Zig standard library doesn't include control message helpers, how are control messages actually represented in memory? How should I generate and parse them?
And lastly, are there any pitfalls I need to be aware of from a security perspective?

representation

Let us answer the third question first — How are cmsgs represented?

This is pretty easy to answer by looking at the Linux source code.

struct user_msghdr {
        void            __user *msg_name;       /* ptr to socket address structure */
        int             msg_namelen;            /* size of socket address structure */
        struct iovec    __user *msg_iov;        /* scatter/gather array */
        __kernel_size_t msg_iovlen;             /* # elements in msg_iov */
        void            __user *msg_control;    /* ancillary data */
        __kernel_size_t msg_controllen;         /* ancillary data buffer length */
        unsigned int    msg_flags;              /* flags on received message */
};

User programs send and receive control messages via sendmsg and recvmsg. These functions take a pointer to a struct msghdr. Control messages are also known as ancillary data. This struct allows one to specify a buffer containing the control messages using msg_control and msg_controllen.

The associated control messages are concatencated together to fill this buffer.

struct cmsghdr {
        __kernel_size_t cmsg_len;       /* data byte count, including hdr */
        int             cmsg_level;     /* originating protocol */
        int             cmsg_type;      /* protocol-specific type */
};

Each control message has a standard header which specifies the size of the message and the type of message. The payload of the control message is appears immediately after the header.

#define CMSG_ALIGN(len) ( ((len)+sizeof(long)-1) & ~(sizeof(long)-1) )

#define CMSG_DATA(cmsg) \
        ((void *)(cmsg) + sizeof(struct cmsghdr))
#define CMSG_USER_DATA(cmsg) \
        ((void __user *)(cmsg) + sizeof(struct cmsghdr))
#define CMSG_SPACE(len) (sizeof(struct cmsghdr) + CMSG_ALIGN(len))
#define CMSG_LEN(len) (sizeof(struct cmsghdr) + (len))

Each control message is padded to the alignment of a long (see CMSG_SPACE).

struct cmsghdr *cmsg;
for (cmsg = CMSG_FIRSTHDR(msg); cmsg; cmsg = CMSG_NXTHDR(msg, cmsg)) {
        if (!CMSG_OK(msg, cmsg)) goto error;
        /* ... process cmsg ... */
}

The macros CMSG_FIRSTHDR, CMSG_NXTHDR and CMSG_OK are provided to iterate over cmsgs. These macros walk over the cmsgs by incrementing by cmsg_len.

summary

When sending a control message, we fill a buffer up with cmsgs each of size CMSG_SPACE(len) and give this to msghdr::msg_control which is then sent off by passing it to sendmsg.

When receiving a control message, we provide recvmsg with a buffer via msghdr::msg_control which is at least of the appropriate size to receive the cmsgs we expect.

wave-particle duality

The fact we were dealing with a stream stumped me the most. How do we associate these effectively out-of-band control messages with specific bytes in the data stream? I wanted to be able to write straightforward code that would be able to handle data at the same time as its associated control messages. Were there any guarantees that help with this?

It turns out that Linux's behavior exposes the underlying packet-based reality somewhat.

The control message is attached to the data bytes that were sent in the same sendmsg call. However, such packets may split by the kernel, for example, if one is doing multiple reads that are smaller than the buffer originally sent. When this happens, the control message will be associated with the first split packet of the range.

Interestingly, if a range of bytes contains a control message, Linux will stop a read early. This occurs whenever your remote sends a control message, even if you are not expecting to receive one. This means you will should never have to read more than one batch of control messages at once.

Unfortunately, Wayland does not make use of the above guarantee at all. Wayland specifies that control messages can turn up at any point during the data stream. To quote the Wayland protocol documentation: (emphasis mine)

The protocol does not specify the exact position of the ancillary data in the stream, except that the order of file descriptors is the same as the order of messages and fd arguments within messages on the wire.

In particular, it means that any byte of the stream, even the message header, may carry the ancillary data with file descriptors.

Clients and compositors should queue incoming data until they have whole messages to process, as file descriptors may arrive earlier or later than the corresponding data bytes.

If you look at how the wire protocol is implemented in libwayland, there is indeed no guarantee that the file descriptors are sent with the associated data bytes. This means (for Wayland at least) we are forced to buffer data and fds until whole messages are available to parse. Yay.

giving away your rights

One final question to ask is if there are any pitfalls to watch out for when passing file descriptors over Unix sockets? Here we'll mainly just focus on a specific UNIX socket control message, specifically the SCM_RIGHTS message.

There are several pitfalls to be aware of when sending SCM_RIGHTS messages:

truncation

If you do not provide a sufficiently large buffer to receive control messages, the list of file descriptors will be truncated. This means that you may receive fewer number of file descriptors than you expect; these file descriptors are lost to you. Linux closes any such truncated file descriptors for us so these won't pollute your fd table. Linux also specifies an upper limit to file descriptors you can send at once (SCM_MAX_FD), so if you want to guarantee you receive all file descriptors, you can provide a sufficiently large buffer.

For Wayland specifically, if you wish to ensure compatibility with libwayland, this means you need to be able to receive at least 28 file descriptors since that is the maximum number it will send at once. This is perhaps an application of Postel's law. I would also recommend sending much fewer than this at once to ensure compatibility with other implementations.

leaking through padding

A random snippet of example code that I found on the internet follows (one of the first hits on Google):

static int do_recvmsg(int sock) {
    struct msghdr msg;
    struct cmsghdr *cmsghdr;
    struct iovec iov[1];
    FILE *fp;
    ssize_t nbytes;
    int i, *p;
    char buf[CMSG_SPACE(sizeof(int))], c;

    iov[0].iov_base = &c;
    iov[0].iov_len = sizeof(c);
    memset(buf, 0x0d, sizeof(buf));
    cmsghdr = (struct cmsghdr *)buf;
    cmsghdr->cmsg_len = CMSG_LEN(sizeof(int));
    cmsghdr->cmsg_level = SOL_SOCKET;
    cmsghdr->cmsg_type = SCM_RIGHTS;
    msg.msg_name = NULL;
    msg.msg_namelen = 0;
    msg.msg_iov = iov;
    msg.msg_iovlen = sizeof(iov) / sizeof(iov[0]);
    msg.msg_control = cmsghdr;
    msg.msg_controllen = CMSG_LEN(sizeof(int));
    msg.msg_flags = 0;

    nbytes = recvmsg(sock, &msg, 0);
    if (nbytes == -1)
        return (1);

    p = (int *)CMSG_DATA(buf);
    printf("recvmsg: %d\n", *p);
    fp = fdopen(*p, "w");
    fprintf(fp, "OK\n");
    fclose(fp);

    return (0);
}

This code allocates CMSG_LEN(sizeof(int)) space for a control message, expecting to receive a single int. It then assumes that when it recieves a message, it has recieved exactly one file descriptor. There are multiple issues with this code.

The first and most obvious error that this code does is fail to check for the presence of control messages at all. It:

unnecessarily puts a cmsghdr into the buffer (which will be overwritten by the recvmsg call)
does not verify that we have received control messages (it doesn't look at msg_controllen)
does not parse any cmsghdr that may be present in the buffer

The second error is more subtle: It only expects to receive one file descriptor. This a mistake. Recall that the CMSG_SPACE aligns the length with long. On a 64-bit system, this means you can fit two file descriptors into that buffer. If you expect only one, you will leak the other file descriptor. A misbehaving remote can exhaust your file descriptor table in this manner.

This error is unfortunately common. You can see examples of this, on Stack Overflow, on GitHub, and on blogposts. (The last example is subtle: The example code does check for the exactly one file descriptor, would end up leaking two file descriptors if the remote sends two since it closes neither.)

What you should do in this instance is to read cmsghdr::cmsg_len and calculate how many file descriptors you have received. You should also iterate over the buffer to deal with all present cmsgs.

conclusion

This was a fun detour into an area of Unix I was previously unfamiliar with. I hope you learnt something.