Our engineers closely follow the work of the Linux kernel contributors. By reading commits in progress, and the commits later approved for the kernel, we see bugs get introduced before anyone else. This also means that our customers get fixes for these issues before anyone else.

Let's look at a recent particular example, where an attempt to fix a security vulnerability led to another bug that still exists in Linux (at time of writing).

In this changeset proposed on November 21 2019, a Marvell developer attempts to address a remotely-triggered kernel heap buffer overflow in a Linux wireless driver, CVE-2019-14895. If your computer has this particular wireless chipset, associating to a Wi-Fi network could lead to system compromise.

The relevant function looked like the following before the proposed fix:


static int mwifiex_process_country_ie(struct mwifiex_private *priv,
				      struct cfg80211_bss *bss)
{
	const u8 *country_ie;
	u8 country_ie_len;
	struct mwifiex_802_11d_domain_reg *domain_info =
					&priv->adapter->domain_reg;

	rcu_read_lock();
	country_ie = ieee80211_bss_get_ie(bss, WLAN_EID_COUNTRY);
	if (!country_ie) {
		rcu_read_unlock();
		return 0;
	}

	country_ie_len = country_ie[1];
	if (country_ie_len < IEEE80211_COUNTRY_IE_MIN_LEN) {
		rcu_read_unlock();
		return 0;
	}

	if (!strncmp(priv->adapter->country_code, &country_ie[2], 2)) {
		rcu_read_unlock();
		mwifiex_dbg(priv->adapter, INFO,
			    "11D: skip setting domain info in FW\n");
		return 0;
	}
	memcpy(priv->adapter->country_code, &country_ie[2], 2);

A couple of things are important: first, the mwifiex_process_country_ie function takes a lock early on by calling rcu_read_lock(). Secondly, all five of the return paths out of this function ensure that the lock is released first by calling rcu_read_unlock().

The proposed fix for the security vulnerability introduces a new return path for cases where a malicious access point attempts to overwrite memory. The patch now makes the code look like the following:


	rcu_read_lock();
	country_ie = ieee80211_bss_get_ie(bss, WLAN_EID_COUNTRY);
	if (!country_ie) {
		rcu_read_unlock();
		return 0;
	}

	country_ie_len = country_ie[1];
	if (country_ie_len < IEEE80211_COUNTRY_IE_MIN_LEN) {
		rcu_read_unlock();
		return 0;
	}

	if (!strncmp(priv->adapter->country_code, &country_ie[2], 2)) {
		rcu_read_unlock();
		mwifiex_dbg(priv->adapter, INFO,
			    "11D: skip setting domain info in FW\n");
		return 0;
	}

	if (country_ie_len >
	    (IEEE80211_COUNTRY_STRING_LEN + MWIFIEX_MAX_TRIPLET_802_11D)) {
		mwifiex_dbg(priv->adapter, ERROR,
			    "11D: country_ie_len overflow!, deauth AP\n");
		return -EINVAL;
	}

	memcpy(priv->adapter->country_code, &country_ie[2], 2);

As can be seen above, this new return path doesn't release the lock.

What can happen when locks and mutexes aren't properly handled? As a demonstration, we introduce a similar bug into the setuid syscall to show the consequences. We built a custom kernel where we deliberately added this bug to kernel/sys.c:


long __sys_setuid(uid_t uid)
{
        struct user_namespace *ns = current_user_ns();
        const struct cred *old;
        struct cred *new;
        int retval;
        kuid_t kuid;

        if (uid == 42) {
                rcu_read_lock();
                return -EINVAL;
        }
(...)

Once we've compiled a kernel with this artificial bug, we then write a tiny program, bugtest.c, that attempts to setuid(42). When we run this program, the following kernel warning gets logged to the serial console:


[   63.619647] ================================================
[   63.623054] WARNING: lock held when returning to user space!
[   63.626525] 5.4.0 #2 Not tainted
[   63.628444] ------------------------------------------------
[   63.631805] bugtest/1684 is leaving the kernel with locks still held!
[   63.635642] 1 lock held by bugtest/1684:
[   63.638087]  #0: ffffffff89498860 (rcu_read_lock){....}, at: __sys_setuid+0x108/0x180

Our computer is now running in an inconsistent state, assuming it's still running. (If our kernel has LOCKDEP enabled and is configured to panic_on_warn, then it is not still running.)

Getting back to the original Linux issue: let's look at the timeline of this new bug. The bug was introduced in a proposed Linux commit on November 21, 2019. A week later, the commit was code reviewed and approved. Subsequently the commit was merged into the mainline kernel and shipped in the Linux 5.4 release on November 25, 2019. And on January 14, 2020, the buggy commit was backported to all stable kernels, specifically 4.19.96, 4.14.165, 4.9.210, and 4.4.210.

At multiple different points, the bug went undetected by reviewers and maintainers, despite being plainly obvious when viewed in a full (non-diff) context. The bug remains in the Linux kernel today.