Fixing Name Resolution in Ubuntu 18.04

We have a server (really a powerful workstation) running Ubuntu Linux that sits in the corner of a professor’s office which we use for teaching the Modelling Materials course. It had been on Ubuntu 16.04 since we installed it last year, but I upgraded it to 18.04 about a month ago ahead of the start of term.

Everything worked fine, except for one pretty annoying thing: it quickly became clear that primary group names weren’t resolving. This meant that anytime someone did something that needed to check what groups you were in you’d get an error message in the terminal, but everything still worked. For example, we use the Lmod module system to load in some of the software we use. This allows you to restrict modules to certain groups, so it checks group membership whenever you load a module. This meant module commands would output errors (but still work) since it couldn’t find the name for your primary group id. Also a simple groups myusername would give the error myusername: cannot find name for group ID 1111.

A little while later, I noticed the system would also forget who I am about 5 minutes after logging in. So right after logging in I could do whoami and it would respond with myusername. Waiting a while and trying again would give the error whoami: cannot find name for user ID 111111. But if I tried id myusername I’d get the correct response, and whoami would work again for about another 5 minutes. Since file ownership is actually via userid and groupid rather than username and groupname (which you may have noticed if you’ve ever formatted an external drive as e.g. ext4 and tried to use it on machines where you have different userids) almost everything still works fine. But you need to remind the system who you are if you want to use sudo for example.

All this together points to something going wrong with the cacheing of user and group names.

The installation is one of the standard college Linux systems. These are set up so that people can use the same username and password to login that they use for other college services like email. Poking around in /etc/ I could see that it uses LDAP to fetch user credentials from a college server, which are then handled by SSSD. We can test various parts of this to confirm where we think the problem lies. Since I’m able to login all right with an LDAP account, I figured that the LDAP lookup was working fine. I had initially though there could be some issue with the group LDAP database, so I tested this directly with ldapsearch.

First confirm I can search for my username in the LDAP database with:

ldapsearch -H ldap:// -b ou=orgunit,dc=ic,dc=ac,dc=uk -x uid=myusername

This listed several entries for me, and had a fields with my uidnumber and gidnumber which match what’s returned by id myusername.

Now I can search for the group I should belong to with:

ldapsearch -H ldap:// -b ou=Group,ou=orgunit,dc=ic,dc=ac,dc=uk -x gidNumber=1111

Which worked and gave the group name. So the LDAP database seems fine and we’re able to look things up in it and get the results we expect.

The other thing we can do is look at the file /etc/nsswitch.conf which gives the NSS configuration. This essentially tells the system what order to check which sources of information for resolving names. Our first few entries were

passwd:         files sss systemd
group:          files sss systemd
shadow:         files sss
gshadow:        files sss

The passwd and group entries lists where user and group information will be found. files indicates it should look system files like /etc/passwd and /etc/group. sss says it should check with the sss daemon. I changed the group entry to say:

group:          files ldap systemd

and now the groups command works every time, but is a bit slow. So at this point it was really looking like the issue was with sssd. I hadn’t changed anything in the configuration in the upgrade process, and everything had been working as expected previously. Maybe the cache got corrupted somehow during the upgrade. So first I tried clearing the sss cache with

sss_cache -E

This had no effect. I also deleted the various cache files and let them be recreated with

systemctl stop sssd
rm -rf /var/lib/sss/db/*
systemctl start sssd

This also had no effect.

Checking man sssd.conf indicated I could get some more verbose logging by setting debug_level in the various sections of the sssd.conf file. By default this is set to 0, but can be increased up to 9 for very verbose. Setting this to 4 and doing a failed whoami gave the following in /var/log/sssd/sssd_nss.log:

(Fri Oct 12 10:14:05 2018) [sssd[nss]] [cache_req_locate_dom_cache_done] (0x0040): cache_req_search_recv returned [1432158300]: ID is outside the allowed range

Our sssd.conf listed two domain sections: a [domain/local] section with a defined range of ids, and a [domain/default] section which listed the ldap details. Then in the [sssd] section it had the entry domains = local, default. So what should have happened was it would check the local domain first, see that my uid was outside the range, and then check the default domain. But what seemed to be happening was that it never checked the second domain. I guess this must be a bug in the version of sssd that came with the upgraded Ubuntu, which is why we never had this issue before.

We don’t actually use the local domain for anything, so a quick workaround was to change the line domains = local, default to domains = default so that it would just skip the local domain lookup. Then on restarting sssd, everything worked as expected, with both user names and group names resolving and not being randomly forgotten.