We have a server (really a powerful workstation) running Ubuntu Linux that sits in the corner of a professor’s office which we use for teaching the Modelling Materials course. It had been on Ubuntu 16.04 since we installed it last year, but I upgraded it to 18.04 about a month ago ahead of the start of term.
Everything worked fine, except for one pretty annoying thing: it quickly
became clear that primary group names weren’t resolving. This meant that
anytime someone did something that needed to check what groups you were in
you’d get an error message in the terminal, but everything still worked. For
example, we use the Lmod module system to load
in some of the software we use. This allows you to restrict modules to certain
groups, so it checks group membership whenever you load a module. This meant
module commands would output errors (but still work) since it couldn’t find
the name for your primary group id. Also a simple groups myusername
would
give the error myusername: cannot find name for group ID 1111
.
A little while later, I noticed the system would also forget who I am
about 5 minutes after logging in. So right after logging in I could do
whoami
and it would respond with myusername
. Waiting a while and trying
again would give the error whoami: cannot find name for user ID 111111
. But
if I tried id myusername
I’d get the correct response, and whoami
would
work again for about another 5 minutes. Since file ownership is actually
via userid and groupid rather than username and groupname (which you may have
noticed if you’ve ever formatted an external drive as e.g. ext4 and tried to
use it on machines where you have different userids) almost everything still
works fine. But you need to remind the system who you are if you want to
use sudo
for example.
All this together points to something going wrong with the cacheing of user and group names.
The installation is one of the standard college Linux systems. These are set
up so that people can use the same username and password to login that they
use for other college services like email. Poking around in /etc/
I could
see that it uses
LDAP to
fetch user credentials from a college server, which are then handled by
SSSD. We can
test various parts of this to confirm where we think the problem lies. Since
I’m able to login all right with an LDAP account, I figured that the LDAP
lookup was working fine. I had initially though there could be some issue with
the group LDAP database, so I tested this directly with ldapsearch
.
First confirm I can search for my username in the LDAP database with:
ldapsearch -H ldap://ldapserveraddress.ic.ac.uk -b ou=orgunit,dc=ic,dc=ac,dc=uk -x uid=myusername
This listed several entries for me, and had a fields with my uidnumber and
gidnumber which match what’s returned by id myusername
.
Now I can search for the group I should belong to with:
ldapsearch -H ldap://ldapserveraddress.ic.ac.uk -b ou=Group,ou=orgunit,dc=ic,dc=ac,dc=uk -x gidNumber=1111
Which worked and gave the group name. So the LDAP database seems fine and we’re able to look things up in it and get the results we expect.
The other thing we can do is look at the file /etc/nsswitch.conf
which
gives the NSS
configuration. This essentially tells the system what order to check which
sources of information for resolving names. Our first few entries were
passwd: files sss systemd
group: files sss systemd
shadow: files sss
gshadow: files sss
The passwd
and group
entries lists where user and group information will
be found. files
indicates it should look system files like /etc/passwd
and
/etc/group
. sss
says it should check with the sss
daemon. I changed
the group entry to say:
group: files ldap systemd
and now the groups
command works every time, but is a bit slow. So at this
point it was really looking like the issue was with sssd
. I hadn’t changed
anything in the configuration in the upgrade process, and everything had been
working as expected previously. Maybe the cache got corrupted somehow during
the upgrade. So first I tried clearing the sss cache with
sss_cache -E
This had no effect. I also deleted the various cache files and let them be recreated with
systemctl stop sssd
rm -rf /var/lib/sss/db/*
systemctl start sssd
This also had no effect.
Checking man sssd.conf
indicated I could get some more verbose logging by
setting debug_level
in the various sections of the sssd.conf
file. By
default this is set to 0, but can be increased up to 9 for very verbose.
Setting this to 4 and doing a failed whoami
gave the following in
/var/log/sssd/sssd_nss.log
:
(Fri Oct 12 10:14:05 2018) [sssd[nss]] [cache_req_locate_dom_cache_done] (0x0040): cache_req_search_recv returned [1432158300]: ID is outside the allowed range
Our sssd.conf
listed two domain sections: a [domain/local]
section with a
defined range of ids, and a [domain/default]
section which listed the ldap
details. Then in the [sssd]
section it had the entry domains = local,
default
. So what should have happened was it would check the local domain
first, see that my uid was outside the range, and then check the default
domain. But what seemed to be happening was that it never checked the second
domain. I guess this must be a bug in the version of sssd that came with the
upgraded Ubuntu, which is why we never had this issue before.
We don’t actually use the local domain for anything, so a quick workaround
was to change the line domains = local, default
to domains = default
so
that it would just skip the local domain lookup. Then on restarting sssd,
everything worked as expected, with both user names and group names
resolving and not being randomly forgotten.