Stefan Kooman <stefan(a)bit.nl> writes:
On 3/23/21 11:00 AM, Nico Schottelius wrote:
Stefan Kooman <stefan(a)bit.nl> writes:
OSDs from
the wrong class (hdd). Does anyone have a hint on how to fix
this?
Do you have: osd_class_update_on_start enabled?
So this one is a bit funky. It
seems to be off, but the behaviour
would
indicate it isn't. Checking the typical configurations:
[10:38:53] black2.place6:~# ceph config-key get
config/global/osd_class_update_on_start; echo ""
obtained 'config/global/osd_class_update_on_start'
false
[10:39:59] black2.place6:~# ceph-conf -D | grep
osd_class_update_on_start
osd_class_update_on_start = true
[10:47:24] black2.place6:~# grep osd_class_update_on_start
/etc/ceph/ceph.conf
[10:52:59] black2.place6:~# ceph config dump | grep
osd_class_update_on_start
global advanced osd_class_update_on_start false
[10:53:38] black2.place6:~#
So it looks like it's already disabled.
What does a "ceph daemon osd.$id config get osd_class_update_on_start"
give on that host for an OSD that is running there?
That returns
[12:52:24] server6.place6:~# ceph daemon osd.4 config get osd_class_update_on_start
{
"osd_class_update_on_start": "false"
}
for all involved OSDs.
It depends on settings on logging of the OSD daemons,
but in our case
it was logged to the daemon log I believe (or syslog, dunno anymore).
It's so strange, because none of the configurations indicate to use a
"hdd" class. Which, btw, we also don't use in other cases (i.e. none of
our used classes is hdd), so I suspect some builtin to try to setup the
class.
I am not sure
where ceph-conf reads the value true from, but I
assume
it's a builtin.
I was also searching for osd_class_update_on_start in the Internet
and
it seems there is no reference to it in the ceph documentation. Do you
have any pointers to it?
Not anymore with new Ceph documentation.
Out of curiosity, do you have any clue why it's not in there anymore?
But the parameter is self
explaining, it will try to put itself into the proper class at
startup. Source code: src/common/options.cc
Option("osd_class_update_on_start", Option::TYPE_BOOL,
Option::LEVEL_ADVANCED)
.set_default(true)
.set_description("set OSD device class on startup"),
The description I am somewhat missing is "set based on which criteria?"
In any case, it seems that the running OSD has the correct class
assigned. However I can see that that OSD has connections open to
unrelated osds:
tcp6 0 0 2a0a:e5c0:2:1:21b:21ff:febc:bf30:6805
2a0a:e5c0:2:1:21b:21ff:febb:68dc:57280 ESTABLISHED 17034/ceph-osd
So something is "not good" or "not correct" with this osd. This
particular one is in a special class that serves 1 pool with only 3 osds
in it. However this osd has around 200 connections established to what I
can see most (all?) other osds in the cluster.
To my understanding, it seems wrong that ceph osds form a complete mesh,
especially if they will never exchange data with the osds they are
connected to.
Can somebody confirm that osds should only connect to osds they share
data with?
And if my assumption is correct: is there any way to tell this osd to
behave correctly and only establish connections to osds of the same
class? (i.e. correctly assigning the class)
Best regards,
Nico
--
Sustainable and modern Infrastructures by ungleich.ch