Op 4 feb. 2023 om 00:03 heeft Thomas Cannon
<thomas.cannon(a)pronto.ai> het volgende geschreven:
Hello Ceph community.
The company that recently hired me has a 3 mode ceph cluster that has been running and
stable. I am the new lone administrator here and do not know ceph and this is my first
experience with it.
The issue was that it is/was running out of space, which is why I made a 4th node and
attempted to add it into the cluster. Along the way, things have begun to break. The
manager daemon on boreal-01 failed to boreal-02 along the way and I tried to get it to
fail back to boreal-01, but was unable, and realized while working on it yesterday I
realized that the nodes in the cluster are all running different versions of the software.
I suspect that might be a huge part of why things aren’t working as expected.
Boreal-01 - the host - 17.2.5:
root@boreal-01:/home/kadmin# ceph -v
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
root@boreal-01:/home/kadmin#
Boreal-01 - the admin docker instance running on the host 17.2.1:
root@boreal-01:/home/kadmin# cephadm shell
Inferring fsid 951fa730-0228-11ed-b1ef-f925f77b75d3
Inferring config /var/lib/ceph/951fa730-0228-11ed-b1ef-f925f77b75d3/mon.boreal-01/config
Using ceph image with id 'e5af760fa1c1' and tag 'v17' created on
2022-06-23 19:49:45 +0000 UTC
quay.io/ceph/ceph@sha256:d3f3e1b59a304a280a3a81641ca730982da141dad41e942631e4c5d88711a66b
<http://quay.io/ceph/ceph@sha256:d3f3e1b59a304a280a3a81641ca730982da141dad41e942631e4c5d88711a66b>
root@boreal-01:/# ceph -v
ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
root@boreal-01:/#
Boreal-02 - 15.2.6:
root@boreal-02:/home/kadmin# ceph -v
ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
root@boreal-02:/home/kadmin#
Boreal-03 - 15.2.8:
root@boreal-03:/home/kadmin# ceph -v
ceph version 15.2.18 (f2877ae32a72fc25acadef57597f44988b805c38) octopus (stable)
root@boreal-03:/home/kadmin#
And the host I added - Boreal-04 - 17.2.5:
root@boreal-04:/home/kadmin# ceph -v
ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
root@boreal-04:/home/kadmin#
The cluster ins’t rebalancing data, and drives are filling up unevenly, despite auto
balancing being on. I can run a df and see that it isn’t working. However it says it is:
root@boreal-01:/# ceph balancer status
{
"active": true,
"last_optimize_duration": "0:00:00.011905",
"last_optimize_started": "Fri Feb 3 18:39:02 2023",
"mode": "upmap",
"optimize_result": "Unable to find further optimization, or pool(s)
pg_num is decreasing, or distribution is already perfect",
"plans": []
}
root@boreal-01:/#
root@boreal-01:/# ceph -s
cluster:
id: 951fa730-0228-11ed-b1ef-f925f77b75d3
health: HEALTH_WARN
There are daemons running an older version of ceph
6 nearfull osd(s)
3 pgs not deep-scrubbed in time
3 pgs not scrubbed in time
4 pool(s) nearfull
1 daemons have recently crashed
services:
mon: 4 daemons, quorum boreal-01,boreal-02,boreal-03,boreal-04 (age 22h)
mgr: boreal-02.lqxcvk(active, since 19h), standbys: boreal-03.vxhpad,
boreal-01.ejaggu
mds: 2/2 daemons up, 2 standby
osd: 89 osds: 89 up (since 5d), 89 in (since 45h)
data:
volumes: 2/2 healthy
pools: 7 pools, 549 pgs
objects: 227.23M objects, 193 TiB
usage: 581 TiB used, 356 TiB / 937 TiB avail
pgs: 533 active+clean
16 active+clean+scrubbing+deep
io:
client: 55 MiB/s rd, 330 KiB/s wr, 21 op/s rd, 45 op/s wr
root@boreal-01:/#
Part of me suspects that I exacerbated the problems by trying to monkey with boreal-04
for several days, trying to get the drives inside the machine turned into OSDs so that
they would be used. One thing I did was attempt to upgrade the code on that machine, and I
could have triggered a cluster-wide upgrade that failed outside of 1 and 4. With 2 and 3
not even running the same major release, if I did make that mistake, I can see why instead
of an upgrade, things would be worse.
According to the documentation, I should be able to upgrade the entire cluster by running
a single command on the admin node, but when I go to run commands I get errors that even
google can’t solve:
root@boreal-01:/# ceph orch host ls
Error ENOENT: Module not found
root@boreal-01:/#
Consequently, I have very little faith that running commands to upgrade everything so
that it’s all running the same code will work. I think each host could be upgraded and fix
things, but do not feel confident doing so and risking our data.
Hopefully that gives a better idea of the problems I am facing. I am hoping for some
professional services hours with someone who is a true expert with this software
I’ve seen
42on.com being recommended before (no affiliation).
, to get us to a stable and sane deployment that can
be managed without it being a terrifying guessing game, trying to get it to work.
If that is you, or if you know someone who can help — please contact me!
Thank you!
Thomas
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io
To unsubscribe send an email to ceph-users-leave(a)ceph.io