[ceph-users] Re: Any ceph constants available?

4 Feb 2023

...
  Op 4 feb. 2023 om 00:03 heeft Thomas Cannon
&lt;thomas.cannon(a)pronto.ai&gt; het volgende geschreven:

 Hello Ceph community.

 The company that recently hired me has a 3 mode ceph cluster that has been running and
stable. I am the new lone administrator here and do not know ceph and this is my first
experience with it. 

 The issue was that it is/was running out of space, which is why I made a 4th node and
attempted to add it into the cluster. Along the way, things have begun to break. The
manager daemon on boreal-01 failed to boreal-02 along the way and I tried to get it to
fail back to boreal-01, but was unable, and realized while working on it yesterday I
realized that the nodes in the cluster are all running different versions of the software.
I suspect that might be a huge part of why things aren’t working as expected. 

 Boreal-01 - the host - 17.2.5:

 root@boreal-01:/home/kadmin# ceph -v
 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
 root@boreal-01:/home/kadmin# 

 Boreal-01 - the admin docker instance running on the host 17.2.1:

 root@boreal-01:/home/kadmin# cephadm shell
 Inferring fsid 951fa730-0228-11ed-b1ef-f925f77b75d3
 Inferring config /var/lib/ceph/951fa730-0228-11ed-b1ef-f925f77b75d3/mon.boreal-01/config
 Using ceph image with id 'e5af760fa1c1' and tag 'v17' created on
2022-06-23 19:49:45 +0000 UTC
 quay.io/ceph/ceph@sha256:d3f3e1b59a304a280a3a81641ca730982da141dad41e942631e4c5d88711a66b
<http://quay.io/ceph/ceph@sha256:d3f3e1b59a304a280a3a81641ca730982da141dad41e942631e4c5d88711a66b>
 root@boreal-01:/# ceph -v
 ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable)
 root@boreal-01:/# 

 Boreal-02 - 15.2.6:

 root@boreal-02:/home/kadmin# ceph -v
 ceph version 15.2.16 (d46a73d6d0a67a79558054a3a5a72cb561724974) octopus (stable)
 root@boreal-02:/home/kadmin# 

 Boreal-03 - 15.2.8:

 root@boreal-03:/home/kadmin# ceph -v
 ceph version 15.2.18 (f2877ae32a72fc25acadef57597f44988b805c38) octopus (stable)
 root@boreal-03:/home/kadmin# 

 And the host I added - Boreal-04 - 17.2.5:

 root@boreal-04:/home/kadmin# ceph -v
 ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)
 root@boreal-04:/home/kadmin# 

 The cluster ins’t rebalancing data, and drives are filling up unevenly, despite auto
balancing being on. I can run a df and see that it isn’t working. However it says it is:

 root@boreal-01:/# ceph balancer status 
 {
    "active": true,
    "last_optimize_duration": "0:00:00.011905",
    "last_optimize_started": "Fri Feb  3 18:39:02 2023",
    "mode": "upmap",
    "optimize_result": "Unable to find further optimization, or pool(s)
pg_num is decreasing, or distribution is already perfect",
    "plans": []
 }
 root@boreal-01:/# 

 root@boreal-01:/# ceph -s
  cluster:
    id:     951fa730-0228-11ed-b1ef-f925f77b75d3
    health: HEALTH_WARN
            There are daemons running an older version of ceph
            6 nearfull osd(s)
            3 pgs not deep-scrubbed in time
            3 pgs not scrubbed in time
            4 pool(s) nearfull
            1 daemons have recently crashed

  services:
    mon: 4 daemons, quorum boreal-01,boreal-02,boreal-03,boreal-04 (age 22h)
    mgr: boreal-02.lqxcvk(active, since 19h), standbys: boreal-03.vxhpad,
boreal-01.ejaggu
    mds: 2/2 daemons up, 2 standby
    osd: 89 osds: 89 up (since 5d), 89 in (since 45h)

  data:
    volumes: 2/2 healthy
    pools:   7 pools, 549 pgs
    objects: 227.23M objects, 193 TiB
    usage:   581 TiB used, 356 TiB / 937 TiB avail
    pgs:     533 active+clean
             16  active+clean+scrubbing+deep

  io:
    client:   55 MiB/s rd, 330 KiB/s wr, 21 op/s rd, 45 op/s wr

 root@boreal-01:/# 

 Part of me suspects that I exacerbated the problems by trying to monkey with boreal-04
for several days, trying to get the drives inside the machine turned into OSDs so that
they would be used. One thing I did was attempt to upgrade the code on that machine, and I
could have triggered a cluster-wide upgrade that failed outside of 1 and 4. With 2 and 3
not even running the same major release, if I did make that mistake, I can see why instead
of an upgrade, things would be worse. 

 According to the documentation, I should be able to upgrade the entire cluster by running
a single command on the admin node, but when I go to run commands I get errors that even
google can’t solve:

 root@boreal-01:/# ceph orch host ls
 Error ENOENT: Module not found
 root@boreal-01:/# 

 Consequently, I have very little faith that running commands to upgrade everything so
that it’s all running the same code will work. I think each host could be upgraded and fix
things, but do not feel confident doing so and risking our data.

 Hopefully that gives a better idea of the problems I am facing. I am hoping for some
professional services hours with someone who is a true expert with this software 
I’ve seen 42on.com being recommended before (no affiliation).

...
  , to get us to a stable and sane deployment that can
be managed without it being a terrifying guessing game, trying to get it to work.

 If that is you, or if you know someone who can help — please contact me!

 Thank you!

 Thomas
 _______________________________________________
 ceph-users mailing list -- ceph-users(a)ceph.io
 To unsubscribe send an email to ceph-users-leave(a)ceph.io 

2024

2023

2022

2021

2020

2019

[ceph-users] Re: Any ceph constants available?