Hello,
we'd like to upgrade our cluster from the latest Ceph 15 to Ceph 18.
It's running with cephadm.
What's the right way to do it?
Latest Ceph 15 to latest 16 and then to the latest 17 and then the
latest 18?
Does that work?
Or is it possible to jump from the latest Ceph 16 to the latest Ceph 18?
Latest Ceph 15 -> latest Ceph 16 -> latest Ceph 18.
Best,
Malte
Hello all! Hope that anybody can help us.
The initial point: Ceph cluster v15.2 (installed and controlled by the Proxmox) with 3 nodes based on physical servers rented from a cloud provider. The volumes provided by Ceph using CephFS and RBD also. We run 2 MDS daemons but use max_mds=1 so one daemon was in active state, and another in standby.
On Thursday some of the applications stopped working. After the investigation it was clear that we have a problem with Ceph, more precisely with СephFS - both MDS daemons suddenly crashed. We tried to restart them and found that they crashed again immediately after the start. The crash information:
2024-04-17T17:47:42.841+0000 7f959ced9700 1 mds.0.29134 recovery_done -- successful recovery!
2024-04-17T17:47:42.853+0000 7f959ced9700 1 mds.0.29134 active_start
2024-04-17T17:47:42.881+0000 7f959ced9700 1 mds.0.29134 cluster recovered.
2024-04-17T17:47:43.825+0000 7f959aed5700 -1 ./src/mds/OpenFileTable.cc: In function 'void OpenFileTable::commit(MDSContext*, uint64_t, int)' thread 7f959aed5700 time 2024-04-17T17:47:43.831243+0000
./src/mds/OpenFileTable.cc: 549: FAILED ceph_assert(count > 0)
Next hours we read tons of articles, studied the documentation, and checked the cluster status in general by the various diagnostic commands - but didn't find anything wrong. At evening we decided to upgrade our Ceph cluster; so, we upgraded it to v16, and finally to v17.2.7. Unfortunately, it didn't solve the problem, MDS continue to crash with the same error. The only difference that we found is the "1 MDSs report damaged metadata" in the output of ceph -s - see it below.
I supposed that it may be the well-known bug, but couldn't find the same one on https://tracker.ceph.com - there are several bugs associated with file OpenFileTable.cc but not related to ceph_assert(count > 0)
We tried to check the source code of OpenFileTable.cc also, here is a fragment of it, in function OpenFileTable::_journal_finish
int omap_idx = anchor.omap_idx;
unsigned& count = omap_num_items.at(omap_idx);
ceph_assert(count > 0);
So, we guess that the object map is empty for some object in Ceph, and it is unexpected behavior. But again, we found nothing wrong in our cluster...
Next, we started with https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/ article - tried to reset the journal (despite that it was Ok all the time) and wipe the sessions using cephfs-table-tool all reset session command. No result...
Now I decided to continue following this article and run cephfs-data-scan scan_extents command, we started it on Friday but it is still working (2 from 3 workers finished, so I'm waiting for the last one; may be I need more workers for the next command cephfs-data-scan scan_inodes that I plan to run ). But I have a doubt that it will solve the issue because, again, we guess that we have no problem with our objects in Ceph but with metadata only...
Is it the new bug? or something else? What should we try additionally to run our MDS daemon? Any idea is welcome!
The important outputs:
ceph -s
cluster:
id: 4cd1c477-c8d0-4855-a1f1-cb71d89427ed
health: HEALTH_ERR
1 MDSs report damaged metadata
insufficient standby MDS daemons available
83 daemons have recently crashed
3 mgr modules have recently crashed
services:
mon: 3 daemons, quorum asrv-dev-stor-2,asrv-dev-stor-3,asrv-dev-stor-1 (age 22h)
mgr: asrv-dev-stor-2(active, since 22h), standbys: asrv-dev-stor-1
mds: 1/1 daemons up
osd: 18 osds: 18 up (since 22h), 18 in (since 29h)
data:
volumes: 1/1 healthy
pools: 5 pools, 289 pgs
objects: 29.72M objects, 5.6 TiB
usage: 21 TiB used, 47 TiB / 68 TiB avail
pgs: 287 active+clean
2 active+clean+scrubbing+deep
io:
client: 2.5 KiB/s rd, 172 KiB/s wr, 261 op/s rd, 195 op/s wr
ceph fs dump
e29480
enable_multiple, ever_enabled_multiple: 0,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1
Filesystem 'cephfs' (1)
fs_name cephfs
epoch 29480
flags 12 joinable allow_snaps allow_multimds_snaps
created 2022-11-25T15:56:08.507407+0000
modified 2024-04-18T16:52:29.970504+0000
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 1099511627776
required_client_features {}
last_failure 0
last_failure_osd_epoch 14728
compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in 0
up {0=156636152}
failed
damaged
stopped
data_pools [5]
metadata_pool 6
inline_data disabled
balancer
standby_count_wanted 1
[mds.asrv-dev-stor-1{0:156636152} state up:active seq 6 laggy since 2024-04-18T16:52:29.970479+0000 addr [v2:172.22.2.91:6800/2487054023,v1:172.22.2.91:6801/2487054023] compat {c=[1],r=[1],i=[7ff]}]
cephfs-journal-tool --rank=cephfs:0 journal inspect
Overall journal integrity: OK
ceph pg dump summary
version 41137
stamp 2024-04-18T21:17:59.133536+0000
last_osdmap_epoch 0
last_pg_scan 0
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG
sum 29717605 0 0 0 0 6112544251872 13374192956 28493480 1806575 1806575
OSD_STAT USED AVAIL USED_RAW TOTAL
sum 21 TiB 47 TiB 21 TiB 68 TiB
ceph pg dump pools
POOLID OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG
8 31771 0 0 0 0 131337887503 2482 140 401246 401246
7 839707 0 0 0 0 3519034650971 736 61 399328 399328
6 1319576 0 0 0 0 421044421 13374189738 28493279 206749 206749
5 27526539 0 0 0 0 2461702171417 0 0 792165 792165
2 12 0 0 0 0 48497560 0 0 6991 6991
---
Best regards,
Alexey Gerasimov
System Manager
www.opencascade.com<http://www.opencascade.com/>
www.capgemini.com<http://www.capgemini.com>
[cid:image001.png@01DA93D0.B001CE80]
We operate a tiny ceph cluster (v16.2.7) across three machines, each
running two OSDs and one of each mds, mgr, and mon. The cluster serves
one main erasure-coded (2+1) storage pool and a few other
management-related pools. The cluster has been running smoothly for
several months.
A few weeks ago we noticed a health warning reporting
backfillfull/nearfull osds and pools. Here is the output of `ceph -s` at
this point (extraced from logs):
--------------------------------------------------------------------------------
cluster:
health: HEALTH_WARN
1 backfillfull osd(s)
2 nearfull osd(s)
Reduced data availability: 163 pgs inactive, 1 pg peering
Low space hindering backfill (add storage if this doesn't
resolve itself): 2 pgs backfill_toofull
Degraded data redundancy: 1486709/10911157 objects degraded
(13.626%), 68 pgs degraded, 68 pgs undersized
162 pgs not scrubbed in time
6 pool(s) backfillfull
services:
mon: 3 daemons, quorum mon.101,mon.102,mon.100 (age 5m)
mgr: mgr-102(active, since 54m), standbys: mgr-101, mgr-100
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 6 osds: 6 up (since 4m), 6 in (since 2w); 7 remapped pgs
data:
volumes: 1/1 healthy
pools: 6 pools, 338 pgs
objects: 3.64M objects, 14 TiB
usage: 13 TiB used, 1.7 TiB / 15 TiB avail
pgs: 47.929% pgs unknown
0.296% pgs not active
1486709/10911157 objects degraded (13.626%)
52771/10911157 objects misplaced (0.484%)
162 unknown
106 active+clean
67 active+undersized+degraded
1 active+undersized+degraded+remapped+backfill_toofull
1 remapped+peering
1 active+remapped+backfill_toofull
--------------------------------------------------------------------------------
I now see the large amount of pgs in state unknown and the fact that a
significant fraction of objects is degraded despite all osds being up,
but we didn't notice this back then.
Because the cluster continued to act fine from the perspective of the
mounted filesystem, we didn't really notice the potential problem and
did not intervene. From then one, things have mostly gone downwards.
Now, `ceph -s` reports the following:
--------------------------------------------------------------------------------
cluster:
health: HEALTH_WARN
noout flag(s) set
Reduced data availability: 117 pgs inactive
Degraded data redundancy: 2095625/12121767 objects degraded
(17.288%), 114 pgs degraded, 114 pgs undersized
117 pgs not scrubbed in time
services:
mon: 3 daemons, quorum mon.101,mon.102,mon.100 (age 15h)
mgr: mgr-102(active, since 7d), standbys: mgr-100, mgr-101
mds: 1/1 daemons up, 1 standby, 1 hot standby
osd: 6 osds: 6 up (since 55m), 6 in (since 3w)
flags noout
data:
volumes: 1/1 healthy
pools: 6 pools, 338 pgs
objects: 4.04M objects, 15 TiB
usage: 12 TiB used, 2.8 TiB / 15 TiB avail
pgs: 34.615% pgs unknown
2095625/12121767 objects degraded (17.288%)
117 unknown
114 active+undersized+degraded
107 active+clean
--------------------------------------------------------------------------------
Note in particular the still very large number of pgs in state unknown,
which hasn't changed in days. Same goes for the degraded pgs. Also, the
cluster should have around 37TiB storage available but now it only
reports 15 TiB.
We did a bit of digging around but couldn't really get to the bottom of
the unknown pgs and how we can recover from that. One other data point
is that the command `ceph osd df tree` gets stuck on two of the three
machines and one the one where it returns something, it looks like this:
--------------------------------------------------------------------------------
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS TYPE NAME
-1 47.67506 - 0 B 0 B 0 B 0 B 0
B 0 B 0 0 - root default
-13 18.26408 - 0 B 0 B 0 B 0 B 0
B 0 B 0 0 - datacenter dc.100
-5 18.26408 - 0 B 0 B 0 B 0 B 0
B 0 B 0 0 - host osd-100
3 hdd 10.91409 1.00000 0 B 0 B 0 B 0 B 0
B 0 B 0 0 91 up osd.3
5 hdd 7.34999 1.00000 0 B 0 B 0 B 0 B 0
B 0 B 0 0 48 up osd.5
-9 14.69998 - 0 B 0 B 0 B 0 B 0
B 0 B 0 0 - datacenter dc.101
-7 14.69998 - 0 B 0 B 0 B 0 B 0
B 0 B 0 0 - host osd-101
0 hdd 7.34999 1.00000 0 B 0 B 0 B 0 B 0
B 0 B 0 0 83 up osd.0
1 hdd 7.34999 1.00000 0 B 0 B 0 B 0 B 0
B 0 B 0 0 86 up osd.1
-11 14.71100 - 15 TiB 12 TiB 12 TiB 77 MiB 21
GiB 2.6 TiB 82.00 1.00 - datacenter dc.102
-17 7.35550 - 7.4 TiB 6.3 TiB 6.2 TiB 16 MiB 11
GiB 1.1 TiB 85.16 1.04 - host osdroid-102-1
4 hdd 7.35550 1.00000 7.4 TiB 6.3 TiB 6.2 TiB 16 MiB 11
GiB 1.1 TiB 85.16 1.04 114 up osd.4
-15 7.35550 - 7.4 TiB 5.8 TiB 5.7 TiB 61 MiB 10
GiB 1.6 TiB 78.83 0.96 - host osdroid-102-2
2 hdd 7.35550 1.00000 7.4 TiB 5.8 TiB 5.7 TiB 61 MiB 10
GiB 1.6 TiB 78.83 0.96 107 up osd.2
TOTAL 15 TiB 12 TiB 12 TiB 77 MiB 21
GiB 2.6 TiB 82.00
MIN/MAX VAR: 0/1.04 STDDEV: 66.97
--------------------------------------------------------------------------------
The odd part here is that for some reason only osd.2 and osd.4 seem to
contribute size to the cluster. Interestingly, accessing content from
the storage pool works mostly without issues, which shouldn't work if 4
out of 6 OSDs weren't properly up.
Even more odd is that while `ceph health detail` reports a lot of pgs in
state unknown, undersized, and degraded, inspecting the respective pgs
with `ceph pg <pdid> query` results in active+clean for *all* of them...
I'm not sure which of the two pieces of information I am supposed to
trust...
Any ideas what we can do to get our cluster back into a sane state? I'm
happy to provide more logs or command output, please let me know.
Thanks!
Hi All,
*Something* is chewing up a lot of space on our `\var` partition to the
point where we're getting warnings about the Ceph monitor running out of
space (ie > 70% full).
I've been looking, but I can't find anything significant (ie log files
aren't too big, etc) BUT there seem to be a hell of a lot (15) of
sub-directories (with GUIDs for names) under the
`/var/lib/containers/storage/overlay/` folder, all ending with `merged`
- ie `/var/lib/containers/storage/overlay/{{GUID}}/`merged`.
Is this normal, or is something going wrong somewhere, or am I looking
in the wrong place?
Also, if this is the issue, can I delete these folders?
Sorry for asking such a noob Q, but the Cephadm/Podman stuff is
extremely new to me :-)
Thanks in advance
Cheers
Dulux-Oz
Hi,
Trying to delete images in a Ceph pool is causing errors in one of
the clusters. I rebooted all the monitor nodes sequentially to see if the
error went away, but it still persists. What is the best way to fix this?
The Ceph cluster is in an OK state, with no rebalancing or scrubbing
happening (I did set the noscrub and deep-noscrub flags) and also no load
on the cluster, very few IO.
root@ceph-mon01 ~# rbd rm 000dca3d-4f2b-4033-b8f5-95458e0c3444_disk_delete
-p compute
Removing image: 31% complete...2024-04-18 20:42:52.525135 7f6de0c79700 -1
NetHandler create_socket couldn't create socket (24) Too many open files
Removing image: 32% complete...2024-04-18 20:42:52.539882 7f6de9c7b700 -1
NetHandler create_socket couldn't create socket (24) Too many open files
2024-04-18 20:42:52.541508 7f6de947a700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2024-04-18 20:42:52.546613 7f6de0c79700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2024-04-18 20:42:52.558133 7f6de9c7b700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2024-04-18 20:42:52.573819 7f6de947a700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2024-04-18 20:42:52.589733 7f6de0c79700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
Removing image: 33% complete...2024-04-18 20:42:52.643489 7f6de9c7b700 -1
NetHandler create_socket couldn't create socket (24) Too many open files
2024-04-18 20:42:52.727262 7f6de0c79700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2024-04-18 20:42:52.737135 7f6de9c7b700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2024-04-18 20:42:52.743292 7f6de947a700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2024-04-18 20:42:52.746167 7f6de0c79700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2024-04-18 20:42:52.757404 7f6de9c7b700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
Removing image: 34% complete...2024-04-18 20:42:52.773182 7f6de947a700 -1
NetHandler create_socket couldn't create socket (24) Too many open files
2024-04-18 20:42:52.773222 7f6de947a700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2024-04-18 20:42:52.789847 7f6de0c79700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
2024-04-18 20:42:52.844201 7f6de9c7b700 -1 NetHandler create_socket
couldn't create socket (24) Too many open files
^C
root@ceph-mon01 ~#
Thanks,
Pardh
Hi,
We have recently upgraded one of our clusters from Quincy 17.2.6 to Reef 18.2.1, since then we have had 3 instances of our RGWs stop processing requests. We have 3 hosts that run a single instance of RGW on each, and all 3 just seem to stop processing requests at the same time causing our storage to become unavailable. A restart or redeploy of the RGW service brings them back ok. The cluster was deployed using ceph ansible, but since we have adopted it to cephadm which is how the upgrade was performed.
We have enabled debug logging as there was nothing out of the ordinary in normal logs and are currently sifting through them from the last crash.
We are just wondering if it possible to run Quincy RGWs instead of Reef as we didn't have this issue prior to the upgrade?
We have 3 clusters in a multisite setup, we are holding off on upgrading the other 2 clusters due to this issue.
Thanks
Iain
Iain Stott
OpenStack Engineer
Iain.Stott(a)thg.com
[THG Ingenuity Logo]<https://www.thg.com>
www.thg.com<https://www.thg.com/>
[LinkedIn]<https://www.linkedin.com/company/thgplc/?originalSubdomain=uk> [Instagram] <https://www.instagram.com/thg> [X] <https://twitter.com/thgplc?lang=en>
Hello,
I am using Ceph RGW for S3. Is it possible to create (sub)users that
cannot create/delete buckets and are limited to specific buckets?
At the end, I want to create 3 separate users and for each user I want
to create a bucket. The users should only have access to their own
bucket and should not be able to create new or delete buckets.
One approach could be to limit the max_buckets to 1 so the user cannot
create new buckets, but it will still have access to other buckets and
will able to delete buckets.
Any advice here? Thanks!
Sinan
Hi all,
Do the Mons store any crushmap history, and if so how does one get at it
please?
I ask because we've recently encountered an issue in a medium scale (~5PB
raw) EC based RGW focused cluster where "something" happened, which we
still don't know, that suddenly caused us to see 94% of objects (5.4
billion of them) misplaced. We've tracked down the first log message of
that pgmap state change:
Mar 29 10:30:31 mon1 bash\[5804\]: debug 2024-03-29T10:30:31.152+0000
7f3b6e378700 0 log\_channel(cluster) log \[DBG\] : pgmap v44327: 2273 pgs:
225 active+clean, 2038 active+remapped+backfill\_wait, 10
active+remapped+backfilling; 1.6 PiB data, 2.1 PiB used, 2.2 PiB / 4.3 PiB
avail; 5426274136/5752755429 objects misplaced (94.325%); 248 MiB/s, 109
objects/s recovering
This appears to have been preceded (aside from a single HTTP HEAD request
coming into RGW) by a 5 minute gap in logs where either journald couldn't
keep up with debug messages or the Mons were stuck. The last log before
that occurs seems to be a compaction event kicking off:
mon1 bash\[25927\]: Int 0/0 0.00 KB 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.00
0.00 0 0.000 0 0
Mar 29 10:24:14 mon1 bash\[25927\]: \*\* Compaction Stats \[L\] \*\*
Mar 29 10:24:14 mon1 bash\[25927\]: Priority Files Size Score
Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s)
Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
Mar 29 10:24:14 mon1 bash\[25927\]:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Mar 29 10:24:14 mon1 bash\[25927\]: Low 0/0 0.00 KB 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 116.0 11.4
0.02 0.01 7 0.003 490 462
Mar 29 10:24:14 mon1 bash\[25927\]: High 0/0 0.00 KB 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.9
1.23 1.20 28 0.044 0 0
Mar 29 10:24:14 mon1 bash\[25927\]: User 0/0 0.00 KB 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.4
0.00 0.00 1 0.001 0 0
We're left wondering what the heck has happened to cause such a huge
redistribution of data in the cluster when we've not made any corresponding
changes, so wanting to see if there's any breadcrumbs we can find.
Appreciate any pointers!
--
Cheers,
~Blairo