Hello,
In some documentation I was reading last night about laying out OSDs, it
was suggested that if more that one OSD uses the same NVMe drive, the
failure-domain should probably be set to node. However, for a small
cluster the inclination is to use EC-pools and failure-domain = OSD.
I was wondering if there is a middle ground - could we define
failure-domain = NVMe? I think the map would need to be defined
manually in the same way that failure-domain = rack requires information
about which nodes are in each rack.
Example: My latest OSD nodes have 8 HDDs and 3 U.2 NVMe. I'd set up
the WAL/DB for with HDDs per OSD (wasted space on the 3rd NVMe).
Across all my OSD nodes I will have 8 HDDs and either 2 or 3 NVMe
devices per node - 15 total NVMe devices. My preferred EC-pool profile
is 8+2. It seems that this profile could be safely dispersed across 15
failure domains, resulting in protection against NVMe failure.
Please let me know if this is worth pursuing.
Thanks.
-Dave
--
Dave Hall
Binghamton University
kdhall(a)binghamton.edu
607-760-2328 (Cell)
607-777-4641 (Office)
Hi all,
I have a fairly pressing issue. I had a monitor fall out of quorum because
it ran out of disk space during rebalancing from switching to upmap. I
noticed all my monitor store.db started taking up nearly all disk space so I
set noout, nobackfill and norecover and shutdown all the monitor daemons.
Each store.db was at:
mon.a 89GB (the one that firt dropped out)
mon.a 400GB
mon.c 400GB
I tried setting mon_compact_on_start. This brought mon.a down to 1GB. Cool.
However, when I try it on the other monitors it increased the db size
~1Gb/10s so I shut them down again.
Any idea what is going on? Or how can I shrik back down the db?
Hi all,
we are observing a dramatic performance drop on our ceph file system and are wondering if this could be related to ceph fs snapshots. We are taking rotating snapshots in 2 directories and have 11 snapshots in each (ls below) as of today. We observe the performance drop with an rsync process that writes to ceph fs to another folder *without* snapshots. The performance reduction is a factor of 3 or even higher.
Could this possibly be caused by the snapshots being present? Has anyone else seen something like this?
The reason we consider snapshots is that not much else changed on the cluster except that we started taking rolling snapshots on the 23rd of February. In addition, the kernel symbols ceph_update_snap_trace, rebuild_snap_realms and build_snap_context show up really high in a perf report. The performance reduction seems to be present since at least 3 days.
The ceph version is mimic 13.2.10. The kernel version of the rsync server is 3.10.0-1127.10.1.el7.x86_64.
$ ls home/.snap
2021-02-23_183554+0100_weekly 2021-03-06_000611+0100_daily 2021-03-09_000611+0100_daily
2021-03-01_000911+0100_weekly 2021-03-07_000611+0100_daily 2021-03-10_000611+0100_daily
2021-03-04_000611+0100_daily 2021-03-08_000611+0100_daily 2021-03-11_000611+0100_daily
2021-03-05_000611+0100_daily 2021-03-08_000911+0100_weekly
$ ls groups/.snap
2021-02-23_183554+0100_weekly 2021-03-06_000611+0100_daily 2021-03-09_000611+0100_daily
2021-03-01_000912+0100_weekly 2021-03-07_000611+0100_daily 2021-03-10_000612+0100_daily
2021-03-04_000611+0100_daily 2021-03-08_000611+0100_daily 2021-03-11_000612+0100_daily
2021-03-05_000611+0100_daily 2021-03-08_000911+0100_weekly
Many thanks for any pointers and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Before I started the upgrade the cluster was healthy but one
OSD(osd.355) was down, can't remember if it was in or out.
Upgrade was started with
ceph orch upgrade start --image
goharbor.example.com/library/ceph/ceph:v15.2.9
The upgrade started but when Ceph tried to upgrade osd.355 it paused
with the following messages:
2021-03-11T09:15:35.638104+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
Target is goharbor.example.com/library/ceph/ceph:v15.2.9 with id
dfc48307963697ff48acd9dd6fda4a7a24017b9d8124f86c2
a542b0802fe77ba
2021-03-11T09:15:35.639882+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
Checking mgr daemons...
2021-03-11T09:15:35.644170+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
All mgr daemons are up to date.
2021-03-11T09:15:35.644376+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
Checking mon daemons...
2021-03-11T09:15:35.647669+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
All mon daemons are up to date.
2021-03-11T09:15:35.647866+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
Checking crash daemons...
2021-03-11T09:15:35.652035+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
Setting container_image for all crash...
2021-03-11T09:15:35.653683+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
All crash daemons are up to date.
2021-03-11T09:15:35.653896+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
Checking osd daemons...
2021-03-11T09:15:36.273345+0000 mgr.pech-mon-2.cjeiyc [INF] It is
presumed safe to stop ['osd.355']
2021-03-11T09:15:36.273504+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
It is presumed safe to stop ['osd.355']
2021-03-11T09:15:36.273887+0000 mgr.pech-mon-2.cjeiyc [INF] Upgrade:
Redeploying osd.355
2021-03-11T09:15:36.276673+0000 mgr.pech-mon-2.cjeiyc [ERR] Upgrade:
Paused due to UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.355 on host
pech-hd-009 failed.
One of the first ting the upgrade did was to upgrade mon, so they are
restarted and now the osd.355 no longer exist
$ ceph osd info osd.355
Error EINVAL: osd.355 does not exist
But if I run a resume
ceph orch upgrade resume
it still tries to upgrade osd.355, same message as above.
I tried to stop and start the upgrade again with
ceph orch upgrade stop
ceph orch upgrade start --image
goharbor.example.com/library/ceph/ceph:v15.2.9
it still tries to upgrade osd.355, with the same message as above.
Looking at the source code it looks like it get daemons to upgrade from
mgr cache, so I restarted both mgr but still it tries to upgrade
osd.355.
Does anyone know how I can get the upgrade to continue?
--
Kai Stian Olstad
Hi! So, after i selected the tags to add 2 nvme ssds i declared a
replicated n=2 pool .. and for the last 30 min the progress shown in
notification is 0% and iotop shows around 100K/s for 2 (???) ceph-mon
processes and that all ...
and in my service list the osd services look somehow empty:
https://prntscr.com/10iwwbh
what did i miss?
Thanks a lot!
Adrian
Hi Guys,
I have used Ceph rbd for Openstack for sometime, I met a problem while destroying a VM. The Openstack tried to
delete rbd image but failed. I have a test deleting a image by rbd command, it costs lots of time(image size 512G or more).
Anyone met the same problem with me?
Thanks,
Norman
Hi! After an initial bumpy bootstrapping (IMHO the defaults should be
whatever is already defined in .ssh of the user and custom values setup
with cli arguments) now i'm stuck adding any service/hosts/osds because
apparently i lack orchestration .. the the documentation show a big
"Page does not exist"
see
https://docs.ceph.com/en/latest/docs/octopus/mgr/orchestrator
so, what is it and what options do i have?
to set up it seems that is as easy as:
ceph orch set backend
I just started with ceph and i just want to start a ceph service (i
cannot call it a cluster) on my desktop (with 2 dedicated osds) to get
familiar also with usage.
Thanks a lot!
Adrian
> >
> > 2. If a down host comes up again and it's osd are started. Is data
> still being copied, or does ceph see that checksums(?)
>
> PG or RADOS object epoch I think. So if data hasn’t changed, the
> recovery completes without having anything to do.
>
> > are the same and just sets a pointer(?)
> back to the old location?
Yes I mean the osdmap having the old osd on the node that was down.
Hmmm, I currently have pg's in 'active+remapped+backfill_wait' of a pool rbd.backup. Of which I know nothing in it has changed. And those osd's listed[1] are already up. Especially when you are getting at the 'end' of recovery where osd_max_backfills=X has less effect and recovery takes longer. It would be nice to have the "no work needed pg's" be processed quickly.
[1]
[18,25,41]p18 [18,41,17]
[5,0,25]p5 [5,0,4
Hello everyone,
We have an unpurgeable image living in the trash of one of our clusters:
# rbd --pool volumes trash ls
5afa5e5a07b8bc volume-02d959fe-a693-4acb-95e2-ca04b965389b
If we try to purge the whole trash it says the image is being restored
but we have never tried to do that:
# rbd --pool volumes trash purge
Removing images: 0% complete...failed.
2021-03-10 13:58:42.849 7f78b3fc9c80 -1 librbd::api::Trash: remove:
error: image is pending restoration.
When trying to delete manually, it says there are some watchers, but
this is actually not the case:
# rbd --pool volumes trash remove 5afa5e5a07b8bc
rbd: error: image still has watchers2021-03-10 14:00:21.262 7f93ee8f8c80
-1 librbd::api::Trash: remove: error: image is pending restoration.
This means the image is still open or the client using it crashed. Try
again after closing/unmapping it or waiting 30s for the crashed client
to timeout.
Removing image:
0% complete...failed.
# rados listwatchers -p volumes rbd_header.5afa5e5a07b8bc
#
We have tried to stat the first 10 rbd_data objects and they were all
deleted.
We know we can manually delete the omapkey from rbd_trash but we though
it would be better to understand how an image might get in this state.
Has anyone seen this before?
Many thanks!
Cheers,
Enrico
--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage Group - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland