Hi all,
I replaced a disk in our octopus cluster and it is rebuilding. I noticed that since the replacement there is no scrubbing going on. Apparently, an OSD having a PG in backfill_wait state seems to block deep scrubbing all other PGs on that OSD as well - at least this is how it looks.
Some numbers: the pool in question has 8192 PGs with EC 8+3 and ca 850 OSDs. A total of 144 PGs needed backfilling (were remapped after replacing the disk). After about 2 days we are down to 115 backfill_wait + 3 backfilling. It will take a bit more than a week to complete.
There is plenty of time and IOP/s available to deep-scrub PGs on the side, but since the backfill started there is zero scrubbing/deep scrubbing going on and "PGs not deep scrubbed in time" messages are piling up.
Is there a way to allow (deep) scrub in this situation?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi,
After a power outage on my test ceph cluster, 2 osd fail to restart.
The log file show:
8e5f-00266cf8869c(a)osd.2.service: Failed with result 'timeout'.
Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c(a)osd.2.service: Service
RestartSec=10s expired, scheduling restart.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c(a)osd.2.service: Scheduled
restart job, restart counter is at 2.
Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for
250f9864-0142-11ee-8e5f-00266cf8869c.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c(a)osd.2.service: Found left-over
process 1858 (bash) in control group while starting unit. Ignoring.
Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean
termination of a previous run, or service implementation deficiencies.
Sep 21 11:55:12 mostha1 systemd[1]:
ceph-250f9864-0142-11ee-8e5f-00266cf8869c(a)osd.2.service: Found left-over
process 2815 (podman) in control group while starting unit. Ignoring.
This is not critical as it is a test cluster and it is actually
rebalancing on other osd but I would like to know how to return to
HEALTH_OK status.
Smartctl show the HDD are OK.
So is there a way to recover the osd from this state ? Version is
15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try to move
to latest versions as soon as this problem is solved)
Thanks
Patrick
I have a use case where I want to only use a small portion of the disk for
the OSD and the documentation states that I can use
data_allocation_fraction [1]
But cephadm can not use this and throws this error:
/usr/bin/podman: stderr ceph-volume lvm batch: error: unrecognized
arguments: --data-allocate-fraction 0.1
So, what I actually want to achieve:
Split up a single SSD into:
3-5x block.db for spinning disks (5x 320GB or 3x 500GB regarding if I have
8TB HDDs or 16TB HDDs)
1x SSD OSD (100G) for RGW index / meta pools
1x SSD OSD (100G) for RGW gc pool because of this bug [2]
My service definition looks like this:
service_type: osd
service_id: hdd-8tb
placement:
host_pattern: '*'
crush_device_class: hdd
spec:
data_devices:
rotational: 1
size: ':9T'
db_devices:
rotational: 0
limit: 5
size: '1T:2T'
encrypted: true
block_db_size: 320000000000
---
service_type: osd
service_id: hdd-16tb
placement:
host_pattern: '*'
crush_device_class: hdd
spec:
data_devices:
rotational: 1
size: '14T:'
db_devices:
rotational: 0
limit: 1
size: '1T:2T'
encrypted: true
block_db_size: 500000000000
---
service_type: osd
service_id: gc
placement:
host_pattern: '*'
crush_device_class: gc
spec:
data_devices:
rotational: 0
size: '1T:2T'
encrypted: true
data_allocate_fraction: 0.05
---
service_type: osd
service_id: ssd
placement:
host_pattern: '*'
crush_device_class: ssd
spec:
data_devices:
rotational: 0
size: '1T:2T'
encrypted: true
data_allocate_fraction: 0.05
[1]
https://docs.ceph.com/en/pacific/cephadm/services/osd/#ceph.deployment.driv…
[2] https://tracker.ceph.com/issues/53585
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
Hello,
In our deployment we are using the mix of s3 and s3website RGW. I’ve noticed strange behaviour when sending range requests to the s3website RGWs that I’m not able to replicate on the s3 ones.
I’ve created a simple wrk LUA script to test sending range requests on tiny ranges so the issue is easily seen.
When sending these requests against s3 RGW I can see that the amount of data read from Ceph is ± equivalent to what the RGW sends to the client. This change very dramatically when I’m doing the same test against s3website RGW. The read from Ceph is huge (3Gb/s compared to ~22Mb/s on s3 RGW) I seems to me like the RGW is reading the whole files and then sending just the range which is different then what s3 does.
I do not understand why would s3website need to read that much from Ceph and I believe this is a bug - I was looking through the tracker and wasn’t able to find anything related to s3website and range requests.
Did anyone else noticed this issue?
You can replicate it by running this wrk command wrk -t56 -c500 -d5m http://${rgwipaddress}:8080/${bucket}/videos/ -s wrk-range-small.lua
wrk script
-- Initialize the pseudo random number generator
math.randomseed( os.time())
math.random(); math.random(); math.random()
i = 1
function request()
if i == 8
then
i = 1
end
local nrangefrom = math.random()
local nrangeto = math.random(100)
local path = wrk.path
url = path..i..".mp4"
wrk.headers["Range"] = nrangefrom.."-"..nrangeto
i = i+1
return wrk.format(nil, url)
end
Kind regards,
Ondrej
I am still on nautilus and some clients are still on centos7 which mount the cephfs. These mounts stall at some point. Currently I am mounting with something like this in the fstab.
id=cephfsclientid,client_mountpoint=/cephfs/test /mnt/test fuse.ceph noauto,_netdev,noatime,x-systemd.device-timeout=30,x-systemd.mount-timeout=30,x-systemd.automount,x-systemd.idle-timeout=30 0 0
When the mount stalls I am fixing it with a umount -l, but it would be nicer of course when it would not behave like this. Can this be fixed on el7 and Nautilus, like with different mount options or so?
I was checking the tracker again and I found already fixed issue that seems to be connected with this issue.
https://tracker.ceph.com/issues/44508
Here is the PR that fixes it https://github.com/ceph/ceph/pull/33807
What I’m still not understanding is why this is only happening when using s3website api.
Is there someone who could shed some light on this?
Regards,
Ondrej
Hi all,
we seem to have hit a bug in the ceph fs kernel client and I just want to confirm what action to take. We get the error "wrong peer at address" in dmesg and some jobs on that server seem to get stuck in fs access; log extract below. I found these 2 tracker items https://tracker.ceph.com/issues/23883 and https://tracker.ceph.com/issues/41519, which don't seem to have fixes.
My questions:
- Is this harmless or does it indicate invalid/corrupted client cache entries?
- How to resolve, ignore, umount+mount or reboot?
Here an extract from the dmesg log, the error has survived a couple of MDS restarts already:
[Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:05:18 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:13:50 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:13:50 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:18:13 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:29 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:35 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:22:22 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Mon Mar 6 13:22:22 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:30:54 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[...]
[Thu Mar 9 09:37:24 2023] slurm.epilog.cl (31457): drop_caches: 3
[Thu Mar 9 09:38:26 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:38:26 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:46:58 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:46:58 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:55:30 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:55:30 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 10:04:02 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 10:04:02 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hi,
I'm working on a small POC for a ceph setup on 4 old C6100 power-edge. I
had to install Octopus since latest versions were unable to detect the
HDD (too old hardware ??). No matter, this is only for training and
understanding Ceph environment.
My installation is based on
https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noar…
bootstrapped.
I'm reaching the point to automate the snapshots (I can create snapshot
by hand without any problem). The documentation
https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noar…
says to use the snap_schedule module but this module does not exist.
# ceph mgr module ls | jq -r '.enabled_modules []'
cephadm
dashboard
iostat
prometheus
restful
Have I missed something ? Is there some additional install steps to do
for this module ?
Thanks for your help.
Patrick