Hi Ceph Users,
My goal is to control the number of files a ceph client can open to the backend ceph filesystem at once to control the metadata transaction load.
In this experiment, I have a ceph client on version Quincy on a physical server. The fstab entry below shows the options with which the ceph filesystem is mounted on it. In particular, I used caps_max=1900 option with the intention to limit the total number of open files by this client at once to 1900 at the mount point /ourdisk/hpc_scratch.
# cat /etc/fstab
...
10.251.0.30:6789,10.251.0.31:6789,10.251.0.32:6789,10.251.0.33:6789,10.251.0.40:6789:/volumes/hpc_scratch/scratch/525205e7-4f71-4383-89ce-53e2ec68d017 /ourdisk/hpc_scratch ceph fsc,caps_max=1900,name=oscer,secretfile=/etc/ceph/client.oscer.secret 0 0
To test if this worked, I used fio command to create 4000 files at once in a directory on Ceph at /ourdisk/hpc_scratch/soumya/fio_tests/client_c003. During the run, I looked at the caps file (copied below), and it shows the number of used caps is 4007. Given my limited knowledge, it seems to me that the client was able to open 4000 files at once.
# head /sys/kernel/debug/ceph/d5d5b0aa-1867-11eb-9f4a-bc97e1724ff1.client483151133/caps
total 5032
avail 1025
used 4007
reserved 0
min 1024
ino mds issued implemented
--------------------------------------------------
0x200136e01b6 0 pAsLsXsFs pAsLsXsFs
0x1 0 pAsLsXsFs pAsLsXsFs
Is there any way I can control how many simultaneous files a ceph client can open (preferably from the client side, if not then from the ceph side but on an individual client)? If so, how can I check the number of files a client is opening at a given time?
Thank you for your time,
Soumya
PS: I understand that the number of capabilities is not the number of open files, but that's the closest mount option I could find for this experiment.
We experienced a Ceph failure causing the system to become unresponsive with no IOPS or throughput due to a problematic OSD process on one node. This resulted in slow operations and no IOPS for all other OSDs in the cluster. The incident timeline is as follows:
Alert triggered for OSD problem.
6 out of 12 OSDs on the node were down.
Soft restart attempted, but smartmontools process stuck while shutting down server.
Hard restart attempted and service resumed as usual.
Our Ceph cluster has 19 nodes, 218 OSDs, and is using version 15.2.17 octopus (stable).
Questions:
1. What is Ceph's detection mechanism? Why couldn't Ceph detect the faulty node and automatically abandon its resources?
2. Did we miss any patches or bug fixes?
3. Suggestions for improvements to quickly detect and avoid similar issues in the future?
To add, on this, the issue seemed related to a process (ceph-volume) which was doing check operations on all devices. The systemctl osd service was timing out because of that and the osd daemon was going into error state.
We noticed that version 17.2.5 had a change related to ceph-volume, in particular https://tracker.ceph.com/issues/57627.
We decided to skip 16.2.11 and jump to 17.2.5. This second attempt went well, so the issue is now solved.
Note: the upgrade 16.2.7 -> 16.2.11 went smoothly in a TDS cluster with identical OS/software, but much smaller, 3 nodes with a couple of disk each, so the issue seems really to be about the number of devices and nodes.
Regards,
Giuseppe
On 30.03.23, 16:56, "Lo Re Giuseppe" <giuseppe.lore(a)cscs.ch <mailto:giuseppe.lore@cscs.ch>> wrote:
Dear all,
On one of our clusters I started the upgrade process from 16.2.7 to 16.2.11.
Mon and mgr and crash processes were done easily/quickly, then at the first attempt of upgrading a OSD container the upgrade process stopped because of the OSD process is not able to start after the upgrade.
Does anyone have any hint on how to unblock the upgrade?
Some details below:
Regards,
Giuseppe
I started the upgrade process with the cephadm command:
“””
[root@naret-monitor01 ~]# ceph orch upgrade start --ceph-version 16.2.11
Initiating upgrade to quay.io/ceph/ceph:v16.2.11
“””
After a short time:
“””
[root@naret-monitor01 ~]# ceph orch upgrade status
{
"target_image": quay.io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add<mailto:quay.io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add>,
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [
"crash",
"mon",
"mgr"
],
"progress": "64/2039 daemons upgraded",
"message": "Error: UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.4 on host naret-osd01 failed.",
"is_paused": true
}
“””
The ceph health command reports:
“””
[root@naret-monitor01 ~]# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s); 1 osds down; Degraded data redundancy: 2654362/6721382840 objects degraded (0.039%), 14 pgs degraded, 14 pgs undersized; Upgrading daemon osd.4 on host naret-osd01 failed.
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
daemon osd.22 on naret-osd01 is in error state
[WRN] OSD_DOWN: 1 osds down
osd.4 (root=default,host=naret-osd01) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 2654362/6721382840 objects degraded (0.039%), 14 pgs degraded, 14 pgs undersized
pg 28.88 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1373,1337,1508,852,2147483647,483]
pg 28.528 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1063,793,2147483647,931,338,1777]
pg 28.594 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1208,891,1651,364,2147483647,53]
pg 28.8b4 is stuck undersized for 6m, current state active+undersized+degraded, last acting [521,1273,1238,138,1539,2147483647]
pg 28.a90 is stuck undersized for 6m, current state active+undersized+degraded, last acting [237,1665,1836,2147483647,192,1410]
pg 28.ad6 is stuck undersized for 6m, current state active+undersized+degraded, last acting [870,466,350,885,1601,2147483647]
pg 28.b34 is stuck undersized for 6m, current state active+undersized+degraded, last acting [920,1596,2147483647,115,201,941]
pg 28.c14 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1389,424,2147483647,268,1646,632]
pg 28.dba is stuck undersized for 6m, current state active+undersized+degraded, last acting [1099,561,2147483647,1806,1874,1145]
pg 28.ee2 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1621,1904,1044,2147483647,1545,722]
pg 29.163 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1883,2147483647,1509,1697,1187,235]
pg 29.1c1 is stuck undersized for 6m, current state active+undersized+degraded, last acting [122,1226,962,1254,1215,2147483647]
pg 29.254 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1782,1839,1545,412,196,2147483647]
pg 29.2a1 is stuck undersized for 6m, current state active+undersized+degraded, last acting [370,2147483647,575,1423,1755,446]
[WRN] UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.4 on host naret-osd01 failed.
Upgrade daemon: osd.4: cephadm exited with an error code: 1, stderr:Redeploy daemon osd.4 ...
Non-zero exit code 1 from systemctl start ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4>
systemctl: stderr Job for ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice> failed because a timeout was exceeded.
systemctl: stderr See "systemctl status ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice>" and "journalctl -xe" for details.
Traceback (most recent call last):
File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 9248, in <module>
main()
File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 9236, in main
r = ctx.func(ctx)
File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 1990, in _default_image
return func(ctx)
File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 5041, in command_deploy
ports=daemon_ports)
File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 2952, in deploy_daemon
c, osd_fsid=osd_fsid, ports=ports)
File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 3197, in deploy_daemon_units
call_throws(ctx, ['systemctl', 'start', unit_name])
File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 1657, in call_throws
raise RuntimeError(f'Failed command: {" ".join(command)}: {s}')
RuntimeError: Failed command: systemctl start ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4>: Job for ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice> failed because a timeout was exceeded.
See "systemctl status ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice>" and "journalctl -xe" for details.
“””
On the OSD server we have:
“””
[root@naret-osd01 ~]# uname -a
Linux naret-osd01 4.18.0-425.10.1.el8_7.x86_64 #1 SMP Wed Dec 14 16:00:01 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
[root@naret-osd01 ~]# podman -v
podman version 4.2.0
[root@naret-osd01 ~]# ceph -v
ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable)
[root@naret-osd01 ~]# cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.7 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.7"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.7 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL=https://www.redhat.com/ <https://www.redhat.com/>
DOCUMENTATION_URL=https://access.redhat.com/documentation/red_hat_enterpris… <https://access.redhat.com/documentation/red_hat_enterprise_linux/8/>
BUG_REPORT_URL=https://bugzilla.redhat.com/ <https://bugzilla.redhat.com/>
REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.7"
“””
Systemctl says:
“””
systemctl status ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice>
…
● ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice> - Ceph osd.4 for 63334166-d991-11eb-99de-40a6b72108d0
Loaded: loaded (/etc/systemd/system/ceph-63334166-d991-11eb-99de-40a6b72108d0@.service<mailto:/etc/systemd/system/ceph-63334166-d991-11eb-99de-40a6b72108d0@.service>; enabled; vendor preset: disabled)
Active: failed (Result: timeout) since Mon 2023-03-27 15:34:29 CEST; 6min ago
Process: 730621 ExecStopPost=/bin/rm -f /run/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice-pid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice-pid> /run/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice-cid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice-cid> (code=exited, status=0/SUCCESS)
Process: 730209 ExecStopPost=/bin/bash /var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/osd.4/unit.poststop (code=exited, status=0/SUCCESS)
Process: 710355 ExecStartPre=/bin/rm -f /run/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice-pid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice-pid> /run/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice-cid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice-cid> (code=exited, status=0/SUCCESS)
Main PID: 23025 (code=exited, status=0/SUCCESS)
Tasks: 62 (limit: 1647878)
Memory: 961.8M
CGroup: /system.slice/system-ceph\x2d63334166\x2dd991\x2d11eb\x2d99de\x2d40a6b72108d0.slice/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice
├─libpod-payload-b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e
│ └─754976 /usr/bin/ceph-osd -n osd.4 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug
└─runtime
└─754965 /usr/bin/conmon --api-version 1 -c b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e -u b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e/userdata -p /run/containers/storage/overlay-containers/b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e/userdata/pidfile -n ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4 --exit-dir /run/libpod/exits --full-attach -l journald --log-level warning --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e/userdata/oci-log --conmon-pidfile /run/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice-pid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice-pid> --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg --exit-command-arg --network-backend --exit-command-arg cni --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e
Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.886+0000 7f52e6ae1700 1 osd.4 pg_epoch: 821628 pg[28.dbas2( v 821618'4657799 (819107'4647770,821618'4657799] local-lis/les=749842/749843 n=239290 ec=130297/130290 lis/c=821623/749842 les/c/f=821624/749843/0 sis=821628 pruub=7.751406670s) [1099,561,4,1806,1874,1145]p1099(0) r=2 lpr=821628 pi=[749842,821628)/1 crt=821618'4657799 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.039081573s@ mbc={} ps=[4~6]] state<Start>: transitioning to Stray
Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.886+0000 7f52e8ae5700 1 osd.4 pg_epoch: 821628 pg[29.163s1( v 821572'139334 (776804'129273,821572'139334] local-lis/les=749851/749852 n=65683 ec=130801/130801 lis/c=821623/749851 les/c/f=821624/749852/0 sis=821628 pruub=8.023463249s) [1883,4,1509,1697,1187,235]p1883(0) r=1 lpr=821628 pi=[749851,821628)/1 crt=821572'139334 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.311203003s@ mbc={}] start_peering_interval up [1883,4,1509,1697,1187,235] -> [1883,4,1509,1697,1187,235], acting [1883,2147483647,1509,1697,1187,235] -> [1883,4,1509,1697,1187,235], acting_primary 1883(0) -> 1883, up_primary 1883(0) -> 1883, role -1 -> 1, features acting 4540138297136906239 upacting 4540138297136906239
Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.886+0000 7f52e72e2700 1 osd.4 pg_epoch: 821628 pg[29.2a1s1( v 821500'140649 (776804'130601,821500'140649] local-lis/les=749849/749850 n=65848 ec=130801/130801 lis/c=821623/749849 les/c/f=821624/749850/0 sis=821628 pruub=7.845988274s) [370,4,575,1423,1755,446]p370(0) r=1 lpr=821628 pi=[749849,821628)/1 crt=821500'140649 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.133728981s@ mbc={}] start_peering_interval up [370,4,575,1423,1755,446] -> [370,4,575,1423,1755,446], acting [370,2147483647,575,1423,1755,446] -> [370,4,575,1423,1755,446], acting_primary 370(0) -> 370, up_primary 370(0) -> 370, role -1 -> 1, features acting 4540138297136906239 upacting 4540138297136906239
Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.887+0000 7f52e8ae5700 1 osd.4 pg_epoch: 821628 pg[29.163s1( v 821572'139334 (776804'129273,821572'139334] local-lis/les=749851/749852 n=65683 ec=130801/130801 lis/c=821623/749851 les/c/f=821624/749852/0 sis=821628 pruub=8.023443222s) [1883,4,1509,1697,1187,235]p1883(0) r=1 lpr=821628 pi=[749851,821628)/1 crt=821572'139334 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.311203003s@ mbc={}] state<Start>: transitioning to Stray
Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.887+0000 7f52e72e2700 1 osd.4 pg_epoch: 821628 pg[29.2a1s1( v 821500'140649 (776804'130601,821500'140649] local-lis/les=749849/749850 n=65848 ec=130801/130801 lis/c=821623/749849 les/c/f=821624/749850/0 sis=821628 pruub=7.845966339s) [370,4,575,1423,1755,446]p370(0) r=1 lpr=821628 pi=[749849,821628)/1 crt=821500'140649 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.133728981s@ mbc={}] state<Start>: transitioning to Stray
Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.887+0000 7f52e72e2700 1 osd.4 pg_epoch: 821628 pg[28.8b4s5( v 821618'2906095 (817032'2896088,821618'2906095] local-lis/les=749842/749843 n=239377 ec=130295/130290 lis/c=821623/749842 les/c/f=821624/749843/0 sis=821628 pruub=8.158309937s) [521,1273,1238,138,1539,4]p521(0) r=5 lpr=821628 pi=[749842,821628)/1 crt=821618'2906095 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.446221352s@ mbc={} ps=[4~6]] start_peering_interval up [521,1273,1238,138,1539,4] -> [521,1273,1238,138,1539,4], acting [521,1273,1238,138,1539,2147483647] -> [521,1273,1238,138,1539,4], acting_primary 521(0) -> 521, up_primary 521(0) -> 521, role -1 -> 5, features acting 4540138297136906239 upacting 4540138297136906239
Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.887+0000 7f52e72e2700 1 osd.4 pg_epoch: 821628 pg[28.8b4s5( v 821618'2906095 (817032'2896088,821618'2906095] local-lis/les=749842/749843 n=239377 ec=130295/130290 lis/c=821623/749842 les/c/f=821624/749843/0 sis=821628 pruub=8.158291817s) [521,1273,1238,138,1539,4]p521(0) r=5 lpr=821628 pi=[749842,821628)/1 crt=821618'2906095 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.446221352s@ mbc={} ps=[4~6]] state<Start>: transitioning to Stray
Mar 27 15:39:36 naret-osd01 systemd[1]: ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice>: Start request repeated too quickly.
Mar 27 15:39:36 naret-osd01 systemd[1]: ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv <mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.serv>ice>: Failed with result 'timeout'.
Mar 27 15:39:36 naret-osd01 systemd[1]: Failed to start Ceph osd.4 for 63334166-d991-11eb-99de-40a6b72108d0.
“””
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io <mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-leave(a)ceph.io <mailto:ceph-users-leave@ceph.io>
Bringing up that topic again:
is it possible to log the bucket name in the rgw client logs?
currently I am only to know the bucket name when someone access the bucket
via https://TLD/bucket/object instead of https://bucket.TLD/object.
Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens <bb(a)kervyn.de>:
> Hi,
> I am looking forward to move our logs from
> /var/log/ceph/ceph-client...log to our logaggregator.
>
> Is there a way to have the bucket name in the log file?
>
> Or can I write the rgw_enable_ops_log into a file? Maybe I could work with
> this.
>
> Cheers and happy new year
> Boris
>
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
Hi everyone,
I discovered a documentation inconsistency in Ceph Nautilus and would like to know whether this is still the case in the latest ceph release before reporting a bug. Unfortunately, I only have access to a Nautilus cluster right now.
The quincy docs state [1]:
> Create the OSD. If no UUID is given, it will be set automatically when the OSD starts up. The following command will output the OSD number, which you will need for subsequent steps:
>
>ceph osd create [{uuid} [{id}]]
But the man pages [2] state that `ceph osd create` is deprecated in favour of `ceph osd new {<uuid>} {<id>} -i {<params.json>}`, with both uuid and id still being marked as optional parameters.
But when actually running `ceph osd new` without a specified UUID, I get
```
Invalid command: missing required parameter uuid(<uuid>)
osd new <uuid> {<osdname (id|osd.id)>} : Create a new OSD. If supplied, the `id` to be replaced needs to exist and have been previously destroyed. Reads secrets from JSON file via `-i <file>` (see man page).
Error EINVAL: invalid command
```
under Nautilus. Is this still the case under Quincy, can someone reproduce this for me?
Best regards
Oliver Schmidt
[1] https://docs.ceph.com/en/quincy/rados/operations/add-or-rm-osds/
[2] https://docs.ceph.com/en/quincy/man/8/ceph/#osd
--
Oliver Schmidt · os(a)flyingcircus.io · Systems Engineer
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Leipziger Str. 70/71 · 06108 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian Theune, Christian Zagrodnick
Hi folks.
Just looking for some up to date advice please from the collective on how best to set up CEPH on 5 Proxmox hosts each configured with the following:
AMD Ryzen 7 5800X CPU
64GB RAM
2x SSD (as ZFS boot disk for Proxmox)
1x 500GB NVMe for DB/WAL
1x 1TB NVMe as an OSD
1x 16TB SATA HDD as an OSD
2x 10GB NIC (One for Public and one for Cluster networks)
1 GB NIC for management interface
The CEPH solution will be used primarily for storage of another Proxmox cluster's virtual machines and their data. We'd like a fast pool using the NVMe's for critical VMs and a slower HDD based pool for VM's that don't require such fast disk access and perhaps require more storage capacity.
To expand in the future we will probably add more hosts in the same sort of configuration and/or replace NVMe/HDDs OSDs with more capacious ones.
Ideas for configuration welcome please.
Many thanks
Tino
Coastsense Ltd
This E-mail is intended solely for the person or organisation to which it is addressed. It may contain privileged or confidential information and, if you are not the intended recipient, you must not copy, distribute or take any action in reliance upon it. Any views or opinions presented are solely those of the author and do not necessarily represent those of Marlan Maritime Technologies Ltd. If you have received this E-mail in error, please notify us as soon as possible and delete it from your computer. Marlan Maritime Technologies Ltd Registered in England & Wales 323 Mariners House, Norfolk Street, Liverpool. L1 0BG Company No. 08492427.
I hope this email finds you well. I wanted to share a recent experience I
had with our Ceph cluster and get your feedback on a solution I came up
with.
Recently, we had some orphan objects stuck in our cluster that were not
visible by any client like s3cmd, boto3, and mc. This caused some confusion
for our users, as the sum of all objects in their buckets was much less
than what we showed in the panel. We made some adjustments for them, but
the issue persisted.
As we have billions of objects in our cluster, using normal tools to find
orphans was impossible. So, I came up with a tricky way to handle the
situation. I created a bash script that identifies and removes the orphan
objects using radosgw-admin and rados commands. Here is the script:
https://gist.github.com/RaminNietzsche/b9baa06b69fc5f56d907f3c953769182
I am hoping to get some feedback from the community on this solution. Have
any of you faced similar challenges with orphan objects in your Ceph
clusters? Do you have any suggestions or improvements for my script?
Thank you for your time and help.