March 2023 - ceph-users - lists.ceph.io

by Arvid Picciani

Hi again, something is very wrong with my hardware it seems and i'm slowly turning insane. I'm trying to debug why ceph has incredibly poor performance for us. we've got - 3 EPYC 7713 dual-cpu systems - datacenter nvme drives (3GB/s top) - 100G infiniband ceph does 800MB/s read max, CPU is idle, network is idle, I/O nowhere near saturation now while hunting for queues, i wanted to recompile ceph and poke around abit, but ./install-deps.sh is now running for 6 hours with the exact same problem. CPU is idle, network is idle, zero I/O what am i doing wrong? is there another computer resource we're bottlenecked on that i just dont know about? -- +4916093821054

1 year

1
0
0 0

ceph orch ps mon, mgr, osd shows <unknown> for version, image and container id

by anantha.adiga＠intel.com

Hi , Why is ceph orch ps showing ,unknown version, image and container id ? root@a002s002:~# cephadm shell ceph mon versions Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598 Using recent ceph image quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997 { "ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)": 3 } root@a002s002:~# cephadm shell ceph mgr versions Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598 Using recent ceph image quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997 { "ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a) pacific (stable)": 3 } root@a002s002:~# cephadm shell ceph orch ps --daemon-type mgr Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598 Using recent ceph image quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997 NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID mgr.a002s002 a002s002 running 4m ago 11M - - <unknown> <unknown> mgr.a002s003 a002s003 running 87s ago 11M - - <unknown> <unknown> mgr.a002s004 a002s004 running 4m ago 11M - - <unknown> <unknown> root@a002s002:~# cephadm shell ceph orch ps --daemon-type mon Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598 Using recent ceph image quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997 NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID mon.a002s002 a002s002 running 4m ago 11M - 2048M <unknown> <unknown> <unknown> mon.a002s003 a002s003 running 95s ago 11M - 2048M <unknown> <unknown> <unknown> mon.a002s004 a002s004 running (4w) 4m ago 5M 1172M 2048M 16.2.5 6e73176320aa d38b94e00d28 root@a002s002:~# cephadm shell ceph orch ps --daemon-type osd Inferring fsid 682863c2-812e-41c5-8d72-28fd3d228598 Using recent ceph image quay.io/ceph/daemon@sha256:9889075a79f425c2f5f5a59d03c8d5bf823856ab661113fa17a8a7572b16a997 NAME HOST PORTS STATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID osd.0 a002s002 running 8m ago 11M - 10.9G <unknown> <unknown> osd.1 a002s003 running 5m ago 11M - 10.9G <unknown> <unknown> osd.10 a002s004 running 8m ago 11M - 10.9G <unknown> <unknown> osd.11 a002s003 running 5m ago 11M - 10.9G <unknown> <unknown> osd.12 a002s002 running 8m ago 11M - 10.9G <unknown> <unknown> osd.13 a002s004 running 8m ago 11M - 10.9G <unknown> <unknown> osd.14 a002s003 running 5m ago 11M - 10.9G <unknown> <unknown> osd.15 a002s002 running 8m ago 11M - 10.9G <unknown> <unknown> osd.16 a002s004 running 8m ago 11M - 10.9G <unknown> <unknown> osd.17 a002s003 running 5m ago 11M - 10.9G <unknown> <unknown> osd.18 a002s002 running 8m ago 11M - 10.9G <unknown> <unknown> osd.19 a002s004 running 8m ago 11M - 10.9G <unknown> <unknown> osd.2 a002s004 running 8m ago 11M - 10.9G <unknown> <unknown> osd.20 a002s003 running 5m ago 11M - 10.9G <unknown> <unknown> osd.21 a002s002 running 8m ago 11M - 10.9G <unknown> <unknown>

1 year

3
4
0 0

how to set block.db size

by li.xuehai＠99cloud.net

hi, how to understand ``Improvements in releases beginning with Nautilus 14.2.12 and Octopus 15.2.6 enable better utilization of arbitrary DB device sizes, and the Pacific release brings experimental dynamic level support. `` in the document ``https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#sizing`` is there a related patch??

1 year

1
0
0 0

Upgrade from 16.2.7. to 16.2.11 failing on OSDs

by Lo Re Giuseppe

Dear all, On one of our clusters I started the upgrade process from 16.2.7 to 16.2.11. Mon and mgr and crash processes were done easily/quickly, then at the first attempt of upgrading a OSD container the upgrade process stopped because of the OSD process is not able to start after the upgrade. Does anyone have any hint on how to unblock the upgrade? Some details below: Regards, Giuseppe I started the upgrade process with the cephadm command: “”” [root@naret-monitor01 ~]# ceph orch upgrade start --ceph-version 16.2.11 Initiating upgrade to quay.io/ceph/ceph:v16.2.11 “”” After a short time: “”” [root@naret-monitor01 ~]# ceph orch upgrade status { "target_image": quay.io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add<mailto:quay.io/ceph/ceph@sha256:1b9803c8984bef8b82f05e233e8fe8ed8f0bba8e5cc2c57f6efaccbeea682add>, "in_progress": true, "which": "Upgrading all daemon types on all hosts", "services_complete": [ "crash", "mon", "mgr" ], "progress": "64/2039 daemons upgraded", "message": "Error: UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.4 on host naret-osd01 failed.", "is_paused": true } “”” The ceph health command reports: “”” [root@naret-monitor01 ~]# ceph health detail HEALTH_WARN 1 failed cephadm daemon(s); 1 osds down; Degraded data redundancy: 2654362/6721382840 objects degraded (0.039%), 14 pgs degraded, 14 pgs undersized; Upgrading daemon osd.4 on host naret-osd01 failed. [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s) daemon osd.22 on naret-osd01 is in error state [WRN] OSD_DOWN: 1 osds down osd.4 (root=default,host=naret-osd01) is down [WRN] PG_DEGRADED: Degraded data redundancy: 2654362/6721382840 objects degraded (0.039%), 14 pgs degraded, 14 pgs undersized pg 28.88 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1373,1337,1508,852,2147483647,483] pg 28.528 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1063,793,2147483647,931,338,1777] pg 28.594 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1208,891,1651,364,2147483647,53] pg 28.8b4 is stuck undersized for 6m, current state active+undersized+degraded, last acting [521,1273,1238,138,1539,2147483647] pg 28.a90 is stuck undersized for 6m, current state active+undersized+degraded, last acting [237,1665,1836,2147483647,192,1410] pg 28.ad6 is stuck undersized for 6m, current state active+undersized+degraded, last acting [870,466,350,885,1601,2147483647] pg 28.b34 is stuck undersized for 6m, current state active+undersized+degraded, last acting [920,1596,2147483647,115,201,941] pg 28.c14 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1389,424,2147483647,268,1646,632] pg 28.dba is stuck undersized for 6m, current state active+undersized+degraded, last acting [1099,561,2147483647,1806,1874,1145] pg 28.ee2 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1621,1904,1044,2147483647,1545,722] pg 29.163 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1883,2147483647,1509,1697,1187,235] pg 29.1c1 is stuck undersized for 6m, current state active+undersized+degraded, last acting [122,1226,962,1254,1215,2147483647] pg 29.254 is stuck undersized for 6m, current state active+undersized+degraded, last acting [1782,1839,1545,412,196,2147483647] pg 29.2a1 is stuck undersized for 6m, current state active+undersized+degraded, last acting [370,2147483647,575,1423,1755,446] [WRN] UPGRADE_REDEPLOY_DAEMON: Upgrading daemon osd.4 on host naret-osd01 failed. Upgrade daemon: osd.4: cephadm exited with an error code: 1, stderr:Redeploy daemon osd.4 ... Non-zero exit code 1 from systemctl start ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4> systemctl: stderr Job for ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service> failed because a timeout was exceeded. systemctl: stderr See "systemctl status ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>" and "journalctl -xe" for details. Traceback (most recent call last): File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 9248, in <module> main() File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 9236, in main r = ctx.func(ctx) File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 1990, in _default_image return func(ctx) File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 5041, in command_deploy ports=daemon_ports) File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 2952, in deploy_daemon c, osd_fsid=osd_fsid, ports=ports) File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 3197, in deploy_daemon_units call_throws(ctx, ['systemctl', 'start', unit_name]) File "/var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/cephadm.8d0364fef6c92fc3580b0d022e32241348e6f11a7694d2b957cdafcb9d059ff2", line 1657, in call_throws raise RuntimeError(f'Failed command: {" ".join(command)}: {s}') RuntimeError: Failed command: systemctl start ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4>: Job for ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service> failed because a timeout was exceeded. See "systemctl status ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>" and "journalctl -xe" for details. “”” On the OSD server we have: “”” [root@naret-osd01 ~]# uname -a Linux naret-osd01 4.18.0-425.10.1.el8_7.x86_64 #1 SMP Wed Dec 14 16:00:01 EST 2022 x86_64 x86_64 x86_64 GNU/Linux [root@naret-osd01 ~]# podman -v podman version 4.2.0 [root@naret-osd01 ~]# ceph -v ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) [root@naret-osd01 ~]# cat /etc/os-release NAME="Red Hat Enterprise Linux" VERSION="8.7 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.7" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.7 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL=https://www.redhat.com/ DOCUMENTATION_URL=https://access.redhat.com/documentation/red_hat_enterpris… BUG_REPORT_URL=https://bugzilla.redhat.com/ REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.7 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.7" “”” Systemctl says: “”” systemctl status ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service> … ● ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service> - Ceph osd.4 for 63334166-d991-11eb-99de-40a6b72108d0 Loaded: loaded (/etc/systemd/system/ceph-63334166-d991-11eb-99de-40a6b72108d0@.service<mailto:/etc/systemd/system/ceph-63334166-d991-11eb-99de-40a6b72108d0@.service>; enabled; vendor preset: disabled) Active: failed (Result: timeout) since Mon 2023-03-27 15:34:29 CEST; 6min ago Process: 730621 ExecStopPost=/bin/rm -f /run/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service-pid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-pid> /run/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service-cid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-cid> (code=exited, status=0/SUCCESS) Process: 730209 ExecStopPost=/bin/bash /var/lib/ceph/63334166-d991-11eb-99de-40a6b72108d0/osd.4/unit.poststop (code=exited, status=0/SUCCESS) Process: 710355 ExecStartPre=/bin/rm -f /run/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service-pid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-pid> /run/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service-cid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-cid> (code=exited, status=0/SUCCESS) Main PID: 23025 (code=exited, status=0/SUCCESS) Tasks: 62 (limit: 1647878) Memory: 961.8M CGroup: /system.slice/system-ceph\x2d63334166\x2dd991\x2d11eb\x2d99de\x2d40a6b72108d0.slice/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service ├─libpod-payload-b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e │ └─754976 /usr/bin/ceph-osd -n osd.4 -f --setuser ceph --setgroup ceph --default-log-to-file=false --default-log-to-stderr=true --default-log-stderr-prefix=debug └─runtime └─754965 /usr/bin/conmon --api-version 1 -c b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e -u b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e -r /usr/bin/runc -b /var/lib/containers/storage/overlay-containers/b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e/userdata -p /run/containers/storage/overlay-containers/b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e/userdata/pidfile -n ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4 --exit-dir /run/libpod/exits --full-attach -l journald --log-level warning --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/run/containers/storage/overlay-containers/b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e/userdata/oci-log --conmon-pidfile /run/ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service-pid<mailto:/run/ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service-pid> --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /run/containers/storage --exit-command-arg --log-level --exit-command-arg warning --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /run/libpod --exit-command-arg --network-config-dir --exit-command-arg --exit-command-arg --network-backend --exit-command-arg cni --exit-command-arg --volumepath --exit-command-arg /var/lib/containers/storage/volumes --exit-command-arg --runtime --exit-command-arg runc --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg file --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg b4f0ebebdfec38942b614756b6329b04d2939db29a0a9823e314b848680bc58e Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.886+0000 7f52e6ae1700 1 osd.4 pg_epoch: 821628 pg[28.dbas2( v 821618'4657799 (819107'4647770,821618'4657799] local-lis/les=749842/749843 n=239290 ec=130297/130290 lis/c=821623/749842 les/c/f=821624/749843/0 sis=821628 pruub=7.751406670s) [1099,561,4,1806,1874,1145]p1099(0) r=2 lpr=821628 pi=[749842,821628)/1 crt=821618'4657799 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.039081573s@ mbc={} ps=[4~6]] state<Start>: transitioning to Stray Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.886+0000 7f52e8ae5700 1 osd.4 pg_epoch: 821628 pg[29.163s1( v 821572'139334 (776804'129273,821572'139334] local-lis/les=749851/749852 n=65683 ec=130801/130801 lis/c=821623/749851 les/c/f=821624/749852/0 sis=821628 pruub=8.023463249s) [1883,4,1509,1697,1187,235]p1883(0) r=1 lpr=821628 pi=[749851,821628)/1 crt=821572'139334 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.311203003s@ mbc={}] start_peering_interval up [1883,4,1509,1697,1187,235] -> [1883,4,1509,1697,1187,235], acting [1883,2147483647,1509,1697,1187,235] -> [1883,4,1509,1697,1187,235], acting_primary 1883(0) -> 1883, up_primary 1883(0) -> 1883, role -1 -> 1, features acting 4540138297136906239 upacting 4540138297136906239 Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.886+0000 7f52e72e2700 1 osd.4 pg_epoch: 821628 pg[29.2a1s1( v 821500'140649 (776804'130601,821500'140649] local-lis/les=749849/749850 n=65848 ec=130801/130801 lis/c=821623/749849 les/c/f=821624/749850/0 sis=821628 pruub=7.845988274s) [370,4,575,1423,1755,446]p370(0) r=1 lpr=821628 pi=[749849,821628)/1 crt=821500'140649 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.133728981s@ mbc={}] start_peering_interval up [370,4,575,1423,1755,446] -> [370,4,575,1423,1755,446], acting [370,2147483647,575,1423,1755,446] -> [370,4,575,1423,1755,446], acting_primary 370(0) -> 370, up_primary 370(0) -> 370, role -1 -> 1, features acting 4540138297136906239 upacting 4540138297136906239 Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.887+0000 7f52e8ae5700 1 osd.4 pg_epoch: 821628 pg[29.163s1( v 821572'139334 (776804'129273,821572'139334] local-lis/les=749851/749852 n=65683 ec=130801/130801 lis/c=821623/749851 les/c/f=821624/749852/0 sis=821628 pruub=8.023443222s) [1883,4,1509,1697,1187,235]p1883(0) r=1 lpr=821628 pi=[749851,821628)/1 crt=821572'139334 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.311203003s@ mbc={}] state<Start>: transitioning to Stray Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.887+0000 7f52e72e2700 1 osd.4 pg_epoch: 821628 pg[29.2a1s1( v 821500'140649 (776804'130601,821500'140649] local-lis/les=749849/749850 n=65848 ec=130801/130801 lis/c=821623/749849 les/c/f=821624/749850/0 sis=821628 pruub=7.845966339s) [370,4,575,1423,1755,446]p370(0) r=1 lpr=821628 pi=[749849,821628)/1 crt=821500'140649 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.133728981s@ mbc={}] state<Start>: transitioning to Stray Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.887+0000 7f52e72e2700 1 osd.4 pg_epoch: 821628 pg[28.8b4s5( v 821618'2906095 (817032'2896088,821618'2906095] local-lis/les=749842/749843 n=239377 ec=130295/130290 lis/c=821623/749842 les/c/f=821624/749843/0 sis=821628 pruub=8.158309937s) [521,1273,1238,138,1539,4]p521(0) r=5 lpr=821628 pi=[749842,821628)/1 crt=821618'2906095 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.446221352s@ mbc={} ps=[4~6]] start_peering_interval up [521,1273,1238,138,1539,4] -> [521,1273,1238,138,1539,4], acting [521,1273,1238,138,1539,2147483647] -> [521,1273,1238,138,1539,4], acting_primary 521(0) -> 521, up_primary 521(0) -> 521, role -1 -> 5, features acting 4540138297136906239 upacting 4540138297136906239 Mar 27 15:36:56 naret-osd01 ceph-63334166-d991-11eb-99de-40a6b72108d0-osd-4[754965]: debug 2023-03-27T13:36:56.887+0000 7f52e72e2700 1 osd.4 pg_epoch: 821628 pg[28.8b4s5( v 821618'2906095 (817032'2896088,821618'2906095] local-lis/les=749842/749843 n=239377 ec=130295/130290 lis/c=821623/749842 les/c/f=821624/749843/0 sis=821628 pruub=8.158291817s) [521,1273,1238,138,1539,4]p521(0) r=5 lpr=821628 pi=[749842,821628)/1 crt=821618'2906095 lcod 0'0 mlcod 0'0 unknown NOTIFY pruub 12.446221352s@ mbc={} ps=[4~6]] state<Start>: transitioning to Stray Mar 27 15:39:36 naret-osd01 systemd[1]: ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>: Start request repeated too quickly. Mar 27 15:39:36 naret-osd01 systemd[1]: ceph-63334166-d991-11eb-99de-40a6b72108d0(a)osd.4.service<mailto:ceph-63334166-d991-11eb-99de-40a6b72108d0@osd.4.service>: Failed with result 'timeout'. Mar 27 15:39:36 naret-osd01 systemd[1]: Failed to start Ceph osd.4 for 63334166-d991-11eb-99de-40a6b72108d0. “””

1 year

2
1
0 0

how ceph OSD bench works?

by Luis Domingues

Hi, I am currently testing some new disks, doing some benchmarks and stuff, and I would like to undertand how the OSD bench works. If I quicky explain our setup, we have a small ceph cluster, where our new disks are inserted. And we have some pools with no replication at all, and 1 PG only, up-mapped to those new disks. So I can do some benchmarks on them. The thing that is odd, is that doing some tests with fio tool, I have similar results on all disks, and doing the rados bench during 5 minutes as well. But the OSD bench at startup of the OSD, for mClock to configure osd_mclock_max_capacity_iops_hdd gives me a very big difference between disks. (600 vs 2200). I am running Pacific on this test cluster. Is there anywhere documentation of how this works? Or if anyone could explain that would be great. I did not found any documentation on how OSD benchmark works, only how to used it. But playing a little bit with it, it seems the results we get is highly dependent on the block sizes we use. Same for rados bench, results are dependent, at least on my tests, of the block size we use, which I found a little bit weird to be honest. And as mClock depends on that, it is impactful performance wise. On our cluster we can reach a lot better performances if we teak those values, instead of letting the cluster do proper measurements. And this looks to impact certain disk vendors more than others. Luis Domingues Proton AG

1 year

3
3
0 0

Call for Submissions IO500 ISC23

by IO500 Committee

Stabilization Period: Monday, April 3rd - Friday, April 14th, 2023 Submission Deadline: Tuesday, May 16st, 2023 AoE The IO500 is now accepting and encouraging submissions for the upcoming 12th semi-annual IO500 list, in conjunction with ISC23. Once again, we are also accepting submissions to the 10 Client Node Challenge to encourage the submission of small scale results. The new ranked lists will be announced at the ISC23 BoF [1]. We hope to see many new results. What's New 1. Creation of Production and Research Lists - Starting with ISC'22, we proposed a separation of the list into separate Production and Research lists. This better reflects the important distinction between storage systems that run in production environments and those that may use more experimental hardware and software configurations. At ISC23, we will formally create these two lists and users will be able to submit to either of the two lists (and their 10 client-node counterparts). Please see the requirements for each list on the IO500 rules page [3]. 2. New Submission Tool - There is now a new IO500 submission tool that improves the overall submission experience. Users can create accounts and then update and manage all of their submissions through that account. As part of this new tool, we have improved the submission fields that describe the hardware and software of the system under test. For reproducibility and analysis reasons, we now made the easily obtainable fields mandatory - data from storage servers are for users often difficult to obtain, therefore, most remain optional. As a new system, there may be quirks, please reach out on Slack or the mailing list if you see any issues. Further details will be released on the submission page [2]. 3. Reproducibility - Every submission will now receive a reproducibility score based upon the provided system details and the reproducibility questionnaire. This score will inform the community on the amount of details provided in the submission and the obtainability of the storage system. Further, this score will be used to evaluate if a submission is eligible for the Production list. 4. New Phases - We are continuing to evaluate the inclusion of optional test phases for additional key workloads - split easy/hard find phases, 4KB and 1MB random read/write phases, and concurrent metadata operations. This is called an extended run. At the moment, we collect the information to verify that additional phases do not significantly impact the results of a standard run and an extended run to facilitate comparisons between the existing and new benchmark phases. In a future release, we may include some or all of these results as part of the standard benchmark. The extended results are not currently included in the scoring of any ranked list. Background The benchmark suite is designed to be easy to run and the community has multiple active support channels to help with any questions. Please note that submissions of all sizes are welcome; the site has customizable sorting, so it is possible to submit on a small system and still get a very good per-client score, for example. Additionally, the list is about much more than just the raw rank; all submissions help the community by collecting and publishing a wider corpus of data. More details below. Following the success of the Top500 in collecting and analyzing historical trends in supercomputer technology and evolution, the IO500 was created in 2017, published its first list at SC17, and has grown continually since then. The need for such an initiative has long been known within High-Performance Computing; however, defining appropriate benchmarks has long been challenging. Despite this challenge, the community, after long and spirited discussion, finally reached consensus on a suite of benchmarks and a metric for resolving the scores into a single ranking. The multi-fold goals of the benchmark suite are as follows: 1. Maximizing simplicity in running the benchmark suite 2. Encouraging optimization and documentation of tuning parameters for performance 3. Allowing submitters to highlight their "hero run" performance numbers 4. Forcing submitters to simultaneously report performance for challenging IO patterns. Specifically, the benchmark suite includes a hero-run of both IOR and mdtest configured however possible to maximize performance and establish an upper-bound for performance. It also includes an IOR and mdtest run with highly prescribed parameters in an attempt to determine a lower performance bound. Finally, it includes a namespace search as this has been determined to be a highly sought-after feature in HPC storage systems that has historically not been well-measured. Submitters are encouraged to share their tuning insights for publication. The goals of the community are also multi-fold: 1. Gather historical data for the sake of analysis and to aid predictions of storage futures 2. Collect tuning information to share valuable performance optimizations across the community 3. Encourage vendors and designers to optimize for workloads beyond "hero runs" 4. Establish bounded expectations for users, procurers, and administrators The IO500 follows a two-staged approach. First, there will be a two-week stabilization period during which we encourage the community to verify that the benchmark runs properly on a variety of storage systems. During this period the benchmark may be updated based upon feedback from the community. The final benchmark will then be released. We expect that runs compliant with the rules made during the stabilization period will be valid as a final submission unless a significant defect is found. 10 Client Node I/O Challenge The 10 Client Node Challenge is conducted using the regular IO500 benchmark, however, with the rule that exactly 10 client nodes must be used to run the benchmark. You may use any shared storage with any number of servers. We will announce the results in the Production and Research lists as well as in separate derived lists. Birds-of-a-Feather Once again, we encourage you to submit [2] to join our community, and to attend the ISC23 BoF [1], where we will announce the new IO500 Production and Research lists and their 10 client node counterparts. The current list includes results from twenty different storage system types and 70 institutions. We hope that the upcoming list grows even more. [1] https://io500.org/pages/bof-isc23 [2] https://io500.org/submission [3] https://io500.org/rules-submission -- The IO500 Committee

1 year

1
0
0 0

OSD will not start - ceph_assert(r == q->second->file_map.end())

by Pat Vaughan

I have a cluster that I increased the the number of PGs on because the autoscaler wasn't working as expected. It's recovering the misplaced objects, but a OSD just failed, and refuses to come back up. The device is readable to the OS, and there are 2 other OSDs on the same node that are online. I looked online, but haven't found anything relevant. This is the end of the OSD log: -3> 2023-03-30T21:21:19.641+0000 7fcb026413c0 1 bluefs mount -2> 2023-03-30T21:21:19.641+0000 7fcb026413c0 1 bluefs _init_alloc shared, id 1, capacity 0x4affc00000, block size 0x10000 -1> 2023-03-30T21:21:19.673+0000 7fcb026413c0 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_replay(bool, bool)' thread 7fcb026413c0 time 2023-03-30T21:21:19.665811+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.3/rpm/el8/BUILD/ceph-17.2.3/src/os/bluestore/BlueFS.cc: 1419: FAILED ceph_assert(r == q->second->file_map.end()) ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x56525ddd8954] 2: /usr/bin/ceph-osd(+0x5d8b75) [0x56525ddd8b75] 3: (BlueFS::_replay(bool, bool)+0x599c) [0x56525e5590ec] 4: (BlueFS::mount()+0x120) [0x56525e559530] 5: (BlueStore::_open_bluefs(bool, bool)+0x94) [0x56525e4160b4] 6: (BlueStore::_prepare_db_environment(bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x6e1) [0x56525e417211] 7: (BlueStore::_open_db(bool, bool, bool)+0x159) [0x56525e449f69] 8: (BlueStore::_open_db_and_around(bool, bool)+0x2b4) [0x56525e493f14] 9: (BlueStore::_mount()+0x1ae) [0x56525e4970fe] 10: (OSD::init()+0x403) [0x56525df16f23] 11: main() 12: __libc_start_main() 13: _start() 0> 2023-03-30T21:21:19.681+0000 7fcb026413c0 -1 *** Caught signal (Aborted) ** in thread 7fcb026413c0 thread_name:ceph-osd ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable) 1: /lib64/libpthread.so.0(+0x12ce0) [0x7fcb00844ce0] 2: gsignal() 3: abort() 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x1b0) [0x56525ddd89b2] 5: /usr/bin/ceph-osd(+0x5d8b75) [0x56525ddd8b75] 6: (BlueFS::_replay(bool, bool)+0x599c) [0x56525e5590ec] 7: (BlueFS::mount()+0x120) [0x56525e559530] 8: (BlueStore::_open_bluefs(bool, bool)+0x94) [0x56525e4160b4] 9: (BlueStore::_prepare_db_environment(bool, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*)+0x6e1) [0x56525e417211] 10: (BlueStore::_open_db(bool, bool, bool)+0x159) [0x56525e449f69] 11: (BlueStore::_open_db_and_around(bool, bool)+0x2b4) [0x56525e493f14] 12: (BlueStore::_mount()+0x1ae) [0x56525e4970fe] 13: (OSD::init()+0x403) [0x56525df16f23] 14: main() 15: __libc_start_main() 16: _start() NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/ 5 rgw_datacache 1/10 civetweb 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 4/ 5 memdb 1/ 5 fuse 2/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite 0/ 5 seastore 0/ 5 seastore_onode 0/ 5 seastore_odata 0/ 5 seastore_omap 0/ 5 seastore_tm 0/ 5 seastore_cleaner 0/ 5 seastore_lba 0/ 5 seastore_cache 0/ 5 seastore_journal 0/ 5 seastore_device 0/ 5 alienstore 1/ 5 mclock -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7fcafb03b700 / admin_socket 7fcafb83c700 / msgr-worker-2 7fcb026413c0 / ceph-osd max_recent 10000 max_new 10000 log_file /var/log/ceph/ceph-osd.5.log --- end dump of recent events --- I'd like to recover this OSD if possible. Does anyone have any suggestions?

1 year

1
0
0 0

Controlling the number of open files from ceph client

by bhattacharya.soumya.ou＠gmail.com

Hi Ceph Users, My goal is to control the number of files a ceph client can open to the backend ceph filesystem at once to control the metadata transaction load. In this experiment, I have a ceph client on version Quincy on a physical server. The fstab entry below shows the options with which the ceph filesystem is mounted on it. In particular, I used caps_max=1900 option with the intention to limit the total number of open files by this client at once to 1900 at the mount point /ourdisk/hpc_scratch. # cat /etc/fstab ... 10.251.0.30:6789,10.251.0.31:6789,10.251.0.32:6789,10.251.0.33:6789,10.251.0.40:6789:/volumes/hpc_scratch/scratch/525205e7-4f71-4383-89ce-53e2ec68d017 /ourdisk/hpc_scratch ceph fsc,caps_max=1900,name=oscer,secretfile=/etc/ceph/client.oscer.secret 0 0 To test if this worked, I used fio command to create 4000 files at once in a directory on Ceph at /ourdisk/hpc_scratch/soumya/fio_tests/client_c003. During the run, I looked at the caps file (copied below), and it shows the number of used caps is 4007. Given my limited knowledge, it seems to me that the client was able to open 4000 files at once. # head /sys/kernel/debug/ceph/d5d5b0aa-1867-11eb-9f4a-bc97e1724ff1.client483151133/caps total 5032 avail 1025 used 4007 reserved 0 min 1024 ino mds issued implemented -------------------------------------------------- 0x200136e01b6 0 pAsLsXsFs pAsLsXsFs 0x1 0 pAsLsXsFs pAsLsXsFs Is there any way I can control how many simultaneous files a ceph client can open (preferably from the client side, if not then from the ceph side but on an individual client)? If so, how can I check the number of files a client is opening at a given time? Thank you for your time, Soumya PS: I understand that the number of capabilities is not the number of open files, but that's the closest mount option I could find for this experiment.

1 year

1
0
0 0

17.2.6 RC available

by Yuri Weinstein

We are publishing a release candidate this time for users to try for testing only. Please note this RC had only limited testing. Full testing is being done now. Upgrading has been tested on some internal clusters, and the final upgrade of the longest-running cluster there is in progress. The branch name: https://github.com/ceph/ceph/tree/quincy-release https://shaman.ceph.com/builds/ceph/quincy-release/23eb3b2f0fc65087846571af… To install dev packages see https://docs.ceph.com/en/quincy/install/get-packages/#ceph-development-pack… The container build see - https://quay.ceph.io/ceph-ci/ceph:quincy-release Release notes: https://github.com/ceph/ceph/pull/50721 ***Don’t use this RC on production clusters!*** The goal is to give users time to test and give feedback on RC releases while our upstream long-running cluster also runs the same RC release during that time (period of one week). Please respond to this email to provide any feedback on issues found in this release. Thx YuriW

1 year

1
0
0 0

OSD down cause all OSD slow ops

by petersun＠raksmart.com

We experienced a Ceph failure causing the system to become unresponsive with no IOPS or throughput due to a problematic OSD process on one node. This resulted in slow operations and no IOPS for all other OSDs in the cluster. The incident timeline is as follows: Alert triggered for OSD problem. 6 out of 12 OSDs on the node were down. Soft restart attempted, but smartmontools process stuck while shutting down server. Hard restart attempted and service resumed as usual. Our Ceph cluster has 19 nodes, 218 OSDs, and is using version 15.2.17 octopus (stable). Questions: 1. What is Ceph's detection mechanism? Why couldn't Ceph detect the faulty node and automatically abandon its resources? 2. Did we miss any patches or bug fixes? 3. Suggestions for improvements to quickly detect and avoid similar issues in the future?

1 year

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users March 2023