February 2024 - ceph-users

by wodel youchi

Hi, I am in the middle of migration from ceph-ansible to cephadm (version quincy), so far so good ;-). And I have some questions : - I still have the ceph-crash container, what should I do with it? - The new rgw and mds daemons have some random string in their names (like rgw.opsrgw.controllera.*pkajqw*), is this correct ? - How should I proceed with the monitoring stack (grafana, prometheus, alermanager and node-exporter)? should I stop then delete the old ones, then deploy the new ones with ceph orch? Regards.

2 months, 1 week

2
1
0 0

Ceph & iSCSI

by Michael Worsham

I was reading on the Ceph site that iSCSI is no longer under active development since November 2022. Why is that? https://docs.ceph.com/en/latest/rbd/iscsi-overview/ -- Michael This message and its attachments are from Data Dimensions and are intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately and permanently delete the original email and destroy any copies or printouts of this email as well as any attachments.

2 months, 1 week

5
5
0 0

Dropping focal for squid

by Reed Dier

Found this mention in the CLT Minutes posted this morning[1], of a discussion on ceph-dev[2] about dropping ubuntu focal builds for the squid release, and beginning builds of quincy for jammy to facilitate quincy->squid upgrades. > there was a consensus to drop support for ubuntu focal and centos > stream 8 with the squid release, and i'd love to remove those distros > from the shaman build matrix for squid and main branches asap > > however, i see that quincy never supported ubuntu jammy, so our quincy > upgrade tests still have to run against focal. that means we'd still > have to build focal packages for squid > > would it be possible to start building jammy packages for quincy to > allow those upgrade tests to run jammy instead? Just wanting to voice my support for this, as this both seems to match the historical ceph:ubuntu cadence going back roughly a decade, and helps facilitate a narrow upgrade window for ubuntu users to get to jammy. +----------+-----+-----+-----+-----+-----+-----+ | ceph | u14 | u16 | u18 | u20 | u22 | u24 | +----------+-----+-----+-----+-----+-----+-----+ | jewel | + | + | - | - | - | - | | luminous | + | + | - | - | - | - | | mimic | - | + | + | - | - | - | | nautilus | - | + | + | - | - | - | | octopus | - | - | + | + | - | - | | pacific | - | - | + | + | - | - | | quincy | - | - | - | + | M | - | | reef | - | - | - | + | + | - | | squid | - | - | - | - | E | E | | T | - | - | - | - | E | E | +----------+-----+-----+-----+-----+-----+-----+ Hopefully this table translates the mailing list well enough. But going back to both jewel/10 and Ubuntu 14.04/trusty, there has been a consistent 4 ceph releases per ubuntu LTS release, with a dist drop/add every two releases. This gives ample window for users to upgrade ubuntu and ceph at a reasonable pace. However with quincy not being built for jammy (M=missing), this broke the trend and forced anyone looking to get to jammy to have to go all the way to reef, which they may not be ready to do just yet. Running the table out to the T release and ubuntu 24.04/noble, following this trend, it would be expected (E=expected) that squid would be built for jammy (and eventually noble), and the same would be true for the T release. Many words to say that as a user this would be beneficial to me, and likely others. Reed [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/SI3KZTU6GLG… <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/SI3KZTU6GLG…> [2] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/ONAWOAE7MPMT7CP6KH… <https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/ONAWOAE7MPMT7CP6KH…>

2 months, 1 week

1
0
0 0

Ceph Leadership Team Meeting, 2024-02-28 Minutes

by Patrick Donnelly

Hi folks, Today we discussed: - [casey] on dropping ubuntu focal support for squid - Discussion thread: https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/ONAWOAE7MPMT7CP6KH… - Quincy doesn't build jammy packages, so quincy->squid upgrade tests have to run on focal - proposing to add jammy packages for quincy to enable that upgrade path (from 17.2.8+) - https://github.com/ceph/ceph-build/pull/2206 - Need to indicate that Quincy clusters must upgrade to jammy before upgrading to Squid. - T release name: https://pad.ceph.com/p/t - Tentacle wins! - Patrick to do release kick-off - Cephalocon news? - Planning is in progress; no news as knowledgeable parties not present for this meeting. - Volunteers for compiling the Contributor Credits? - https://tracker.ceph.com/projects/ceph/wiki/Ceph_contributors_list_maintena… - Laura will give it a try. - Plan for tagged vs. named Github milestones? - Continue using priority order for qa testing: exhaust testing on tagged milestone, then go to "release" catch-all milestone - v18.2.2 hotfix release next - Reef HEAD is still cooking with to-be-addressed upgrade issues. - v19.1.0 (first Squid RC) - two rgw features still waiting to go into squid - cephfs quiesce feature to be backported - Nightlies crontab to be updated by Patrick. - V19.1.0 milestone: https://github.com/ceph/ceph/milestone/21 -- Patrick Donnelly, Ph.D. He / Him / His Red Hat Partner Engineer IBM, Inc. GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

2 months, 1 week

1
0
0 0

Possible to tune Full Disk warning ??

by Daniel Brown

Greetings - Is there any way to tune what CEPH will complain about, in terms of “full disks” ?? One of my ceph servers has an NFS mount which is for all intents and purposes “read only” and is sitting at 100% full. Ceph keeps warning me about this, unless I unmount the nfs mount point. Is there any way to tell it to ignore that mount point? I’m using reef 18.2.1, running on Ubuntu, which was setup with cephadm.

2 months, 1 week

2
2
0 0

pg repair doesn't fix "got incorrect hash on read" / "candidate had an ec hash mismatch"

by Kai Stian Olstad

Hi, Short summary PG 404.bc is an EC 4+2 where s0 and s2 report hash mismtach for 698 objects. Ceph pg repair doesn't fix it, because if you run deep-srub on the PG after repair is finished, it still report scrub errors. Why can't ceph pg repair repair this, it has 4 out of 6 should be able to reconstruct the corrupted shards? Is there a way to fix this? Like delete object s0 and s2 so it's forced to recreate them? Long detailed summary A short backstory. * This is aftermath of problems with mclock, post "17.2.7: Backfilling deadlock / stall / stuck / standstill" [1]. - 4 OSDs had a few bad sectors, set all 4 out and cluster stopped. - Solution was to swap from mclock to wpq and restart alle OSD. - When all backfilling was finished all 4 OSD was replaced. - osd.223 and osd.269 was 2 of the 4 OSDs that was replaced. PG / pool 404 is EC 4+2 default.rgw.buckets.data 9 days after the osd.223 og osd.269 was replaced, deep-scub was run and reported errors ceph status ----------- HEALTH_ERR 1396 scrub errors; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 404.bc is active+clean+inconsistent, acting [223,297,269,276,136,197] I then run repair ceph pg repair 404.bc And ceph status showed this ceph status ----------- HEALTH_WARN Too many repaired reads on 2 OSDs [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 698 reads repaired osd.269 had 698 reads repaired But osd.223 and osd.269 is new disks and the disks has no SMART error or any I/O error in OS logs. So I tried to run deep-scrub again on the PG. ceph pg deep-scrub 404.bc And got this result. ceph status ----------- HEALTH_ERR 1396 scrub errors; Too many repaired reads on 2 OSDs; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 1396 scrub errors [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 698 reads repaired osd.269 had 698 reads repaired [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 404.bc is active+clean+scrubbing+deep+inconsistent+repair, acting [223,297,269,276,136,197] 698 + 698 = 1396 so the same amount of errors. Run repair again on 404.bc and ceph status is HEALTH_WARN Too many repaired reads on 2 OSDs [WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs osd.223 had 1396 reads repaired osd.269 had 1396 reads repaired So even when repair finish it doesn't fix the problem since they reappear again after a deep-scrub. The log for osd.223 and osd.269 contain "got incorrect hash on read" and "candidate had an ec hash mismatch" for 698 unique objects. But i only show the logs for 1 of the 698 object, the log is the same for the other 697 objects. osd.223 log (only showing 1 of 698 object named 2021-11-08T19%3a43%3a50,145489260+00%3a00) ----------- Feb 20 10:31:00 ceph-hd-003 ceph-osd[3665432]: osd.223 pg_epoch: 231235 pg[404.bcs0( v 231235'1636919 (231078'1632435,231235'1636919] local-lis/les=226263/226264 n=296580 ec=36041/27862 lis/c=226263/226263 les/c/f=226264/230954/0 sis=226263) [223,297,269,276,136,197]p223(0) r=0 lpr=226263 crt=231235'1636919 lcod 231235'1636918 mlcod 231235'1636918 active+clean+scrubbing+deep+inconsistent+repair [ 404.bcs0: REQ_SCRUB ] MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] _scan_list 404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head got incorrect hash on read 0xc5d1dd1b != expected 0x7c2f86d7 Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]: log_channel(cluster) log [ERR] : 404.bc shard 223(0) soid 404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head : candidate had an ec hash mismatch Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]: log_channel(cluster) log [ERR] : 404.bc shard 269(2) soid 404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head : candidate had an ec hash mismatch Feb 20 10:31:01 ceph-hd-003 ceph-b321e76e-da3a-11eb-b75c-4f948441dcd0-osd-223[3665427]: 2024-02-20T10:31:01.117+0000 7f128a88d700 -1 log_channel(cluster) log [ERR] : 404.bc shard 223(0) soid 404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head : candidate had an ec hash mismatch Feb 20 10:31:01 ceph-hd-003 ceph-b321e76e-da3a-11eb-b75c-4f948441dcd0-osd-223[3665427]: 2024-02-20T10:31:01.117+0000 7f128a88d700 -1 log_channel(cluster) log [ERR] : 404.bc shard 269(2) soid 404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head : candidate had an ec hash mismatch osd.269 log (only showing 1 of 698 object named 2021-11-08T19%3a43%3a50,145489260+00%3a00) ----------- Feb 20 10:31:00 ceph-hd-001 ceph-osd[3656897]: osd.269 pg_epoch: 231235 pg[404.bcs2( v 231235'1636919 (231078'1632435,231235'1636919] local-lis/les=226263/226264 n=296580 ec=36041/27862 lis/c=226263/226263 les/c/f=226264/230954/0 sis=226263) [223,297,269,276,136,197]p223(0) r=2 lpr=226263 luod=0'0 crt=231235'1636919 mlcod 231235'1636919 active mbc={}] _scan_list 404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head got incorrect hash on read 0x7c0871dc != expected 0xcf6f4c58 The log for the other osd in the PG osd.297, osd.276, osd.136 and osd.197 doesn't show any error. If I try to get the object it failes $ s3cmd s3://benchfiles/2021-11-08T19:43:50,145489260+00:00 download: 's3://benchfiles/2021-11-08T19:43:50,145489260+00:00' -> './2021-11-08T19:43:50,145489260+00:00' [1 of 1] ERROR: Download of './2021-11-08T19:43:50,145489260+00:00' failed (Reason: 500 (UnknownError)) ERROR: S3 error: 500 (UnknownError) And the RGW log show this Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== starting new request req=0x7f94b744d660 ===== Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: WARNING: set_req_state_err err_no=5 resorting to 500 Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== starting new request req=0x7f94b6e41660 ===== Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== req done req=0x7f94b744d660 op status=-5 http_status=500 latency=0.020000568s ====== Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: beast: 0x7f94b744d660: 110.2.0.46 - test1 [21/Feb/2024:08:27:06.021 +0000] "GET /benchfiles/2021-11-08T19%3A43%3A50%2C145489260%2B00%3A00 HTTP/1.1" 500 226 - - - latency=0.020000568s [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5A… -- Kai Stian Olstad

2 months, 1 week

3
5
0 0

OSD with dm-crypt?

by Michael Worsham

Is there a how-to document or cheat sheet on how to enable OSD encryption using dm-crypt? -- Michael This message and its attachments are from Data Dimensions and are intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately and permanently delete the original email and destroy any copies or printouts of this email as well as any attachments.

2 months, 1 week

3
4
0 0

ceph-mgr client.0 error registering admin socket command: (17) File exists

by Denis Polom

Hi, running Ceph Quincy 17.2.7 on Ubuntu Focal LTS, ceph-mgr service reports following errors: client.0 error registering admin socket command: (17) File exists I don't use any extra mgr configuration: mgr advanced mgr/balancer/active true mgr advanced mgr/balancer/log_level debug mgr advanced mgr/balancer/log_to_cluster true mgr advanced mgr/balancer/mode upmap mgr advanced mgr/balancer/upmap_max_deviation 1 mgr advanced mgr/balancer/upmap_max_optimizations 20 mgr advanced mgr/prometheus/cache true Do you have some idea, what's the cause and how to fix it? Thank you

2 months, 1 week

2
1
0 0

Sata SSD trim latency with (WAL+DB on NVME + Sata OSD)

by Özkan Göksu

Hello. With the SSD drives without tantalum capacitors Ceph faces trim latency on every write. I wonder if the behavior is the same if we locate WAL+DB on NVME drives with "Tantalum capacitors" ? Do I need to use NVME + SAS SSD to avoid this latency issue? Best regards.

2 months, 1 week

1
0
0 0

Some questions about cephadm

by wodel youchi

Hi, I have some questions about ceph using cephadm. I used to deploy ceph using ceph-ansible, now I have to move to cephadm, I am in my learning journey. - How can I tell my cluster that it's a part of an HCI deployment? With ceph-ansible it was easy using is_hci : yes - The documentation of ceph does not indicate what versions of grafana, prometheus, ...etc should be used with a certain version. - I am trying to deploy Quincy, I did a bootstrap to see what containers were downloaded and their version. - I am asking because I need to use a local registry to deploy those images. - After the bootstrap, the Web interface was accessible : - How can I access the wizard page again? If I don't use it the first time I could not find another way to get it. - I had a problem with telemetry, I did not configure telemetry, then when I clicked the button, the web gui became inaccessible.....????!!! Regards.

2 months, 1 week

6
11
0 0

2024

2023

2022

2021

2020

2019

ceph-users February 2024