Hi,
I am in the middle of migration from ceph-ansible to cephadm (version
quincy), so far so good ;-). And I have some questions :
- I still have the ceph-crash container, what should I do with it?
- The new rgw and mds daemons have some random string in their names (like
rgw.opsrgw.controllera.*pkajqw*), is this correct ?
- How should I proceed with the monitoring stack (grafana, prometheus,
alermanager and node-exporter)? should I stop then delete the old ones,
then deploy the new ones with ceph orch?
Regards.
I was reading on the Ceph site that iSCSI is no longer under active development since November 2022. Why is that?
https://docs.ceph.com/en/latest/rbd/iscsi-overview/
-- Michael
This message and its attachments are from Data Dimensions and are intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately and permanently delete the original email and destroy any copies or printouts of this email as well as any attachments.
Found this mention in the CLT Minutes posted this morning[1], of a discussion on ceph-dev[2] about dropping ubuntu focal builds for the squid release, and beginning builds of quincy for jammy to facilitate quincy->squid upgrades.
> there was a consensus to drop support for ubuntu focal and centos
> stream 8 with the squid release, and i'd love to remove those distros
> from the shaman build matrix for squid and main branches asap
>
> however, i see that quincy never supported ubuntu jammy, so our quincy
> upgrade tests still have to run against focal. that means we'd still
> have to build focal packages for squid
>
> would it be possible to start building jammy packages for quincy to
> allow those upgrade tests to run jammy instead?
Just wanting to voice my support for this, as this both seems to match the historical ceph:ubuntu cadence going back roughly a decade, and helps facilitate a narrow upgrade window for ubuntu users to get to jammy.
+----------+-----+-----+-----+-----+-----+-----+
| ceph | u14 | u16 | u18 | u20 | u22 | u24 |
+----------+-----+-----+-----+-----+-----+-----+
| jewel | + | + | - | - | - | - |
| luminous | + | + | - | - | - | - |
| mimic | - | + | + | - | - | - |
| nautilus | - | + | + | - | - | - |
| octopus | - | - | + | + | - | - |
| pacific | - | - | + | + | - | - |
| quincy | - | - | - | + | M | - |
| reef | - | - | - | + | + | - |
| squid | - | - | - | - | E | E |
| T | - | - | - | - | E | E |
+----------+-----+-----+-----+-----+-----+-----+
Hopefully this table translates the mailing list well enough.
But going back to both jewel/10 and Ubuntu 14.04/trusty, there has been a consistent 4 ceph releases per ubuntu LTS release, with a dist drop/add every two releases.
This gives ample window for users to upgrade ubuntu and ceph at a reasonable pace.
However with quincy not being built for jammy (M=missing), this broke the trend and forced anyone looking to get to jammy to have to go all the way to reef, which they may not be ready to do just yet.
Running the table out to the T release and ubuntu 24.04/noble, following this trend, it would be expected (E=expected) that squid would be built for jammy (and eventually noble), and the same would be true for the T release.
Many words to say that as a user this would be beneficial to me, and likely others.
Reed
[1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/SI3KZTU6GLG… <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/SI3KZTU6GLG…>
[2] https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/ONAWOAE7MPMT7CP6KH… <https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/ONAWOAE7MPMT7CP6KH…>
Hi folks,
Today we discussed:
- [casey] on dropping ubuntu focal support for squid
- Discussion thread:
https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/ONAWOAE7MPMT7CP6KH…
- Quincy doesn't build jammy packages, so quincy->squid upgrade tests
have to run on focal
- proposing to add jammy packages for quincy to enable that upgrade path
(from 17.2.8+)
- https://github.com/ceph/ceph-build/pull/2206
- Need to indicate that Quincy clusters must upgrade to jammy before
upgrading to Squid.
- T release name: https://pad.ceph.com/p/t
- Tentacle wins!
- Patrick to do release kick-off
- Cephalocon news?
- Planning is in progress; no news as knowledgeable parties not present
for this meeting.
- Volunteers for compiling the Contributor Credits?
-
https://tracker.ceph.com/projects/ceph/wiki/Ceph_contributors_list_maintena…
- Laura will give it a try.
- Plan for tagged vs. named Github milestones?
- Continue using priority order for qa testing: exhaust testing on
tagged milestone, then go to "release" catch-all milestone
- v18.2.2 hotfix release next
- Reef HEAD is still cooking with to-be-addressed upgrade issues.
- v19.1.0 (first Squid RC)
- two rgw features still waiting to go into squid
- cephfs quiesce feature to be backported
- Nightlies crontab to be updated by Patrick.
- V19.1.0 milestone: https://github.com/ceph/ceph/milestone/21
--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
Greetings -
Is there any way to tune what CEPH will complain about, in terms of “full disks” ??
One of my ceph servers has an NFS mount which is for all intents and purposes “read only” and is sitting at 100% full. Ceph keeps warning me about this, unless I unmount the nfs mount point.
Is there any way to tell it to ignore that mount point?
I’m using reef 18.2.1, running on Ubuntu, which was setup with cephadm.
Hi,
Short summary
PG 404.bc is an EC 4+2 where s0 and s2 report hash mismtach for 698
objects.
Ceph pg repair doesn't fix it, because if you run deep-srub on the PG
after repair is finished, it still report scrub errors.
Why can't ceph pg repair repair this, it has 4 out of 6 should be able
to reconstruct the corrupted shards?
Is there a way to fix this? Like delete object s0 and s2 so it's forced
to recreate them?
Long detailed summary
A short backstory.
* This is aftermath of problems with mclock, post "17.2.7: Backfilling
deadlock / stall / stuck / standstill" [1].
- 4 OSDs had a few bad sectors, set all 4 out and cluster stopped.
- Solution was to swap from mclock to wpq and restart alle OSD.
- When all backfilling was finished all 4 OSD was replaced.
- osd.223 and osd.269 was 2 of the 4 OSDs that was replaced.
PG / pool 404 is EC 4+2 default.rgw.buckets.data
9 days after the osd.223 og osd.269 was replaced, deep-scub was run and
reported errors
ceph status
-----------
HEALTH_ERR 1396 scrub errors; Possible data damage: 1 pg
inconsistent
[ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 404.bc is active+clean+inconsistent, acting
[223,297,269,276,136,197]
I then run repair
ceph pg repair 404.bc
And ceph status showed this
ceph status
-----------
HEALTH_WARN Too many repaired reads on 2 OSDs
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
osd.223 had 698 reads repaired
osd.269 had 698 reads repaired
But osd.223 and osd.269 is new disks and the disks has no SMART error or
any I/O error in OS logs.
So I tried to run deep-scrub again on the PG.
ceph pg deep-scrub 404.bc
And got this result.
ceph status
-----------
HEALTH_ERR 1396 scrub errors; Too many repaired reads on 2 OSDs;
Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1396 scrub errors
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
osd.223 had 698 reads repaired
osd.269 had 698 reads repaired
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 404.bc is active+clean+scrubbing+deep+inconsistent+repair,
acting [223,297,269,276,136,197]
698 + 698 = 1396 so the same amount of errors.
Run repair again on 404.bc and ceph status is
HEALTH_WARN Too many repaired reads on 2 OSDs
[WRN] OSD_TOO_MANY_REPAIRS: Too many repaired reads on 2 OSDs
osd.223 had 1396 reads repaired
osd.269 had 1396 reads repaired
So even when repair finish it doesn't fix the problem since they
reappear again after a deep-scrub.
The log for osd.223 and osd.269 contain "got incorrect hash on read" and
"candidate had an ec hash mismatch" for 698 unique objects.
But i only show the logs for 1 of the 698 object, the log is the same
for the other 697 objects.
osd.223 log (only showing 1 of 698 object named
2021-11-08T19%3a43%3a50,145489260+00%3a00)
-----------
Feb 20 10:31:00 ceph-hd-003 ceph-osd[3665432]: osd.223 pg_epoch:
231235 pg[404.bcs0( v 231235'1636919 (231078'1632435,231235'1636919]
local-lis/les=226263/226264 n=296580 ec=36041/27862 lis/c=226263/226263
les/c/f=226264/230954/0 sis=226263) [223,297,269,276,136,197]p223(0) r=0
lpr=226263 crt=231235'1636919 lcod 231235'1636918 mlcod 231235'1636918
active+clean+scrubbing+deep+inconsistent+repair [ 404.bcs0: REQ_SCRUB ]
MUST_REPAIR MUST_DEEP_SCRUB MUST_SCRUB planned REQ_SCRUB] _scan_list
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
got incorrect hash on read 0xc5d1dd1b != expected 0x7c2f86d7
Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]: log_channel(cluster)
log [ERR] : 404.bc shard 223(0) soid
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
: candidate had an ec hash mismatch
Feb 20 10:31:01 ceph-hd-003 ceph-osd[3665432]: log_channel(cluster)
log [ERR] : 404.bc shard 269(2) soid
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
: candidate had an ec hash mismatch
Feb 20 10:31:01 ceph-hd-003
ceph-b321e76e-da3a-11eb-b75c-4f948441dcd0-osd-223[3665427]:
2024-02-20T10:31:01.117+0000 7f128a88d700 -1 log_channel(cluster) log
[ERR] : 404.bc shard 223(0) soid
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
: candidate had an ec hash mismatch
Feb 20 10:31:01 ceph-hd-003
ceph-b321e76e-da3a-11eb-b75c-4f948441dcd0-osd-223[3665427]:
2024-02-20T10:31:01.117+0000 7f128a88d700 -1 log_channel(cluster) log
[ERR] : 404.bc shard 269(2) soid
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
: candidate had an ec hash mismatch
osd.269 log (only showing 1 of 698 object named
2021-11-08T19%3a43%3a50,145489260+00%3a00)
-----------
Feb 20 10:31:00 ceph-hd-001 ceph-osd[3656897]: osd.269 pg_epoch:
231235 pg[404.bcs2( v 231235'1636919 (231078'1632435,231235'1636919]
local-lis/les=226263/226264 n=296580 ec=36041/27862 lis/c=226263/226263
les/c/f=226264/230954/0 sis=226263) [223,297,269,276,136,197]p223(0) r=2
lpr=226263 luod=0'0 crt=231235'1636919 mlcod 231235'1636919 active
mbc={}] _scan_list
404:3d001f95:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.3__multipart_2021-11-08T19%3a43%3a50,145489260+00%3a00.2~OoetD5vkh8fyh-2eeR7GF5rZK7d5EVa.1:head
got incorrect hash on read 0x7c0871dc != expected 0xcf6f4c58
The log for the other osd in the PG osd.297, osd.276, osd.136 and
osd.197 doesn't show any error.
If I try to get the object it failes
$ s3cmd s3://benchfiles/2021-11-08T19:43:50,145489260+00:00
download: 's3://benchfiles/2021-11-08T19:43:50,145489260+00:00' ->
'./2021-11-08T19:43:50,145489260+00:00' [1 of 1]
ERROR: Download of './2021-11-08T19:43:50,145489260+00:00' failed
(Reason: 500 (UnknownError))
ERROR: S3 error: 500 (UnknownError)
And the RGW log show this
Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== starting new
request req=0x7f94b744d660 =====
Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: WARNING: set_req_state_err
err_no=5 resorting to 500
Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== starting new
request req=0x7f94b6e41660 =====
Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: ====== req done
req=0x7f94b744d660 op status=-5 http_status=500 latency=0.020000568s
======
Feb 21 08:27:06 ceph-mon-1 radosgw[1747]: beast: 0x7f94b744d660:
110.2.0.46 - test1 [21/Feb/2024:08:27:06.021 +0000] "GET
/benchfiles/2021-11-08T19%3A43%3A50%2C145489260%2B00%3A00 HTTP/1.1" 500
226 - - - latency=0.020000568s
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5A…
--
Kai Stian Olstad
Is there a how-to document or cheat sheet on how to enable OSD encryption using dm-crypt?
-- Michael
This message and its attachments are from Data Dimensions and are intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, or the employee or agent responsible for delivering the message to the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender immediately and permanently delete the original email and destroy any copies or printouts of this email as well as any attachments.
Hi,
running Ceph Quincy 17.2.7 on Ubuntu Focal LTS, ceph-mgr service reports
following errors:
client.0 error registering admin socket command: (17) File exists
I don't use any extra mgr configuration:
mgr advanced mgr/balancer/active true
mgr advanced mgr/balancer/log_level debug
mgr advanced mgr/balancer/log_to_cluster true
mgr advanced mgr/balancer/mode upmap
mgr advanced mgr/balancer/upmap_max_deviation 1
mgr advanced mgr/balancer/upmap_max_optimizations 20
mgr advanced mgr/prometheus/cache true
Do you have some idea, what's the cause and how to fix it?
Thank you
Hello.
With the SSD drives without tantalum capacitors Ceph faces trim latency on
every write.
I wonder if the behavior is the same if we locate WAL+DB on NVME drives
with "Tantalum capacitors" ?
Do I need to use NVME + SAS SSD to avoid this latency issue?
Best regards.
Hi,
I have some questions about ceph using cephadm.
I used to deploy ceph using ceph-ansible, now I have to move to cephadm, I
am in my learning journey.
- How can I tell my cluster that it's a part of an HCI deployment? With
ceph-ansible it was easy using is_hci : yes
- The documentation of ceph does not indicate what versions of grafana,
prometheus, ...etc should be used with a certain version.
- I am trying to deploy Quincy, I did a bootstrap to see what
containers were downloaded and their version.
- I am asking because I need to use a local registry to deploy those
images.
- After the bootstrap, the Web interface was accessible :
- How can I access the wizard page again? If I don't use it the first
time I could not find another way to get it.
- I had a problem with telemetry, I did not configure telemetry, then
when I clicked the button, the web gui became inaccessible.....????!!!
Regards.