Hi,
If you remember, I hit bug https://tracker.ceph.com/issues/58489 so I
was very relieved when 17.2.6 was released and started to update
immediately.
But now I'm stuck again with my broken MDS. MDS won't get into up:active
without the update but the update waits for them to get into up:active
state. Seems like a deadlock / chicken-egg problem to me.
Since I'm still relatively new to Ceph, could you help me?
What I see when watching the update status:
{
"target_image":
"quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635",
"in_progress": true,
"which": "Upgrading all daemon types on all hosts",
"services_complete": [
"crash",
"mgr",
"mon",
"osd"
],
"progress": "18/40 daemons upgraded",
"message": "Error: UPGRADE_OFFLINE_HOST: Upgrade: Failed to connect
to host ceph01 at addr (192.168.23.61)",
"is_paused": false
}
(The offline host was one host that broke during the upgrade. I fixed
that in the meantime and the update went on.)
And in the log:
2023-04-10T19:23:48.750129+0000 mgr.ceph04.qaexpv [INF] Upgrade: Waiting
for mds.mds01.ceph04.hcmvae to be up:active (currently up:replay)
2023-04-10T19:23:58.758141+0000 mgr.ceph04.qaexpv [WRN] Upgrade: No mds
is up; continuing upgrade procedure to poke things in the right direction
Please give me a hint what I can do.
Cheers,
Thomas
--
http://www.widhalm.or.at
GnuPG : 6265BAE6 , A84CB603
Threema: H7AV7D33
Telegram, Signal: widhalmt(a)widhalm.or.at
Details of this release are summarized here:
https://tracker.ceph.com/issues/59426#note-3
Release Notes - TBD
Seeking approvals/reviews for:
smoke - Josh approved?
orch - Adam King approved?
(there are infrastructure issues in the runs, but we want to release this ASAP)
Thx
YuriW
Hello,
I setup two-way snapshot-based RBD mirroring between two Ceph clusters.
After enabling mirroring for an image that already had regular snapshots
independently from RBD mirror on the source cluster, the image and all
snapshots were synced to the destination cluster.
Is there a way to avoid having all snapshots being synced? We only need
the latest version of the image on the destination cluster and the
snapshots add around 200% disk space overhead on average.
Best regards,
Andreas
Hi folks,
Today we discussed:
- Just short of 1 exabyte of Ceph storage reported to Telemetry.
Telemetry's data is public and viewable at:
https://telemetry-public.ceph.com/d/ZFYuv1qWz/telemetry?orgId=1
If your cluster is not reporting to Telemetry, please consider it! :)
- A request from the Ceph Foundation Board to begin tracking component
(e.g. CephFS) roadmaps in docs (or somewhere else appropriate).
Concurrently, leads may also begin sending out status updates on a
~quarterly basis. To be discussed further.
- Cephalocon schedule is available: https://ceph2023.sched.com/
- A regression was reported for the exporter in 17.2.6:
https://github.com/ceph/ceph/pull/50718#issuecomment-1503376925
A follow-up hotfix/announcement is planned.
- Next week's meeting is canceled due to Cephalocon/travel.
Meeting minutes available here as always:
https://pad.ceph.com/p/clt-weekly-minutes
--
Patrick Donnelly, Ph.D.
He / Him / His
Red Hat Partner Engineer
IBM, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
Hello cephers !
As I was asking in another thread ([RGW] Rebuilding a non master zone),
I try to find the best way to rebuild a zone in a multisite config.
The goal is to get rid of remaining Large OMAP objects.
The simplest way, as I can rely only on the primary zone, is to :
- remove the zone from the zonegroup
- delete the pools
- recreate them by restarting radosgw instances
- add the zone in the zonegroup
Is there a simpler way ?
One question I already asked was : can I delete only index and data
pools (ZONE.rgw.buckets.data and ZONE.rgw.buckets.index),
the multisite config (zones and zonegroup) should be still there, and
perhaps it will just resync (or I can force a sync init) ?
Another thing, about ceph-ansible, which I still use.
I configure my zones with the multisite config of ceph-ansible, but one
thing bothers me :
it defines the zone endpoints on each radosgw instances, even with the
loadbalancer configured.
So, do you think I'd better not configure multisite with ceph-ansible,
and do it manually ?
Is there a way that will make things easier to migrate to cephadm ?
Hi,
Our cluster is running Pacific 16.2.10. We have a problem using the
dashboard to display information about RGWs configured in the cluster.
When clicking on "Object Gateway", we get an error 500. Looking in the
mgr logs, I found that the problem is that the RGW is accessed by its IP
address rather than its name. As the RGW has SSL enabled, the
certificate cannot be matched against the IP address.
I digged into the configuration but I was not able to identify where an
IP address rather than a name was used (I checked in particular the
zonegroup parameters and names are used to define endpoints). Did I make
something wrong in the configuration or is it a know issue when using
SSL-enabled RGW?
Best regards,
Michel
Hi,
Our Ceph cluster is in an error state with the message:
# ceph status
cluster:
id: 58140ed2-4ed4-11ed-b4db-5c6f69756a60
health: HEALTH_ERR
Module 'cephadm' has failed: invalid literal for int() with base 10: '352.broken'
This happened after trying to re-add an OSD which had failed. Adopting it back in to the Ceph failed because a directory was causing problems in /var/lib/ceph/{cephid}/osd.352. To re-add the OSD I renamed it to osd.352.broken (rather than delete it), re-ran the command and then everything worked perfectly. Then 5 minutes later the ceph orchestrator went into "HEALTH_ERR"
I've removed that directory, but "cephadm" isn't cleaning up after itself. Does anyone know if there's a way I can clear the cache for this directory it's tried to inventory and failed?
Thanks,
Duncan
--
Dr Duncan Tooke | Research Cluster Administrator
Centre for Computational Biology, Weatherall Institute of Molecular Medicine,
University of Oxford, OX3 9DS
www.imm.ox.ac.uk<http://www.imm.ox.ac.uk>
The radosgw-admin bucket stats show there are 209266 objects in this bucket, but it included failed multiparts, so that make the size parameter is also wrong. When I use boto3 to count objects, the bucket only has 209049 objects.
The only solution I can find is to use lifecycle to clean these failed multiparts, but in production, the client will decide to use lifecycle or not?
So are there any way to exclude the failed multiparts in bucket statistic?
Does Ceph allow to set auto clean failed multiparts globally?
Thanks!
"usage": {
"rgw.main": {
"size": 593286801276,
"size_actual": 593716080640,
"size_utilized": 593286801276,
"size_kb": 579381642,
"size_kb_actual": 579800860,
"size_kb_utilized": 579381642,
"num_objects": 209266
}