Hi all,
we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still there. To clean this up I failed the MDS. Unfortunately, the MDS that took over is remained stuck in rejoin without doing anything. All that happened in the log was:
[root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log
2023-07-20T15:54:29.147+0200 7fedb9c9f700 1 mds.2.896604 rejoin_start
2023-07-20T15:54:29.161+0200 7fedb9c9f700 1 mds.2.896604 rejoin_joint_start
2023-07-20T15:55:28.005+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896614 from mon.4
2023-07-20T15:56:00.278+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896615 from mon.4
[...]
2023-07-20T16:02:54.935+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896653 from mon.4
2023-07-20T16:03:07.276+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896654 from mon.4
After some time I decided to give another fail a try and, this time, the replacement daemon went to active state really fast.
If I have a message like the above, what is the clean way of getting the client clean again (version: 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable))?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Bringing up that topic again:
is it possible to log the bucket name in the rgw client logs?
currently I am only to know the bucket name when someone access the bucket
via https://TLD/bucket/object instead of https://bucket.TLD/object.
Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens <bb(a)kervyn.de>:
> Hi,
> I am looking forward to move our logs from
> /var/log/ceph/ceph-client...log to our logaggregator.
>
> Is there a way to have the bucket name in the log file?
>
> Or can I write the rgw_enable_ops_log into a file? Maybe I could work with
> this.
>
> Cheers and happy new year
> Boris
>
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
Hey everyone,
On 20/10/2022 10:12, Christian Rohmann wrote:
> 1) May I bring up again my remarks about the timing:
>
> On 19/10/2022 11:46, Christian Rohmann wrote:
>
>> I believe the upload of a new release to the repo prior to the
>> announcement happens quite regularly - it might just be due to the
>> technical process of releasing.
>> But I agree it would be nice to have a more "bit flip" approach to
>> new releases in the repo and not have the packages appear as updates
>> prior to the announcement and final release and update notes.
> By my observations sometimes there are packages available on the
> download servers via the "last stable" folders such as
> https://download.ceph.com/debian-quincy/ quite some time before the
> announcement of a release is out.
> I know it's hard to time this right with mirrors requiring some time
> to sync files, but would be nice to not see the packages or have
> people install them before there are the release notes and potential
> pointers to changes out.
Todays 16.2.11 release shows the exact issue I described above ....
1) 16.2.11 packages are already available via e.g.
https://download.ceph.com/debian-pacific
2) release notes not yet merged:
(https://github.com/ceph/ceph/pull/49839), thus
https://ceph.io/en/news/blog/2022/v16-2-11-pacific-released/ show a 404 :-)
3) No announcement like
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/QOCU563UD3…
to the ML yet.
Regards
Christian
Hi everyone
I'm new to CEPH, just a french 4 days training session with Octopus on
VMs that convince me to build my first cluster.
At this time I have 4 old identical nodes for testing with 3 HDDs each,
2 network interfaces and running Alma Linux8 (el8). I try to replay the
training session but it fails, breaking the web interface because of
some problems with podman 4.2 not compatible with Octopus.
So I try to deploy Pacific with cephadm tool on my first node (mostha1)
(to enable testing also an upgrade later).
dnf -y install
https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noar…
monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }')
cephadm bootstrap --mon-ip $monip --initial-dashboard-password xxxxx \
--initial-dashboard-user admceph \
--allow-fqdn-hostname --cluster-network 10.1.0.0/16
This was sucessfull.
But running "*c**eph orch device ls*" do not show any HDD even if I have
/dev/sda (used by the OS), /dev/sdb and /dev/sdc
The web interface shows a row capacity which is an aggregate of the
sizes of the 3 HDDs for the node.
I've also tried to reset /dev/sdb but cephadm do not see it:
[ceph: root@mostha1 /]# ceph orch device zap
mostha1.legi.grenoble-inp.fr /dev/sdb --force
Error EINVAL: Device path '/dev/sdb' not found on host
'mostha1.legi.grenoble-inp.fr'
On my first attempt with octopus, I was able to list the available HDD
with this command line. Before moving to Pacific, the OS on this node
has been reinstalled from scratch.
Any advices for a CEPH beginner ?
Thanks
Patrick
Hi,
while writing a response to [1] I tried to convert an existing
directory within a single cephfs into a subvolume. According to [2]
that should be possible, I'm just wondering how to confirm that it
actually worked. Because setting the xattr works fine, the directory
just doesn't show up in the subvolume ls command. This is what I tried
(in Reef and Pacific):
# one "regular" subvolume already exists
$ ceph fs subvolume ls cephfs
[
{
"name": "subvol1"
}
]
# mounted / and created new subdir
$ mkdir /mnt/volumes/subvol2
$ setfattr -n ceph.dir.subvolume -v 1 /mnt/volumes/subvol2
# still only one subvolume
$ ceph fs subvolume ls cephfs
[
{
"name": "subvol1"
}
]
I also tried it directly underneath /mnt:
$ mkdir /mnt/subvol2
$ setfattr -n ceph.dir.subvolume -v 1 /mnt/subvol2
But still no subvolume2 available. What am I missing here?
Thanks
Eugen
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/G4ZWGGUPPFQ…
[2] https://www.spinics.net/lists/ceph-users/msg72341.html
Hi,
on debian12, ceph-dashboard is throwing a warning
"Module 'dashboard' has failed dependency: PyO3 modules may only be
initialized once per interpreter process"
Seem to be related to pyo3 0.17 change
https://github.com/PyO3/pyo3/blob/7bdc504252a2f972ba3490c44249b202a4ce6180/…
"
Each #[pymodule] can now only be initialized once per process
To make PyO3 modules sound in the presence of Python sub-interpreters,
for now it has been necessary to explicitly disable the ability to
initialize a #[pymodule] more than once in the same process. Attempting
to do this will now raise an ImportError.
"
Hi Matthew,
At least for nautilus (14.2.22) i have discovered through trial and
error that you need to specify a beginning or end date. Something like
this:
radosgw-admin sync error trim --end-date="2023-08-20 23:00:00"
--rgw-zone={your_zone_name}
I specify the zone as there's a error list for each zone.
Hopefully that helps.
Rich
------------------------------
Date: Sat, 19 Aug 2023 12:48:55 -0400
From: Matthew Darwin <bugs(a)mdarwin.ca>
Subject: [ceph-users] radosgw-admin sync error trim seems to do
nothing
To: Ceph Users <ceph-users(a)ceph.io>
Message-ID: <95e7edfd-ca29-fc0e-a30a-987f1c43e2d4(a)mdarwin.ca>
Content-Type: text/plain; charset=UTF-8; format=flowed
Hello all,
"radosgw-admin sync error list" returns errors from 2022. I want to
clear those out.
I tried "radosgw-admin sync error trim" but it seems to do nothing.
The man page seems to offer no suggestions
https://protect-au.mimecast.com/s/26o0CzvkGRhLoOXfXjZR3?domain=docs.ceph.com
Any ideas what I need to do to remove old errors? (or at least I want
to see more recent errors)
ceph version 17.2.6 (quincy)
Thanks.
Fellow cephalopods,
I'm trying to get quick, seamless NFS failover happening on my four-node
Ceph cluster.
I followed the instructions here:
https://docs.ceph.com/en/latest/cephadm/services/nfs/#high-availability-nfs
but testing shows that failover doesn't happen. When I placed node 2
("san2") in maintenance mode, the NFS service shut down:
Aug 24 14:19:03 san2 ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq[1962479]: 24/08/2023 04:19:03 : epoch 64b8af5a : san2 : ganesha.nfsd-8[Admin] do_shutdown :MAIN :EVENT :Removing all exports.
Aug 24 14:19:13 san2 bash[3235994]: time="2023-08-24T14:19:13+10:00" level=warning msg="StopSignal SIGTERM failed to stop container ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq in 10 seconds, resorting to SIGKILL"
Aug 24 14:19:13 san2 bash[3235994]: ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66-nfs-xcpnfs-1-0-san2-datsvq
Aug 24 14:19:13 san2 systemd[1]:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic <mailto:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic>e: Main process exited, code=exited, status=137/n/a
Aug 24 14:19:14 san2 systemd[1]:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic <mailto:ceph-e2f1b934-ed43-11ec-80fa-04421a1a1d66@nfs.xcpnfs.1.0.san2.datsvq.servic>e: Failed with result 'exit-code'.
Aug 24 14:19:14 san2 systemd[1]: Stopped Ceph nfs.xcpnfs.1.0.san2.datsvq for e2f1b934-ed43-11ec-80fa-04421a1a1d66.
And that's it. The ingress IP didn't move.
More odd, the cluster seems to have placed the ingress IP on node 1
(san1) but seems to be using the NFS service on node 2.
Do I need to more tightly connect the NFS service to the keepalive and
haproxy services, or do I need to expand the ingress services to refer
to multiple NFS services?
Thank you.
--
Regards,
Thorne Lawler - Senior System Administrator
*DDNS* | ABN 76 088 607 265
First registrar certified ISO 27001-2013 Data Security Standard ITGOV40172
P +61 499 449 170
_DDNS
/_*Please note:* The information contained in this email message and any
attached files may be confidential information, and may also be the
subject of legal professional privilege. _If you are not the intended
recipient any use, disclosure or copying of this email is unauthorised.
_If you received this email in error, please notify Discount Domain Name
Services Pty Ltd on 03 9815 6868 to report this matter and delete all
copies of this transmission together with any attachments. /
Hi all,
we seem to have hit a bug in the ceph fs kernel client and I just want to confirm what action to take. We get the error "wrong peer at address" in dmesg and some jobs on that server seem to get stuck in fs access; log extract below. I found these 2 tracker items https://tracker.ceph.com/issues/23883 and https://tracker.ceph.com/issues/41519, which don't seem to have fixes.
My questions:
- Is this harmless or does it indicate invalid/corrupted client cache entries?
- How to resolve, ignore, umount+mount or reboot?
Here an extract from the dmesg log, the error has survived a couple of MDS restarts already:
[Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:05:18 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:13:50 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:13:50 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:18:13 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:29 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:35 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:22:22 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Mon Mar 6 13:22:22 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:30:54 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[...]
[Thu Mar 9 09:37:24 2023] slurm.epilog.cl (31457): drop_caches: 3
[Thu Mar 9 09:38:26 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:38:26 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:46:58 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:46:58 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:55:30 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:55:30 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 10:04:02 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 10:04:02 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14