Hi all,
we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients failing to advance oldest client/flush tid". I looked at the client and there was nothing going on, so I rebooted it. After the client was back, the message was still there. To clean this up I failed the MDS. Unfortunately, the MDS that took over is remained stuck in rejoin without doing anything. All that happened in the log was:
[root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log
2023-07-20T15:54:29.147+0200 7fedb9c9f700 1 mds.2.896604 rejoin_start
2023-07-20T15:54:29.161+0200 7fedb9c9f700 1 mds.2.896604 rejoin_joint_start
2023-07-20T15:55:28.005+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896614 from mon.4
2023-07-20T15:56:00.278+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896615 from mon.4
[...]
2023-07-20T16:02:54.935+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896653 from mon.4
2023-07-20T16:03:07.276+0200 7fedb9c9f700 1 mds.ceph-10 Updating MDS map to version 896654 from mon.4
After some time I decided to give another fail a try and, this time, the replacement daemon went to active state really fast.
If I have a message like the above, what is the clean way of getting the client clean again (version: 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable))?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Bringing up that topic again:
is it possible to log the bucket name in the rgw client logs?
currently I am only to know the bucket name when someone access the bucket
via https://TLD/bucket/object instead of https://bucket.TLD/object.
Am Di., 3. Jan. 2023 um 10:25 Uhr schrieb Boris Behrens <bb(a)kervyn.de>:
> Hi,
> I am looking forward to move our logs from
> /var/log/ceph/ceph-client...log to our logaggregator.
>
> Is there a way to have the bucket name in the log file?
>
> Or can I write the rgw_enable_ops_log into a file? Maybe I could work with
> this.
>
> Cheers and happy new year
> Boris
>
--
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groüen Saal.
Hey everyone,
On 20/10/2022 10:12, Christian Rohmann wrote:
> 1) May I bring up again my remarks about the timing:
>
> On 19/10/2022 11:46, Christian Rohmann wrote:
>
>> I believe the upload of a new release to the repo prior to the
>> announcement happens quite regularly - it might just be due to the
>> technical process of releasing.
>> But I agree it would be nice to have a more "bit flip" approach to
>> new releases in the repo and not have the packages appear as updates
>> prior to the announcement and final release and update notes.
> By my observations sometimes there are packages available on the
> download servers via the "last stable" folders such as
> https://download.ceph.com/debian-quincy/ quite some time before the
> announcement of a release is out.
> I know it's hard to time this right with mirrors requiring some time
> to sync files, but would be nice to not see the packages or have
> people install them before there are the release notes and potential
> pointers to changes out.
Todays 16.2.11 release shows the exact issue I described above ....
1) 16.2.11 packages are already available via e.g.
https://download.ceph.com/debian-pacific
2) release notes not yet merged:
(https://github.com/ceph/ceph/pull/49839), thus
https://ceph.io/en/news/blog/2022/v16-2-11-pacific-released/ show a 404 :-)
3) No announcement like
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/QOCU563UD3…
to the ML yet.
Regards
Christian
Hi everyone
I'm new to CEPH, just a french 4 days training session with Octopus on
VMs that convince me to build my first cluster.
At this time I have 4 old identical nodes for testing with 3 HDDs each,
2 network interfaces and running Alma Linux8 (el8). I try to replay the
training session but it fails, breaking the web interface because of
some problems with podman 4.2 not compatible with Octopus.
So I try to deploy Pacific with cephadm tool on my first node (mostha1)
(to enable testing also an upgrade later).
dnf -y install
https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noar…
monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }')
cephadm bootstrap --mon-ip $monip --initial-dashboard-password xxxxx \
--initial-dashboard-user admceph \
--allow-fqdn-hostname --cluster-network 10.1.0.0/16
This was sucessfull.
But running "*c**eph orch device ls*" do not show any HDD even if I have
/dev/sda (used by the OS), /dev/sdb and /dev/sdc
The web interface shows a row capacity which is an aggregate of the
sizes of the 3 HDDs for the node.
I've also tried to reset /dev/sdb but cephadm do not see it:
[ceph: root@mostha1 /]# ceph orch device zap
mostha1.legi.grenoble-inp.fr /dev/sdb --force
Error EINVAL: Device path '/dev/sdb' not found on host
'mostha1.legi.grenoble-inp.fr'
On my first attempt with octopus, I was able to list the available HDD
with this command line. Before moving to Pacific, the OS on this node
has been reinstalled from scratch.
Any advices for a CEPH beginner ?
Thanks
Patrick
Hi,
on debian12, ceph-dashboard is throwing a warning
"Module 'dashboard' has failed dependency: PyO3 modules may only be
initialized once per interpreter process"
Seem to be related to pyo3 0.17 change
https://github.com/PyO3/pyo3/blob/7bdc504252a2f972ba3490c44249b202a4ce6180/…
"
Each #[pymodule] can now only be initialized once per process
To make PyO3 modules sound in the presence of Python sub-interpreters,
for now it has been necessary to explicitly disable the ability to
initialize a #[pymodule] more than once in the same process. Attempting
to do this will now raise an ImportError.
"
Hi all,
we seem to have hit a bug in the ceph fs kernel client and I just want to confirm what action to take. We get the error "wrong peer at address" in dmesg and some jobs on that server seem to get stuck in fs access; log extract below. I found these 2 tracker items https://tracker.ceph.com/issues/23883 and https://tracker.ceph.com/issues/41519, which don't seem to have fixes.
My questions:
- Is this harmless or does it indicate invalid/corrupted client cache entries?
- How to resolve, ignore, umount+mount or reboot?
Here an extract from the dmesg log, the error has survived a couple of MDS restarts already:
[Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:05:18 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:13:50 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:13:50 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:18:13 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:29 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:35 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:22:22 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Mon Mar 6 13:22:22 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:30:54 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[...]
[Thu Mar 9 09:37:24 2023] slurm.epilog.cl (31457): drop_caches: 3
[Thu Mar 9 09:38:26 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:38:26 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:46:58 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:46:58 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:55:30 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:55:30 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 10:04:02 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 10:04:02 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hey ceph-users,
I am running two (now) Quincy clusters doing RGW multi-site replication
with only one actually being written to by clients.
The other site is intended simply as a remote copy.
On the primary cluster I am observing an ever growing (objects and
bytes) "sitea.rgw.log" pool, not so on the remote "siteb.rgw.log" which
is only 300MiB and around 15k objects with no growth.
Metrics show that the growth of pool on primary is linear for at least 6
months, so not sudden spikes or anything. Also sync status appears to be
totally happy.
There are also no warnings in regards to large OMAPs or anything similar.
I was under the impression that RGW will trim its three logs (md, bi,
data) automatically and only keep data that has not yet been replicated
by the other zonegroup members?
The config option "ceph config get mgr rgw_sync_log_trim_interval" is
set to 1200, so 20 Minutes.
So I am wondering if there might be some inconsistency or how I can best
analyze what the cause for the accumulation of log data is?
There are older questions on the ML, such as [1], but there was not
really a solution or root cause identified.
I know there is manual trimming, but I'd rather want to analyze the
current situation and figure out what the cause for the lack of
auto-trimming is.
* Do I need to go through all buckets and count logs and look at
their timestamps? Which queries do make sense here?
* Is there usually any logging of the log trimming activity that I
should expect? Or that might indicate why trimming does not happen?
Regards
Christian
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/WZCFOAMLWV…
Hello,
This message does not concern Ceph itself but a hardware vulnerability which can lead to permanent loss of data on a Ceph cluster equipped with the same hardware in separate fault domains.
The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD drives of the 13G generation of DELL servers are subject to a vulnerability which renders them unusable after 70,000 hours of operation, i.e. approximately 7 years and 11 months of activity.
This topic has been discussed here: https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-commu…
The risk is all the greater since these disks may die at the same time in the same server leading to the loss of all data in the server.
To date, DELL has not provided any firmware fixing this vulnerability, the latest firmware version being "A3B3" released on Sept. 12, 2016: https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k
If your have servers running these drives, check their uptime. If they are close to the 70,000 hour limit, replace them immediately.
The smartctl tool does not report the uptime for these SSDs, but if you have HDDs in the server, you can query their SMART status and get their uptime, which should be about the same as the SSDs.
The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the iSCSI bus number).
We have informed DELL about this but have no information yet on the arrival of a fix.
We have lost 6 disks, in 3 different servers, in the last few weeks. Our observation shows that the drives don't survive full shutdown and restart of the machine (power off then power on in iDrac), but they may also die during a single reboot (init 6) or even while the machine is running.
Fujitsu released a corrective firmware in June 2021 but this firmware is most certainly not applicable to DELL drives: https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf
Regards,
Frederic
Sous-direction Infrastructure and Services
Direction du Numérique
Université de Lorraine
Hello everyone!
Recently we had a very nasty incident with one of our CEPH storages.
During basic backfill recovery operation due to faulty disk CephFS metadata started growing exponentially until they used all available space and whole cluster DIED. Usage graph screenshot in attachment.
Everything was very fast and even when the OSDs were marked full they tripped failsafe and ate all the free blocks, still trying to allocate space and completely died without possibility to even start them again.
Only solution was to copy whole bluestore to bigger SSD and resize underlying BS device. Just about 1/3 was able to start after moving but it was enough since we have very redundant settings for cephfs metadata. Basically metadata were moved from 12x 240g SSD to 12x 500GB SSD to have enough space to start again.
Brief info about the cluster:
- CephFS data are stored on ~500x 8TB SAS HDD using 10+2 ECC in 18 hosts.
- CephFS metadata are stored on ~12x 500GB SAS/SATA SSD using 5x replication on 6 hosts.
- Version was one of the latest 16.x.x Pacific at the time of the incident.
- 3x Mon+mgr and 2 active and 2 hot standby MDS are on separate virtual servers.
- typical file size to be stored is from hundreds of MBs to tens of GBs.
- this cluster is not the biggest, not having the most HDDs, no special config, I simply see nothing special about this cluster.
During investigation I found out the following:
- Metadata are outgrowing any time recovery is running on any of maintained clusters (~15 clusters of different usages and sizes) but not this much, this was an extreme situation.
- after recovery finish size went fine again.
- i think there is slight correlation with recovery width (objects to be touched by recovery in order to recovery everything) and recovery (time) length. But i have no proof.
- nothing much else
I would like to find out why this happened because i think this can happen again sometime and someone might lose some data if they have less luck.
Any ideas are appreciated, or even info if anyone have seen any similar behavior or if i am the only one struggling with issue like this :)
Kind regards,
Jakub Petrzilka