Hi all,
we seem to have hit a bug in the ceph fs kernel client and I just want to confirm what action to take. We get the error "wrong peer at address" in dmesg and some jobs on that server seem to get stuck in fs access; log extract below. I found these 2 tracker items https://tracker.ceph.com/issues/23883 and https://tracker.ceph.com/issues/41519, which don't seem to have fixes.
My questions:
- Is this harmless or does it indicate invalid/corrupted client cache entries?
- How to resolve, ignore, umount+mount or reboot?
Here an extract from the dmesg log, the error has survived a couple of MDS restarts already:
[Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:05:18 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:13:50 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:13:50 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:18:13 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:29 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:35 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:22:22 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Mon Mar 6 13:22:22 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:30:54 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[...]
[Thu Mar 9 09:37:24 2023] slurm.epilog.cl (31457): drop_caches: 3
[Thu Mar 9 09:38:26 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:38:26 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:46:58 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:46:58 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:55:30 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:55:30 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 10:04:02 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 10:04:02 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hey ceph-users,
I am running two (now) Quincy clusters doing RGW multi-site replication
with only one actually being written to by clients.
The other site is intended simply as a remote copy.
On the primary cluster I am observing an ever growing (objects and
bytes) "sitea.rgw.log" pool, not so on the remote "siteb.rgw.log" which
is only 300MiB and around 15k objects with no growth.
Metrics show that the growth of pool on primary is linear for at least 6
months, so not sudden spikes or anything. Also sync status appears to be
totally happy.
There are also no warnings in regards to large OMAPs or anything similar.
I was under the impression that RGW will trim its three logs (md, bi,
data) automatically and only keep data that has not yet been replicated
by the other zonegroup members?
The config option "ceph config get mgr rgw_sync_log_trim_interval" is
set to 1200, so 20 Minutes.
So I am wondering if there might be some inconsistency or how I can best
analyze what the cause for the accumulation of log data is?
There are older questions on the ML, such as [1], but there was not
really a solution or root cause identified.
I know there is manual trimming, but I'd rather want to analyze the
current situation and figure out what the cause for the lack of
auto-trimming is.
* Do I need to go through all buckets and count logs and look at
their timestamps? Which queries do make sense here?
* Is there usually any logging of the log trimming activity that I
should expect? Or that might indicate why trimming does not happen?
Regards
Christian
[1]
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/WZCFOAMLWV…
Hello,
This message does not concern Ceph itself but a hardware vulnerability which can lead to permanent loss of data on a Ceph cluster equipped with the same hardware in separate fault domains.
The DELL / Toshiba PX02SMF020, PX02SMF040, PX02SMF080 and PX02SMB160 SSD drives of the 13G generation of DELL servers are subject to a vulnerability which renders them unusable after 70,000 hours of operation, i.e. approximately 7 years and 11 months of activity.
This topic has been discussed here: https://www.dell.com/community/PowerVault/TOSHIBA-PX02SMF080-has-lost-commu…
The risk is all the greater since these disks may die at the same time in the same server leading to the loss of all data in the server.
To date, DELL has not provided any firmware fixing this vulnerability, the latest firmware version being "A3B3" released on Sept. 12, 2016: https://www.dell.com/support/home/en-us/ drivers/driversdetails?driverid=hhd9k
If your have servers running these drives, check their uptime. If they are close to the 70,000 hour limit, replace them immediately.
The smartctl tool does not report the uptime for these SSDs, but if you have HDDs in the server, you can query their SMART status and get their uptime, which should be about the same as the SSDs.
The smartctl command is: smartctl -a -d megaraid,XX /dev/sdc (where XX is the iSCSI bus number).
We have informed DELL about this but have no information yet on the arrival of a fix.
We have lost 6 disks, in 3 different servers, in the last few weeks. Our observation shows that the drives don't survive full shutdown and restart of the machine (power off then power on in iDrac), but they may also die during a single reboot (init 6) or even while the machine is running.
Fujitsu released a corrective firmware in June 2021 but this firmware is most certainly not applicable to DELL drives: https://www.fujitsu.com/us/imagesgig5/PY-CIB070-00.pdf
Regards,
Frederic
Sous-direction Infrastructure and Services
Direction du Numérique
Université de Lorraine
Hello dear CEPH users and developers,
we're dealing with strange problems.. we're having 12 node alma linux 9 cluster,
initially installed CEPH 15.2.16, then upgraded to 17.2.5. It's running bunch
of KVM virtual machines accessing volumes using RBD.
everything is working well, but there is strange and for us quite serious issue
- speed of write operations (both sequential and random) is constantly degrading
drastically to almost unusable numbers (in ~1week it drops from ~70k 4k writes/s
from 1 VM to ~7k writes/s)
When I restart all OSD daemons, numbers immediately return to normal..
volumes are stored on replicated pool of 4 replicas, on top of 7*12 = 84
INTEL SSDPE2KX080T8 NVMEs.
I've updated cluster to 17.2.6 some time ago, but the problem persists. This is
especially annoying in connection with https://tracker.ceph.com/issues/56896
as restarting OSDs is quite painfull when half of them crash..
I don't see anything suspicious, nodes load is quite low, no logs errors,
network latency and throughput is OK too
Anyone having simimar issue?
I'd like to ask for hints on what should I check further..
we're running lots of 14.2.x and 15.2.x clusters, none showing similar
issue, so I'm suspecting this is something related to quincy
thanks a lot in advance
with best regards
nikola ciprich
--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava
tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: servis(a)linuxbox.cz
-------------------------------------
Hey ceph-users,
I setup a multisite sync between two freshly setup Octopus clusters.
In the first cluster I created a bucket with some data just to test the
replication of actual data later.
I then followed the instructions on
https://docs.ceph.com/en/octopus/radosgw/multisite/#migrating-a-single-site…
to add a second zone.
Things went well and both zones are now happily reaching each other and
the API endpoints are talking.
Also the metadata is in sync already - both sides are happy and I can
see bucket listings and users are "in sync":
> # radosgw-admin sync status
> realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst)
> zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra)
> zone 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn)
> metadata sync no sync (zone is master)
> data sync source: c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1)
> init
> full sync: 128/128 shards
> full sync: 0 buckets to sync
> incremental sync: 0/128 shards
> data is behind on 128 shards
> behind shards: [0...127]
>
and on the other side ...
> # radosgw-admin sync status
> realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst)
> zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra)
> zone c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1)
> metadata sync syncing
> full sync: 0/64 shards
> incremental sync: 64/64 shards
> metadata is caught up with master
> data sync source: 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn)
> init
> full sync: 128/128 shards
> full sync: 0 buckets to sync
> incremental sync: 0/128 shards
> data is behind on 128 shards
> behind shards: [0...127]
>
also the newly created buckets (read: their metadata) is synced.
What is apparently not working in the sync of actual data.
Upon startup the radosgw on the second site shows:
> 2021-06-25T16:15:06.445+0000 7fe71eff5700 1 RGW-SYNC:meta: start
> 2021-06-25T16:15:06.445+0000 7fe71eff5700 1 RGW-SYNC:meta: realm
> epoch=2 period id=f4553d7c-5cc5-4759-9253-9a22b051e736
> 2021-06-25T16:15:11.525+0000 7fe71dff3700 0
> RGW-SYNC:data:sync:init_data_sync_status: ERROR: failed to read remote
> data log shards
>
also when issuing
# radosgw-admin data sync init --source-zone obst-rgn
it throws
> 2021-06-25T16:20:29.167+0000 7f87c2aec080 0
> RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data
> log shards
Does anybody have any hints on where to look for what could be broken here?
Thanks a bunch,
Regards
Christian
On Wed, 28 Jun 2023 at 22:44, Ilya Dryomov <idryomov(a)redhat.com> wrote:
>> ** TL;DR
>>
>> In testing, the write latency performance of a PWL-cache backed RBD
>> disk was 2 orders of magnitude worse than the disk holding the PWL
>> cache.
>>
>> ** Summary
>>
>> I was hoping that PWL cache might be a good solution to the problem of
>> write latency requirements of etcd when running a kubernetes control
>> plane on ceph. Etcd is extremely write latency sensitive and becomes
>> unstable if write latency is too high. The etcd workload can be
>> characterised by very small (~4k) writes with a queue depth of 1.
>> Throughput, even on a busy system, is normally very low. As etcd is
>> distributed and can safely handle the loss of un-flushed data from a
>> single node, a local ssd PWL cache for etcd looked like an ideal
>> solution.
>
>
> Right, this is exactly the use case that the PWL cache is supposed to address.
Good to know!
>> My expectation was that adding a PWL cache on a local SSD to an
>> RBD-backed would improve write latency to something approaching the
>> write latency performance of the local SSD. However, in my testing
>> adding a PWL cache to an rbd-backed VM increased write latency by
>> approximately 4x over not using a PWL cache. This was over 100x more
>> than the write latency performance of the underlying SSD.
>>
>> My expectation was based on the documentation here:
>> https://docs.ceph.com/en/quincy/rbd/rbd-persistent-write-log-cache/
>>
>> “The cache provides two different persistence modes. In
>> persistent-on-write mode, the writes are completed only when they are
>> persisted to the cache device and will be readable after a crash. In
>> persistent-on-flush mode, the writes are completed as soon as it no
>> longer needs the caller’s data buffer to complete the writes, but does
>> not guarantee that writes will be readable after a crash. The data is
>> persisted to the cache device when a flush request is received.”
>>
>> ** Method
>>
>> 2 systems, 1 running single-node Ceph Quincy (17.2.6), the other
>> running libvirt and mounting a VM’s disk with librbd (also 17.2.6)
>> from the first node.
>>
>> All performance testing is from the libvirt system. I tested write
>> latency performance:
>>
>> * Inside the VM without a PWL cache
>> * Of the PWL device directly from the host (direct to filesystem, no VM)
>> * Inside the VM with a PWL cache
>>
>> I am testing with fio. Specifically I am running a containerised test,
>> executed with:
>> podman run --volume .:/var/lib/etcd:Z quay.io/openshift-scale/etcd-perf
>>
>> This container runs:
>> fio --rw=write --ioengine=sync --fdatasync=1
>> --directory=/var/lib/etcd --size=100m --bs=8000 --name=etcd_perf
>> --output-format=json --runtime=60 --time_based=1
>>
>> And extracts sync.lat_ns.percentile["99.000000"]
>
>
> Matthew, do you have the rest of the fio output captured? It would be interesting to see if it's just the 99th percentile that is bad or the PWL cache is worse in general.
Sure.
With PWL cache: https://paste.openstack.org/show/820504/
Without PWL cache: https://paste.openstack.org/show/b35e71zAwtYR2hjmSRtR/
With PWL cache, 'rbd_cache'=false:
https://paste.openstack.org/show/byp8ZITPzb3r9bb06cPf/
>> ** Results
>>
>> All results were stable across multiple runs within a small margin of error.
>>
>> * rbd no cache: 1417216 ns
>> * pwl cache device: 44288 ns
>> * rbd with pwl cache: 5210112 ns
>>
>> Note that by adding a PWL cache we increase write latency by
>> approximately 4x, which is more than 100x than the underlying device.
>>
>> ** Hardware
>>
>> 2 x Dell R640s, each with Xeon Silver 4216 CPU @ 2.10GHz and 192G RAM
>> Storage under test: 2 x SAMSUNG MZ7KH480HAHQ0D3 SSDs attached to PERC
>> H730P Mini (Embedded)
>>
>> OS installed on rotational disks
>>
>> N.B. Linux incorrectly detects these disks as rotational, which I
>> assume relates to weird behaviour by the PERC controller. I remembered
>> to manually correct this on the ‘client’ machine for the PWL cache,
>> but at OSD configuration time ceph would have detected them as
>> rotational. They are not rotational.
>>
>> ** Ceph Configuration
>>
>> CentOS Stream 9
>>
>> # ceph version
>> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
>> (stable)
>>
>> Single node installation with cephadm. 2 OSDs, one on each SSD.
>> 1 pool with size 2
>>
>> ** Client Configuration
>>
>> Fedora 38
>> Librbd1-17.2.6-3.fc38.x86_64
>>
>> PWL cache is XFS filesystem with 4k block size, matching the
>> underlying device. The filesystem uses the whole block device. There
>> is no other load on the system.
>>
>> ** RBD Configuration
>>
>> # rbd config image list libvirt-pool/pwl-test | grep cache
>> rbd_cache true
>
>
> I wonder if rbd_cache should have been set to false here to disable the default volatile cache. Other than that, I don't see anything obviously wrong with the configuration at first sight.
I added some full output for this above.
>
> --
> Ilya
>
>> config
>> rbd_cache_block_writes_upfront false
>> config
>> rbd_cache_max_dirty 25165824
>> config
>> rbd_cache_max_dirty_age 1.000000
>> config
>> rbd_cache_max_dirty_object 0
>> config
>> rbd_cache_policy writeback
>> pool
>> rbd_cache_size 33554432
>> config
>> rbd_cache_target_dirty 16777216
>> config
>> rbd_cache_writethrough_until_flush true
>> pool
>> rbd_parent_cache_enabled false
>> config
>> rbd_persistent_cache_mode ssd
>> pool
>> rbd_persistent_cache_path /var/lib/libvirt/images/pwl
>> pool
>> rbd_persistent_cache_size 1073741824
>> config
>> rbd_plugins pwl_cache
>> pool
>>
>> # rbd status libvirt-pool/pwl-test
>> Watchers:
>> watcher=10.1.240.27:0/1406459716 client.14475
>> cookie=140282423200720
>> Persistent cache state:
>> host: dell-r640-050
>> path:
>> /var/lib/libvirt/images/pwl/rbd-pwl.libvirt-pool.37e947fd216b.pool
>> size: 1 GiB
>> mode: ssd
>> stats_timestamp: Mon Jun 26 11:29:21 2023
>> present: true empty: false clean: true
>> allocated: 180 MiB
>> cached: 135 MiB
>> dirty: 0 B
>> free: 844 MiB
>> hits_full: 1 / 0%
>> hits_partial: 3 / 0%
>> misses: 21952
>> hit_bytes: 6 KiB / 0%
>> miss_bytes: 349 MiB
--
Matthew Booth
hi Ernesto and lists,
> [1] https://github.com/ceph/ceph/pull/47501
are we planning to backport this to quincy so we can support centos 9
there? enabling that upgrade path on centos 9 was one of the
conditions for dropping centos 8 support in reef, which i'm still keen
to do
if not, can we find another resolution to
https://tracker.ceph.com/issues/58832? as i understand it, all of
those python packages exist in centos 8. do we know why they were
dropped for centos 9? have we looked into making those available in
epel? (cc Ken and Kaleb)
On Fri, Sep 2, 2022 at 12:01 PM Ernesto Puerta <epuertat(a)redhat.com> wrote:
>
> Hi Kevin,
>
>>
>> Isn't this one of the reasons containers were pushed, so that the packaging isn't as big a deal?
>
>
> Yes, but the Ceph community has a strong commitment to provide distro packages for those users who are not interested in moving to containers.
>
>> Is it the continued push to support lots of distros without using containers that is the problem?
>
>
> If not a problem, it definitely makes it more challenging. Compiled components often sort this out by statically linking deps whose packages are not widely available in distros. The approach we're proposing here would be the closest equivalent to static linking for interpreted code (bundling).
>
> Thanks for sharing your questions!
>
> Kind regards,
> Ernesto
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io