Hi all,
we seem to have hit a bug in the ceph fs kernel client and I just want to confirm what action to take. We get the error "wrong peer at address" in dmesg and some jobs on that server seem to get stuck in fs access; log extract below. I found these 2 tracker items https://tracker.ceph.com/issues/23883 and https://tracker.ceph.com/issues/41519, which don't seem to have fixes.
My questions:
- Is this harmless or does it indicate invalid/corrupted client cache entries?
- How to resolve, ignore, umount+mount or reboot?
Here an extract from the dmesg log, the error has survived a couple of MDS restarts already:
[Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:05:18 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:13:50 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:13:50 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:18:13 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:29 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:35 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:22:22 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Mon Mar 6 13:22:22 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:30:54 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[...]
[Thu Mar 9 09:37:24 2023] slurm.epilog.cl (31457): drop_caches: 3
[Thu Mar 9 09:38:26 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:38:26 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:46:58 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:46:58 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:55:30 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:55:30 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 10:04:02 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 10:04:02 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hello dear CEPH users and developers,
we're dealing with strange problems.. we're having 12 node alma linux 9 cluster,
initially installed CEPH 15.2.16, then upgraded to 17.2.5. It's running bunch
of KVM virtual machines accessing volumes using RBD.
everything is working well, but there is strange and for us quite serious issue
- speed of write operations (both sequential and random) is constantly degrading
drastically to almost unusable numbers (in ~1week it drops from ~70k 4k writes/s
from 1 VM to ~7k writes/s)
When I restart all OSD daemons, numbers immediately return to normal..
volumes are stored on replicated pool of 4 replicas, on top of 7*12 = 84
INTEL SSDPE2KX080T8 NVMEs.
I've updated cluster to 17.2.6 some time ago, but the problem persists. This is
especially annoying in connection with https://tracker.ceph.com/issues/56896
as restarting OSDs is quite painfull when half of them crash..
I don't see anything suspicious, nodes load is quite low, no logs errors,
network latency and throughput is OK too
Anyone having simimar issue?
I'd like to ask for hints on what should I check further..
we're running lots of 14.2.x and 15.2.x clusters, none showing similar
issue, so I'm suspecting this is something related to quincy
thanks a lot in advance
with best regards
nikola ciprich
--
-------------------------------------
Ing. Nikola CIPRICH
LinuxBox.cz, s.r.o.
28.rijna 168, 709 00 Ostrava
tel.: +420 591 166 214
fax: +420 596 621 273
mobil: +420 777 093 799
www.linuxbox.cz
mobil servis: +420 737 238 656
email servis: servis(a)linuxbox.cz
-------------------------------------
Hey ceph-users,
I setup a multisite sync between two freshly setup Octopus clusters.
In the first cluster I created a bucket with some data just to test the
replication of actual data later.
I then followed the instructions on
https://docs.ceph.com/en/octopus/radosgw/multisite/#migrating-a-single-site…
to add a second zone.
Things went well and both zones are now happily reaching each other and
the API endpoints are talking.
Also the metadata is in sync already - both sides are happy and I can
see bucket listings and users are "in sync":
> # radosgw-admin sync status
> realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst)
> zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra)
> zone 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn)
> metadata sync no sync (zone is master)
> data sync source: c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1)
> init
> full sync: 128/128 shards
> full sync: 0 buckets to sync
> incremental sync: 0/128 shards
> data is behind on 128 shards
> behind shards: [0...127]
>
and on the other side ...
> # radosgw-admin sync status
> realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst)
> zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra)
> zone c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1)
> metadata sync syncing
> full sync: 0/64 shards
> incremental sync: 64/64 shards
> metadata is caught up with master
> data sync source: 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn)
> init
> full sync: 128/128 shards
> full sync: 0 buckets to sync
> incremental sync: 0/128 shards
> data is behind on 128 shards
> behind shards: [0...127]
>
also the newly created buckets (read: their metadata) is synced.
What is apparently not working in the sync of actual data.
Upon startup the radosgw on the second site shows:
> 2021-06-25T16:15:06.445+0000 7fe71eff5700 1 RGW-SYNC:meta: start
> 2021-06-25T16:15:06.445+0000 7fe71eff5700 1 RGW-SYNC:meta: realm
> epoch=2 period id=f4553d7c-5cc5-4759-9253-9a22b051e736
> 2021-06-25T16:15:11.525+0000 7fe71dff3700 0
> RGW-SYNC:data:sync:init_data_sync_status: ERROR: failed to read remote
> data log shards
>
also when issuing
# radosgw-admin data sync init --source-zone obst-rgn
it throws
> 2021-06-25T16:20:29.167+0000 7f87c2aec080 0
> RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data
> log shards
Does anybody have any hints on where to look for what could be broken here?
Thanks a bunch,
Regards
Christian
hi Ernesto and lists,
> [1] https://github.com/ceph/ceph/pull/47501
are we planning to backport this to quincy so we can support centos 9
there? enabling that upgrade path on centos 9 was one of the
conditions for dropping centos 8 support in reef, which i'm still keen
to do
if not, can we find another resolution to
https://tracker.ceph.com/issues/58832? as i understand it, all of
those python packages exist in centos 8. do we know why they were
dropped for centos 9? have we looked into making those available in
epel? (cc Ken and Kaleb)
On Fri, Sep 2, 2022 at 12:01 PM Ernesto Puerta <epuertat(a)redhat.com> wrote:
>
> Hi Kevin,
>
>>
>> Isn't this one of the reasons containers were pushed, so that the packaging isn't as big a deal?
>
>
> Yes, but the Ceph community has a strong commitment to provide distro packages for those users who are not interested in moving to containers.
>
>> Is it the continued push to support lots of distros without using containers that is the problem?
>
>
> If not a problem, it definitely makes it more challenging. Compiled components often sort this out by statically linking deps whose packages are not widely available in distros. The approach we're proposing here would be the closest equivalent to static linking for interpreted code (bundling).
>
> Thanks for sharing your questions!
>
> Kind regards,
> Ernesto
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io
hi , starting upgrade from 15.2.17 i got this error
Module 'cephadm' has failed: Expecting value: line 1 column 1 (char 0)
Cluster was in health ok before starting.
Hi,
As discussed in another thread (Crushmap rule for multi-datacenter
erasure coding), I'm trying to create an EC pool spanning 3 datacenters
(datacenters are present in the crushmap), with the objective to be
resilient to 1 DC down, at least keeping the readonly access to the pool
and if possible the read-write access, and have a storage efficiency
better than 3 replica (let say a storage overhead <= 2).
In the discussion, somebody mentioned LRC plugin as a possible jerasure
alternative to implement this without tweaking the crushmap rule to
implement the 2-step OSD allocation. I looked at the documentation
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) but
I have some questions if someone has experience/expertise with this LRC
plugin.
I tried to create a rule for using 5 OSDs per datacenter (15 in total),
with 3 (9 in total) being data chunks and others being coding chunks.
For this, based of my understanding of examples, I used k=9, m=3, l=4.
Is it right? Is this configuration equivalent, in terms of redundancy,
to a jerasure configuration with k=9, m=6?
The resulting rule, which looks correct to me, is:
--------
{
"rule_id": 6,
"rule_name": "test_lrc_2",
"ruleset": 6,
"type": 3,
"min_size": 3,
"max_size": 15,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -4,
"item_name": "default~hdd"
},
{
"op": "choose_indep",
"num": 3,
"type": "datacenter"
},
{
"op": "chooseleaf_indep",
"num": 5,
"type": "host"
},
{
"op": "emit"
}
]
}
------------
Unfortunately, it doesn't work as expected: a pool created with this
rule ends up with its pages active+undersize, which is unexpected for
me. Looking at 'ceph health detail` output, I see for each page
something like:
pg 52.14 is stuck undersized for 27m, current state active+undersized,
last acting
[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]
For each PG, there is 3 '2147483647' entries and I guess it is the
reason of the problem. What are these entries about? Clearly it is not
OSD entries... Looks like a negative number, -1, which in terms of
crushmap ID is the crushmap root (named "default" in our configuration).
Any trivial mistake I would have made?
Thanks in advance for any help or for sharing any successful configuration?
Best regards,
Michel
Hi everyone
I'm new to CEPH, just a french 4 days training session with Octopus on
VMs that convince me to build my first cluster.
At this time I have 4 old identical nodes for testing with 3 HDDs each,
2 network interfaces and running Alma Linux8 (el8). I try to replay the
training session but it fails, breaking the web interface because of
some problems with podman 4.2 not compatible with Octopus.
So I try to deploy Pacific with cephadm tool on my first node (mostha1)
(to enable testing also an upgrade later).
dnf -y install
https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noar…
monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }')
cephadm bootstrap --mon-ip $monip --initial-dashboard-password xxxxx \
--initial-dashboard-user admceph \
--allow-fqdn-hostname --cluster-network 10.1.0.0/16
This was sucessfull.
But running "*c**eph orch device ls*" do not show any HDD even if I have
/dev/sda (used by the OS), /dev/sdb and /dev/sdc
The web interface shows a row capacity which is an aggregate of the
sizes of the 3 HDDs for the node.
I've also tried to reset /dev/sdb but cephadm do not see it:
[ceph: root@mostha1 /]# ceph orch device zap
mostha1.legi.grenoble-inp.fr /dev/sdb --force
Error EINVAL: Device path '/dev/sdb' not found on host
'mostha1.legi.grenoble-inp.fr'
On my first attempt with octopus, I was able to list the available HDD
with this command line. Before moving to Pacific, the OS on this node
has been reinstalled from scratch.
Any advices for a CEPH beginner ?
Thanks
Patrick
Hi,
Perhaps this is a known issue and I was simply too dumb to find it, but
we are having problems with our CephFS metadata pool filling up over night.
Our cluster has a small SSD pool of around 15TB which hosts our CephFS
metadata pool. Usually, that's more than enough. The normal size of the
pool ranges between 200 and 800GiB (which is quite a lot of fluctuation
already). Yesterday, we had suddenly had the pool fill up entirely and
they only way to fix it was to add more capacity. I increased the pool
size to 18TB by adding more SSDs and could resolve the problem. After a
couple of hours of reshuffling, the pool size finally went back to 230GiB.
But then we had another fill-up tonight to 7.6TiB. Luckily, I had
adjusted the weights so that not all disks could fill up entirely like
last time, so it ended there.
I wasn't really able to identify the problem yesterday, but under the
more controllable scenario today, I could check the MDS logs at
debug_mds=10 and to me it seems like the problem is caused by snapshot
trimming. The logs contain a lot of snapshot-related messages for paths
that haven't been touched in a long time. The messages all look
something like this:
May 31 09:16:48 XXX ceph-mds[2947525]: 2023-05-31T09:16:48.292+0200
7f7ce1bd9700 10 mds.1.cache.ino(0x1000b3c3670) add_client_cap first cap,
joining realm snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1
b1b cps 2 snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
0x10000000000 ...
May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.396+0200
7f0e6a6ca700 10 mds.0.cache | |______ 3 rep [dir
0x100000218fe.101111101* /storage/REDACTED/| ptrwaiter=0 request=0
child=0 frozen=0 subtree=1 replicated=0 dirty=0 waiter=0 authpin=0
tempexporting=0 0x5607759d9600]
May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.452+0200
7f0e6a6ca700 10 mds.0.cache | | |____ 4 rep [dir
0x100000ff904.100111101010* /storage/REDACTED/| ptrwaiter=0 request=0
child=0 frozen=0 subtree=1 importing=0 replicated=0 waiter=0 authpin=0
tempexporting=0 0x56034ed25200]
May 31 09:25:03 XXX ceph-mds[3268481]: 2023-05-31T09:25:03.716+0200
7f0e6becd700 10 mds.0.server set_trace_dist snaprealm
snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2
snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
'monthly_20230101' 2023-01-01T00:00:04.657252+0100),1941=snap(1941
0x10000000000 'monthly_20230201'
2023-02-01T00:00:01.854059+0100),19a6=snap(19a6 0x10000000000
'monthly_20230301' 2023-03-01T00:00:01.215197+0100),1a24=snap(1a24
0x10000000000 'monthly_20230401' ...) len=384
May 31 09:25:36 deltaweb055 ceph-mds[3268481]:
2023-05-31T09:25:36.076+0200 7f0e6becd700 10
mds.0.cache.ino(0x10004d74911) remove_client_cap last cap, leaving realm
snaprealm(0x10000000000 seq 1b1c lc 1b1b cr 1b1b cps 2
snaps={185f=snap(185f 0x10000000000 'monthly_20221201'
2022-12-01T00:00:01.530830+0100),18de=snap(18de 0x10000000000
'monthly_20230101' ...)
The daily_*, montly_* etc. names are the names of our regular snapshots.
I posted a larger log file snippet using ceph-post-file with the ID:
da0eb93d-f340-4457-8a3f-434e8ef37d36
Is it possible that the MDS are trimming old snapshots without taking
care not to fill up the entire metadata pool?
Cheers
Janek
Hi all,
I wanted to call attention to some RGW issues that we've observed on a
Pacific cluster over the past several weeks. The problems relate to versioned
buckets and index entries that can be left behind after transactions complete
abnormally. The scenario is multi-faceted and we're still investigating some of
the details, but I wanted to provide a big-picture summary of what we've found
so far. It looks like most of these issues should be reproducible on versions
before and after Pacific as well. I'll enumerate the individual issues below:
1. PUT requests during reshard of versioned bucket fail with 404 and leave
behind dark data
Tracker: https://tracker.ceph.com/issues/61359
2. When bucket index ops are cancelled it can leave behind zombie index entries
This one was merged a few months ago and did make the v16.2.13 release, but
in our case we had billions of extra index entries by the time that we had
upgraded to the patched version.
Tracker: https://tracker.ceph.com/issues/58673
3. Issuing a delete for a key that already has a delete marker as the current
version leaves behind index entries and OLH objects
Note that the tracker's original description describes the problem a bit
differently, but I've clarified the nature of the issue in a comment.
Tracker: https://tracker.ceph.com/issues/59663
The extra index entries and OLH objects that are left behind due to these sorts
of issues are obviously annoying in regards to the fact that they unnecessarily
consume space, but we've found that they can also cause severe performance
degradation for bucket listings, lifecycle processing, and other ops indirectly
due to higher osd latencies.
The reason for the performance impact is that bucket listing calls must
repeatedly perform additional OSD ops until they find the requisite number
of entries to return. The OSD cls method for bucket listing also does its own
internal iteration for the same purpose. Since these entries are invalid, they
are skipped. In the case that we observed, where some of our bucket indexes were
filled with a sea of contiguous leftover entries, the process of continually
iterating over and skipping invalid entries caused enormous read amplification.
I believe that the following tracker is describing symptoms that are related to
the same issue: https://tracker.ceph.com/issues/59164.
Note that this can also cause LC processing to repeatedly fail in cases where
there are enough contiguous invalid entries, since the OSD cls code eventually
gives up and returns an error that isn't handled.
The severity of these issues likely varies greatly based upon client behavior.
If anyone has experienced similar problems, we'd love to hear about the nature
of how they've manifested for you so that we can be more confident that we've
plugged all of the holes.
Thanks,
Cory Snyder
11:11 Systems