Hi all,
we seem to have hit a bug in the ceph fs kernel client and I just want to confirm what action to take. We get the error "wrong peer at address" in dmesg and some jobs on that server seem to get stuck in fs access; log extract below. I found these 2 tracker items https://tracker.ceph.com/issues/23883 and https://tracker.ceph.com/issues/41519, which don't seem to have fixes.
My questions:
- Is this harmless or does it indicate invalid/corrupted client cache entries?
- How to resolve, ignore, umount+mount or reboot?
Here an extract from the dmesg log, the error has survived a couple of MDS restarts already:
[Mon Mar 6 12:56:46 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:05:18 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:05:18 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:13:50 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-1572619386
[Mon Mar 6 13:13:50 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:41 2023] libceph: mds1 192.168.32.87:6801 socket closed (con state OPEN)
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:45 2023] ceph: mds1 reconnect start
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:16:48 2023] ceph: mds1 reconnect success
[Mon Mar 6 13:18:13 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:17 2023] libceph: mds7 192.168.32.88:6801 socket closed (con state OPEN)
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:23 2023] ceph: mds1 recovery completed
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect start
[Mon Mar 6 13:18:28 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:29 2023] ceph: mds7 reconnect success
[Mon Mar 6 13:18:35 2023] ceph: update_snap_trace error -22
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:18:35 2023] ceph: mds7 recovery completed
[Mon Mar 6 13:22:22 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Mon Mar 6 13:22:22 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Mon Mar 6 13:30:54 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[...]
[Thu Mar 9 09:37:24 2023] slurm.epilog.cl (31457): drop_caches: 3
[Thu Mar 9 09:38:26 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:38:26 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:46:58 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:46:58 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 09:55:30 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 09:55:30 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
[Thu Mar 9 10:04:02 2023] libceph: wrong peer, want 192.168.32.87:6801/-223958753, got 192.168.32.87:6801/-453143347
[Thu Mar 9 10:04:02 2023] libceph: mds1 192.168.32.87:6801 wrong peer at address
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hey ceph-users,
I setup a multisite sync between two freshly setup Octopus clusters.
In the first cluster I created a bucket with some data just to test the
replication of actual data later.
I then followed the instructions on
https://docs.ceph.com/en/octopus/radosgw/multisite/#migrating-a-single-site…
to add a second zone.
Things went well and both zones are now happily reaching each other and
the API endpoints are talking.
Also the metadata is in sync already - both sides are happy and I can
see bucket listings and users are "in sync":
> # radosgw-admin sync status
> realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst)
> zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra)
> zone 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn)
> metadata sync no sync (zone is master)
> data sync source: c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1)
> init
> full sync: 128/128 shards
> full sync: 0 buckets to sync
> incremental sync: 0/128 shards
> data is behind on 128 shards
> behind shards: [0...127]
>
and on the other side ...
> # radosgw-admin sync status
> realm 13d1b8cb-dc76-4aed-8578-2ce5d3d010e8 (obst)
> zonegroup 17a06c15-2665-484e-8c61-cbbb806e11d2 (obst-fra)
> zone c07447eb-f93a-4d8f-bf7a-e52fade399f3 (obst-az1)
> metadata sync syncing
> full sync: 0/64 shards
> incremental sync: 64/64 shards
> metadata is caught up with master
> data sync source: 6d2c1275-527e-432f-a57a-9614930deb61 (obst-rgn)
> init
> full sync: 128/128 shards
> full sync: 0 buckets to sync
> incremental sync: 0/128 shards
> data is behind on 128 shards
> behind shards: [0...127]
>
also the newly created buckets (read: their metadata) is synced.
What is apparently not working in the sync of actual data.
Upon startup the radosgw on the second site shows:
> 2021-06-25T16:15:06.445+0000 7fe71eff5700 1 RGW-SYNC:meta: start
> 2021-06-25T16:15:06.445+0000 7fe71eff5700 1 RGW-SYNC:meta: realm
> epoch=2 period id=f4553d7c-5cc5-4759-9253-9a22b051e736
> 2021-06-25T16:15:11.525+0000 7fe71dff3700 0
> RGW-SYNC:data:sync:init_data_sync_status: ERROR: failed to read remote
> data log shards
>
also when issuing
# radosgw-admin data sync init --source-zone obst-rgn
it throws
> 2021-06-25T16:20:29.167+0000 7f87c2aec080 0
> RGW-SYNC:data:init_data_sync_status: ERROR: failed to read remote data
> log shards
Does anybody have any hints on where to look for what could be broken here?
Thanks a bunch,
Regards
Christian
hi Ernesto and lists,
> [1] https://github.com/ceph/ceph/pull/47501
are we planning to backport this to quincy so we can support centos 9
there? enabling that upgrade path on centos 9 was one of the
conditions for dropping centos 8 support in reef, which i'm still keen
to do
if not, can we find another resolution to
https://tracker.ceph.com/issues/58832? as i understand it, all of
those python packages exist in centos 8. do we know why they were
dropped for centos 9? have we looked into making those available in
epel? (cc Ken and Kaleb)
On Fri, Sep 2, 2022 at 12:01 PM Ernesto Puerta <epuertat(a)redhat.com> wrote:
>
> Hi Kevin,
>
>>
>> Isn't this one of the reasons containers were pushed, so that the packaging isn't as big a deal?
>
>
> Yes, but the Ceph community has a strong commitment to provide distro packages for those users who are not interested in moving to containers.
>
>> Is it the continued push to support lots of distros without using containers that is the problem?
>
>
> If not a problem, it definitely makes it more challenging. Compiled components often sort this out by statically linking deps whose packages are not widely available in distros. The approach we're proposing here would be the closest equivalent to static linking for interpreted code (bundling).
>
> Thanks for sharing your questions!
>
> Kind regards,
> Ernesto
> _______________________________________________
> Dev mailing list -- dev(a)ceph.io
> To unsubscribe send an email to dev-leave(a)ceph.io
hi , starting upgrade from 15.2.17 i got this error
Module 'cephadm' has failed: Expecting value: line 1 column 1 (char 0)
Cluster was in health ok before starting.
Hi,
As discussed in another thread (Crushmap rule for multi-datacenter
erasure coding), I'm trying to create an EC pool spanning 3 datacenters
(datacenters are present in the crushmap), with the objective to be
resilient to 1 DC down, at least keeping the readonly access to the pool
and if possible the read-write access, and have a storage efficiency
better than 3 replica (let say a storage overhead <= 2).
In the discussion, somebody mentioned LRC plugin as a possible jerasure
alternative to implement this without tweaking the crushmap rule to
implement the 2-step OSD allocation. I looked at the documentation
(https://docs.ceph.com/en/latest/rados/operations/erasure-code-lrc/) but
I have some questions if someone has experience/expertise with this LRC
plugin.
I tried to create a rule for using 5 OSDs per datacenter (15 in total),
with 3 (9 in total) being data chunks and others being coding chunks.
For this, based of my understanding of examples, I used k=9, m=3, l=4.
Is it right? Is this configuration equivalent, in terms of redundancy,
to a jerasure configuration with k=9, m=6?
The resulting rule, which looks correct to me, is:
--------
{
"rule_id": 6,
"rule_name": "test_lrc_2",
"ruleset": 6,
"type": 3,
"min_size": 3,
"max_size": 15,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -4,
"item_name": "default~hdd"
},
{
"op": "choose_indep",
"num": 3,
"type": "datacenter"
},
{
"op": "chooseleaf_indep",
"num": 5,
"type": "host"
},
{
"op": "emit"
}
]
}
------------
Unfortunately, it doesn't work as expected: a pool created with this
rule ends up with its pages active+undersize, which is unexpected for
me. Looking at 'ceph health detail` output, I see for each page
something like:
pg 52.14 is stuck undersized for 27m, current state active+undersized,
last acting
[90,113,2147483647,103,64,147,164,177,2147483647,133,58,28,8,32,2147483647]
For each PG, there is 3 '2147483647' entries and I guess it is the
reason of the problem. What are these entries about? Clearly it is not
OSD entries... Looks like a negative number, -1, which in terms of
crushmap ID is the crushmap root (named "default" in our configuration).
Any trivial mistake I would have made?
Thanks in advance for any help or for sharing any successful configuration?
Best regards,
Michel
I'm in the process of exploring if it is worthwhile to add RadosGW to
our existing ceph cluster. We've had a few internal requests for
exposing the S3 API for some of our business units, right now we just
use the ceph cluster for VM disk image storage via RBD.
Everything looks pretty straight forward until we hit multitenancy. The
page on multi-tenancy doesn't dive into permission delegation:
https://docs.ceph.com/en/quincy/radosgw/multitenancy/
The end goal I want is to be able to create a single user per tenant
(Business Unit) which will act as their 'administrator', where they can
then do basically whatever they want under their tenant sandbox (though
I don't think we need more advanced cases like creations of roles or
policies, just create/delete their own users, buckets, objects). I was
hopeful this would just work, and I asked on the ceph IRC channel on
OFTC and was told once I grant a user caps="users=*", they would then be
allowed to create users *outside* of their own tenant using the Rados
Admin API and that I should explore IAM roles.
I think it would make sense to add a feature, such as a flag that can be
set on a user, to ensure they stay in their "sandbox". I'd assume this
is probably a common use-case.
Anyhow, if its possible to do today using iam roles/policies, then
great, unfortunately this is my first time looking at this stuff and
there are some things not immediately obvious.
I saw this online about AWS itself and creating a permissions boundary,
but that's for allowing creation of roles within a boundary:
https://www.qloudx.com/delegate-aws-iam-user-and-role-creation-without-givi…
I'm not sure what "Action" is associated with the Rados Admin API create
user for applying a boundary that the user can only create users with
the same tenant name.
https://docs.ceph.com/en/quincy/radosgw/adminops/#create-user
Any guidance on this would be extremely helpful.
Thanks!
-Brad
Bonjour,
Reading Karan's blog post about benchmarking the insertion of billions objects to Ceph via S3 / RGW[0] from last year, it reads:
> we decided to lower bluestore_min_alloc_size_hdd to 18KB and re-test. As represented in chart-5, the object creation rate found to be notably reduced after lowering the bluestore_min_alloc_size_hdd parameter from 64KB (default) to 18KB. As such, for objects larger than the bluestore_min_alloc_size_hdd , the default values seems to be optimal, smaller objects further require more investigation if you intended to reduce bluestore_min_alloc_size_hdd parameter.
There also is a mail thread dated 2018 on this topic as well, with the same conclusion although using RADOS directly and not RGW[3]. I read the RGW data layout page in the documentation[1] and concluded that by default every object inserted with S3 / RGW will indeed use at least 64kb. A pull request from last year[2] seems to confirm it and also suggests modifying bluestore_min_alloc_size_hdd has adverse side effects.
That being said, I'm curious to know if people developed strategies to cope with this overhead. Someone mentioned packing objects together client side to make them larger. But maybe there are simpler ways to do the same?
Cheers
[0] https://www.redhat.com/en/blog/scaling-ceph-billion-objects-and-beyond
[1] https://docs.ceph.com/en/latest/radosgw/layout/
[2] https://github.com/ceph/ceph/pull/32809
[3] https://www.spinics.net/lists/ceph-users/msg45755.html
--
Loïc Dachary, Artisan Logiciel Libre
Hi all,
on an NFS re-export of a ceph-fs (kernel client) I observe a very strange error. I'm un-taring a larger package (1.2G) and after some time I get these errors:
ln: failed to create hard link 'file name': Read-only file system
The strange thing is that this seems only temporary. When I used "ln src dst" for manual testing, the command failed as above. However, after that I tried "ln -v src dst" and this command created the hard link with exactly the same path arguments. During the period when the error occurs, I can't see any FS in read-only mode, neither on the NFS client nor the NFS server. Funny thing is that file creation and write still works, its only the hard-link creation that fails.
For details, the set-up is:
file-server: mount ceph-fs at /shares/path, export /shares/path as nfs4 to other server
other server: mount /shares/path as NFS
More precisely, on the file-server:
fstab: MON-IPs:/shares/folder /shares/nfs/folder ceph defaults,noshare,name=NAME,secretfile=sec.file,mds_namespace=FS-NAME,_netdev 0 0
exports: /shares/nfs/folder -no_root_squash,rw,async,mountpoint,no_subtree_check DEST-IP
On the host at DEST-IP:
fstab: FILE-SERVER-IP:/shares/nfs/folder /mnt/folder nfs defaults,_netdev 0 0
Both, the file server and the client server are virtual machines. The file server is on Centos 8 stream (4.18.0-338.el8.x86_64) and the client machine is on AlmaLinux 8 (4.18.0-425.13.1.el8_7.x86_64).
When I change the NFS export from "async" to "sync" everything works. However, that's a rather bad workaround and not a solution. Although this looks like an NFS issue, I'm afraid it is a problem with hard links and ceph-fs. It looks like a race with scheduling and executing operations on the ceph-fs kernel mount.
Has anyone seen something like that?
Thanks and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
On Thu, Dec 15, 2022 at 9:32 AM Stolte, Felix <f.stolte(a)fz-juelich.de> wrote:
>
> Hi Patrick,
>
> we used your script to repair the damaged objects on the weekend and it went smoothly. Thanks for your support.
>
> We adjusted your script to scan for damaged files on a daily basis, runtime is about 6h. Until thursday last week, we had exactly the same 17 Files. On thursday at 13:05 a snapshot was created and our active mds crashed once at this time (snapshot was created):
>
> 2022-12-08T13:05:48.919+0100 7f440afec700 -1 /build/ceph-16.2.10/src/mds/ScatterLock.h: In function 'void ScatterLock::set_xlock_snap_sync(MDSContext*)' thread 7f440afec700 time 2022-12-08T13:05:48.921223+0100
> /build/ceph-16.2.10/src/mds/ScatterLock.h: 59: FAILED ceph_assert(state LOCK_XLOCK || state LOCK_XLOCKDONE)
>
> 12 Minutes lates the unlink_local error crashes appeared again. This time with a new file. During debugging we noticed a MTU mismatch between MDS (1500) and client (9000) with cephfs kernel mount. The client is also creating the snapshots via mkdir in the .snap directory.
>
> We disabled snapshot creation for now, but really need this feature. I uploaded the mds logs of the first crash along with the information above to https://tracker.ceph.com/issues/38452
>
> I would greatly appreciate it, if you could answer me the following question:
>
> Is the Bug related to our MTU Mismatch? We fixed the MTU Issue going back to 1500 on all nodes in the ceph public network on the weekend also.
I doubt it.
> If you need a debug level 20 log of the ScatterLock for further analysis, i could schedule snapshots at the end of our workdays and increase the debug level 5 Minutes arround snap shot creation.
This would be very helpful!
--
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D