Hi,
I try to set rdma setting in cluster of 6 nodes (3 mon, 3 OSD Nodes and 10 OSDs on each OSD nodes).
OS : CentOS Stream release 8.
I've followed the step bellow, but I got an error.
[root@mon1 ~]# cephadm shell
Inferring fsid 9414e1bc-9061-11ed-90fc-00163e4f92ad
Using recent ceph image quay.io/ceph/ceph@sha256:3cd25ee2e1589bf534c24493ab12e27caf634725b4449d50408fd5ad4796bbfa
[ceph: root@mon1 /]# ceph config set global ms_type async+rdma
2023-01-21T11:11:49.182+0000 7fab5922e700 -1 Infiniband verify_prereq!!! WARNING !!! For RDMA to work properly user memlock (ulimit -l) must be big enough to allow large amount of registered memory. We recommend setting this parameter to infinity
/usr/include/c++/8/bits/stl_vector.h:932: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](std::vector<_Tp, _Alloc>::size_type) [with _Tp = Worker*; _Alloc = std::allocator<Worker*>; std::vector<_Tp, _Alloc>::reference = Worker*&; std::vector<_Tp, _Alloc>::size_type = long unsigned int]: Assertion '__builtin_expect(__n < this->size(), true)' failed.
Aborted (core dumped)
[ceph: root@mon1 /]#
With error I can see suggestion about ulimit, but with a containerized deployment, How can I configure properly async+rdma.
Best regards,
Hi,
I have been studying the document from Ceph and Rados but I could not find
any metrics to measure the number of read/write operations for each file. I
understand that Cephfs is the front-end, the file is going to be stored as
an object in the OSD and I have found that Ceph provides a Cache Tiering
feature which also requires the monitor for read/write operation for each
object. Could someone please give me guidance on how this is achieved?
Thanks.
Best regards,
Le Thanh Son
Hello,
We would like to run a RAID1 between a local storage and a RBD device. This would allow us to sustain network failures or Ceph failures and also give better read performance as we would set it up with write-mostly on RBD in mdadm.
Basically we would like to implement https://discord.com/blog/how-discord-supercharges-network-disks-for-extreme….
RAID1 is working well but if there is timeouts, the RBD volume won't fail and mdadm will not catch the broken device. Also the writes then hangs waiting for the network/RBD to come back. If we force unmap the RBD device then it fails as expected and writes can continue on other RAID1 device.
We tried setting the `osd_request_timeout` to a small value (3 or 2 seconds) but it only gives us timeout in kernel logs:
```
libceph: tid 25792 on osd39 timeout
rbd: rbd0: write at objno 602 0~512 result -110
rbd: rbd0: write result -110
print_req_error: 15 callbacks suppressed
blk_update_request: timeout error, dev rbd0, sector 4931584 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
libceph: tid 25794 on osd39 timeout
rbd: rbd0: write at objno 602 512~512 result -110
rbd: rbd0: write result -110
blk_update_request: timeout error, dev rbd0, sector 4931585 op 0x1:(WRITE) flags 0x800 phys_seg 1 prio class 0
```
Is there something that we missed or is it currently impossible with kRBD to kind of "fail fast" in case of timeout and unmap/remove associated RBD devices? Or is there another client that can do what we want (ceph-nbd or with librbd)?
We found this rook issue that is not really helpful but give insight https://github.com/rook/rook/issues/376.
Thanks!
--
Mathias Chapelain
Storage Engineer
Proton AG
Hi,
How to set up values for rgw_keystone_url and other related fields that
are not possible to change via the GUI under cluster configuration. ?
ceph qunicy is deployed using cephadm.
--
Cheers,
Shashi
I'm running Quincy and my journal fills with messages that I consider
"debug" level such as:
* ceph-mgr[1615]: [volumes INFO mgr_util] scanning for idle connections..
* ceph-mon[1617]: pgmap v1176995: 145 pgs: 145 active+clean; ...
* ceph-mgr[1615]: [dashboard INFO request] ...
* ceph-mgr[1615]: [pg_autoscaler INFO root] effective_target_ratio 0.0
0.0 0 1649254858752
I've never changed levels after installation and e.g. `ceph config show
osd.0 | fgrep debug` does not show anything (and `ceph config show
mgr.0`results in "Error ENOENT: no config state for daemon mgr.0".
I have no ideas left, any hints?
- Michael
I'm running Quincy and my journal fills with messages that I consider
"debug" level such as:
* ceph-mgr[1615]: [volumes INFO mgr_util] scanning for idle connections..
* ceph-mon[1617]: pgmap v1176995: 145 pgs: 145 active+clean; ...
* ceph-mgr[1615]: [dashboard INFO request] ...
* ceph-mgr[1615]: [pg_autoscaler INFO root] effective_target_ratio 0.0
0.0 0 1649254858752
I've never changed levels after installation and e.g. `ceph config show
osd.0 | fgrep debug` does not show anything (and `ceph config show
mgr.0`results in "Error ENOENT: no config state for daemon mgr.0".
I have no ideas left, any hints?
- Michael
Hi all,
we are observing a problem on a libvirt virtualisation cluster that might come from ceph rbd clients. Something went wrong during execution of a live-migration operation and as a result we have two instances of the same VM running on 2 different hosts, the source- and the destination host. What we observe now is the the exclusive lock of the RBD disk image moves between these two clients periodically (every few minutes the owner flips).
We are pretty sure that no virsh commands possibly having that effect are executed during this time. The client connections are not lost and the OSD blacklist is empty. I don't understand why a ceph rbd client would surrender an exclusive lock in such a split brain situation, its exactly when it needs to hold on to it. As a result, the affected virtual drives are corrupted.
The questions we have in this context are:
Under what conditions does a ceph rbd client surrender an exclusive lock?
Could this be a bug in the client or a ceph config error?
Is this a known problem with libceph and libvirtd?
Anyone else making the same observation and having some guidance?
The VM hosts are on alma8 and we use the advanced virtualisation repo providing very recent versions of qemu and libvirtd. We have seen this floating exclusive lock before on mimic. Now we are on octopus and I can't really blame it on the old ceph version any more. We use opennebula as a KVM front-end.
Thanks for any pointers!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hey all! I’ve run into an MDS crash on a cluster recently upgraded from Ceph 16.2.7 to 16.2.10. I’m hitting an assert nearly identical to this one gathered by the telemetry module:
https://tracker.ceph.com/issues/54747
I have a new build compiling to test whether https://github.com/ceph/ceph/pull/43184/ makes a difference or not, when setting mds_inject_skip_replaying_inotable.
Relevant logs are below, but I’m wondering if anyone has hit anything like this? Thanks in advance!
=== BEGIN LOG SNIPPET ===
-2> 2023-01-18T20:16:29.789+0000 7f6190243700 -1 log_channel(cluster) log [ERR] : journal replay alloc 0x10000000010 not in free [0x10000000011~0x3dc,0x100000003fb~0x1e8,0x100000005e5~0x2,0x100000009d4~0x2,0x1000005cc6d~0x4,0x10001c6b44e~0x4,0x10001cb91f4~0x1f4,0x10001cb93f4~0x3dd,0x10007582c15~0x279,0x10007582e90~0x1f4,0x10007583094~0xfff8a7cf6c]
-1> 2023-01-18T20:16:29.789+0000 7f6190243700 -1 /builds/66321/e7c73776/ceph/-build//WORKDIR/ceph-16.2.10/src/mds/journal.cc: In function 'void EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)' thread 7f6190243700 time 2023-01-18T20:16:29.794189+0000
/WORKDIR/ceph-16.2.10/src/mds/journal.cc: 1577: FAILED ceph_assert(inotablev == mds->inotable->get_version())
ceph version 16.2.10 (e7c73776b3136f6d18a35febeb38f5fdd41be364) pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14c) [0x7f619d548645]
2: /usr/lib/ceph/libceph-common.so.2(+0x27182f) [0x7f619d54882f]
3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5815) [0x560bfd1c6935]
4: (EUpdate::replay(MDSRank*)+0x3c) [0x560bfd1c7ecc]
5: (MDLog::_replay_thread()+0xca9) [0x560bfd153de9]
6: (MDLog::ReplayThread::entry()+0xd) [0x560bfce78fdd]
7: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f619cf29fa3]
8: clone()
0> 2023-01-18T20:16:29.793+0000 7f6190243700 -1 *** Caught signal (Aborted) **
in thread 7f6190243700 thread_name:md_log_replay
ceph version 16.2.10 (e7c73776b3136f6d18a35febeb38f5fdd41be364) pacific (stable)
1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x12730) [0x7f619cf34730]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x19d) [0x7f619d548696]
5: /usr/lib/ceph/libceph-common.so.2(+0x27182f) [0x7f619d54882f]
6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5815) [0x560bfd1c6935]
7: (EUpdate::replay(MDSRank*)+0x3c) [0x560bfd1c7ecc]
8: (MDLog::_replay_thread()+0xca9) [0x560bfd153de9]
9: (MDLog::ReplayThread::entry()+0xd) [0x560bfce78fdd]
10: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f619cf29fa3]
11: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
=== END LOG SNIPPET ===
Hi,
We have a full SSD production cluster running on Pacific 16.2.10 and
deployed with cephadm that is experiencing OSD flapping issues.
Essentially, random OSDs will get kicked out of the cluster and then
automatically brought back in a few times a day. As an example, let's
take the case of OSD.184 :
-It flapped 9 times between January 15th and 17th with the following log
message each time : 2023-01-15T16:33:19.903+0000 prepare_failure
osd.184 from osd.49 is reporting failure:1
-On January 17th, it complains that there are slow ops and spam its logs
with the following line : heartbeat_map is_healthy 'OSD::osd_op_tp
thread 0x7f346aa64700' had timed out after 15.000000954s
The storage node itself has over 30 GB of ram still available in cache
and the drives themselves only seldom peak at 100% usage and that never
lasts more than a few seconds. CPU usage is also constantly around 5%.
Considering there is no other error messages in any of the regular logs,
including the systemd logs, why would this OSD not reply to heartbeats?
--
Jean-Philippe Méthot
Senior Openstack system administrator
Administrateur système Openstack sénior
PlanetHoster inc.
Ceph Pacific 16.2.9
We have a storage server with multiple 1.7TB SSDs dedicated to the bluestore DB usage. The osd spec originally was misconfigured slightly and had set the "limit" parameter on the db_devices to 5 (there are 8 SSDs available) and did not specify a block_db_size. ceph layed out the original 40 OSDs and put 8 DBs across 5 of the SSDs (because of limit param). Ceph seems to have auto-sized the bluestore DB partitions to be about 45GB, which is far less than the recommended 1-4% (using 10TB drives). How does ceph-volume determine the size of the bluestore DB/WAL partitions when it is not specified in the spec?
We updated the spec and specified a block_db_size of 300G and removed the "limit" value. Now we can see in the cephadm.log that the ceph-volume command being issued is using the correct list of SSD devices (all 8) as options to the lvm batch (--db-devices ...), but it keeps failing to create the new OSD because we are asking for 300G and it thinks there is only 44G available even though the last 3 SSDs in the list are empty (1.7T). So, it appears that somehow the orchestrator is ignoring the last 3 SSDs. I have verified that these SSDs are wiped clean, have no partitions or LVM, and no label (sgdisk -Z, wipefs -a). They appear as available in the inventory and not locked or otherwise in use.
Also, the "db_slots" spec parameter is ignored in pacific due to a bug so there is no way to tell the orchestrator to use "block_db_slots". Adding it to the spec like "block_db_size" fails since it is not recognized.
Any help figuring out why these SSDs are being ignored would be much appreciated.
Our spec for this host looks like this:
---
spec:
data_devices:
rotational: 1
size: '3TB:'
db_devices:
rotational: 0
size: ':2T'
vendor: 'SEAGATE'
block_db_size: 300G
---