Hi all,
Due to so many reasons (political, heating problems, lack of space
aso.) we have to
plan for our ceph cluster to be hosted externaly.
The planned version to setup is reef.
Reading up on documentation we found that it was possible to run in
secure mode.
Our ceph.conf file will state both v1 and v2 addresses for mons:
mon host = [v2:4.3.2.1:3300/0,v1:4.3.2.1:6789/0]
[v2:4.3.2.2:3300/0,v1:4.3.2.2:6789/0]
[v2:4.3.2.3:3300/0,v1:4.3.2.3:6789/0]
Then changing the following configuration options to only secure:
ms_cluster_mode = secure
ms_service_mode = secure
ms_client_mode = secure
ms_mon_cluster_mode = secure
ms_mon_service_mode = secure
ms_mon_client_mode = secure
Then I remounted cephfs on the clients on our test cluster,
but still the fs would mount on ports 6789.
I thought that the above secure config change would "force"
the mount on port 3300 and v2.
Mounting with option ms_mode=secure, did the trick.
Is that the way cephfs is working that you explicit have to
specify secure mode? I thought that cephfs clients would
use the secure mode with these settings, but maybe I am wrong?
Of cause we also plan to limit the firewalls on servers so only
the specific subnet will be able to connect and mount cephfs.
From my understanding from the documenation this would be the
way to set this up with ceph exposed to internet.
Is there something that we are missing or something that would
make the setup more secure?
Many thanks in advance
Marcus
Hi,
We are testing rbd-mirroring. There seems to be a permission error with
the rbd-mirror user. Using this user to query the mirror pool status gives:
failed to query services: (13) Permission denied
And results in the following output:
health: UNKNOWN
daemon health: UNKNOWN
image health: OK
images: 3 total
2 replaying
1 stopped
So, this command: rbd --id rbd-mirror mirror pool status rbd
So basically the health and daemon health cannot be obtained due to
permission errors, but status about images can.
When the command is run with admin permissions the health and daemon
health are returned without issue.
I tested this on Reef 18.2.2.
Is this expected behavior? If not, I will create a tracker ticket for it.
Gr. Stefan
Env:
- OS: Ubuntu 20.04
- Ceph Version: Octopus 15.0.0.1
- OSD Disk: 2.9TB NVMe
- BlockStorage (Replication 3)
Symptom:
- Peering when OSD's node up is very slow. Peering speed varies from PG to
PG, and some PG may even take 10 seconds. But, there is no log for 10
seconds.
- I checked the effect of client VM's. Actually, Slow queries of mysql
occur at the same time.
There are Ceph OSD logs of both Best and Worst.
Best Peering Case (0.5 Seconds)
2024-04-11T15:32:44.693+0900 7f108b522700 1 osd.7 pg_epoch: 27368 pg[6.8]
state<Start>: transitioning to Primary
2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8]
state<Started/Primary/Peering>: Peering, affected_by_map, going to Reset
2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27371 pg[6.8]
start_peering_interval up [7,6,11] -> [6,11], acting [7,6,11] -> [6,11],
acting_primary 7 -> 6, up_primary 7 -> 6, role 0 -> -1, features acting
2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27377 pg[6.8]
state<Start>: transitioning to Primary
2024-04-11T15:32:45.165+0900 7f108f52a700 1 osd.7 pg_epoch: 27377 pg[6.8]
start_peering_interval up [6,11] -> [7,6,11], acting [6,11] -> [7,6,11],
acting_primary 6 -> 7, up_primary 6 -> 7, role -1 -> 0, features acting
Worst Peering Case (11.6 Seconds)
2024-04-11T15:32:45.169+0900 7f108b522700 1 osd.7 pg_epoch: 27377 pg[30.20]
state<Start>: transitioning to Stray
2024-04-11T15:32:45.169+0900 7f108b522700 1 osd.7 pg_epoch: 27377 pg[30.20]
start_peering_interval up [0,1] -> [0,7,1], acting [0,1] -> [0,7,1],
acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting
2024-04-11T15:32:46.173+0900 7f108b522700 1 osd.7 pg_epoch: 27378 pg[30.20]
state<Start>: transitioning to Stray
2024-04-11T15:32:46.173+0900 7f108b522700 1 osd.7 pg_epoch: 27378 pg[30.20]
start_peering_interval up [0,7,1] -> [0,7,1], acting [0,7,1] -> [0,1],
acting_primary 0 -> 0, up_primary 0 -> 0, role 1 -> -1, features acting
2024-04-11T15:32:57.794+0900 7f108b522700 1 osd.7 pg_epoch: 27390 pg[30.20]
state<Start>: transitioning to Stray
2024-04-11T15:32:57.794+0900 7f108b522700 1 osd.7 pg_epoch: 27390 pg[30.20]
start_peering_interval up [0,7,1] -> [0,7,1], acting [0,1] -> [0,7,1],
acting_primary 0 -> 0, up_primary 0 -> 0, role -1 -> 1, features acting
*I wish to know about*
- Why some PG's take 10 seconds until Peering finishes.
- Why Ceph log is quiet during peering.
- Is this symptom intended in Ceph.
*And please give some advice,*
- Is there any way to improve peering speed?
- Or, Is there a way to not affect the client when peering occurs?
P.S
- I checked the symptoms in the following environments.
-> Octopus Version, Reef Version, Cephadm, Ceph-Ansible
Hi,
I'm probably Doing It Wrong here, but. My hosts are in racks, and I
wanted ceph to use that information from the get-go, so I tried to
achieve this during bootstrap.
This has left me with a single sad pg:
[WRN] PG_AVAILABILITY: Reduced data availability: 1 pg inactive
pg 1.0 is stuck inactive for 33m, current state unknown, last acting []
ceph osd tree shows that CRUSH picked up my racks OK, eg.
-3 45.11993 rack B4
-2 45.11993 host moss-be1001
1 hdd 3.75999 osd.1 up 1.00000 1.00000
But root seems empty:
-1 0 root default
and if I decompile the crush map, indeed:
# buckets
root default {
id -1 # do not change unnecessarily
id -14 class hdd # do not change unnecessarily
# weight 0.00000
alg straw2
hash 0 # rjenkins1
}
which does indeed look empty, whereas I have rack entries that contain
the relevant hosts.
And the replication rule:
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type rack
step emit
}
I passed this config to bootstrap with --config:
[global]
osd_crush_chooseleaf_type = 3
and an initial spec file with host entries like this:
service_type: host
hostname: moss-be1001
addr: 10.64.16.40
location:
rack: B4
labels:
- _admin
- NVMe
Once the cluster was up I used an osd spec file that looked like:
service_type: osd
service_id: rrd_single_NVMe
placement:
label: "NVMe"
spec:
data_devices:
rotational: 1
db_devices:
model: "NVMe"
I could presumably fix this up by editing the crushmap (to put the racks
into the default bucket), but what did I do wrong? Was this not a
reasonable thing to want to do with cephadm?
I'm running
ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
Thanks,
Matthew
Hi,
I got into a weird and unexpected situation today. I added 6 hosts to
an existing Pacific cluster (16.2.13, 20 existing OSD hosts across 2
DCs). The hosts were added to the root=default subtree, their
designated location is one of two datacenters underneath the default
root. Nothing unusual, I believe many people use different subtrees to
organize their cluster, as do we in our own (and haven't seen above
issue yet).
The main application is RGW, the main pool is erasure-coded (k=7,
m=11). The crush rule looks like this:
rule rule-ec-k7m11 {
id 1
type erasure
min_size 3
max_size 18
step set_chooseleaf_tries 5
step set_choose_tries 100
step take default class hdd
step choose indep 2 type datacenter
step chooseleaf indep 9 type host
step emit
}
After almost all peering had finished the status showed 6 inactive +
peering PGs for a while. I had to fail the mgr because it didn't
report correct stats anymore, then it showed 16 unknown PGs. Their
application noticed the (unexpected) disruption, after putting the
hosts into their designated crush bucket (datacenter) the situation
resolved. But I can't make any sense of it, I tried to reproduce it in
my lab environment (Quincy), but to no avail. In my tests it behaves
as expected, after new OSDs become active there are remapped PGs, but
nothing happens until I add them to their designated location.
I know I could have prevented that with either
osd_crush_initial_weight = 0, then move the crush buckets, then
reweight, or by adding the crush buckets first, but usually I don't
need to bother about these things.
Does anyone have an explanation? I'd appreciate any comments.
Thanks!
Eugen
Ceph version 14.2.7
Ceph osd df tree command take long time than usual but I can't find out what is the reason? The monitor node still has plenty of available RAM and CPU resources. I checked the monitor and mgr log but nothing seems useful. I checked an older cluster in version 13.2.10 but Ceph osd df tree still responded normally, so that was not version related.
Anyone have any idea? Thanks
Dear Users,
i recently setup a new ceph 3 node cluster. Network is meshed between
all nodes (2 x 25G with DAC).
Storage is flash only (Kioxia 3.2 TBBiCS FLASH 3D TLC, KCMYXVUG3T20)
The latency with ping tests between the nodes shows:
# ping 10.1.3.13
PING 10.1.3.13 (10.1.3.13) 56(84) bytes of data.
64 bytes from 10.1.3.13: icmp_seq=1 ttl=64 time=0.145 ms
64 bytes from 10.1.3.13: icmp_seq=2 ttl=64 time=0.180 ms
64 bytes from 10.1.3.13: icmp_seq=3 ttl=64 time=0.180 ms
64 bytes from 10.1.3.13: icmp_seq=4 ttl=64 time=0.115 ms
64 bytes from 10.1.3.13: icmp_seq=5 ttl=64 time=0.110 ms
64 bytes from 10.1.3.13: icmp_seq=6 ttl=64 time=0.120 ms
64 bytes from 10.1.3.13: icmp_seq=7 ttl=64 time=0.124 ms
64 bytes from 10.1.3.13: icmp_seq=8 ttl=64 time=0.140 ms
64 bytes from 10.1.3.13: icmp_seq=9 ttl=64 time=0.127 ms
64 bytes from 10.1.3.13: icmp_seq=10 ttl=64 time=0.143 ms
64 bytes from 10.1.3.13: icmp_seq=11 ttl=64 time=0.129 ms
--- 10.1.3.13 ping statistics ---
11 packets transmitted, 11 received, 0% packet loss, time 10242ms
rtt min/avg/max/mdev = 0.110/0.137/0.180/0.022 ms
On another cluster i have much better values, with 10G SFP+ and
fibre-cables:
64 bytes from large-ipv6-ip: icmp_seq=42 ttl=64 time=0.081 ms
64 bytes from large-ipv6-ip: icmp_seq=43 ttl=64 time=0.078 ms
64 bytes from large-ipv6-ip: icmp_seq=44 ttl=64 time=0.084 ms
64 bytes from large-ipv6-ip: icmp_seq=45 ttl=64 time=0.075 ms
64 bytes from large-ipv6-ip: icmp_seq=46 ttl=64 time=0.071 ms
64 bytes from large-ipv6-ip: icmp_seq=47 ttl=64 time=0.081 ms
64 bytes from large-ipv6-ip: icmp_seq=48 ttl=64 time=0.074 ms
64 bytes from large-ipv6-ip: icmp_seq=49 ttl=64 time=0.085 ms
64 bytes from large-ipv6-ip: icmp_seq=50 ttl=64 time=0.077 ms
64 bytes from large-ipv6-ip: icmp_seq=51 ttl=64 time=0.080 ms
64 bytes from large-ipv6-ip: icmp_seq=52 ttl=64 time=0.084 ms
64 bytes from large-ipv6-ip: icmp_seq=53 ttl=64 time=0.084 ms
^C
--- long-ipv6-ip ping statistics ---
53 packets transmitted, 53 received, 0% packet loss, time 53260ms
rtt min/avg/max/mdev = 0.071/0.082/0.111/0.006 ms
If i want best performance, does the latency difference matter at all?
Should i change DAC to SFP-transceivers wwith fibre-cables to improve
overall ceph performance or is this nitpicking?
Thanks a lot.
Stefan
Hi all,
I've almost got my ceph back to normal after a triple drive failure.
But it seems my lost+found folder is corrupted.
I've followed the process in
https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/#disaster-…
However doing an online scrub, as there is still other damage, fails
as it appears my lost+found inode is corrupted.
Any advice is appreciated
The only thing I can think of is at the beginning of one of the steps
it asks if I want to re-create something and,
from what I could read from other emails to the list, I thought
answering no was the correct answer but I am now wondering if I did
want to recreate those entries.
(specifically it asks at the cephfs-data-scan init point)
Thanks in advance,
ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
Arch linux
3 node cluster
33 OSD's
Hi everyone,
I'm managing a Ceph Quincy 17.2.5 cluster, waiting to upgrade it to
version 17.2.7, composed and configured as follows:
- 16 identical nodes 256 GB RAM, 32 CPU Cores (64 threads), 12 x rotary
HDD (BLOCK) + 4 x Sata SSD (RocksDB/WAL)
- Erasure Code 11+4 (Jerasure)
- 10 x S3 RGW on dedicated nodes (5 physical nodes)
- 3 x full SSD dedicated nodes for replicated S3 pools
- 2 x 10 Gbit Public network (LACP) + 2 x 10 Gbit cluster network (LACP)
- On all nodes: Ubuntu 20.04.4 LTS Operating System updated
- Ceph deployed on containers on Docker CE (docker-ce
5:20.10.17~3-0~ubuntu-focal).
All pools, except the EC data pool, are configured with replication 3
and stored in dedicated SSD devices on 3 dedicated nodes to guarantee
the necessary performance.
We encountered a constant and random problem relating to the
availability of access to bucket data and many slow_ops relating to
rotary OSDs (data-pools) not caused by saturation of physical devices
nor by the availability of CPU/RAM on all nodes.
It sometimes happens that slow_ops caused by some requests to some PCs
in a random way are reported.
The cluster is currently in the recovery/rebalance phase for the
reconstruction of 3 HDDs that we had to recreate from scratch (all 3
HDDs are physically on the same node).
By doing some analysis of the events we verified the following from the
status of some OSDs impacted by slow_ops:
20/05/2024 10:19 •
"description": "osd_op(client.186021790.0:57620 29.258s0
29:1a5928ea:::31497ca8-e7d6-4e53-b150-91f9ac02ac67.246100.6329_storage%2ffirstMemories%2f1010543%2f:head
[getxattrs,stat] snapc 0=[]
ondisk+read+known_if_redirected+supports_pool_eio e481205)",
"initiated_at": "2024-05-16T15:03:06.015956+0000",
"age": 963.83795990199997,
"duration": 963.83819621700002,
"type_data": {
"flag_point": "delayed",
"client_info": {
"client": "client.186021790",
"client_addr": "10.151.11.11:0/3913909849",
"tid": 57620
},
"events": [
{
"event": "initiated",
"time": "2024-05-16T15:03:06.015956+0000",
"duration": 0
},
{
"event": "throttled",
"time": "2024-05-16T15:03:06.015956+0000",
"duration": 0
},
{
"event": "header_read",
"time": "2024-05-16T15:03:06.015954+0000",
"duration": 4294967295.9999986
},
{
"event": "all_read",
"time": "2024-05-16T15:03:06.015961+0000",
"duration": 7.2300000000000002e-06
},
{
"event": "dispatched",
"time": "2024-05-16T15:03:06.015962+0000",
"duration": 1.063e-06
},
{
"event": "queued_for_pg",
"time": "2024-05-16T15:03:06.015966+0000",
"duration": 3.332e-06
},
{
"event": "reached_pg",
"time": "2024-05-16T15:03:06.015992+0000",
"duration": 2.6078e-05
},
{
"event": "waiting for readable",
"time": "2024-05-16T15:03:06.016002+0000",
"duration": 1.0348e-05
}
]
}
}
],
"num_ops": 6
}
###########
{
"event": "reached_pg",
"time": "2024-05-16T12:43:11.694220+0000",
"duration": 480.97258642200001
},
In essence the operations remain suspended in this condition:
"event": "waiting for readable"
trying to access some PGs.
We intend to update the version as soon as the recovery/rebalance is
completed.
Does anyone have any idea what checks I could do to analyze the problem
more thoroughly?
I can't define whether the problem could be the use of the EC or whether
the data written in some buckets is in a "non-standard" condition, which
causes the access to wait for some reason.
Thank you all for your kindness.
Greetings
Andrea Martra
--
--
Andrea Martra
+39 393 9048451