Hello,
currently, we are experiencing problems with a cluster used for storing
RBD backups. Config:
* 8 nodes, each with 6 HDDs OSDs and 1 SSD used for blockdb and WAL
* k=4 m=2 EC
* dual 25GbE NIC
* v14.2.8
ceph health detail shows the following messages:
HEALTH_WARN BlueFS spillover detected on 1 OSD(s); 45 pgs not
deep-scrubbed in time; snap trim queue for 2 pg(s) >= 32768
(mon_osd_snap_trim_queue_warn_on); 1 slow ops, oldest one blocked for
18629 sec, mon.cloud10-1517 has slow ops
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
osd.0 spilled over 68 MiB metadata from 'db' device (35 GiB used of
185 GiB) to slow device
PG_NOT_DEEP_SCRUBBED 45 pgs not deep-scrubbed in time
pg 18.3f5 not deep-scrubbed since 2020-09-03 21:58:28.316958
pg 18.3ed not deep-scrubbed since 2020-09-01 15:11:54.335935
[--- cut ---]
PG_SLOW_SNAP_TRIMMING snap trim queue for 2 pg(s) >= 32768
(mon_osd_snap_trim_queue_warn_on)
snap trim queue for pg 18.2c5 at 41630
snap trim queue for pg 18.d6 at 44079
longest queue on pg 18.d6 at 44079
try decreasing "osd snap trim sleep" and/or increasing "osd pg max
concurrent snap trims".
SLOW_OPS 1 slow ops, oldest one blocked for 18629 sec, mon.cloud10-1517
has slow ops
We've made some observations on that cluster:
* The BlueFS spillover goes away with "ceph tell osd.0 compact" but
comes back eventually
* The blockdb/WAL SSD is highly utilized, while the HDDs are not
* When one OSD fails, there is a cascade failure taking down many other
OSDs across all nodes. Most of the time, the cluster comes back when
setting the nodown flag and restarting all failed OSDs one by one
* Sometimes, especially during maintenance, "Long heartbeat ping times
on front/back interface seen, longest is 1390.076 msec" messages pop up
* The cluster performance deteriorates sharply when upgrading from
14.2.8 to 14.2.11 or later, so we've rolled back to 14.2.8
Of these problems, the OSD cascade failure is the most important, and is
responsible for lenghty downtimes in the past few weeks.
Do you have any ideas on how to combat these problems?
Thank you,
Paul
--
Mit freundlichen Grüßen
Paul Kramme
Ihr Profihost Team
-------------------------------
Profihost AG
Expo Plaza 1
30539 Hannover
Deutschland
Tel.: +49 (511) 5151 8181 | Fax.: +49 (511) 5151 8282
URL: http://www.profihost.com | E-Mail: info(a)profihost.com
Sitz der Gesellschaft: Hannover, USt-IdNr. DE813460827
Registergericht: Amtsgericht Hannover, Register-Nr.: HRB 202350
Vorstand: Cristoph Bluhm, Stefan Priebe, Marc Zocher, Dr. Claus Boyens,
Daniel Hagemeier
Aufsichtsrat: Gabriele Pulvermüller (Vorsitzende)
We have had multiple clusters experiencing the following situation over the
past few months on both 14.2.6 and 14.2.11. On a few instances it seemed
random , in a second situation we had temporary networking disruption, in a
third situation we accidentally made some osd changes which caused certain
OSDs to reach the hard limit of too many pgs-per-osd and said PGs were
stuck inactive. Regardless of the scenario one monitor always falls out of
quorum and seemingly can't rejoin and its logs contain the following:
2020-11-18 05:34:54.113 7f14286d9700 1 mon.a2plcephmon01@2(probing) e30
handle_auth_request failed to assign global_id
2020-11-18 05:34:54.295 7f14286d9700 -1 mon.a2plcephmon01@2(probing) e30
handle_auth_bad_method hmm, they didn't like 2 result (13) Permission denied
2020-11-18 05:34:55.397 7f14286d9700 1 mon.a2plcephmon01@2(probing) e30
handle_auth_request failed to assign global_id
Ultimately, we take the quick route of rebuilding the mon by wiping its DB
and re-doing a mkfs and rejoining the quorum using:
sudo -u ceph ceph-mon -i a2plcephmon01 --public-addr 10.1.1.1:3300
Luckily we have never dropped below 50% in quorum so far but we are very
interested in preventing this from happening going forward. Inspecting
"ceph mon dump" on the given clusters I see that all of the rebuilt mons
use only msgrv2 on 3300 but all of the mons that never required rebuilding
are using v1 and v2 addressing. So my question are:
- Does this monitor failure sound familiar?
- Is it problematic the manner in which we are rebuilding mons and it only
lands them on using msgr v2 or should we be using v2 only on an
all-nautilus cluster?
Thanks for any insight.
Respectfully,
*Wes Dillingham*
wes(a)wesdillingham.com
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
Hi All,
I've been using ceph block and object storage for years but just
wandering into cephfs now (Nautilus all servers on 14.2.9 ).
I created small data and metadata pools, a new filesystem and used:
ceph fs authorize <fsname> client.<name> / rw
creating two new users to mount it, both can one using fuse (14.2.9)
and one using kernel client (Ubuntu 20.04 kernel 5.4.0-53).
so far so good, but then it gets "weird" I can perform metadata
operations like "mkdir" and "touch" but not actually write any data:
testy-mctestface% touch /mnt/cephfs/boo
testy-mctestface% echo foo > /mnt/cephfs/boo
echo: write error: operation not permitted
auth caps look good to me, but seem most likely to be worng:
root@ceph-mon0:/ # ceph auth get client.client0
exported keyring for client.client0
[client.client0]
key = <SEKRET>
caps mds = "allow rw"
caps mon = "allow r"
caps osd = "allow rw tag cephfs data=<fsname>"
is "data" hear supposed to be <fsname> or <data_pool_name>? Presumably
it's fsname since that what the "fa authorize" put there and it should
know...
can anyone see what I'm doing wrong here?
Thanks,
-Jon
Hi,
I have a cluster of 14.2.8.
I created OSDs with dedicated PCIE for wal/db when deployed the cluster.
I set 72G for db and 3G for wal on each OSD.
And now my cluster is in a WARN stats until a long health time.
# ceph health detail
HEALTH_WARN BlueFS spillover detected on 1 OSD(s)
BLUEFS_SPILLOVER BlueFS spillover detected on 1 OSD(s)
osd.63 spilled over 33 MiB metadata from 'db' device (1.5 GiB used of
72 GiB) to slow device
I lookup on google and find https://tracker.ceph.com/issues/38745
I'm not sure if it's the same issue.
How can I deal with this?
THANKS
Hi,
I ran those commands as usual:
$ ceph orch host ls
Result is as expected by hosts names and addresses.
$ ceph orch ls
Again, Expected result as before.
Then I started to upgrade via this command:
$ ceph orch upgrade start --ceph-version 15.2.6
It failed with attached logs. Please see logs.txt for detailed output.
It says;
orchestrator._interface.OrchestratorError: DaemonDescription: Cannot
calculate service_id: daemon_id='cephfs.host-1.shsruc' hostname='host'
This cluster have 2 nodes. Then I just ran same command again:
$ ceph orch ls
Result is NOT as expected. I got;
>> Error EINVAL: DaemonDescription: Cannot calculate service_id:
daemon_id='cephfs.host-1.shsruc' hostname='host'
(See attached logs.)
If I restart "host-1" then i can execute "ceph orch ls" command. After I
start upgrade it errors again. Interestingly I only have cephfs mounted
in node "host". Not "host-1".
How can I fix this issue?
Thanks,
Gencer.
Hi,
We are having slow osd's... A hot topic to search on it... I've tried to
dive as deep as I can, but I need to know which debug setting will help me
to dive even deeper...
Okay: situation:
- After expansion lot's of backfill operations are running spread over the
osd's.
- max_backfills is set to 1 (meaning only 1 read and 1 write operation
might take place at the same time for the backfill operation)
- Disks are monitored and manual compaction is performed as soon as load
goes over 80%
- Requests thru RGW goes okay for some objects, and less okay for others.
When it goes fine requests are returned within a few ms, other requests
vary from a few ms to >10s.
- Ceph version 14.2.11
- Erasure coding configured, 6-3
- Backfill is running at an incredible terrible pace. I assume it's all
related
Okay, what have I done to navigate down:
1. Removed 1 RGW from the load balancer to increase logging and dive into it
2. Using s3cmd to request the object teaches me that the delay is between
the request from the RGW to the OSD and the response.
3. Investigating the osd using iostat doesn't reveal an issue. Load is very
low.
3. Next I increased logging on the osd (debug_osd 20/20) and launched a
request (debug setting has been immediately reversed)...
4. First you see that it is starting with the block for getting all
available shards... so far so good
5. Next we will see multiple blocks like:
2020-11-19 11:53:21.938 7fe90f79c700 15 osd.156 45341 *enqueue_op*
0x559f94d43ce0 prio 127 cost 0 latency 0.000032 epoch 45341
*MOSDECSubOpRead*(1.1e2s0 45341/45313 ECSubRead(tid=191,
to_read={1:47c1b09f:::44a57332-bad8-47b4-85d6-6d34be960d5f.51962.196_CT%2f20201001%2f82078cfd-a111-4b58-8b79-88a860915c56.PNG:head=0,28672,0},
subchunks={1:47c1b09f:::44a57332-bad8-47b4-85d6-6d34be960d5f.51962.196_CT%2f20201001%2f82078cfd-a111-4b58-8b79-88a860915c56.PNG:head=[0,1]},
attrs_to_read=)) v3
2020-11-19 11:53:21.938 7fe90678a700 10 osd.156 45341 *dequeue_op*
0x559f94d43ce0 prio 127 cost 0 latency 0.000135 *MOSDECSubOpRead*(1.1e2s0
45341/45313 ECSubRead(tid=191,
to_read={1:47c1b09f:::44a57332-bad8-47b4-85d6-6d34be960d5f.51962.196_CT%2f20201001%2f82078cfd-a111-4b58-8b79-88a860915c56.PNG:head=0,28672,0},
subchunks={1:47c1b09f:::44a57332-bad8-47b4-85d6-6d34be960d5f.51962.196_CT%2f20201001%2f82078cfd-a111-4b58-8b79-88a860915c56.PNG:head=[0,1]},
attrs_to_read=)) v3 pg pg[1.1e2s0( v 45295'4373744
(34740'4370700,45295'4373744] local-lis/les=45313/45317 n=1842926
ec=5869/714 lis/c 45313/45313 les/c/f 45317/45322/0 45313/45313/45313)
[156,64,146,256,98,67,46,31,100]p156(0) r=0 lpr=45313 crt=45295'4373744
lcod 0'0 mlcod 0'0 active+clean]
2020-11-19 11:53:21.938 7fe90678a700 20 osd.156 45341 share_map osd.156 v2:
192.168.34.57:6892/1464289 45341
2020-11-19 11:53:21.938 7fe90678a700 20 osd.156 45341 should_share_map
osd.156 v2:192.168.34.57:6892/1464289 45341
2020-11-19 11:53:21.938 7fe90678a700 10 osd.156 pg_epoch: 45341 pg[1.1e2s0(
v 45295'4373744 (34740'4370700,45295'4373744] local-lis/les=45313/45317
n=1842926 ec=5869/714 lis/c 45313/45313 les/c/f 45317/45322/0
45313/45313/45313) [156,64,146,256,98,67,46,31,100]p156(0) r=0 lpr=45313
crt=45295'4373744 lcod 0'0 mlcod 0'0 active+clean] _handle_message:
MOSDECSubOpRead(1.1e2s0 45341/45313 ECSubRead(tid=191,
to_read={1:47c1b09f:::44a57332-bad8-47b4-85d6-6d34be960d5f.51962.196_CT%2f20201001%2f82078cfd-a111-4b58-8b79-88a860915c56.PNG:head=0,28672,0},
subchunks={1:47c1b09f:::44a57332-bad8-47b4-85d6-6d34be960d5f.51962.196_CT%2f20201001%2f82078cfd-a111-4b58-8b79-88a860915c56.PNG:head=[0,1]},
attrs_to_read=)) v3
2020-11-19 11:53:21.938 7fe90678a700 20 osd.156 pg_epoch: 45341 pg[1.1e2s0(
v 45295'4373744 (34740'4370700,45295'4373744] local-lis/les=45313/45317
n=1842926 ec=5869/714 lis/c 45313/45313 les/c/f 45317/45322/0
45313/45313/45313) [156,64,146,256,98,67,46,31,100]p156(0) r=0 lpr=45313
crt=45295'4373744 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read read
request=28672 r=28672 len=28672
2020-11-19 11:53:21.938 7fe90678a700 10 osd.156 pg_epoch: 45341 pg[1.1e2s0(
v 45295'4373744 (34740'4370700,45295'4373744] local-lis/les=45313/45317
n=1842926 ec=5869/714 lis/c 45313/45313 les/c/f 45317/45322/0
45313/45313/45313) [156,64,146,256,98,67,46,31,100]p156(0) r=0 lpr=45313
crt=45295'4373744 lcod 0'0 mlcod 0'0 active+clean] get_hash_info: Getting
attr on
1:47c1b09f:::44a57332-bad8-47b4-85d6-6d34be960d5f.51962.196_CT%2f20201001%2f82078cfd-a111-4b58-8b79-88a860915c56.PNG:head
2020-11-19 11:53:21.938 7fe90678a700 10 osd.156 pg_epoch: 45341 pg[1.1e2s0(
v 45295'4373744 (34740'4370700,45295'4373744] local-lis/les=45313/45317
n=1842926 ec=5869/714 lis/c 45313/45313 les/c/f 45317/45322/0
45313/45313/45313) [156,64,146,256,98,67,46,31,100]p156(0) r=0 lpr=45313
crt=45295'4373744 lcod 0'0 mlcod 0'0 active+clean] get_hash_info: not in
cache
1:47c1b09f:::44a57332-bad8-47b4-85d6-6d34be960d5f.51962.196_CT%2f20201001%2f82078cfd-a111-4b58-8b79-88a860915c56.PNG:head
2020-11-19 11:53:21.938 7fe90678a700 10 osd.156 pg_epoch: 45341 pg[1.1e2s0(
v 45295'4373744 (34740'4370700,45295'4373744] local-lis/les=45313/45317
n=1842926 ec=5869/714 lis/c 45313/45313 les/c/f 45317/45322/0
45313/45313/45313) [156,64,146,256,98,67,46,31,100]p156(0) r=0 lpr=45313
crt=45295'4373744 lcod 0'0 mlcod 0'0 active+clean] get_hash_info: found on
disk, size 28672
2020-11-19 11:53:21.938 7fe90678a700 20 osd.156 pg_epoch: 45341 pg[1.1e2s0(
v 45295'4373744 (34740'4370700,45295'4373744] local-lis/les=45313/45317
n=1842926 ec=5869/714 lis/c 45313/45313 les/c/f 45317/45322/0
45313/45313/45313) [156,64,146,256,98,67,46,31,100]p156(0) r=0 lpr=45313
crt=45295'4373744 lcod 0'0 mlcod 0'0 active+clean] handle_sub_read:
Checking hash of
1:47c1b09f:::44a57332-bad8-47b4-85d6-6d34be960d5f.51962.196_CT%2f20201001%2f82078cfd-a111-4b58-8b79-88a860915c56.PNG:head
2020-11-19 11:53:21.938 7fe90678a700 20 osd.156 45341 share_map_peer
0x559f71ee3400 don't know epoch, doing nothing
2020-11-19 11:53:21.938 7fe90678a700 10 osd.156 45341 dequeue_op
0x559f94d43ce0 finish
.... This goes on...
The latency is caused between such blocks. E.g.:
Last line of a block of logging:
...
2020-11-19 11:53:*22.150* 7fe90678a700 10 osd.156 45341 dequeue_op
0x559f96a5f600 finish
2020-11-19 11:53:*22.150* 7fe90678a700 20 osd.156 op_wq(2) _process empty
q, waiting
...
Next block of logging:
2020-11-19 11:53:*27.089* 7fe925251700 15 osd.156 45341 enqueue_op
0x559f8f9618c0 prio 127 cost 0 latency 0.000019 epoch 45341
*MOSDECSubOpReadReply*(1.1e2s0 45341/45313 *ECSubReadReply*(tid=191,
attrs_read=0)) v2
2020-11-19 11:53:27.089 7fe925251700 20 osd.156 op_wq(2) _enqueue
OpQueueItem(1.1e2s0 PGOpItem(op=*MOSDECSubOpReadReply*(1.1e2s0 45341/45313
ECSubReadReply(tid=191, attrs_read=0)) v2) prio 127 cost 0 e45341)
2020-11-19 11:53:27.089 7fe90678a700 20 osd.156 op_wq(2) _process 1.1e2s0
to_process <> waiting <> waiting_peering {}
2020-11-19 11:53:27.089 7fe90678a700 20 osd.156 op_wq(2) _process
OpQueueItem(1.1e2s0 PGOpItem(op=*MOSDECSubOpReadReply*(1.1e2s0 45341/45313
ECSubReadReply(tid=191, attrs_read=0)) v2) prio 127 cost 0 e45341) queued
2020-11-19 11:53:27.089 7fe90678a700 20 osd.156 op_wq(2) _process 1.1e2s0
to_process <OpQueueItem(1.1e2s0 PGOpItem(op=MOSDECSubOpReadReply(1.1e2s0
45341/45313 ECSubReadReply(tid=191, attrs_read=0)) v2) prio 127 cost 0
e45341)> waiting <> waiting_peering {}
2020-11-19 11:53:27.089 7fe90678a700 20 osd.156 op_wq(2) _process
OpQueueItem(1.1e2s0 PGOpItem(op=*MOSDECSubOpReadReply*(1.1e2s0 45341/45313
*ECSubReadReply*(tid=191, attrs_read=0)) v2) prio 127 cost 0 e45341) pg
0x559f8b5e5000
2020-11-19 11:53:27.089 7fe90678a700 10 osd.156 45341 dequeue_op
0x559f8f9618c0 prio 127 cost 0 latency 0.000092
MOSDECSubOpReadReply(1.1e2s0 45341/45313 ECSubReadReply(tid=191,
attrs_read=0)) v2 pg pg[1.1e2s0( v 45295'4373744
(34740'4370700,45295'4373744] local-lis/les=45313/45317 n=1842926
ec=5869/714 lis/c 45313/45313 les/c/f 45317/45322/0 45313/45313/45313)
[156,64,146,256,98,67,46,31,100]p156(0) r=0 lpr=45313 crt=45295'4373744
lcod 0'0 mlcod 0'0 active+clean]
Now my question is:
What is triggering the enqueue_op? And what is triggering this operation
('cause it takes 5 seconds)? Which debug setting can help me find that out?
Many thanks for the feedback,
Kristof
Hi,
Is it s problem if ec_overwrite enabled in the data pool?
https://docs.ceph.com/en/latest/rados/operations/erasure-code/#erasure-codi…
Thanks
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Hi,
I have a use case where the user would like to have 50000 Buckets.
Is it normal for ceph just too much for me?
The reason they want this level of granularity is because they might need to clean buckets for a specific subset and not affect others.
Bucket format is this:
PR_PAGETPYE-<PAGE_TYPE_ID>_DEVICE-<DEVICE_ID>_WHITELABEL-<WHITELABEL_ID>_LANGUAGE-<LANGUAGE_ID>_SUBTYPE-<SUBTYPE_ID>/<OBJECT_ID>
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Dear cephers,
I have a problem with MGR daemons, ceph version mimic-13.2.8. I'm trying to do maintenance on our MON/MGR servers and am through with 2 out of 3. I have MON and MGR collocated on a host, 3 hosts in total. So far, procedure was to stop the deamons on the server and do the maintenance. Now I'd stuck at the last server, because MGR fail-over does not work. The remaining MGR instances go into a restart loop.
In an attempt to mitigate this, I stopped all but 1 MGR on a node that is done with maintenance. Everything fine. However, as soon as I stop the last MON I need to do maintenance on, the last remaining MGR goes into a restart loop all by itself. As far as I can see, the MGR does actually not restart, it just gets thrown out of the cluster. Here is a ceph status before stopping mon.ceph-01:
[root@ceph-01 ~]# ceph status
cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
services:
mon: 3 daemons, quorum ceph-01,ceph-02,ceph-03
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
data:
pools: 11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage: 877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3208 active+clean
7 active+clean+scrubbing+deep
As soon as I stop mon.ceph-01, all hell breaks loose. Note that mgr.ceph-03 is collocated with mon.ceph-03 and we have quorum between mon.ceph-02 and mon.ceph-03. Here ceph status snapshots after shutting down mon.ceph-01:
[root@ceph-01 ~]# ceph status
cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active)
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
data:
pools: 11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage: 877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
8 active+clean+scrubbing+deep
[root@ceph-01 ~]# ceph status
cluster:
id: xxx
health: HEALTH_WARN
no active mgr
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: no daemons active
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
data:
pools: 11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage: 877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
8 active+clean+scrubbing+deep
[root@ceph-01 ~]# ceph status
cluster:
id: xxx
health: HEALTH_WARN
1 pools nearfull
1/3 mons down, quorum ceph-02,ceph-03
services:
mon: 3 daemons, quorum ceph-02,ceph-03, out of quorum: ceph-01
mgr: ceph-03(active, starting)
mds: con-fs2-1/1/1 up {0=ceph-08=up:active}, 1 up:standby-replay
osd: 302 osds: 281 up, 281 in
data:
pools: 11 pools, 3215 pgs
objects: 334.1 M objects, 689 TiB
usage: 877 TiB used, 1.1 PiB / 1.9 PiB avail
pgs: 3207 active+clean
8 active+clean+scrubbing+deep
It is cycling through these 3 states and I couldn't find a reason why. The node ceph-01 is not special in any way.
Any hint would be greatly appreciated.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14