Hi,
I've added 4 nvme hosts with 2osd/nvme to my cluster and it made al the ssd osds flapping I don't understand why.
It is under the same root but 2 different device classes, nvme and ssd.
The pools are on the ssd on the nvme nothing at the moment.
The only way to bring back the ssd osds alive to shutdown the nvmes.
The new nvme servers have 25GB nics the old servers and the mons have 10GB but in aggregated mode.
This is the crush rule dump:
[
{
"rule_id": 0,
"rule_name": "replicated_ssd",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -21,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "replicated_nvme",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -10,
"item_name": "default~nvme"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
This is the osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-19 561.15057 root default
-1 38.03099 host server-2001
0 ssd 2.00000 osd.0 up 1.00000 1.00000
10 ssd 6.98499 osd.10 up 1.00000 1.00000
11 ssd 6.98599 osd.11 up 1.00000 1.00000
12 ssd 2.29799 osd.12 up 1.00000 1.00000
13 ssd 2.29799 osd.13 up 1.00000 1.00000
14 ssd 3.49300 osd.14 up 1.00000 1.00000
41 ssd 6.98499 osd.41 up 1.00000 1.00000
42 ssd 6.98599 osd.42 up 1.00000 1.00000
-3 38.03099 host server-2002
1 ssd 2.00000 osd.1 up 1.00000 1.00000
24 ssd 6.98499 osd.24 up 1.00000 1.00000
25 ssd 6.98599 osd.25 up 1.00000 1.00000
27 ssd 2.29799 osd.27 up 1.00000 1.00000
28 ssd 2.29799 osd.28 up 1.00000 1.00000
29 ssd 3.49300 osd.29 up 1.00000 1.00000
43 ssd 6.98499 osd.43 up 1.00000 1.00000
44 ssd 6.98599 osd.44 up 1.00000 1.00000
-6 38.03000 host server-2003
2 ssd 2.00000 osd.2 up 1.00000 1.00000
26 ssd 6.98499 osd.26 up 1.00000 1.00000
38 ssd 2.29999 osd.38 up 1.00000 1.00000
39 ssd 2.29500 osd.39 up 1.00000 1.00000
40 ssd 3.49300 osd.40 up 1.00000 1.00000
45 ssd 6.98499 osd.45 up 1.00000 1.00000
46 ssd 6.98599 osd.46 up 1.00000 1.00000
47 ssd 6.98599 osd.47 up 1.00000 1.00000
-17 111.76465 host server-2004
5 nvme 6.98529 osd.5 down 0 1.00000
9 nvme 6.98529 osd.9 down 0 1.00000
18 nvme 6.98529 osd.18 down 0 1.00000
22 nvme 6.98529 osd.22 down 0 1.00000
32 nvme 6.98529 osd.32 down 0 1.00000
36 nvme 6.98529 osd.36 down 0 1.00000
50 nvme 6.98529 osd.50 down 0 1.00000
54 nvme 6.98529 osd.54 down 0 1.00000
58 nvme 6.98529 osd.58 down 0 1.00000
62 nvme 6.98529 osd.62 down 0 1.00000
66 nvme 6.98529 osd.66 down 0 1.00000
70 nvme 6.98529 osd.70 down 0 1.00000
74 nvme 6.98529 osd.74 down 0 1.00000
78 nvme 6.98529 osd.78 down 0 1.00000
82 nvme 6.98529 osd.82 down 0 1.00000
86 nvme 6.98529 osd.86 down 0 1.00000
-14 111.76465 host server-2005
4 nvme 6.98529 osd.4 down 0 1.00000
8 nvme 6.98529 osd.8 down 0 1.00000
17 nvme 6.98529 osd.17 down 0 1.00000
21 nvme 6.98529 osd.21 down 0 1.00000
31 nvme 6.98529 osd.31 down 0 1.00000
35 nvme 6.98529 osd.35 down 0 1.00000
49 nvme 6.98529 osd.49 down 0 1.00000
53 nvme 6.98529 osd.53 down 0 1.00000
57 nvme 6.98529 osd.57 down 0 1.00000
61 nvme 6.98529 osd.61 down 0 1.00000
65 nvme 6.98529 osd.65 down 0 1.00000
69 nvme 6.98529 osd.69 down 0 1.00000
73 nvme 6.98529 osd.73 down 0 1.00000
77 nvme 6.98529 osd.77 down 0 1.00000
81 nvme 6.98529 osd.81 down 0 1.00000
85 nvme 6.98529 osd.85 down 0 1.00000
-22 111.76465 host server-2006
6 nvme 6.98529 osd.6 down 0 1.00000
15 nvme 6.98529 osd.15 down 0 1.00000
19 nvme 6.98529 osd.19 down 0 1.00000
23 nvme 6.98529 osd.23 down 0 1.00000
33 nvme 6.98529 osd.33 down 0 1.00000
37 nvme 6.98529 osd.37 down 0 1.00000
51 nvme 6.98529 osd.51 down 0 1.00000
55 nvme 6.98529 osd.55 down 0 1.00000
59 nvme 6.98529 osd.59 down 0 1.00000
63 nvme 6.98529 osd.63 up 0 1.00000
67 nvme 6.98529 osd.67 down 0 1.00000
71 nvme 6.98529 osd.71 up 0 1.00000
75 nvme 6.98529 osd.75 down 0 1.00000
79 nvme 6.98529 osd.79 down 0 1.00000
83 nvme 6.98529 osd.83 down 0 1.00000
87 nvme 6.98529 osd.87 down 0 1.00000
-11 111.76465 host server-2007
3 nvme 6.98529 osd.3 down 0 1.00000
7 nvme 6.98529 osd.7 down 0 1.00000
16 nvme 6.98529 osd.16 down 0 1.00000
20 nvme 6.98529 osd.20 down 0 1.00000
30 nvme 6.98529 osd.30 down 0 1.00000
34 nvme 6.98529 osd.34 down 0 1.00000
48 nvme 6.98529 osd.48 down 0 1.00000
52 nvme 6.98529 osd.52 down 0 1.00000
56 nvme 6.98529 osd.56 down 0 1.00000
60 nvme 6.98529 osd.60 down 0 1.00000
64 nvme 6.98529 osd.64 down 0 1.00000
68 nvme 6.98529 osd.68 down 0 1.00000
72 nvme 6.98529 osd.72 down 0 1.00000
76 nvme 6.98529 osd.76 down 0 1.00000
80 nvme 6.98529 osd.80 down 0 1.00000
84 nvme 6.98529 osd.84 down 0 1.00000
Pool info:
pool 21 'dbs-realtime-staging-client' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 27611 lfor 0/27512/27510 flags hashpspool,selfmanaged_snaps max_bytes 9999757606912 stripe_width 0 application rbd
pool 24 'dbs-realtime-staging-w-financedb' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 27613 flags hashpspool,selfmanaged_snaps max_bytes 19999515213824 stripe_width 0 application rbd
pool 25 'dbs-realtime-staging-w-dstest' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 27813 lfor 0/0/23856 flags hashpspool,selfmanaged_snaps max_bytes 99857989632 stripe_width 0 application rbd
Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo(a)agoda.com<mailto:istvan.szabo@agoda.com>
---------------------------------------------------
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Thanks Marc.
Means, we can upgrade from Luminous to Nautils and later can upgrade the
OSDs from ceph-disk to ceph-volmue..
On Thu, Jul 8, 2021 at 5:45 PM Marc <Marc(a)f1-outsourcing.eu> wrote:
> I did the same upgrade from Luminous to Nautilus, and still have osd's
> created with ceph-disk. I am slowly migrating to lvm and encryption.
>
> However I did have some issues with osd's not starting, you have to check
> run levels and make sure the symlinks are still correct. I also had
> something that I had to change ownership /dev/sdX before it would start.
>
> I still have this in my rc.local
>
> chown ceph.ceph /dev/sdb2
> chown ceph.ceph /dev/sdc2
> chown ceph.ceph /dev/sdd2
> chown ceph.ceph /dev/sde2
> chown ceph.ceph /dev/sdf2
> chown ceph.ceph /dev/sdg2
> chown ceph.ceph /dev/sdh2
> chown ceph.ceph /dev/sdi2
> chown ceph.ceph /dev/sdj2
> chown ceph.ceph /dev/sdk2
>
>
> > -----Original Message-----
> > From: M Ranga Swami Reddy <swamireddy(a)gmail.com>
> > Sent: Thursday, 8 July 2021 11:49
> > To: ceph-devel <ceph-devel(a)vger.kernel.org>; ceph-users <ceph-
> > users(a)ceph.com>
> > Subject: [ceph-users] Fwd: ceph upgrade from luminous to nautils
> >
> > ---------- Forwarded message ---------
> > From: M Ranga Swami Reddy <swamireddy(a)gmail.com>
> > Date: Thu, Jul 8, 2021 at 2:30 PM
> > Subject: ceph upgrade from luminous to nautils
> > To: ceph-devel <ceph-devel(a)vger.kernel.org>
> >
> >
> > Dear All,
> > I am using the Ceph with Luminous version with 2000+ OSDs.
> > Planning to upgrade the ceph from Luminous to Nautils.
> > Currently, all OSDs deployed via ceph-disk.
> > Can I proceed with this upgrade?
> > is the ceph-disk OSDs will work with ceph-volumes (as ceph-disk
> > deprecated
> > in memic release)
> >
> > Please advise.
> >
> > Thanks
> > Swami
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
Hi,
After upgrading from 15.2.8. to 15.2.13 with cephadm on CentOS 8
(containerised installation done by cephadm), Grafana no longer shows
new data. Additionally, when accessing the Dashboard-URL on a host
currently not hosting the dashboard, I am redirected to a wrong hostname
(as shown in ceph mgr services).
I assume that this is caused by the same reason which leads to this
output of `ceph mgr services`:
{
"dashboard": "https://ceph-<cluster-id>-mgr.iceph-11.tsmsqs:8443/",
"prometheus": "http://ceph-<cluster-id>-mgr.iceph-11.tsmsqs:9283/"
}
The correct hostname is iceph-11 (without the tsmsqs part), FQDN is
iceph-11.servernet. The hosts use DNS, the names (iceph-11 and
iceph-11.servernet) are resolvable both from the hosts as well as from
within the Podman containers.
I have determined that podman by default sets the container name as a
hostname alias (visible with `hostname -a` within the container), which
somehow leads to Ceph mgr picking it up as the primary name?
My workaround is to modify
/var/lib/ceph/<cluster-id>/mgr.<hostname>.<random-6-char-string>/unit.run,
adding --no-hosts as an additional argument to the "podman run" command.
I could probably use a system-wide containers.conf as well.
With this workaround and after restarting the Ceph mgr container (via
systemctl) and then restarting Prometheus and Grafana (with ceph orch
redeploy), I once again get data in Grafana and the correct redirect for
the dashboard. `ceph mgr services` also shows expected and correct values.
I am wondering if this kind of issue is known or whether there is
something wrong with my setup. I expected Ceph mgr to use the primary
hostname and not some seemingly random hostname alias. Maybe this issue
can also be discussed in a troubleshooting section of the monitoring
stack documentation.
Cheers
Sebastian
We're running a rook-ceph cluster that has gotten stuck in "1 MDSs behind
on trimming".
* 1 filesystem, three active MDS servers each with standby
* Quite a few files (20M objects), daily snapshots. This might be a
problem?
* Ceph pacific 16.2.4
* `ceph health detail` doesn't provide much help (see below)
* num_segments is very slowly increasing over time
* Restarting all of the MDSs returns to the same point.
* moderate CPU usage for each MDS server (~30% for the stuck one, ~80% of a
core for the others)
* logs for the stuck MDS looks clean, it hits rejoin_joint_start then
standard 'updating MDS map to version XXX" messages
* `ceph daemon mds.x ops` shows no active ops on each of the MDS servers
* `mds_log_max_segments` is set to 128, setting to a higher number causes
the warning to go away, but the filesystem remains degraded, and setting it
back to 128 shows num_segments has not changed.
* I've tried playing around with other MDS settings based on various posts
on this list and elsewhere, to no avail
* `cephfs-journal-tool journal inspect` for each rank says journal
integrity is fine.
Something similar happened last week and (probably by accident by
removing/adding nodes?) I got the MDSs to start recovering and the
filesystem went back to healthy.
I'm at a bit of a loss for what else to try.
Thanks!
Zack
`ceph health detail`
HEALTH_WARN mons are allowing insecure global_id reclaim; 1 filesystem is
degraded; 1 MDSs behind on trimming; mon x is low on available space
[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure
global_id reclaim
mon.x has auth_allow_insecure_global_id_reclaim set to true
mon.ad has auth_allow_insecure_global_id_reclaim set to true
mon.af has auth_allow_insecure_global_id_reclaim set to true
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs myfs is degraded
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.myfs-d(mds.2): Behind on trimming (340/128) max_segments: 128,
num_segments: 340
[WRN] MON_DISK_LOW: mon x is low on available space
mon.x has 22% avail
`ceph config get mds`
WHO MASK LEVEL OPTION VALUE RO
global basic log_file *
global basic log_to_file false
mds basic mds_cache_memory_limit 17179869184
mds advanced mds_cache_trim_decay_rate 1.000000
mds advanced mds_cache_trim_threshold 1048576
mds advanced mds_log_max_segments 128
mds advanced mds_recall_max_caps 5000
mds advanced mds_recall_max_decay_rate 2.500000
global advanced mon_allow_pool_delete true
global advanced mon_allow_pool_size_one true
global advanced mon_cluster_log_file
global advanced mon_pg_warn_min_per_osd 0
global advanced osd_pool_default_pg_autoscale_mode on
global advanced osd_scrub_auto_repair true
global advanced rbd_default_features 3
Hello
I have some experience with RBD clusters (for use with KVM/libvirt) but
now I'm building my first cluster to use with RGW.
The RGW cluster size will be around 70T RAW, current RBD cluster(s)
are in similar (or smaller) size. I'll be deploying Octopus
Since most of the tunning is pretty different (big number of PGs,
bluestore_compression_*, bluestore_min_alloc_size_*) I wonder if
makes sense to run both workloads in the same cluster or if it would
be better to have dedicated clusters.
However bigger clusters (AFAIK) are more stable, what is other people doing?
Single cluster for all workloads or a cluster per workload?
thanks!
PS: Asking for the future, what about CepFS? should it share cluster?
--
IRC: gfa
GPG: 0x27263FA42553615F904A7EBE2A40A2ECB8DAD8D5
OLD GPG: 0x44BB1BA79F6C6333
---------- Forwarded message ---------
From: M Ranga Swami Reddy <swamireddy(a)gmail.com>
Date: Thu, Jul 8, 2021 at 2:30 PM
Subject: ceph upgrade from luminous to nautils
To: ceph-devel <ceph-devel(a)vger.kernel.org>
Dear All,
I am using the Ceph with Luminous version with 2000+ OSDs.
Planning to upgrade the ceph from Luminous to Nautils.
Currently, all OSDs deployed via ceph-disk.
Can I proceed with this upgrade?
is the ceph-disk OSDs will work with ceph-volumes (as ceph-disk deprecated
in memic release)
Please advise.
Thanks
Swami
Hi,
We've done our fair share of Ceph cluster upgrades since Hammer, and
have not seen much problems with them. I'm now at the point that I have
to upgrade a rather large cluster running Luminous and I would like to
hear from other users if they have experiences with issues I can expect
so that I can anticipate on them beforehand.
As said, the cluster is running Luminous (12.2.13) and has the following
services active:
services:
mon: 3 daemons, quorum osdnode01,osdnode02,osdnode04
mgr: osdnode01(active), standbys: osdnode02, osdnode03
mds: pmrb-3/3/3 up {0=osdnode06=up:active,1=osdnode08=up:active,2=osdnode07=up:active}, 1 up:standby
osd: 116 osds: 116 up, 116 in;
rgw: 3 daemons active
Of the OSD's, we have 11 SSD's and 105 HDD. The capacity of the cluster
is 1.01PiB.
We have 2 active crush-rules on 18 pools. All pools have a size of 3 there is a total of 5760 pgs.
{
"rule_id": 1,
"rule_name": "hdd-data",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -10,
"item_name": "default~hdd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "ssd-data",
"ruleset": 2,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -21,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
rbd -> crush_rule: hdd-data
.rgw.root -> crush_rule: hdd-data
default.rgw.control -> crush_rule: hdd-data
default.rgw.data.root -> crush_rule: ssd-data
default.rgw.gc -> crush_rule: ssd-data
default.rgw.log -> crush_rule: ssd-data
default.rgw.users.uid -> crush_rule: hdd-data
default.rgw.usage -> crush_rule: ssd-data
default.rgw.users.email -> crush_rule: hdd-data
default.rgw.users.keys -> crush_rule: hdd-data
default.rgw.meta -> crush_rule: hdd-data
default.rgw.buckets.index -> crush_rule: ssd-data
default.rgw.buckets.data -> crush_rule: hdd-data
default.rgw.users.swift -> crush_rule: hdd-data
default.rgw.buckets.non-ec -> crush_rule: ssd-data
DB0475 -> crush_rule: hdd-data
cephfs_pmrb_data -> crush_rule: hdd-data
cephfs_pmrb_metadata -> crush_rule: ssd-data
All but four clients are running Luminous, the four are running Jewel
(that needs upgrading before proceeding with this upgrade).
So, normally, I would 'just' upgrade all Ceph packages on the
monitor-nodes and restart mons and then mgrs.
After that, I would upgrade all Ceph packages on the OSD nodes and
restart all the OSD's. Then, after that, the MDSes and RGWs. Restarting
the OSD's will probably take a while.
If anyone has a hint on what I should expect to cause some extra load or
waiting time, that would be great.
Obviously, we have read
https://ceph.com/releases/v14-2-0-nautilus-released/ , but I'm looking
for real world experiences.
Thanks!
--
Mark Schouten | Tuxis B.V.
KvK: 74698818 | http://www.tuxis.nl/
T: +31 318 200208 | info(a)tuxis.nl
Hi,
Is there anybody know about list-type=2 request?
GET /bucket?list-type=2&max-keys=2
We faced yesterday the 2nd big objectstore cluster outage due to this request. 1 user made the cluster down totally. The normal ceph iostat read operation is below 30k, when they deployed their release it jumped up to 350k which made rados-gateway died under the haproxy.
Why ceph is so sensitive for this or what is this actually? I don't even found anything in google.
Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo(a)agoda.com<mailto:istvan.szabo@agoda.com>
---------------------------------------------------
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.