Hi,
I’m continuously getting scrub errors in my index pool and log pool that I need to repair always.
HEALTH_ERR 2 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 2 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 20.19 is active+clean+inconsistent, acting [39,41,37]
Why is this?
I have no cue at all, no log entry no anything ☹
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
We're happy to announce the 5th backport release in the Pacific series.
We recommend users to update to this release. For a detailed release
notes with links & changelog please refer to the official blog entry at
https://ceph.io/en/news/blog/2021/v16-2-5-pacific-released
Notable Changes
---------------
* `ceph-mgr-modules-core` debian package does not recommend
`ceph-mgr-rook` anymore. As the latter depends on `python3-numpy` which
cannot be imported in different Python sub-interpreters multi-times if
the version of `python3-numpy` is older than 1.19. Since `apt-get`
installs the `Recommends` packages by default, `ceph-mgr-rook` was
always installed along with `ceph-mgr` debian package as an indirect
dependency. If your workflow depends on this behavior, you might want to
install `ceph-mgr-rook` separately.
* mgr/nfs: `nfs` module is moved out of volumes plugin. Prior using the
`ceph nfs` commands, `nfs` mgr module must be enabled.
* volumes/nfs: The `cephfs` cluster type has been removed from the `nfs
cluster create` subcommand. Clusters deployed by cephadm can support an
NFS export of both `rgw` and `cephfs` from a single NFS cluster instance.
* The `nfs cluster update` command has been removed. You can modify the
placement of an existing NFS service (and/or its associated ingress
service) using `orch ls --export` and `orch apply -i ...`.
* The `orch apply nfs` command no longer requires a pool or namespace
argument. We strongly encourage users to use the defaults so that the
`nfs cluster ls` and related commands will work properly.
* The `nfs cluster delete` and `nfs export delete` commands are
deprecated and will be removed in a future release. Please use `nfs
cluster rm` and `nfs export rm` instead.
* A long-standing bug that prevented 32-bit and 64-bit client/server
interoperability under msgr v2 has been fixed. In particular, mixing
armv7l (armhf) and x86_64 or aarch64 servers in the same cluster now works.
Getting Ceph
------------
* Git at git://github.com/ceph/ceph.git
* Tarball at https://download.ceph.com/tarballs/ceph-16.2.5.tar.gz
* For packages, see https://docs.ceph.com/docs/master/install/get-packages/
* Release git sha1: 0883bdea7337b95e4b611c768c0279868462204a
We have a nautilus cluster where any metadata write operation is very slow.
We're seeing very light load from clients, as reported by dumping ops in
flight, often it's zero.
We're also seeing about 100 MB/s writes to the metadata pool, constantly,
for weeks on end, which seems excessive, as only 22GB is utilized.
Should the writes to the metadata pool not quiet down when there's nothing
going on?
Is there any way i can get information about why the MDSes are thrashing so
badly?
Hi,
I've added 4 nvme hosts with 2osd/nvme to my cluster and it made al the ssd osds flapping I don't understand why.
It is under the same root but 2 different device classes, nvme and ssd.
The pools are on the ssd on the nvme nothing at the moment.
The only way to bring back the ssd osds alive to shutdown the nvmes.
The new nvme servers have 25GB nics the old servers and the mons have 10GB but in aggregated mode.
This is the crush rule dump:
[
{
"rule_id": 0,
"rule_name": "replicated_ssd",
"ruleset": 0,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -21,
"item_name": "default~ssd"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "replicated_nvme",
"ruleset": 1,
"type": 1,
"min_size": 1,
"max_size": 10,
"steps": [
{
"op": "take",
"item": -10,
"item_name": "default~nvme"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
This is the osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-19 561.15057 root default
-1 38.03099 host server-2001
0 ssd 2.00000 osd.0 up 1.00000 1.00000
10 ssd 6.98499 osd.10 up 1.00000 1.00000
11 ssd 6.98599 osd.11 up 1.00000 1.00000
12 ssd 2.29799 osd.12 up 1.00000 1.00000
13 ssd 2.29799 osd.13 up 1.00000 1.00000
14 ssd 3.49300 osd.14 up 1.00000 1.00000
41 ssd 6.98499 osd.41 up 1.00000 1.00000
42 ssd 6.98599 osd.42 up 1.00000 1.00000
-3 38.03099 host server-2002
1 ssd 2.00000 osd.1 up 1.00000 1.00000
24 ssd 6.98499 osd.24 up 1.00000 1.00000
25 ssd 6.98599 osd.25 up 1.00000 1.00000
27 ssd 2.29799 osd.27 up 1.00000 1.00000
28 ssd 2.29799 osd.28 up 1.00000 1.00000
29 ssd 3.49300 osd.29 up 1.00000 1.00000
43 ssd 6.98499 osd.43 up 1.00000 1.00000
44 ssd 6.98599 osd.44 up 1.00000 1.00000
-6 38.03000 host server-2003
2 ssd 2.00000 osd.2 up 1.00000 1.00000
26 ssd 6.98499 osd.26 up 1.00000 1.00000
38 ssd 2.29999 osd.38 up 1.00000 1.00000
39 ssd 2.29500 osd.39 up 1.00000 1.00000
40 ssd 3.49300 osd.40 up 1.00000 1.00000
45 ssd 6.98499 osd.45 up 1.00000 1.00000
46 ssd 6.98599 osd.46 up 1.00000 1.00000
47 ssd 6.98599 osd.47 up 1.00000 1.00000
-17 111.76465 host server-2004
5 nvme 6.98529 osd.5 down 0 1.00000
9 nvme 6.98529 osd.9 down 0 1.00000
18 nvme 6.98529 osd.18 down 0 1.00000
22 nvme 6.98529 osd.22 down 0 1.00000
32 nvme 6.98529 osd.32 down 0 1.00000
36 nvme 6.98529 osd.36 down 0 1.00000
50 nvme 6.98529 osd.50 down 0 1.00000
54 nvme 6.98529 osd.54 down 0 1.00000
58 nvme 6.98529 osd.58 down 0 1.00000
62 nvme 6.98529 osd.62 down 0 1.00000
66 nvme 6.98529 osd.66 down 0 1.00000
70 nvme 6.98529 osd.70 down 0 1.00000
74 nvme 6.98529 osd.74 down 0 1.00000
78 nvme 6.98529 osd.78 down 0 1.00000
82 nvme 6.98529 osd.82 down 0 1.00000
86 nvme 6.98529 osd.86 down 0 1.00000
-14 111.76465 host server-2005
4 nvme 6.98529 osd.4 down 0 1.00000
8 nvme 6.98529 osd.8 down 0 1.00000
17 nvme 6.98529 osd.17 down 0 1.00000
21 nvme 6.98529 osd.21 down 0 1.00000
31 nvme 6.98529 osd.31 down 0 1.00000
35 nvme 6.98529 osd.35 down 0 1.00000
49 nvme 6.98529 osd.49 down 0 1.00000
53 nvme 6.98529 osd.53 down 0 1.00000
57 nvme 6.98529 osd.57 down 0 1.00000
61 nvme 6.98529 osd.61 down 0 1.00000
65 nvme 6.98529 osd.65 down 0 1.00000
69 nvme 6.98529 osd.69 down 0 1.00000
73 nvme 6.98529 osd.73 down 0 1.00000
77 nvme 6.98529 osd.77 down 0 1.00000
81 nvme 6.98529 osd.81 down 0 1.00000
85 nvme 6.98529 osd.85 down 0 1.00000
-22 111.76465 host server-2006
6 nvme 6.98529 osd.6 down 0 1.00000
15 nvme 6.98529 osd.15 down 0 1.00000
19 nvme 6.98529 osd.19 down 0 1.00000
23 nvme 6.98529 osd.23 down 0 1.00000
33 nvme 6.98529 osd.33 down 0 1.00000
37 nvme 6.98529 osd.37 down 0 1.00000
51 nvme 6.98529 osd.51 down 0 1.00000
55 nvme 6.98529 osd.55 down 0 1.00000
59 nvme 6.98529 osd.59 down 0 1.00000
63 nvme 6.98529 osd.63 up 0 1.00000
67 nvme 6.98529 osd.67 down 0 1.00000
71 nvme 6.98529 osd.71 up 0 1.00000
75 nvme 6.98529 osd.75 down 0 1.00000
79 nvme 6.98529 osd.79 down 0 1.00000
83 nvme 6.98529 osd.83 down 0 1.00000
87 nvme 6.98529 osd.87 down 0 1.00000
-11 111.76465 host server-2007
3 nvme 6.98529 osd.3 down 0 1.00000
7 nvme 6.98529 osd.7 down 0 1.00000
16 nvme 6.98529 osd.16 down 0 1.00000
20 nvme 6.98529 osd.20 down 0 1.00000
30 nvme 6.98529 osd.30 down 0 1.00000
34 nvme 6.98529 osd.34 down 0 1.00000
48 nvme 6.98529 osd.48 down 0 1.00000
52 nvme 6.98529 osd.52 down 0 1.00000
56 nvme 6.98529 osd.56 down 0 1.00000
60 nvme 6.98529 osd.60 down 0 1.00000
64 nvme 6.98529 osd.64 down 0 1.00000
68 nvme 6.98529 osd.68 down 0 1.00000
72 nvme 6.98529 osd.72 down 0 1.00000
76 nvme 6.98529 osd.76 down 0 1.00000
80 nvme 6.98529 osd.80 down 0 1.00000
84 nvme 6.98529 osd.84 down 0 1.00000
Pool info:
pool 21 'dbs-realtime-staging-client' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 27611 lfor 0/27512/27510 flags hashpspool,selfmanaged_snaps max_bytes 9999757606912 stripe_width 0 application rbd
pool 24 'dbs-realtime-staging-w-financedb' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 256 pgp_num 256 autoscale_mode on last_change 27613 flags hashpspool,selfmanaged_snaps max_bytes 19999515213824 stripe_width 0 application rbd
pool 25 'dbs-realtime-staging-w-dstest' replicated size 3 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 27813 lfor 0/0/23856 flags hashpspool,selfmanaged_snaps max_bytes 99857989632 stripe_width 0 application rbd
Istvan Szabo
Senior Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo(a)agoda.com<mailto:istvan.szabo@agoda.com>
---------------------------------------------------
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
Thanks Marc.
Means, we can upgrade from Luminous to Nautils and later can upgrade the
OSDs from ceph-disk to ceph-volmue..
On Thu, Jul 8, 2021 at 5:45 PM Marc <Marc(a)f1-outsourcing.eu> wrote:
> I did the same upgrade from Luminous to Nautilus, and still have osd's
> created with ceph-disk. I am slowly migrating to lvm and encryption.
>
> However I did have some issues with osd's not starting, you have to check
> run levels and make sure the symlinks are still correct. I also had
> something that I had to change ownership /dev/sdX before it would start.
>
> I still have this in my rc.local
>
> chown ceph.ceph /dev/sdb2
> chown ceph.ceph /dev/sdc2
> chown ceph.ceph /dev/sdd2
> chown ceph.ceph /dev/sde2
> chown ceph.ceph /dev/sdf2
> chown ceph.ceph /dev/sdg2
> chown ceph.ceph /dev/sdh2
> chown ceph.ceph /dev/sdi2
> chown ceph.ceph /dev/sdj2
> chown ceph.ceph /dev/sdk2
>
>
> > -----Original Message-----
> > From: M Ranga Swami Reddy <swamireddy(a)gmail.com>
> > Sent: Thursday, 8 July 2021 11:49
> > To: ceph-devel <ceph-devel(a)vger.kernel.org>; ceph-users <ceph-
> > users(a)ceph.com>
> > Subject: [ceph-users] Fwd: ceph upgrade from luminous to nautils
> >
> > ---------- Forwarded message ---------
> > From: M Ranga Swami Reddy <swamireddy(a)gmail.com>
> > Date: Thu, Jul 8, 2021 at 2:30 PM
> > Subject: ceph upgrade from luminous to nautils
> > To: ceph-devel <ceph-devel(a)vger.kernel.org>
> >
> >
> > Dear All,
> > I am using the Ceph with Luminous version with 2000+ OSDs.
> > Planning to upgrade the ceph from Luminous to Nautils.
> > Currently, all OSDs deployed via ceph-disk.
> > Can I proceed with this upgrade?
> > is the ceph-disk OSDs will work with ceph-volumes (as ceph-disk
> > deprecated
> > in memic release)
> >
> > Please advise.
> >
> > Thanks
> > Swami
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io
>
Hi,
After upgrading from 15.2.8. to 15.2.13 with cephadm on CentOS 8
(containerised installation done by cephadm), Grafana no longer shows
new data. Additionally, when accessing the Dashboard-URL on a host
currently not hosting the dashboard, I am redirected to a wrong hostname
(as shown in ceph mgr services).
I assume that this is caused by the same reason which leads to this
output of `ceph mgr services`:
{
"dashboard": "https://ceph-<cluster-id>-mgr.iceph-11.tsmsqs:8443/",
"prometheus": "http://ceph-<cluster-id>-mgr.iceph-11.tsmsqs:9283/"
}
The correct hostname is iceph-11 (without the tsmsqs part), FQDN is
iceph-11.servernet. The hosts use DNS, the names (iceph-11 and
iceph-11.servernet) are resolvable both from the hosts as well as from
within the Podman containers.
I have determined that podman by default sets the container name as a
hostname alias (visible with `hostname -a` within the container), which
somehow leads to Ceph mgr picking it up as the primary name?
My workaround is to modify
/var/lib/ceph/<cluster-id>/mgr.<hostname>.<random-6-char-string>/unit.run,
adding --no-hosts as an additional argument to the "podman run" command.
I could probably use a system-wide containers.conf as well.
With this workaround and after restarting the Ceph mgr container (via
systemctl) and then restarting Prometheus and Grafana (with ceph orch
redeploy), I once again get data in Grafana and the correct redirect for
the dashboard. `ceph mgr services` also shows expected and correct values.
I am wondering if this kind of issue is known or whether there is
something wrong with my setup. I expected Ceph mgr to use the primary
hostname and not some seemingly random hostname alias. Maybe this issue
can also be discussed in a troubleshooting section of the monitoring
stack documentation.
Cheers
Sebastian
We're running a rook-ceph cluster that has gotten stuck in "1 MDSs behind
on trimming".
* 1 filesystem, three active MDS servers each with standby
* Quite a few files (20M objects), daily snapshots. This might be a
problem?
* Ceph pacific 16.2.4
* `ceph health detail` doesn't provide much help (see below)
* num_segments is very slowly increasing over time
* Restarting all of the MDSs returns to the same point.
* moderate CPU usage for each MDS server (~30% for the stuck one, ~80% of a
core for the others)
* logs for the stuck MDS looks clean, it hits rejoin_joint_start then
standard 'updating MDS map to version XXX" messages
* `ceph daemon mds.x ops` shows no active ops on each of the MDS servers
* `mds_log_max_segments` is set to 128, setting to a higher number causes
the warning to go away, but the filesystem remains degraded, and setting it
back to 128 shows num_segments has not changed.
* I've tried playing around with other MDS settings based on various posts
on this list and elsewhere, to no avail
* `cephfs-journal-tool journal inspect` for each rank says journal
integrity is fine.
Something similar happened last week and (probably by accident by
removing/adding nodes?) I got the MDSs to start recovering and the
filesystem went back to healthy.
I'm at a bit of a loss for what else to try.
Thanks!
Zack
`ceph health detail`
HEALTH_WARN mons are allowing insecure global_id reclaim; 1 filesystem is
degraded; 1 MDSs behind on trimming; mon x is low on available space
[WRN] AUTH_INSECURE_GLOBAL_ID_RECLAIM_ALLOWED: mons are allowing insecure
global_id reclaim
mon.x has auth_allow_insecure_global_id_reclaim set to true
mon.ad has auth_allow_insecure_global_id_reclaim set to true
mon.af has auth_allow_insecure_global_id_reclaim set to true
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs myfs is degraded
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.myfs-d(mds.2): Behind on trimming (340/128) max_segments: 128,
num_segments: 340
[WRN] MON_DISK_LOW: mon x is low on available space
mon.x has 22% avail
`ceph config get mds`
WHO MASK LEVEL OPTION VALUE RO
global basic log_file *
global basic log_to_file false
mds basic mds_cache_memory_limit 17179869184
mds advanced mds_cache_trim_decay_rate 1.000000
mds advanced mds_cache_trim_threshold 1048576
mds advanced mds_log_max_segments 128
mds advanced mds_recall_max_caps 5000
mds advanced mds_recall_max_decay_rate 2.500000
global advanced mon_allow_pool_delete true
global advanced mon_allow_pool_size_one true
global advanced mon_cluster_log_file
global advanced mon_pg_warn_min_per_osd 0
global advanced osd_pool_default_pg_autoscale_mode on
global advanced osd_scrub_auto_repair true
global advanced rbd_default_features 3
Hello
I have some experience with RBD clusters (for use with KVM/libvirt) but
now I'm building my first cluster to use with RGW.
The RGW cluster size will be around 70T RAW, current RBD cluster(s)
are in similar (or smaller) size. I'll be deploying Octopus
Since most of the tunning is pretty different (big number of PGs,
bluestore_compression_*, bluestore_min_alloc_size_*) I wonder if
makes sense to run both workloads in the same cluster or if it would
be better to have dedicated clusters.
However bigger clusters (AFAIK) are more stable, what is other people doing?
Single cluster for all workloads or a cluster per workload?
thanks!
PS: Asking for the future, what about CepFS? should it share cluster?
--
IRC: gfa
GPG: 0x27263FA42553615F904A7EBE2A40A2ECB8DAD8D5
OLD GPG: 0x44BB1BA79F6C6333