I recently setup a new octopus cluster and was testing the autoscale
feature. Used ceph-ansible so its enabled by default. Anyhow, I have three
other clusters that are on nautilus, so I wanted to see if it made sense to
enable it there on the main pool.
Here is a print out of the autoscale status:
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO
TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
default.rgw.buckets.non-ec 0 2.0 55859G 0.0000
1.0 32 on
default.rgw.meta 9298 3.0 55859G 0.0000
1.0 32 on
default.rgw.buckets.index 18058M 3.0 55859G 0.0009
1.0 32 on
default.rgw.control 0 3.0 55859G 0.0000
1.0 32 on
default.rgw.buckets.data 9126G 2.0 55859G 0.3268
1.0 4096 1024 off
.rgw.root 3155 3.0 55859G 0.0000
1.0 32 on
rbd 155.5G 2.0 55859G 0.0056
1.0 32 on
default.rgw.log 374.4k 3.0 55859G 0.0000
1.0 64 on
For this entry:
default.rgw.buckets.data 9126G 2.0 55859G 0.3268
1.0 4096 1024 off
I have it disabled because it showed a warn message, but its recommending a
1024 PG setting. When I use the online ceph calculator at ceph.io, its
saying the 4096 setting is correct. So why is autoscaler saying 1024?
There are 6 osd servers with 10 OSDs each ( all SSD ). 60 TB total.
Pool LS output:
pool 1 '.rgw.root' replicated size 3 min_size 1 crush_rule 0 object_hash
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 8800 lfor
0/0/344 flags hashpspool stripe_width 0 application rgw
pool 2 'default.rgw.control' replicated size 3 min_size 1 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 8799
lfor 0/0/346 flags hashpspool stripe_width 0 application rgw
pool 3 'default.rgw.meta' replicated size 3 min_size 1 crush_rule 0
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 8798
lfor 0/0/350 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.log' replicated size 3 min_size 1 crush_rule 0
object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 8802
lfor 0/0/298 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.buckets.index' replicated size 3 min_size 1 crush_rule 0
object_hash rjenkins pg_num 638 pgp_num 608 pg_num_target 32 pgp_num_target
32 autoscale_mode on last_change 10320 lfor 0/10320/10318 owner
18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 7 'default.rgw.buckets.data' replicated size 2 min_size 1 crush_rule 0
object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 9467 lfor 0/0/552
owner 18446744073709551615 flags hashpspool stripe_width 0 application rgw
pool 8 'default.rgw.buckets.non-ec' replicated size 2 min_size 1 crush_rule
0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
8797 lfor 0/0/348 owner 18446744073709551615 flags hashpspool stripe_width 0
application rgw
pool 9 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins
pg_num 32 pgp_num 32 autoscale_mode on last_change 8801 flags
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
Regards,
-Brent
Existing Clusters:
Test: Ocotpus 15.2.5 ( all virtual on nvme )
US Production(HDD): Nautilus 14.2.11 with 11 osd servers, 3 mons, 4
gateways, 2 iscsi gateways
UK Production(HDD): Nautilus 14.2.11 with 18 osd servers, 3 mons, 4
gateways, 2 iscsi gateways
US Production(SSD): Nautilus 14.2.11 with 6 osd servers, 3 mons, 4 gateways,
2 iscsi gateways
UK Production(SSD): Octopus 15.2.5 with 5 osd servers, 3 mons, 4 gateways
Yes, that's right. It would be nice if there was a mount option to have such parameters adjusted on a per-file system basis. I should mention that I observed a significant performance improvement for HDD throughput of the local disk as well when adjusting these parameters for ceph.
This is largely due to the "too much memory problem" on big servers. The kernel defaults are suitable for machines with 4-8G of RAM. Any enterprise server will beat that with the consequence of insanely large amounts of dirty buffers, leading to buffer flush panic overloading in particular, network file systems (there is a nice article by SUSE https://www.suse.com/support/kb/doc/?id=000017857). Adjusting these parameters to play nice with ceph might actually improve overall performance as a side effect. I would give it a go.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Sage Meng <lkkey80(a)gmail.com>
Sent: 12 November 2020 16:00:08
To: Frank Schilder
Cc: ceph-users(a)ceph.io
Subject: Re: [ceph-users] Is there a way to make Cephfs kernel client to write data to ceph osd smoothly with buffer io
vm.dirty_bytes and vm.dirty_background_bytes are all system-wide control parameters, it will influence all the system jobs by adjusting them. Better to have a Ceph Special way to make the transfer more smoothly.
Frank Schilder <frans(a)dtu.dk<mailto:frans@dtu.dk>> 于2020年11月11日周三 下午3:28写道:
These kernel parameters influence the flushing of data, and also performance:
vm.dirty_bytes
vm.dirty_background_bytes
Smaller vm.dirty_background_bytes will make the transfer more smooth and the ceph cluster will like that. However, it reduces the chances of merge operations in cache and the ceph cluster will not like that. The tuning is heavily workload dependent. Test with realistic workloads and a reasonably large spectrum of values. I got good results by tuning down vm.dirty_background_bytes just to the point when it reduced client performance of copying large files.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Sage Meng <lkkey80(a)gmail.com<mailto:lkkey80@gmail.com>>
Sent: 06 November 2020 13:45:53
To: ceph-users(a)ceph.io<mailto:ceph-users@ceph.io>
Subject: [ceph-users] Is there a way to make Cephfs kernel client to write data to ceph osd smoothly with buffer io
Hi All,
Cephfs kernel client is influenced by kernel page cache when we write
data to it, outgoing data will be huge when os starts flush page cache.
So Is there a way to make Cephfs kernel client to write data to ceph osd
smoothly when buffer io is used ?
_______________________________________________
ceph-users mailing list -- ceph-users(a)ceph.io<mailto:ceph-users@ceph.io>
To unsubscribe send an email to ceph-users-leave(a)ceph.io<mailto:ceph-users-leave@ceph.io>
Hi
We’ve recently encountered the following errors:
[WRN] OSD_SLOW_PING_TIME_BACK: Slow OSD heartbeats on back (longest 2752.832ms)
Slow OSD heartbeats on back from osd.2 [nvme-a] to osd.290 [nvme-c] 2752.832 msec
...
Truncated long network list. Use ceph daemon mgr.# dump_osd_network for more information
To get more information we wanted to run the dump_osd_network command, but it doesn’t seem to be a valid command:
ceph daemon /var/run/ceph/ceph-mgr.$(hostname).asok dump_osd_network 0
no valid command found; 10 closest matches:
0
1
2
abort
assert
config diff
config diff get <var>
config get <var>
config help [<var>]
config set <var> <val>...
admin_socket: invalid command
Other commands, like ceph daemon dump_cache work, so it seems to hit the right socket.
What am I doing wrong?
Cheers,
Denis
I'm building a new 4-node Proxmox/Ceph cluster, to hold disk images for our VMs. (Ceph version is 15.2.5).
Each node has 6 x NVMe SSDs (4TB), and 1 x Optane drive (960GB).
CPU is AMD Rome 7442, so there should be plenty of CPU capacity to spare.
My aim is to create 4 x OSDs per NVMe SSD (to make more effective use of the NVMe performance) and use the Optane drive to store the WAL/DB partition for each OSD. (I.e. total of 24 x 35GB WAL/DB partitions).
However, I am struggling to get the right ceph-volume command to achieve this.
Thanks to a very kind Redditor, I was able to get close:
/dev/nvme0n1 is an Optane device (900GB).
/dev/nvme2n1 is an Intel NVMe SSD (4TB).
```
# ceph-volume lvm batch --osds-per-device 4 /dev/nvme2n1 --db-devices /dev/nvme0n1
Total OSDs: 4
Solid State VG:
Targets: block.db Total size: 893.00 GB
Total LVs: 16 Size per LV: 223.25 GB
Devices: /dev/nvme0n1
Type Path LV Size % of device
----------------------------------------------------------------------------------------------------
[data] /dev/nvme2n1 931.25 GB 25.0%
[block.db] vg: vg/lv 223.25 GB 25%
----------------------------------------------------------------------------------------------------
[data] /dev/nvme2n1 931.25 GB 25.0%
[block.db] vg: vg/lv 223.25 GB 25%
----------------------------------------------------------------------------------------------------
[data] /dev/nvme2n1 931.25 GB 25.0%
[block.db] vg: vg/lv 223.25 GB 25%
----------------------------------------------------------------------------------------------------
[data] /dev/nvme2n1 931.25 GB 25.0%
[block.db] vg: vg/lv 223.25 GB 25%
--> The above OSDs would be created if the operation continues
--> do you want to proceed? (yes/no)
```
This does split up the NVMe disk into 4 OSDs, and creates WAL/DB partition on the Optane drive - however, it creates 4 x 223 GB partitions on the Optane (whereas I want 35GB partitions).
Is there any way to specify the WAL/DB partition size in the above?
And can it be done, such that you can run successive ceph-volume commands, to add the OSDs and WAL/DB partitions for each NVMe disk?
(Or if there's an easier way to achieve the above layout, please let me know).
That being said - I also just saw this ceph-users thread:
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/3Y6DEJCF7ZM…
It talks there about "osd op num shards" and "osd op num threads per shard" - is there some way to set those, to achieve similar performance to say, 4 x OSDs per NVMe drive, but with only 1 x NVMe? Has anybody done any testing/benchmarking on this they can share?
Hi All,
I'm exploring deploying Ceph at my organization for use as an object storage system (using the S3 RGW interface).
My users have range of file sizes and I'd like to direct small files to a pool that uses replication and large files to a pool that uses erasure encoding.
Is that possible?
Thanks!
Bill
Hi
We have ceph cluster running on Nautilus, recently upgraded from Mimic.
When in Mimic we noticed issue with osdmap not trimming, which caused
part of our cluster to crash due to osdmap cache misses. We solved it by
adding "osd_map_cache_size = 5000" to our ceph.conf
Because we had at that time mixed OSD versions from both Mimic and
Nautilus we decided to finish upgrade, but it didn't solve our problem.
We have at the moment: "oldest_map": 67114, "newest_map": 72588,and the
difference is not shrinking even thought cluster is in active+clean
state. Restarting all mon's didn't help. It seems bug is similar to
https://tracker.ceph.com/issues/44184 but there's no solution there.
What else can i check or do?
I don't want do to cangerous things like mon_osd_force_trim_to or
something similar without finding the cause.
I noticed in MON debug log:
2020-11-10 17:11:14.612 7f9592d5b700 10 mon.monb01(a)0(leader).osd e72571
should_prune could only prune 4957 epochs (67114..72071), which is less
than the required minimum (10000)
2020-11-10 17:11:19.612 7f9592d5b700 10 mon.monb01(a)0(leader).osd e72571
should_prune could only prune 4957 epochs (67114..72071), which is less
than the required minimum (10000)
So i added config options to reduce those values:
mon dev mon_debug_block_osdmap_trim false
mon advanced mon_min_osdmap_epochs 100
mon advanced mon_osdmap_full_prune_min 500
mon advanced paxos_service_trim_min 10
But it didn't help:
2020-11-10 18:28:26.165 7f1b700ab700 20 mon.monb01(a)0(leader).osd e72588
load_osdmap_manifest osdmap manifest detected in store; reload.
2020-11-10 18:28:26.169 7f1b700ab700 10 mon.monb01(a)0(leader).osd e72588
load_osdmap_manifest store osdmap manifest pinned (67114 .. 72484)
2020-11-10 18:28:26.169 7f1b700ab700 10 mon.monb01(a)0(leader).osd e72588
should_prune not enough epochs to form an interval (last pinned: 72484,
last to pin: 72488, interval: 10)
Command "ceph report | jq '.osdmap_manifest' |jq '.pinned_maps[]'" shows
67114 on the top, but i'm unable to determine why.
Same with 'ceph report | jq .osdmap_first_committed':
root@monb01:/var/log/ceph# ceph report | jq .osdmap_first_committed
report 4073203295
67114
root@monb01:/var/log/ceph#
When i try to derermine if a certain PG or OSD is keeping it so low i
don't get anything.
And in MON debug log i get:
2020-11-10 18:42:41.767 7f1b74721700 10 mon.monb01@0(leader) e6
refresh_from_paxos
2020-11-10 18:42:41.767 7f1b74721700 10
mon.monb01(a)0(leader).paxosservice(mdsmap 1..1) refresh
2020-11-10 18:42:41.767 7f1b74721700 10
mon.monb01(a)0(leader).paxosservice(osdmap 67114..72588) refresh
2020-11-10 18:42:41.767 7f1b74721700 20 mon.monb01(a)0(leader).osd e72588
load_osdmap_manifest osdmap manifest detected in store; reload.
2020-11-10 18:42:41.767 7f1b74721700 10 mon.monb01(a)0(leader).osd e72588
load_osdmap_manifest store osdmap manifest pinned (67114 .. 72484)
I also get:
root@monb01:/var/log/ceph# ceph report |grep "min_last_epoch_clean"
report 2716976759
"min_last_epoch_clean": 0,
root@monb01:/var/log/ceph#
Additional info:
root@monb01:/var/log/ceph# ceph versions
{
"mon": {
"ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f)
nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f)
nautilus (stable)": 3
},
"osd": {
"ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f)
nautilus (stable)": 120,
"ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0)
nautilus (stable)": 164
},
"mds": {},
"overall": {
"ceph version 14.2.13 (1778d63e55dbff6cedb071ab7d367f8f52a8699f)
nautilus (stable)": 126,
"ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0)
nautilus (stable)": 164
}
}
root@monb01:/var/log/ceph# ceph mon feature ls
all features
supported: [kraken,luminous,mimic,osdmap-prune,nautilus]
persistent: [kraken,luminous,mimic,osdmap-prune,nautilus]
on current monmap (epoch 6)
persistent: [kraken,luminous,mimic,osdmap-prune,nautilus]
required: [kraken,luminous,mimic,osdmap-prune,nautilus]
root@monb01:/var/log/ceph# ceph osd dump | grep require
require_min_compat_client luminous
require_osd_release nautilus
root@monb01:/var/log/ceph# ceph report | jq
'.osdmap_manifest.pinned_maps | length'
report 1777129876
538
root@monb01:/var/log/ceph# ceph pg dump -f json | jq .osd_epochs
dumped all
null
--
Best regards
Marcin
Hi,
We have this "not permitted to load rgw_gc" error on some of our osds.
Anyone knows what this is and how to fix it?
Nautilus 14.2.11 and CentOS 7 / 8:
2020-11-11 09:48:15.914 7f665c1ea700 0 _get_class not permitted to load rgw_gc
2020-11-11 09:48:15.914 7f665c1ea700 -1 osd.874 163331 class rgw_gc
open got (1) Operation not permitted
Cheers, Dan