Hello!
We have a small proxmox farm with
ceph consisting of three nodes.
Each node has 6 disks each with a capacity of 4 TB.
A only one pool has been created on these disks.
Size 2/1.
In theory, this pool should have a capacity: 32.74 TB
But the ceph df command returns only: 22.4 TB (USED + MAX AVAIL)(16.7 + 5.7)
How to explain this difference?
*ceph version is:* 12.2.12-pve1
*ceph df command out:*
POOLS:
NAME ID QUOTA OBJECTS QUOTA BYTES USED
%USED MAX AVAIL OBJECTS DIRTY READ WRITE
RAW USED
ala01vf01p01 7 N/A N/A 16.7TiB
74.53 5.70TiB 4411119 4.41M 2.62GiB 887MiB
33.4TiB
*crush map:*
host n01vf01 {
id -3 # do not change unnecessarily
id -4 class hdd # do not change unnecessarily
id -18 class nvme # do not change unnecessarily
# weight 22.014
alg straw2
hash 0 # rjenkins1
item osd.0 weight 3.669
item osd.13 weight 3.669
item osd.14 weight 3.669
item osd.15 weight 3.669
item osd.16 weight 3.669
item osd.17 weight 3.669
}
host n02vf01 {
id -5 # do not change unnecessarily
id -6 class hdd # do not change unnecessarily
id -19 class nvme # do not change unnecessarily
# weight 22.014
alg straw2
hash 0 # rjenkins1
item osd.1 weight 3.669
item osd.8 weight 3.669
item osd.9 weight 3.669
item osd.10 weight 3.669
item osd.11 weight 3.669
item osd.12 weight 3.669
}
host n04vf01 {
id -34 # do not change unnecessarily
id -35 class hdd # do not change unnecessarily
id -36 class nvme # do not change unnecessarily
# weight 22.014
alg straw2
hash 0 # rjenkins1
item osd.7 weight 3.669
item osd.27 weight 3.669
item osd.24 weight 3.669
item osd.25 weight 3.669
item osd.26 weight 3.669
item osd.28 weight 3.669
}
root default {
id -1 # do not change unnecessarily
id -2 class hdd # do not change unnecessarily
id -21 class nvme # do not change unnecessarily
# weight 66.042
alg straw2
hash 0 # rjenkins1
item n01vf01 weight 22.014
item n02vf01 weight 22.014
item n04vf01 weight 22.014
}
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
I currently have two roots in my crush map, one for HDD devices and one for SSD devices, and have had it that way since Jewel.
I am currently on Nautilus, and have had my crush device classes for my OSD's set since Luminous.
> ID CLASS WEIGHT TYPE NAME
> -13 105.37599 root ssd
> -11 105.37599 rack ssd.rack2
> -14 17.61099 host ceph00
> 24 ssd 1.76099 osd.24
> -1 398.92554 root default
> -10 397.07343 rack default.rack2
> -70 44.45032 chassis ceph05
> -67 44.45032 host ceph05
> 74 hdd 1.85210 osd.74
I have crush rulesets that distribute based on the roots for each device class.
> [
> {
> "rule_id": 0,
> "rule_name": "replicated_ruleset",
> "ruleset": 0,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -1,
> "item_name": "default"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "chassis"
> },
> {
> "op": "emit"
> }
> ]
> },
> {
> "rule_id": 1,
> "rule_name": "ssd_ruleset",
> "ruleset": 1,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -13,
> "item_name": "ssd"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 0,
> "type": "host"
> },
> {
> "op": "emit"
> }
> ]
> },
> {
> "rule_id": 2,
> "rule_name": "hybrid_ruleset",
> "ruleset": 2,
> "type": 1,
> "min_size": 1,
> "max_size": 10,
> "steps": [
> {
> "op": "take",
> "item": -13,
> "item_name": "ssd"
> },
> {
> "op": "chooseleaf_firstn",
> "num": 1,
> "type": "host"
> },
> {
> "op": "emit"
> },
> {
> "op": "take",
> "item": -1,
> "item_name": "default"
> },
> {
> "op": "chooseleaf_firstn",
> "num": -1,
> "type": "chassis"
> },
> {
> "op": "emit"
> }
> ]
> }
If I wanted to migrate to rulesets based on device class with minimal disruption, what are my options?
In my mind the way this would work would be to
1. Set the norebalance flag
2. Rework my crush rulesets to use takes based on class rather than root.
3. Merge my ssd hosts from the ssd root to the default root
4. Let things rebalance?
Would prefer minimal data movement, as that would be potentially disruptive, and I imagine provide minimal gain for me, but possibly better data distribution?
Maybe there are better steps to take?
Appreciate any help.
Reed
Hi,all
we use openstack + ceph(hammer) in my production.There are 22 osds on a host and 11 osds share one ssd for osd journal.
Unfortunately, one of the ssds does not work,so the 11 osds were down.The osd log shows as below:
-1> 2019-09-19 11:35:52.681142 7fcab5354700 1 -- xxxxxxxxxxxx:6831/16460 --> xxxxxxxxxxxx:0/14304 -- osd_ping(ping_reply e6152 stamp 2019-09-19 11:35:52.679939) v2 -- ?+0 0x20af8400 con 0x20a4b340
0> 2019-09-19 11:35:52.682578 7fcabed3c700 -1 os/FileJournal.cc: In function 'void FileJournal::write_finish_thread_entry()' thread 7fcabed3c700 time 2019-09-19 11:35:52.640294
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbc8b55]
2: (FileJournal::write_finish_thread_entry()+0x695) [0xa795c5]
3: (FileJournal::WriteFinisher::entry()+0xd) [0x91cecd]
4: (()+0x7dc5) [0x7fcacb81cdc5]
5: (clone()+0x6d) [0x7fcaca2fd1cd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this
As it shows,osd was down at 09-19,But now the pg still is degraded and remapped,It seam stuck.
ceph -s
cluster 26cc714c-ed78-4d62-9435-db3e87509c5f
health HEALTH_WARN
681 pgs degraded
681 pgs stuck degraded
797 pgs stuck unclean
681 pgs stuck undersized
681 pgs undersized
recovery 132155/11239182 objects degraded (1.176%)
recovery 22056/11239182 objects misplaced (0.196%)
monmap e1: 3 mons at {ctrl01=xxx.xxx.xxx.xxx:6789/0,ctrl02=xxx.xxx.xxx.xxx:6789/0,ctrl03=xxx.xxx.xxx.xxx:6789/0}
election epoch 122, quorum 0,1,2 ctrl01,ctrl02,ctrl03
osdmap e6590: 324 osds: 313 up, 313 in; 116 remapped pgs
pgmap v40849600: 21504 pgs, 6 pools, 14048 GB data, 3658 kobjects
41661 GB used, 279 TB / 319 TB avail
132155/11239182 objects degraded (1.176%)
22056/11239182 objects misplaced (0.196%)
20707 active+clean
681 active+undersized+degraded
116 active+remapped
client io 121 MB/s rd, 144 MB/s wr, 1029 op/s
I query one of the pgs,the recovery_state is started [1]
I also found pg have not third osd to mapped ,as shows below.
[root@ctrl01 ~]# ceph pg map 4.75f
osdmap e6590 pg 4.75f (4.75f) -> up [34,106] acting [34,106]
crushmap at [2]
How i should to get the cluster come back ok?
can someone help me. very very thanks.
[1] https://github.com/rongzhen-zhan/myfile/blob/master/pgquery
[2] https://github.com/rongzhen-zhan/myfile/blob/master/crushmap
Hi Cephalopods,
I'm in the process of migrating radosgw Erasure Code pool from old cluster
to Replica pool on new cluster. To avoid user write new object to old pool,
I want to set the radosgw user privilege to read only.
Could you guys please share how to limit radosgw user privilege to read
only?
I could not find any clear explanation and example in the Ceph
radosgw-admin docs. Is it by changing the user's caps or op_mask? Or
setting the civetweb option to only allow HTTP HEAD and GET methods?
Kind regards,
Charles Alva
Sent from Gmail Mobile
The standard advice is "1GB RAM per 1TB of OSD". Does this actually still hold with large OSDs on bluestore? Can it be reasonably reduced with tuning?
From the docs, it looks like bluestore should target the "osd_memory_target" value by default. This is a fixed value (4GB by default), which does not depend on OSD size. So shouldn't the advice really by "4GB per OSD", rather than "1GB per TB"? Would it also be reasonable to reduce osd_memory_target for further RAM savings?
For example, suppose we have 90 12TB OSD drives:
* "1GB per TB" rule: 1080GB RAM
* "4GB per OSD" rule: 360GB RAM
* "2GB per OSD" (osd_memory_target reduced to 2GB): 180GB RAM
Those are some massively different RAM values. Perhaps the old advice was for filestore? Or there is something to consider beyond the bluestore memory target? What about when using very dense nodes (for example, 60 12TB OSDs on a single node)?
Hi all,
I've recently moved a 1TiB pool (3TiB raw use) from hdd osds (7) to newly added nvme osds (14). The hdd osds should be almost empty by now as just small pools reside on them. The pools on the hdd osds in sum store about 25GiB, which should use about 75GiB with a pool size of 3. Wal and db are on separate devices.
However the outputs of ceph df and ceph osd df tell a different story:
# ceph df
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 19 TiB 18 TiB 775 GiB 782 GiB 3.98
# ceph osd df | egrep "(ID|hdd)"
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
8 hdd 2.72392 1.00000 2.8 TiB 111 GiB 10 GiB 111 KiB 1024 MiB 2.7 TiB 3.85 0.60 65 up
6 hdd 2.17914 1.00000 2.3 TiB 112 GiB 11 GiB 83 KiB 1024 MiB 2.2 TiB 4.82 0.75 58 up
3 hdd 2.72392 1.00000 2.8 TiB 114 GiB 13 GiB 71 KiB 1024 MiB 2.7 TiB 3.94 0.62 76 up
5 hdd 2.72392 1.00000 2.8 TiB 109 GiB 7.6 GiB 83 KiB 1024 MiB 2.7 TiB 3.76 0.59 63 up
4 hdd 2.72392 1.00000 2.8 TiB 112 GiB 11 GiB 55 KiB 1024 MiB 2.7 TiB 3.87 0.60 59 up
7 hdd 2.72392 1.00000 2.8 TiB 114 GiB 13 GiB 8 KiB 1024 MiB 2.7 TiB 3.93 0.61 66 up
2 hdd 2.72392 1.00000 2.8 TiB 111 GiB 9.9 GiB 78 KiB 1024 MiB 2.7 TiB 3.84 0.60 69 up
The sum of "DATA" is 75,5GiB which is what I am expecting to be used by the pools. How come the sum of "RAW USE" is 783GiB? More than 10x the size of the stored data. On my nvme osds the "RAW USE" to "DATA" overhead is <1%:
ceph osd df|egrep "(ID|nvme)"
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS
0 nvme 2.61989 1.00000 2.6 TiB 181 GiB 180 GiB 31 KiB 1.0 GiB 2.4 TiB 6.74 1.05 12 up
1 nvme 2.61989 1.00000 2.6 TiB 151 GiB 150 GiB 39 KiB 1024 MiB 2.5 TiB 5.62 0.88 10 up
13 nvme 2.61989 1.00000 2.6 TiB 239 GiB 238 GiB 55 KiB 1.0 GiB 2.4 TiB 8.89 1.39 16 up
-- truncated --
I am running ceph version 14.2.3 (0f776cf838a1ae3130b2b73dc26be9c95c6ccc39) nautilus (stable) which was upgraded recently from 13.2.1.
Any help is appreciated.
Best regards,
Georg
Ganesha itself has no dependencies on samba (and there aren't any on
my system, when I build). These must be being pulled in by something
else that Ganesha does use.
Daniel
On Thu, Sep 26, 2019 at 11:21 AM Marc Roos <M.Roos(a)f1-outsourcing.eu> wrote:
>
>
> Is it really necessary to have these dependencies in nfs-ganesha 2.7
>
> Dep-Install samba-client-libs-4.8.3-4.el7.x86_64 @CentOS7
> Dep-Install samba-common-4.8.3-4.el7.noarch @CentOS7
> Dep-Install samba-common-libs-4.8.3-4.el7.x86_64 @CentOS7
>
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io
> To unsubscribe send an email to ceph-users-leave(a)ceph.io
Hi,
It's observerd up to 10 times space is consumed when concurrent 200 files iozone writing test ,
with erasure code profile (k=8,m=4) data pool, mounted with ceph fuse, but disk usage is normal if only has one writing task .
Furthermore everything is normal using replicated data pool, no matter how many writing operations
at the same time.
Regards,
Dai
#iozone -s 100M -r 1M -i 0 -u 200 -l 200 -+n -w
#df -h /data01
ceph-fuse on /data01 type fuse.ceph-fuse (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other)
# du -sh /data01
801M /data01
#ceph df
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 357 TiB 356 TiB 60 GiB 132 GiB 0.04
ssd 1.3 TiB 1.3 TiB 2.6 GiB 5.6 GiB 0.42
TOTAL 358 TiB 358 TiB 62 GiB 137 GiB 0.04
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
meta_data01 1 166 MiB 64 509 MiB 0.04 423 GiB
data_data01 2 800 MiB 201 8.9 GiB 0 226 TiB
#ceph osd pool get data_data01 erasure_code_profile
erasure_code_profile: profile_data01
#ceph osd erasure-code-profile get profile_data01
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=8
m=4
plugin=jerasure
technique=reed_sol_van
w=8
Hello,
In my company, we currently have the following infrastructure:
- Ceph Luminous
- OpenStack Pike.
We have a cluster of 3 osd nodes with the following configuration:
- 1 x Xeon (R) D-2146NT CPU @ 2.30GHz
- 128GB RAM
- 128GB ROOT DISK
- 12 x 10TB SATA ST10000NM0146 (OSD)
- 1 x Intel Optane P4800X SSD DC 375GB (block.DB / block.wal)
- Ubuntu 16.04
- 2 X 10Gb network interface configured with lacp
The compute nodes have
- 4 x 10Gb network interfaces with lacp.
We also have 4 monitors with:
- 4 x 10Gb lacp network interfaces.
- The monitor nodes are approx. 90% cpu idle time with 32GB / 256GB
available RAM
For each OSD disk we have created a partition of 33GB to block.db and
block.wal.
We are recently facing a number of performance issues. Virtual machines
created in OpenStack are experiencing slow writing issues (approx. 50MB /
s).
The OSD nodes monitoring incur an average of 20% cpu IOwait time and 70 cpu
idle time.
The memory consumption is around 30% consumption.
We have no latency issues (9ms average)
My question is if what is happening may have to do with the amount of disk
dedicated to DB / WAL. In the CEPH documentation it says it is recommended
that the block.db size is not smaller than 4% of block.
In this case for each disk in my environment block.db could not be less
than 400GB / OSD.
Another question is if I set my disks to use block.db / block.wal on the
mechanical disks themselves, if that could lead to a performance
degradation.
Att.
João Victor Rodrigues Soares