https://docs.ceph.com/en/reef/cephfs/createfs/ says:
> The data pool used to create the file system is the “default” data pool and the location for storing all inode backtrace information, which is used for hard link management and disaster recovery.
> For this reason, all CephFS inodes have at least one object in the default data pool. If erasure-coded pools are planned for file system data, it is best to configure the default as a replicated pool to improve small-object write and read performance when updating backtraces.
This poses the question:
Are normal replicated CephFS installations (metadata on SSDs, data on HDDs) set up with suboptimal performance because they don't do this?
If having inodes/backtraces on replicated instead of EC improves performance, shouldn't one expect that putting inodes/backtraces on SSD would improve it even more?
From the docs I also cannot really conclude when inotes/backtraces become important.
Is that all the time, or only sometimes?
Thanks!
Hi All,
So I've got a Ceph Reef Cluster (latest version) with a CephFS system set up with a number of directories on it.
On a Laptop (running Rocky Linux (latest version)) I've used fstab to mount a number of those directories - all good, everything works, happy happy joy joy! :-)
However, when the laptop goes into sleep or hibernate mode (ie when I close the lid) and then bring it back out of sleep/hibernate (ie open the lid) the CephFS mounts are "not present". The only way to get them back is to run `mount -a` as either root or as sudo. This, as I'm sure you'll agree, is less than ideal - especially as this is a pilot project for non-admin users (ie they won't have access to the root account or sudo on their own (corporate) laptops).
So, my question to the combined wisdom of the Community is what's the best way to resolve this issue?
I've looked at autofs, and even tried (half-heartedly - it was late, and I wanted to go home :-) ) to get this running, but I'm note sure if this is the best way to resolve things.
All help and advice on this greatly appreciated - thank in advance
Cheers
Dulux-Oz
Hello all.
I have a cluster with ~80TB of spinning disk. Its primary role is
cephfs. Recently I had a multiple drive failure (it was not
simultaneous) but it's left me with 20 incomplete pg's
I know this data is toast, but I need to be able to get what isn't
toast out of the cephfs. Well out of that pool and into a new pool.
The issue is the PG's that are incomplete block IO and that hinders
browsing the filesystem.
I'm attempting to use the "new" ceph-objectstore-tool mark-complete
operation, but I'm struggling to work out what to mark complete. Being
that it's EC and each PG is made up of multiple shards (I think is the
right word) and they all have their own status.
I did manage to mark one of these shard pg's complete on what appeared
to be the primary OSD, however it had no effect on that shard when
checking it with ceph pg X query. By that I mean the shard was marked
incomplete before and after using ceph-ojbectstore-tool.
I'm running Ceph 18.2.2, I have all HDD osd's. I can get any logs that
will help, I'm just not 100% sure where to start and don't want to
just dump 20 PG's of ceph pg X query on the mailing list
Thanks so much
Mal
Hello,
We have a question about mClock scheduling reads on pacific (16.2.14 currently).
When we do massive reads, from let's say machines we want to drain containing a lot of data on EC pools, we observe quite frequently slow ops on the source OSDs. Those slow ops affect the client services, talking directly rados. If we kill the OSD that causes slow ops, the recovery stays more or less at the same speed, but no more slow ops.
And when we tweak mClock, if we limit on the OSDs that are the source, nothing that we can observe happens. However, if we limit on the target OSDs, the global speed slows down, and the slow ops disappear.
So our question, is mClock taking into account the reads as well as the writes? Or are the reads calculate to be less expensive than the writes?
Thanks,
Luis Domingues
Proton AG
---------- Forwarded message ---------
发件人: <ceph-users-request(a)ceph.io>
Date: 2024年3月25日周一 20:58
Subject: Welcome to the "ceph-users" mailing list
To: <xuchenhuig(a)gmail.com>
Welcome to the "ceph-users" mailing list!
To post to this list, send your email to:
ceph-users(a)ceph.io
You can unsubscribe or make adjustments to your options via email by
sending a message to:
ceph-users-request(a)ceph.io
with the word 'help' in the subject or body (don't include the
quotes), and you will get back a message with instructions. You will
need your password to change your options, but for security purposes,
this password is not included here. If you have forgotten your
password you will need to reset it via the web UI.
Hi,
Wonder what we are missing from the netplan configuration on ubuntu which ceph needs to tolerate properly.
We are using this bond configuration on ubuntu 20.04 with octopus ceph:
bond1:
macaddress: x.x.x.x.x.50
dhcp4: no
dhcp6: no
addresses:
- 192.168.199.7/24
interfaces:
- ens2f0np0
- ens2f1np1
mtu: 9000
parameters:
mii-monitor-interval: 100
mode: 802.3ad
lacp-rate: fast
transmit-hash-policy: layer3+4
ens2f1np1 failed and caused slow ops, all osd down ... = disaster
Any idea what is wrong with this bond config?
Thank you
________________________________
This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.
(adding the list back to the thread)
On Wednesday, March 27, 2024 12:54:34 PM EDT Daniel Brown wrote:
> John
>
>
> I got curious and was taking another quick look through the python script
> for cephadm.
>
That's always welcome. :-D
> This is probably too simple of a question to be asking — or maybe I should
> say, I’m not expecting that there’s a simple answer to what might seem like
> a simple question -
>
> Is there anything that notifies the cluster, or the other hosts in a
> cluster, when a host is going into maintenance mode that it is going into
> maintenance mode, or is cephadm just doing systemctl commands behind the
> scenes to stop and later restart the appropriate ceph containers locally on
> that host?
>
> Maybe a better way to say it would be - what is differentiating between
> maintenance mode and a host simply crashing or going offline?
I'll paraphrase Adam King, tech lead for cephadm here:
If one runs the command from cephadm binary directly, it will be disabling/
stopping the systemd target only. The intention is for users to use the `ceph
orch host maintenance` ... commands.
When you use the orch command (quoting Adam here):
```
when we put something into maintenance mode we
1) disable and stop the systemd target for the daemons on the host
2) set the noout flag for all the OSDs on that host
3) internally to cephadm mark the host as having a status of "maintenance"
which has some effects such as us not refreshing metadata on that host or
attempting to place/remove daemons from there
The main difference from that to a host going offline is the noout flag for the
OSDs, and that cephadm will not periodically try to check if the host is
alive, as it would do for an offline host.
I believe the noout flag stops it from trying to migrate all the data on that
OSDs to other OSDs as it shouldn't be necessary if they will be coming back
```
The `cephadm host-maintenance enter` is meant to be a component of the `ceph
orch host maintenance` workflow. It still has a bug, the way it always exits
with an error is wrong. But you may not want to use it directly.
Reference links:
https://docs.ceph.com/en/latest/cephadm/host-management/#maintenance-modehttps://docs.ceph.com/en/latest/dev/cephadm/host-maintenance/
> > On Mar 22, 2024, at 6:26 AM, Daniel Brown <daniel.h.brown(a)thermify.cloud>
> > wrote:
> >
> >
> > Looks like it got OK’ed. I’ll put in something today.
> >
> >
> > --
> > Dan Brown
> >
> >> On Mar 21, 2024, at 13:44, John Mulligan <phlogistonjohn(a)asynchrono.us>
> >> wrote:>>
> >> On Thursday, March 21, 2024 11:43:19 AM EDT Daniel Brown wrote:
> >>> Assuming I need admin approval to report this on tracker, how long does
> >>> it
> >>> take to get approved?? Signed up a couple days ago, but still seeing
> >>> “Your
> >>> account was created and is now pending administrator approval.”
> >>
> >> That's unfortunate. I pinged about your issue signing up on the ceph
> >> slack
> >> channel for infrastructure. Hopefully, that'll get somebody's attention.
> >> If
> >> you don't get access by tomorrow feel free to ping me again directly and
> >> then *I'll* file the issue for you instead of having you wait around
> >> more.
Hey,
I'm running ceph version 18.2.1 (reef) but this problem must have existed a
long time before reef.
The documentation says the autoscaler will target 100 pgs per OSD but I'm
only seeing ~10. My erasure encoding is a stripe of 6 data 3 parity.
Could that be the reason? PGs numbers for that EC pool are therefore
multiplied by k+m by the autoscaler calculations?
Is backfill_toofull calculated against the total size of the PG against
every OSD it is destined for? For my case I have ~1TiB PGs because the
autoscaler is creating only 10 per host, and then backfill too full is
considering that one of my OSDs only has 500GiB free, although that doesn't
quite add up either because two 1TiB PGs are backfilling two pg's that have
OSD 1 in them. My backfill full ratio is set to 97%.
Would it be correct for me to change the autoscaler to target ~700 pgs per
osd and bias for storagefs and all EC pools to k+m? Should that be the
default or the documentation recommended value?
How scary is changing PG_NUM while backfilling misplaced PGs? It seems like
there's a chance the backfill might succeed so I think I can wait.
Any help is greatly appreciated, I've tried to include as much of the
relevant debugging output as I can think of.
Daniel
# ceph osd ls | wc -l
44
# ceph pg ls | wc -l
484
# ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO
TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK
.rgw.root 216.0k 3.0 480.2T 0.0000
1.0 32 on False
default.rgw.control 0 3.0 480.2T 0.0000
1.0 32 on False
default.rgw.meta 0 3.0 480.2T 0.0000
1.0 32 on False
default.rgw.log 1636k 3.0 480.2T 0.0000
1.0 32 on False
storagefs 233.5T 1.5 480.2T 0.7294
1.0 256 on False
storagefs-meta 850.2M 4.0 480.2T 0.0000
4.0 32 on False
storagefs_wide 355.3G 1.375 480.2T 0.0010
1.0 32 on False
.mgr 457.3M 3.0 480.2T 0.0000
1.0 1 on False
mgr-backup-2022-08-19 370.6M 3.0 480.2T 0.0000
1.0 32 on False
# ceph osd pool ls detail | column -t
pool 15 '.rgw.root' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32
autoscale_mode on
pool 16 'default.rgw.control' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32
autoscale_mode on
pool 17 'default.rgw.meta' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32
autoscale_mode on
pool 18 'default.rgw.log' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32
autoscale_mode on
pool 36 'storagefs' erasure profile 6.3 size 9
min_size 7 crush_rule 2 object_hash rjenkins pg_num 256
pgp_num 256 autoscale_mode on
pool 37 'storagefs-meta' replicated size 4 min_size 1
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32
autoscale_mode on
pool 45 'storagefs_wide' erasure profile 8.3 size 11
min_size 9 crush_rule 8 object_hash rjenkins pg_num 32
pgp_num 32 autoscale_mode on
pool 46 '.mgr' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1
autoscale_mode on
pool 48 'mgr-backup-2022-08-19' replicated size 3 min_size 2
crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32
autoscale_mode on
# ceph osd erasure-code-profile get 6.3
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=6
m=3
plugin=jerasure
technique=reed_sol_van
w=8
# ceph pg ls | awk 'NR==1 || /backfill_toofull/' | awk '{print $1" "$2"
"$4" "$6" "$11" "$15" "$16}' | column -t
PG OBJECTS MISPLACED BYTES STATE
UP ACTING
36.f 222077 141392 953817797727 active+remapped+backfill_toofull
[1,27,41,8,36,17,14,40,32]p1 [33,32,29,23,16,17,28,1,14]p33
36.5c 221761 147015 950692130045 active+remapped+backfill_toofull
[26,27,40,29,1,37,39,11,42]p26 [12,24,4,2,31,25,17,33,8]p12
36.60 222710 0 957109050809 active+remapped+backfill_toofull
[41,34,22,3,1,35,9,39,29]p41 [2,34,22,3,27,32,28,24,1]p2
36.6b 222202 427168 953843892012 active+remapped+backfill_toofull
[20,15,7,21,37,1,38,17,32]p20 [7,2,32,26,5,35,24,17,23]p7
36.74 222681 777546 957679960067 active+remapped+backfill_toofull
[42,24,12,34,38,10,27,1,25]p42 [34,33,12,0,19,14,17,30,25]p34
36.7b 222974 1560818 957691042940 active+remapped+backfill_toofull
[2,35,27,1,20,18,19,12,8]p2 [31,23,21,24,35,18,19,33,25]p31
36.82 222362 1998670 954507657022 active+remapped+backfill_toofull
[37,22,1,38,11,23,27,32,33]p37 [27,33,0,32,5,25,20,13,15]p27
36.b5 221676 1330056 953443725830 active+remapped+backfill_toofull
[6,8,38,12,21,1,39,34,27]p6 [33,8,26,12,3,10,22,34,1]p33
36.b6 222669 1335327 956973704883 active+remapped+backfill_toofull
[11,13,41,4,12,34,29,6,1]p11 [2,29,34,4,12,9,15,6,28]p2
36.e0 221518 1772144 952581426388 active+remapped+backfill_toofull
[1,27,21,31,30,23,37,13,28]p1 [25,21,14,31,1,2,34,17,24]p25
ceph pg ls | awk 'NR==1 || /backfilling/' | grep -e BYTES -e '\[1' -e ',1,'
-e '1\]' | awk '{print $1" "$2" "$4" "$6" "$11" "$15" "$16}' | column -t
PG OBJECTS MISPLACED BYTES STATE UP
ACTING
36.4a 221508 89144 951346455917 active+remapped+backfilling
[40,43,33,32,30,38,22,35,9]p40 [27,10,20,7,30,21,1,28,31]p27
36.79 222315 1111575 955797107713 active+remapped+backfilling
[1,36,31,33,25,23,14,3,13]p1 [27,6,31,23,25,5,14,29,13]p27
36.8d 222229 1284156 955234423342 active+remapped+backfilling
[35,34,27,37,38,36,43,3,16]p35 [35,34,15,26,1,11,27,18,16]p35
36.ba 222039 0 952547107971 active+remapped+backfilling
[0,40,33,23,41,4,27,22,28]p0 [0,35,33,27,1,3,30,22,28]p0
36.da 221607 277464 951599928383 active+remapped+backfilling
[21,31,8,9,11,25,36,23,28]p21 [0,10,1,22,33,11,35,15,28]p0
36.db 221685 58816 951420054091 active+remapped+backfilling
[3,28,12,13,1,38,40,35,43]p3 [27,20,17,21,1,23,28,24,31]p27
# ceph osd df | sort -nk 17 | tail -n 5
21 hdd 9.09598 1.00000 9.1 TiB 7.7 TiB 7.7 TiB 0 B 31 GiB
1.4 TiB 84.62 1.16 68 up
24 hdd 9.09598 1.00000 9.1 TiB 7.7 TiB 7.7 TiB 1 KiB 25 GiB
1.4 TiB 84.98 1.16 69 up
29 hdd 9.09569 1.00000 9.1 TiB 8.0 TiB 8.0 TiB 72 MiB 23 GiB
1.1 TiB 88.42 1.21 73 up
13 hdd 9.09569 1.00000 9.1 TiB 8.1 TiB 8.1 TiB 1 KiB 22 GiB
1023 GiB 89.02 1.22 76 up
1 hdd 7.27698 1.00000 7.3 TiB 6.8 TiB 6.8 TiB 27 MiB 18 GiB
451 GiB 93.94 1.28 64 up
# cat /etc/ceph/ceph.conf | grep full
mon_osd_full_ratio = .98
mon_osd_nearfull_ratio = .96
mon_osd_backfillfull_ratio = .97
osd_backfill_full_ratio = .97
osd_failsafe_full_ratio = .99
Hi team,
I am new to ceph and I am looking to monitor the user/bucket usage for ceph.
As per the following link:
https://docs.ceph.com/en/latest/radosgw/metrics/
But when I enabled the same using the command:
'ceph config set client.rgw CONFIG_VARIABLE VALUE'
I ould only see the following perf schema:
```
"rgw": {
"req": {
"type": 10,
"metric_type": "counter",
"value_type": "integer",
"description": "Requests",
"nick": "",
"priority": 5,
"units": "none"
},
"failed_req": {
"type": 10,
"metric_type": "counter",
"value_type": "integer",
"description": "Aborted requests",
"nick": "",
"priority": 5,
"units": "none"
xxxxxxxxxxxxxxxxxxxSNIPxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
```
But as per the link we should have also gotten the following metrices:
```
"rgw_op": [
{
"labels": {},
"counters": {
"put_obj_ops": 2,
"put_obj_bytes": 5327,
"put_obj_lat": {
"avgcount": 2,
"sum": 2.818064835,
"avgtime": 1.409032417
},
"get_obj_ops": 5,
"get_obj_bytes": 5325,
"get_obj_lat": {
"avgcount": 2,
"sum": 0.003000069,
"avgtime": 0.001500034
},
...
"list_buckets_ops": 1,
"list_buckets_lat": {
"avgcount": 1,
"sum": 0.002300000,
"avgtime": 0.002300000
}
}
},
]
```
But as per the following links:
https://github.com/ceph/ceph/blob/v19.0.0/src/rgw/rgw_perf_counters.cchttps://github.com/ceph/ceph/blob/v18.2.2/src/rgw/rgw_perf_counters.cc
I don't think this feature is currently supported
could anyone please help me with this?
Ceph-version being used by us - 18.2.0(reef)/18.2.2
Thanks and Regards,
Kushagra Gupta