Quick question Ceph guru's.
For a 1.1PB raw cephfs system currently storing 191TB of data and 390 million objects (mostly small Python, ML training files etc.) how many MDS servers should I be running?
System is Nautilus 14.2.8.
I ask because up to know I have run one MDS with one standby-replay and occasionally it blows up with large memory consumption, 60Gb+ even though I have mds_cache_memory_limit = 32G and that was 16G until recently. It of course tries to restart on another MDS node fails again and after several attempts usually comes back up. Today I increased to two active MDS's but the question is what is the optimal number for a pretty active system? The single MDS seemed to regularly run around 1400 req/s and I often get up to six clients failing to respond to cache pressure.
The current setup is:
ceph fs status
cephfs - 71 clients
======
+------+----------------+--------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+----------------+--------+---------------+-------+-------+
| 0 | active | a | Reqs: 447 /s | 12.0M | 11.9M |
| 1 | active | b | Reqs: 154 /s | 1749k | 1686k |
| 1-s | standby-replay | c | Evts: 136 /s | 1440k | 1423k |
| 0-s | standby-replay | d | Evts: 402 /s | 16.8k | 298 |
+------+----------------+--------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 160G | 169G |
| cephfs_data | data | 574T | 140T |
+-----------------+----------+-------+-------+
+-------------+
| Standby MDS |
+-------------+
| w |
| x |
| y |
| z |
+-------------+
MDS version: ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)
Regards.
Robert Ruge
Systems & Network Manager
Faculty of Science, Engineering & Built Environment
[cid:image001.png@01D36789.04BE09A0]
Important Notice: The contents of this email are intended solely for the named addressee and are confidential; any unauthorised use, reproduction or storage of the contents is expressly prohibited. If you have received this email in error, please delete it and any attachments immediately and advise the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are error or virus free.
This is the second time this happened in a couple of weeks. The MDS locks
up and the stand-by can't take over so the Montiors black list them. I try
to unblack list them, but they still say this in the logs
mds.0.1184394 waiting for osdmap 234947 (which blacklists prior instance)
Looking at a pg dump, it looks like the epoch is passed that.
$ ceph pg map 3.756
osdmap e234953 pg 3.756 (3.756) -> up [113,180,115] acting [113,180,115]
Last time, it seemed to just recover after about an hour all by it's self.
Any way to speed this up?
Thank you,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
In the Nautilus manual it recommends >= 4.14 kernel for multiple active
MDSes. What are the potential issues for running the 4.4 kernel with
multiple MDSes? We are in the process of upgrading the clients, but at
times overrun the capacity of a single MDS server.
MULTIPLE ACTIVE METADATA SERVERS
<https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active…>
The feature has been supported since the Luminous release. It is
recommended to use Linux kernel clients >= 4.14 when there are multiple
active MDS.
Thank you,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
Hi
I have a few questions about bucket versioning.
in the output of command "*radosgw-admin bucket stats --bucket=XXX"* there
is info about versions:
"ver": "0#521391,1#516042,2#518098,3#517681,4#518423",
"master_ver": "0#0,1#0,2#0,3#0,4#0",
Also "*metadata get"* returns info about versions:
radosgw-admin metadata get bucket:XXX
{
"key": "bucket:XXX",
"ver": {
"tag": "_KrvQc6gBg1Zcrr8s8M5jXmk",
"ver": 335
},
But I'm pretty sure that bucket versioning should be disabled, because "*aws
s3api get-bucket-versioning*" returns nothing.
How should I understand the current situation?
The problem is that from the client side I can see that the bucket is very
small. Less than 10GB while checking the bucket stats from radosgw-admins
side shows the bucket is taking nearly 1TB.
Kind regards / Pozdrawiam,
Katarzyna Myrek
Hi Manuel,
My replica is 2, hence about 10TB of unaccounted usage.
Andrei
----- Original Message -----
> From: "EDH - Manuel Rios" <mriosfer(a)easydatahost.com>
> To: "Andrei Mikhailovsky" <andrei(a)arhont.com>
> Sent: Tuesday, 28 April, 2020 23:57:20
> Subject: RE: rados buckets copy
> Is your replica x3? 9x3 27... plus some overhead rounded....
>
> Ceph df show including replicas , bucket stats just bucket usage no replicas.
>
> -----Mensaje original-----
> De: Andrei Mikhailovsky <andrei(a)arhont.com>
> Enviado el: miércoles, 29 de abril de 2020 0:55
> Para: ceph-users <ceph-users(a)ceph.io>
> Asunto: [ceph-users] rados buckets copy
>
> Hello,
>
> I have a problem with radosgw service where the actual disk usage (ceph df shows
> 28TB usage) is way more than reported by the radosgw-admin bucket stats (9TB
> usage). I have tried to get to the end of the problem, but no one seems to be
> able to help. As a last resort I will attempt to copy the buckets, rename them
> and remove the old buckets.
>
> What is the best way of doing this (probably on a high level) so that the copy
> process doesn't carry on the wasted space to the new buckets?
>
> Cheers
>
> Andrei
> _______________________________________________
> ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to
> ceph-users-leave(a)ceph.io
Dear all,
Two days ago I added very few disks to a ceph cluster and run into a problem I have never seen before when doing that. The entire cluster was deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I added OSDs under 13.2.8.
I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one that needed 1. Procedure was as usual:
ceph osd set norebalance
deploy additional OSD
The OSD came up and PGs started peering, so far so good. To my surprise, however, I started seeing health-warnings about slow ping times:
Long heartbeat ping times on back interface seen, longest is 1171.910 msec
Long heartbeat ping times on front interface seen, longest is 1180.764 msec
After peering it looked like it got better and I waited it out until the messages were gone. This took a really long time, at least 5-10 minutes.
I went on to the next host and deployed 2 new OSDs this time. Same as above, but with much worse consequences. Apparently, the ping times exceeded a timeout for a very short moment and an OSD was marked out for ca. 2 seconds. Now all hell broke loose. I got health errors with the dreaded "backfill_toofull", undersized PGs and a large amount of degraded objects. I don't know what is causing what, but I ended up with data loss by just adding 2 disks.
We have dedicated network hardware and each of the OSD hosts has 20GBit front and 40GBit back network capacity (LACP trunking). There are currently no more than 16 disks per server. The disks were added to an SSD pool. There was no traffic nor any other exceptional load on the system. I have ganglia resource monitoring on all nodes and cannot see a single curve going up. Network, CPU utilisation, load, everything below measurement accuracy. The hosts and network are quite overpowered and dimensioned to host many more OSDs (in future expansions).
I have three questions, ordered by how urgently I need an answer:
1) I need to add more disks next week and need a workaround. Will something like this help avoiding the heartbeat time-out:
ceph osd set noout
ceph osd set nodown
ceph osd set norebalance
2) The "lost" shards of the degraded objects were obviously still on the cluster somewhere. Is there any way to force the cluster to rescan OSDs for the shards that went orphan during the incident?
3) This smells a bit like a bug that requires attention. I was probably just lucky that I only lost 1 shard per PG. Has something similar reported before? Is this fixed in 13.2.10? Is it something new? Any settings that need to be looked at? If logs need to be collected, I can do so during my next attempt. However, I cannot risk data integrity of a production cluster and, therefore, probably not run the original procedure again.
Many thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
Hello,
I saw there was a clone_range function in librados earlier. But it removed in version 12 I beleive. I need excatly that function to avoid unnecessary network traffic.
I need to combine many small objects into one. So clone range would be really useful for me. I can read from an object and write to another but this will cause unnecessary network traffic.
How can I do this in new versions of librados ?
Hello,
Is there a way to get read/write I/O statistics for each rbd device for
each mapping?
For example, when an application uses one of the volumes, I would like to
find out what performance (avg read/write bandwidth, IOPS, etc) that
application observed on a given volume. Is that possible?
Thanks,
Shridhar
Hi,
I have added a new fast_data pool to cephfs, fixed the auth caps, eg
client.f9wn
key: ........
caps: [mds] allow rw
caps: [mon] allow r
caps: [osd] allow rw pool=cephfs_data, allow rw pool=fast_data
but the client with kernel mounted cephfs reports an error when trying
to read or write, eg
f9nd003 ~ # echo a > /ceph/grid/cache/a
-bash: echo: write error: Operation not permitted
mount is:
cat /proc/mounts |grep ceph
monitors...:/ /ceph ceph rw,relatime,name=f9wn,secret=<hidden>,acl 0 0
it seems that only umount + mount solves this issue, kernel is vanilla
4.19.60.
Is there any way to force to propagate new auth capabilities without
remounting the fs?
Thanks,
Andrej
--
_____________________________________________________________
prof. dr. Andrej Filipcic, E-mail: Andrej.Filipcic(a)ijs.si
Department of Experimental High Energy Physics - F9
Jozef Stefan Institute, Jamova 39, P.o.Box 3000
SI-1001 Ljubljana, Slovenia
Tel.: +386-1-477-3674 Fax: +386-1-425-7074
-------------------------------------------------------------
Hi all,
I am trying to setup an active-active NFS Ganesha cluster (with two Ganeshas (v3.0) running in Docker containers). I could manage to get two Ganesha daemons running using the rados_cluster backend for active-active deployment. I have the grace db within the cephfs metadata pool in an own namespace which keeps track on the node status.
Now, I can mount the exposed filesystem over NFS (v4.1, v4.2) with both daemons. So far so good. __
Testing high availability resulted in an unexpected behavior for that I am not sure whether it is intentional or whether it is a configuration problem.
Problem:
If both are running, no E or N flags are set within the grace db, as I expect. Once, one host goes down (or is taken down) ALL clients cannot read nor write to the mounted filesystem, even the clients which are not connected to dead ganesha. In the db, I see that the dead ganesha has state NE and the active has E. This state is what I expect from the Ganesha documentation. Nevertheless, I would assume that the clients connected to the active daemon are not blocked. This state is not cleaned up by itself (e.g. after the grace period).
I can unlock this situation by 'lifting' the dead node with a direct db call (using ganesha-rados-grace tool). But within an active-active deployment this is not suitable.
The ganesha config looks like:
------------
NFS_CORE_PARAM
{
Enable_NLM = false;
Protocols = 4;
}
NFSv4
{
RecoveryBackend = rados_cluster;
Minor_Versions = 1,2;
}
RADOS_KV
{
pool = "cephfsmetadata";
nodeid = "a" ;
namespace = "grace";
UserId = "ganesha";
Ceph_Conf = "/etc/ceph/ceph.conf";
}
MDCACHE {
Dir_Chunk = 0;
NParts = 1;
Cache_Size = 1;
}
EXPORT
{
Export_ID=101;
Protocols = 4;
Transports = TCP;
Path = PATH;
Pseudo = PSEUDO_PATH;
Access_Type = RW;
Attr_Expiration_Time = 0;
Squash = no_root_squash;
FSAL {
Name = CEPH;
User_Id = "ganesha";
Secret_Access_Key = CEPHXKEY;
}
}
LOG {
Default_Log_Level = "FULL_DEBUG";
}
------------
Does anyone have similar problems? Or if this behavior is by purpose, can you explain to me why this is the case?
Thank you in advance for your time and thoughts.
Kind regards,
Michael