- ceph-users - lists.ceph.io

4.14 kernel or greater recommendation for multiple active MDS

by Robert LeBlanc

In the Nautilus manual it recommends >= 4.14 kernel for multiple active MDSes. What are the potential issues for running the 4.4 kernel with multiple MDSes? We are in the process of upgrading the clients, but at times overrun the capacity of a single MDS server. MULTIPLE ACTIVE METADATA SERVERS <https://docs.ceph.com/docs/nautilus/cephfs/kernel-features/#multiple-active…> The feature has been supported since the Luminous release. It is recommended to use Linux kernel clients >= 4.14 when there are multiple active MDS. Thank you, Robert LeBlanc ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1

4 years

3
6
0 0

Question about bucket versions

by Katarzyna Myrek

Hi I have a few questions about bucket versioning. in the output of command "*radosgw-admin bucket stats --bucket=XXX"* there is info about versions: "ver": "0#521391,1#516042,2#518098,3#517681,4#518423", "master_ver": "0#0,1#0,2#0,3#0,4#0", Also "*metadata get"* returns info about versions: radosgw-admin metadata get bucket:XXX { "key": "bucket:XXX", "ver": { "tag": "_KrvQc6gBg1Zcrr8s8M5jXmk", "ver": 335 }, But I'm pretty sure that bucket versioning should be disabled, because "*aws s3api get-bucket-versioning*" returns nothing. How should I understand the current situation? The problem is that from the client side I can see that the bucket is very small. Less than 10GB while checking the bucket stats from radosgw-admins side shows the bucket is taking nearly 1TB. Kind regards / Pozdrawiam, Katarzyna Myrek

4 years

2
1
0 0

Re: rados buckets copy

by Andrei Mikhailovsky

Hi Manuel, My replica is 2, hence about 10TB of unaccounted usage. Andrei ----- Original Message ----- > From: "EDH - Manuel Rios" <mriosfer(a)easydatahost.com> > To: "Andrei Mikhailovsky" <andrei(a)arhont.com> > Sent: Tuesday, 28 April, 2020 23:57:20 > Subject: RE: rados buckets copy > Is your replica x3? 9x3 27... plus some overhead rounded.... > > Ceph df show including replicas , bucket stats just bucket usage no replicas. > > -----Mensaje original----- > De: Andrei Mikhailovsky <andrei(a)arhont.com> > Enviado el: miércoles, 29 de abril de 2020 0:55 > Para: ceph-users <ceph-users(a)ceph.io> > Asunto: [ceph-users] rados buckets copy > > Hello, > > I have a problem with radosgw service where the actual disk usage (ceph df shows > 28TB usage) is way more than reported by the radosgw-admin bucket stats (9TB > usage). I have tried to get to the end of the problem, but no one seems to be > able to help. As a last resort I will attempt to copy the buckets, rename them > and remove the old buckets. > > What is the best way of doing this (probably on a high level) so that the copy > process doesn't carry on the wasted space to the new buckets? > > Cheers > > Andrei > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to > ceph-users-leave(a)ceph.io

4 years

2
3
0 0

Data loss by adding 2OSD causing Long heartbeat ping times

by Frank Schilder

Dear all, Two days ago I added very few disks to a ceph cluster and run into a problem I have never seen before when doing that. The entire cluster was deployed with mimic 13.2.2 and recently upgraded to 13.2.8. This is the first time I added OSDs under 13.2.8. I had a few hosts that I needed to add 1 or 2 OSDs to and I started with one that needed 1. Procedure was as usual: ceph osd set norebalance deploy additional OSD The OSD came up and PGs started peering, so far so good. To my surprise, however, I started seeing health-warnings about slow ping times: Long heartbeat ping times on back interface seen, longest is 1171.910 msec Long heartbeat ping times on front interface seen, longest is 1180.764 msec After peering it looked like it got better and I waited it out until the messages were gone. This took a really long time, at least 5-10 minutes. I went on to the next host and deployed 2 new OSDs this time. Same as above, but with much worse consequences. Apparently, the ping times exceeded a timeout for a very short moment and an OSD was marked out for ca. 2 seconds. Now all hell broke loose. I got health errors with the dreaded "backfill_toofull", undersized PGs and a large amount of degraded objects. I don't know what is causing what, but I ended up with data loss by just adding 2 disks. We have dedicated network hardware and each of the OSD hosts has 20GBit front and 40GBit back network capacity (LACP trunking). There are currently no more than 16 disks per server. The disks were added to an SSD pool. There was no traffic nor any other exceptional load on the system. I have ganglia resource monitoring on all nodes and cannot see a single curve going up. Network, CPU utilisation, load, everything below measurement accuracy. The hosts and network are quite overpowered and dimensioned to host many more OSDs (in future expansions). I have three questions, ordered by how urgently I need an answer: 1) I need to add more disks next week and need a workaround. Will something like this help avoiding the heartbeat time-out: ceph osd set noout ceph osd set nodown ceph osd set norebalance 2) The "lost" shards of the degraded objects were obviously still on the cluster somewhere. Is there any way to force the cluster to rescan OSDs for the shards that went orphan during the incident? 3) This smells a bit like a bug that requires attention. I was probably just lucky that I only lost 1 shard per PG. Has something similar reported before? Is this fixed in 13.2.10? Is it something new? Any settings that need to be looked at? If logs need to be collected, I can do so during my next attempt. However, I cannot risk data integrity of a production cluster and, therefore, probably not run the original procedure again. Many thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14

4 years

3
4
0 0

Rados clone_range

by Ali Turan

Hello, I saw there was a clone_range function in librados earlier. But it removed in version 12 I beleive. I need excatly that function to avoid unnecessary network traffic. I need to combine many small objects into one. So clone range would be really useful for me. I can read from an object and write to another but this will cause unnecessary network traffic. How can I do this in new versions of librados ?

4 years

1
0
0 0

per rbd performance counters

by Void Star Nill

Hello, Is there a way to get read/write I/O statistics for each rbd device for each mapping? For example, when an application uses one of the volumes, I would like to find out what performance (avg read/write bandwidth, IOPS, etc) that application observed on a given volume. Is that possible? Thanks, Shridhar

4 years

3
2
0 0

changed caps not propagated to kernel cephfs mounts

by Andrej Filipcic

Hi, I have added a new fast_data pool to cephfs, fixed the auth caps, eg client.f9wn key: ........ caps: [mds] allow rw caps: [mon] allow r caps: [osd] allow rw pool=cephfs_data, allow rw pool=fast_data but the client with kernel mounted cephfs reports an error when trying to read or write, eg f9nd003 ~ # echo a > /ceph/grid/cache/a -bash: echo: write error: Operation not permitted mount is: cat /proc/mounts |grep ceph monitors...:/ /ceph ceph rw,relatime,name=f9wn,secret=<hidden>,acl 0 0 it seems that only umount + mount solves this issue, kernel is vanilla 4.19.60. Is there any way to force to propagate new auth capabilities without remounting the fs? Thanks, Andrej -- _____________________________________________________________ prof. dr. Andrej Filipcic, E-mail: Andrej.Filipcic(a)ijs.si Department of Experimental High Energy Physics - F9 Jozef Stefan Institute, Jamova 39, P.o.Box 3000 SI-1001 Ljubljana, Slovenia Tel.: +386-1-477-3674 Fax: +386-1-425-7074 -------------------------------------------------------------

4 years

1
0
0 0

CephFS with active-active NFS Ganesha

by Michael Bisig

Hi all, I am trying to setup an active-active NFS Ganesha cluster (with two Ganeshas (v3.0) running in Docker containers). I could manage to get two Ganesha daemons running using the rados_cluster backend for active-active deployment. I have the grace db within the cephfs metadata pool in an own namespace which keeps track on the node status. Now, I can mount the exposed filesystem over NFS (v4.1, v4.2) with both daemons. So far so good. __ Testing high availability resulted in an unexpected behavior for that I am not sure whether it is intentional or whether it is a configuration problem. Problem: If both are running, no E or N flags are set within the grace db, as I expect. Once, one host goes down (or is taken down) ALL clients cannot read nor write to the mounted filesystem, even the clients which are not connected to dead ganesha. In the db, I see that the dead ganesha has state NE and the active has E. This state is what I expect from the Ganesha documentation. Nevertheless, I would assume that the clients connected to the active daemon are not blocked. This state is not cleaned up by itself (e.g. after the grace period). I can unlock this situation by 'lifting' the dead node with a direct db call (using ganesha-rados-grace tool). But within an active-active deployment this is not suitable. The ganesha config looks like: ------------ NFS_CORE_PARAM { Enable_NLM = false; Protocols = 4; } NFSv4 { RecoveryBackend = rados_cluster; Minor_Versions = 1,2; } RADOS_KV { pool = "cephfsmetadata"; nodeid = "a" ; namespace = "grace"; UserId = "ganesha"; Ceph_Conf = "/etc/ceph/ceph.conf"; } MDCACHE { Dir_Chunk = 0; NParts = 1; Cache_Size = 1; } EXPORT { Export_ID=101; Protocols = 4; Transports = TCP; Path = PATH; Pseudo = PSEUDO_PATH; Access_Type = RW; Attr_Expiration_Time = 0; Squash = no_root_squash; FSAL { Name = CEPH; User_Id = "ganesha"; Secret_Access_Key = CEPHXKEY; } } LOG { Default_Log_Level = "FULL_DEBUG"; } ------------ Does anyone have similar problems? Or if this behavior is by purpose, can you explain to me why this is the case? Thank you in advance for your time and thoughts. Kind regards, Michael

4 years

3
2
0 0

Re: Workload in Unit testing

by Bobby

Hi Sebastian, Thanks a lot for your reply. It was really helpful and it is clear that 'make check' don't start a ceph cluster. After your email I have figured it out. This brings me to another question :-) In my earlier email I should have defined what exactly I mean by 'workload' in my case. Given my current task/scenario, the definition of 'workload' is only the workload of client machine. Meaning, if there is a Ceph cluster, I am only concerned with the workload of a single ceph client node. And not the workload of other nodes that include OSDs, MONs, MDS etc. The question arises what exactly on ceph client? On client side, I would like to profile the workload of CRUSH. Because I am quite sure there are many computations in CRUSH that are compute intensive for CPU and can be offloaded. May be these compute intensive computations can be more parallelized. This is why I was profiling the binaries of unit tests (in particular CRUSH unit tests) on profiling tool Valgrind --tool=callgrind to see the function calls. May be this is not the right way? Please do comment on it :-). Considering my task, would you still recommend me to use Teuthology tests at this point? Please do comment on this also :-). Because integration tests (Teuthology framework) require multi-machine clusters to run. And according to my understanding that would be too complex for a single client workload or lets say if I am only interested in CRUSH workload. Thanks in advance :-) On Thu, May 7, 2020 at 12:20 AM Bobby <italienisch1987(a)gmail.com> wrote: > > > Hi Sebastian, > > Thanks a lot for your reply. It was really helpful and it is clear that > 'make check' don't start a ceph cluster. After your email I have figured it > out. This brings me to another question :-) > > In my earlier email I should have defined what exactly I mean by > 'workload' in my case. Given my current task/scenario, the definition of > 'workload' is only the workload of client machine. Meaning, if there is a > Ceph cluster, I am only concerned with the workload of a single ceph > client node. And not the workload of other nodes that include OSDs, MONs, > MDS etc. The question arises what exactly on ceph client? On client side, I > would like to profile the workload of CRUSH. Because I am quite sure there > are many computations in CRUSH that are compute intensive for CPU and can > be offloaded. May be these compute intensive computations can be more > parallelized. This is why I was profiling the binaries of unit tests (in > particular CRUSH unit tests) on profiling tool Valgrind --tool=callgrind to > see the function calls. May be this is not the right way? Please do comment > on it :-). > > Considering my task, would you still recommend me to use Teuthology tests > at this point? Please do comment on this also :-). Because integration > tests (Teuthology framework) require multi-machine clusters to run. And > according to my understanding that would be too complex for a single client > workload or lets say if I am only interested in CRUSH workload. > > Thanks in advance :-´) > > Bobby > > On Wed, May 6, 2020 at 5:37 PM Sebastian Wagner <sebastian.wagner(a)suse.com> > wrote: > >> Hi Bobby, >> >> `make check` aka unit tests don't start a ceph cluster. Instead they test >> individual functions. There is nothing similar to a "workload" involved >> here. >> >> Maybe, you're interested in the vstart_runner, which makes it possible to >> run Teuthology tests in a vstart cluster. >> >> Best, >> >> Sebastian >> _______________________________________________ >> Dev mailing list -- dev(a)ceph.io >> To unsubscribe send an email to dev-leave(a)ceph.io >> >

4 years

1
0
0 0

Re: What's the best practice for Erasure Coding

by Alex Gorbachev

Hi Frank, Reviving this old thread as to whether the performance on these raw NL-SAS drives is adequate? I was wondering if this is a deep archive with almost no retrieval, or how many drives are used? In my experience with large parallel writes, WAL/DB with bluestore, or journal drives on SSD with filestore have always been needed to sustain a reasonably consistent transfer rate. Very much appreciate any reference info as to your design. Best regards, Alex On Mon, Jul 8, 2019 at 4:30 AM Frank Schilder <frans(a)dtu.dk> wrote: > Hi David, >> >> I'm running a cluster with bluestore on raw devices (no lvm) and all >> journals collocated on the same disk with the data. Disks are spinning >> NL-SAS. Our goal was to build storage at lowest cost, therefore all data on >> HDD only. I got a few SSDs that I'm using for FS and RBD meta data. All >> large pools are EC on spinning disk. >> >> I spent at least one month to run detailed benchmarks (rbd bench) >> depending on EC profile, object size, write size, etc. Results were varying >> a lot. My advice would be to run benchmarks with your hardware. If there >> was a single perfect choice, there wouldn't be so many options. For >> example, my tests will not be valid when using separate fast disks for WAL >> and DB. >> >> There are some results though that might be valid in general: >> >> 1) EC pools have high throughput but low IOP/s compared with replicated >> pools >> >> I see single-thread write speeds of up to 1.2GB (gigabyte) per second, >> which is probably the network limit and not the disk limit. IOP/s get >> better with more disks, but are way lower than what replicated pools can >> provide. On a cephfs with EC data pool, small-file IO will be comparably >> slow and eat a lot of resources. >> >> 2) I observe massive network traffic amplification on small IO sizes, >> which is due to the way EC overwrites are handled. This is one bottleneck >> for IOP/s. We have 10G infrastructure and use 2x10G client and 4x10G OSD >> network. OSD bandwidth at least 2x client network, better 4x or more. >> >> 3) k should only have small prime factors, power of 2 if possible >> >> I tested k=5,6,8,10,12. Best results in decreasing order: k=8, k=6. All >> other choices were poor. The value of m seems not relevant for performance. >> Larger k will require more failure domains (more hardware). >> >> 4) object size matters >> >> The best throughput (1M write size) I see with object sizes of 4MB or >> 8MB, with IOP/s getting somewhat better with slower object sizes but >> throughput dropping fast. I use the default of 4MB in production. Works >> well for us. >> >> 5) jerasure is quite good and seems most flexible >> >> jerasure is quite CPU efficient and can handle smaller chunk sizes than >> other plugins, which is preferrable for IOP/s. However, CPU usage can >> become a problem and a plugin optimized for specific values of k and m >> might help here. Under usual circumstances I see very low load on all OSD >> hosts, even under rebalancing. However, I remember that once I needed to >> rebuild something on all OSDs (I don't remember what it was, sorry). In >> this situation, CPU load went up to 30-50% (meaning up to half the cores >> were at 100%), which is really high considering that each server has only >> 16 disks at the moment and is sized to handle up to 100. CPU power could >> become a bottle for us neck in the future. >> >> These are some general observations and do not replace benchmarks for >> specific use cases. I was hunting for a specific performance pattern, which >> might not be what you want to optimize for. I would recommend to run >> extensive benchmarks if you have to live with a configuration for a long >> time - EC profiles cannot be changed. >> >> We settled on 8+2 and 6+2 pools with jerasure and object size 4M. We also >> use bluestore compression. All meta data pools are on SSD, only very little >> SSD space is required. This choice works well for the majority of our use >> cases. We can still build small expensive pools to accommodate special >> performance requests. >> >> Best regards, >> >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: ceph-users <ceph-users-bounces(a)lists.ceph.com> on behalf of David < >> xiaomajia.st(a)gmail.com> >> Sent: 07 July 2019 20:01:18 >> To: ceph-users(a)lists.ceph.com >> Subject: [ceph-users] What's the best practice for Erasure Coding >> >> Hi Ceph-Users, >> >> I'm working with a Ceph cluster (about 50TB, 28 OSDs, all Bluestore on >> lvm). >> Recently, I'm trying to use the Erasure Code pool. >> My question is "what's the best practice for using EC pools ?". >> More specifically, which plugin (jerasure, isa, lrc, shec or clay) >> should I adopt, and how to choose the combinations of (k,m) (e.g. >> (k=3,m=2), (k=6,m=3) ). >> >> Does anyone share some experience? >> >> Thanks for any help. >> >> Regards, >> David >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users(a)lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >

4 years

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users