January 2021 - ceph-users

Re: Dashboard : Block image listing and infos

by Gilles Mocellin

Hi ! I respond to the list, as it may help others. I also reorder the response. > On Mon, Jan 18, 2021 at 2:41 PM Gilles Mocellin < > > gilles.mocellin(a)nuagelibre.org> wrote: > > Hello Cephers, > > > > On a new cluster, I only have 2 RBD block images, and the Dashboard > > doesn't manage to list them correctly. > > > > I have this message : > > Warning > > Displaying previously cached data for pool veeam-repos. > > > > Sometime it disappears, but as soon as I reload or return to the listing > > page, it's there. > > > > What I've seen, is a high CPU load due to ceph-mgr on the active > > manager. > > And also stack-traces like this : [...] > > dashboard.exceptions.ViewCacheNoDataException: ViewCache: unable to > > retrieve data > > > > I also see that, when I try to edit an image : > > > > 2021-01-18T11:13:26.383+0100 7f00199ca700 0 [dashboard ERROR > > frontend.error] > > (https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/#/block/rbd/edit/veeam-> > repos%252Fveeam-repo2-vol1 > > <https://fidcl-mrs4-sto-sds.fidcl.cloud:8443/#/block/rbd/edit/veeam-repos% > > 252Fveeam-repo2-vol1>): Cannot read property 'features_name' of > > undefined > > > > TypeError: Cannot read property 'features_name' of undefined [...] > > > > But that's perhaps just becaus I open an Edit window on the image and it > > does not have the datas. > > The Edit window is empty, and I can't edit things, especially, I wan't > > to resize the image. > > [...] > > -- > > Gilles Le jeudi 21 janvier 2021, 21:56:58 CET Ernesto Puerta a écrit : > Hey Gilles, > > If I'm not wrong, that exception (ViewCacheNoDataException) happens when > the dashboard is unable to gather all required data from Ceph within a > defined timeout (5 secs I think, since the UI refreshes the data every ~5 > seconds). > > It'd be great if you could provide the steps to reproduce it and some > insights into your environment (number of RBD pools, number of RBD images, > snapshots, etc.). > > Kind Regards, > > Ernesto OK, As it is now, it always hapens, on the image listing, I have the Warning and the list is not always up to date, if I create an image, I must wait very long to see it. Also, I can not edit the 2 big images I have. Perhaps the size is important, they are 2 images of 40 TB. If I create a 1 GB test image, I can edit and resize it. But impossible withe the big image, the windows opens but all the fields are empty. Also, if it can matter, the images use a data pool (EC 3+2). I have 2 pools, a replicated one for metadatas veeam-repos (replic x3), and a data pool veeam-repos.data (EC 3+2). My cluster has 6 nodes with AMD 16 cores CPU, 128 GB RAM, 10 8 TB HDD. So 60 OSD. Soon doubling everything to 12 nodes. Usage, as the pool and image names can tell, is to mount RBD image as a XFS filesystem for a Veeam Backup Repository (krbd, because nbd-rbd tailed regularly, especially during fstrim).

3 years, 2 months

2
2
0 0

Speed of S3 Ceph gateways

by michal.strnad＠cesnet.cz

Hi all, We are testing our S3 Ceph endpoints and we are not satisfied with its speed. Our results are something between around 120 - 150 MB/s depending on small/bigger files. This is good for 1Gbps connection, but not for 10GE or more. We've tried the most recent versions of the AWS CLI, s3cmd, s4cmd, s3fs ... programs. Of course we are using multipart upload/download which is precondition for parallel upload/download. Also we tried multi-thread (25 or more threads) transfer in s4cmd but still we don't get proper results. For proof of concept that high speed can be achieved we have written small script in bash which uses multi-part & parallel transfer and can saturate at least 10GE without problem. I would like to ask you, if you know proper program and its parameters, so we can saturate n x 10GE if needed? We are using the latest nautilus. S3 gateways have much more computer power and bandwidth to internet then it is used right now. Thank you Regards Michal Strnad

3 years, 2 months

4
3
0 0

Can see objects with "rados ls" but cannot delete them with "rados rm"

by James, GleSYS

Hi, We have in issue in our cluster (octopus 15.2.7) where we’re unable to remove orphaned objects from a pool, despite the fact these objects can be listed with “rados ls”. Here is an example of an orphaned object which we can list (not sure why multiple objects are returned with the same name…related to the issue perhaps?) rados ls -p default.rgw.buckets.data | grep -i 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 And the error message when we try to stat / rm the object: rados stat -p default.rgw.buckets.data 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 error stat-ing default.rgw.buckets.data/5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6: (2) No such file or directory rados -p default.rgw.buckets.data rm 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6 error removing default.rgw.buckets.data>5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83__shadow_anon_backup_xxxx_xx_xx_090109_7812500.bak.vLHmbxS4DAnRMDVjBYG-5X6iSmepDD6: (2) No such file or directory The bucket with id "5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83” was deleted from radosgw a few months ago, but we still have approximately 450,000 objects with this bucket id that are orphaned: cat orphan-list-202101191211.out | grep -i 5a5c812a-3d31-xxxx-xxxx-xxxxxxxxxxxx.4811659.83 | wc -l 448683 I can also see from our metrics that prior to deletion there was about 10TB of compressed data stored in this bucket, and this has not been reclaimed in the pool usage after the bucket was deleted. Anyone have any suggestions on how we can remove these objects and reclaim the space? We’re not using snapshots or cache tiers in our environment. Thanks, James.

3 years, 2 months

3
3
0 0

MDS rejects clients causing hanging mountpoint on linux kernel client

by Florian Pritz

Hi, We are running a ceph cluster on Ubuntu 18.04 machines with ceph 14.2.4. Our cephfs clients are using the kernel module and we have noticed that some of them are sometimes (at least once) hanging after an MDS restart. The only way to resolve this is to unmount and remount the mountpoint, or reboot the machine if unmounting is not possible. After some investigation, the problem seems to be that the MDS denies reconnect attempts from some clients during restart even though the reconnect interval is not yet reached. In particular, I see the following log entries. Note that there are supposedly 9 sessions. 9 clients reconnect (one client has two mountpoints) and then two more clients reconnect after the MDS already logged "reconnect_done". These two clients were hanging after the event. The kernel log of one of them is shown below too. Running `ceph tell mds.0 client ls` after the clients have been rebooted/remounted also shows 11 clients instead of 9. Do you have any ideas what is wrong here and how it could be fixed? I'm guessing that the issue is that the MDS apparently has an incorrect session count and stops the reconnect process to soon. Is this indeed a bug and if so, do you know what is broken? Regardless, I also think that the kernel should be able to deal with a denied reconnect and that it should try again later. Yet, even after 10 minutes, the kernel does not attempt to reconnect. Is this a known issue or maybe fixed in newer kernels? If not, is there a chance to get this fixed? Thanks, Florian MDS log: > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.server reconnect_clients -- 9 sessions > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24197043 v1:10.1.4.203:0/990008521 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.30487144 v1:10.1.4.146:0/483747473 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21019865 v1:10.1.7.22:0/3752632657 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.21020717 v1:10.1.7.115:0/2841046616 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24171153 v1:10.1.7.243:0/1127767158 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.23978093 v1:10.1.4.71:0/824226283 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.24209569 v1:10.1.4.157:0/1271865906 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190930 v1:10.1.4.240:0/3195698606 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 0 log_channel(cluster) log [DBG] : reconnect by client.20190912 v1:10.1.4.146:0/852604154 after 0 > 2019-09-26 16:08:27.479 7f9fdde99700 1 mds.0.59 reconnect_done > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.24167394 v1:10.1.67.49:0/1483641729 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1087700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.49:0/1483641729 conn(0x55af50053f80 0x55af50140800 :6801 s=OPENED pgs=21 cs=1 l=0).fault server, going to standby > 2019-09-26 16:08:27.483 7f9fdde99700 1 mds.0.server no longer in reconnect state, ignoring reconnect, sending close > 2019-09-26 16:08:27.483 7f9fdde99700 0 log_channel(cluster) log [INF] : denied reconnect attempt (mds is up:reconnect) from client.30586072 v1:10.1.67.140:0/3664284158 after 0.00400002 (allowed interval 45) > 2019-09-26 16:08:27.483 7f9fe1888700 0 --1- [v2:10.1.4.203:6800/806949107,v1:10.1.4.203:6801/806949107] >> v1:10.1.67.140:0/3664284158 conn(0x55af50055600 0x55af50143000 :6801 s=OPENED pgs=8 cs=1 l=0).fault server, going to standby Hanging client (10.1.67.49) kernel log: > 2019-09-26T16:08:27.481676+02:00 hostnamefoo kernel: [708596.227148] ceph: mds0 reconnect start > 2019-09-26T16:08:27.488943+02:00 hostnamefoo kernel: [708596.233145] ceph: mds0 reconnect denied > 2019-09-26T16:16:17.541041+02:00 hostnamefoo kernel: [709066.287601] libceph: mds0 10.1.4.203:6801 socket closed (con state NEGOTIATING) > 2019-09-26T16:16:18.068934+02:00 hostnamefoo kernel: [709066.813064] ceph: mds0 rejected session > 2019-09-26T16:16:18.068955+02:00 hostnamefoo kernel: [709066.814843] ceph: get_quota_realm: ino (10000000008.fffffffffffffffe) null i_snap_realm

3 years, 2 months

3
6
0 0

Using RBD to pack billions of small files

by Loïc Dachary

Bonjour, In the context Software Heritage (a noble mission to preserve all source code)[0], artifacts have an average size of ~3KB and there are billions of them. They never change and are never deleted. To save space it would make sense to write them, one after the other, in an every growing RBD volume (more than 100TB). An index, located somewhere else, would record the offset and size of the artifacts in the volume. I wonder if someone already implemented this idea with success? And if not... does anyone see a reason why it would be a bad idea? Cheers [0] https://docs.softwareheritage.org/ -- Loïc Dachary, Artisan Logiciel Libre

3 years, 2 months

8
16
0 0

radosgw bucket index issue

by Fox, Kevin M

We have a fairly old cluster that has over time been upgraded to nautilus. We were digging through some things and found 3 bucket indexes without a corresponding bucket. They should have been deleted but somehow were left behind. When we try and delete the bucket index, it will not allow it as the bucket is not found. The bucket index list command works fine though without the bucket. Is there a way to delete the indexes? Maybe somehow relink the bucket so it can be deleted again? Thanks, Kevin

3 years, 2 months

1
1
0 0

osd recommended scheduler

by Andrei Mikhailovsky

Hello everyone, Could some one please let me know what is the recommended modern kernel disk scheduler that should be used for SSD and HDD osds? The information in the manuals is pretty dated and refer to the schedulers which have been deprecated from the recent kernels. Thanks Andrei

3 years, 2 months

4
5
0 0

radosgw process crashes multiple times an hour

by Andrei Mikhailovsky

Hello, I am experiencing very frequent crashes of the radosgw service. It happens multiple times every hour. As an example, over the last 12 hours we've had 35 crashes. Has anyone experienced similar behaviour of the radosgw octopus release service? More info below: Radosgw service is running on two Ubuntu servers. I have tried upgrading OS on one of the servers to Ubuntu 20.04 with latest updates. The second server is still running Ubuntu 18.04. Both services crash occasionally, but the service which is running on Ubuntu 20.04 crashes far more often it seems. The ceph cluster itself is pretty old and was initially setup around 2013. The cluster was updated pretty regularly with every major release. Currently, I've got Octopus 15.2.8 running on all osd, mon, mgr and radosgw servers. Crash Backtrace: ceph crash info 2021-01-28T11:36:48.912771Z_08f80efd-c0ad-4551-88ce-905ca9cd3aa8 |less { "backtrace": [ "(()+0x46210) [0x7f815a49a210]", "(gsignal()+0xcb) [0x7f815a49a18b]", "(abort()+0x12b) [0x7f815a479859]", "(()+0x9e951) [0x7f8150ee9951]", "(()+0xaa47c) [0x7f8150ef547c]", "(()+0xaa4e7) [0x7f8150ef54e7]", "(()+0xaa799) [0x7f8150ef5799]", "(()+0x344ba) [0x7f815a1404ba]", "(()+0x71e04) [0x7f815a17de04]", "(librados::v14_2_0::IoCtx::nobjects_begin(librados::v14_2_0::ObjectCursor const&, ceph::buffer::v15_2_0::list const&)+0x5d) [0x7f815a18c7bd]", "(RGWSI_RADOS::Pool::List::init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWAccessListFilter*)+0x115) [0x7f815b0d9935]", "(RGWSI_SysObj_Core::pool_list_objects_init(rgw_pool const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, RGWSI_SysObj::Pool::ListCtx*)+0x255) [0x7f815abd7035]", "(RGWSI_MetaBackend_SObj::list_init(RGWSI_MetaBackend::Context*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x206) [0x7f815b0ccfe6]", "(RGWMetadataHandler_GenericMetaBE::list_keys_init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void**)+0x41) [0x7f815ad23201]", "(RGWMetadataManager::list_keys_init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, void**)+0x71) [0x7f815ad254d1]", "(AsyncMetadataList::_send_request()+0x9b) [0x7f815b13c70b]", "(RGWAsyncRadosProcessor::handle_request(RGWAsyncRadosRequest*)+0x25) [0x7f815ae60f25]", "(RGWAsyncRadosProcessor::RGWWQ::_process(RGWAsyncRadosRequest*, ThreadPool::TPHandle&)+0x11) [0x7f815ae69401]", "(ThreadPool::worker(ThreadPool::WorkThread*)+0x5bb) [0x7f81517b072b]", "(ThreadPool::WorkThread::entry()+0x15) [0x7f81517b17f5]", "(()+0x9609) [0x7f815130d609]", "(clone()+0x43) [0x7f815a576293]" ], "ceph_version": "15.2.8", "crash_id": "2021-01-28T11:36:48.912771Z_08f80efd-c0ad-4551-88ce-905ca9cd3aa8", "entity_name": "client.radosgw1.gateway", "os_id": "ubuntu", "os_name": "Ubuntu", "os_version": "20.04.1 LTS (Focal Fossa)", "os_version_id": "20.04", "process_name": "radosgw", "stack_sig": "347474f09a756104ac2bb99d80e0c1fba3e9dc6f26e4ef68fe55946c103b274a", "timestamp": "2021-01-28T11:36:48.912771Z", "utsname_hostname": "arh-ibstorage1-ib", "utsname_machine": "x86_64", "utsname_release": "5.4.0-64-generic", "utsname_sysname": "Linux", "utsname_version": "#72-Ubuntu SMP Fri Jan 15 10:27:54 UTC 2021" } radosgw.log file (file names were redacted): -25> 2021-01-28T11:36:48.794+0000 7f8043fff700 1 civetweb: 0x7f814c0cf010: 176.35.173.88 - - [28/Jan/2021:11:36:48 +0000] "PUT /<file_name>-u115134.JPG HTTP/1.1" 400 460 - - -24> 2021-01-28T11:36:48.814+0000 7f80437fe700 1 ====== starting new request req=0x7f80437f5780 ===== -23> 2021-01-28T11:36:48.814+0000 7f80437fe700 2 req 5169 0s initializing for trans_id = tx000000000000000001431-006012a1d0-31197b5c-default -22> 2021-01-28T11:36:48.814+0000 7f80437fe700 2 req 5169 0s getting op 1 -21> 2021-01-28T11:36:48.814+0000 7f80437fe700 2 req 5169 0s s3:put_obj verifying requester -20> 2021-01-28T11:36:48.814+0000 7f80437fe700 2 req 5169 0s s3:put_obj normalizing buckets and tenants -19> 2021-01-28T11:36:48.814+0000 7f80437fe700 2 req 5169 0s s3:put_obj init permissions -18> 2021-01-28T11:36:48.814+0000 7f80437fe700 0 req 5169 0s NOTICE: invalid dest placement: default-placement/REDUCED_REDUNDANCY -17> 2021-01-28T11:36:48.814+0000 7f80437fe700 1 op->ERRORHANDLER: err_no=-22 new_err_no=-22 -16> 2021-01-28T11:36:48.814+0000 7f80437fe700 2 req 5169 0s s3:put_obj op status=0 -15> 2021-01-28T11:36:48.814+0000 7f80437fe700 2 req 5169 0s s3:put_obj http status=400 -14> 2021-01-28T11:36:48.814+0000 7f80437fe700 1 ====== req done req=0x7f80437f5780 op status=0 http_status=400 latency=0s ====== -13> 2021-01-28T11:36:48.822+0000 7f80437fe700 1 civetweb: 0x7f814c0cf9e8: 176.35.173.88 - - [28/Jan/2021:11:36:48 +0000] "PUT /<file_name>-d20201223-u115132.JPG HTTP/1.1" 400 460 - - -12> 2021-01-28T11:36:48.878+0000 7f8043fff700 1 ====== starting new request req=0x7f8043ff6780 ===== -11> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s initializing for trans_id = tx000000000000000001432-006012a1d0-31197b5c-default -10> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s getting op 1 -9> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s s3:put_obj verifying requester -8> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s s3:put_obj normalizing buckets and tenants -12> 2021-01-28T11:36:48.878+0000 7f8043fff700 1 ====== starting new request req=0x7f8043ff6780 ===== -11> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s initializing for trans_id = tx000000000000000001432-006012a1d0-31197b5c-default -10> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s getting op 1 -9> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s s3:put_obj verifying requester -12> 2021-01-28T11:36:48.878+0000 7f8043fff700 1 ====== starting new request req=0x7f8043ff6780 ===== -11> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s initializing for trans_id = tx000000000000000001432-006012a1d0-31197b5c-default -10> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s getting op 1 -9> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s s3:put_obj verifying requester -8> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s s3:put_obj normalizing buckets and tenants -7> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s s3:put_obj init permissions -6> 2021-01-28T11:36:48.878+0000 7f8043fff700 0 req 5170 0s NOTICE: invalid dest placement: default-placement/REDUCED_REDUNDANCY -5> 2021-01-28T11:36:48.878+0000 7f8043fff700 1 op->ERRORHANDLER: err_no=-22 new_err_no=-22 -4> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s s3:put_obj op status=0 -3> 2021-01-28T11:36:48.878+0000 7f8043fff700 2 req 5170 0s s3:put_obj http status=400 -2> 2021-01-28T11:36:48.878+0000 7f8043fff700 1 ====== req done req=0x7f8043ff6780 op status=0 http_status=400 latency=0s ====== -1> 2021-01-28T11:36:48.886+0000 7f8043fff700 1 civetweb: 0x7f814c0cf010: 176.35.173.88 - - [28/Jan/2021:11:36:48 +0000] "PUT /<file_name>-223-u115136.JPG HTTP/1.1" 400 460 - - 0> 2021-01-28T11:36:48.910+0000 7f8128ff9700 -1 *** Caught signal (Aborted) ** 2021-01-28T11:36:49.810+0000 7f76032db9c0 0 deferred set uid:gid to 64045:64045 (ceph:ceph) 2021-01-28T11:36:49.810+0000 7f76032db9c0 0 ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable), process radosgw, pid 30417 2021-01-28T11:36:49.810+0000 7f76032db9c0 0 framework: civetweb 2021-01-28T11:36:49.810+0000 7f76032db9c0 0 framework conf key: port, val: 443s Could someone help me troubleshoot and fix the issue? Thanks Andrei

3 years, 2 months

3
4
0 0

Unable to enable RBD-Mirror Snapshot on image when VM is using RBD

by Adam Boyhan

This is a odd one. I don't hit it all the time so I don't think its expected behavior. Sometimes I have no issues enabling rbd-mirror snapshot mode on a rbd when its in use by a KVM VM. Other times I hit the following error, the only way I can get around it is to power down the KVM VM. root@Ccscephtest1:~# rbd mirror image enable CephTestPool1/vm-101-disk-0 snapshot 2021-01-29T09:29:07.875-0500 7f1e99ffb700 -1 librbd::mirror::snapshot::CreatePrimaryRequest: 0x7f1e7c012440 handle_create_snapshot: failed to create mirror snapshot: (22) Invalid argument 2021-01-29T09:29:07.875-0500 7f1e99ffb700 -1 librbd::mirror::EnableRequest: 0x5597667fd200 handle_create_primary_snapshot: failed to create initial primary snapshot: (22) Invalid argument 2021-01-29T09:29:07.875-0500 7f1ea559f3c0 -1 librbd::api::Mirror: image_enable: cannot enable mirroring: (22) Invalid argument

3 years, 2 months

2
4
0 0

Balancing with upmap

by Francois Legrand

Hi all, I have a cluster with 116 disks (24 new disks of 16TB added in december and the rest of 8TB) running nautilus 14.2.16. I moved (8 month ago) from crush_compat to upmap balancing. But the cluster seems not well balanced, with a number of pgs on the 8TB disks varying from 26 to 52 ! And an occupation from 35 to 69%. The recent 16 TB disks are more homogeneous with 48 to 61 pgs and space between 30 and 43%. Last week, I realized that some osd were maybe not using upmap because I did a ceph osd crush weight-set ls and got (compat) as result. Thus I ran a ceph osd crush weight-set rm-compat which triggered some rebalancing. Now there is no more recovery for 2 days, but the cluster is still unbalanced. As far as I understand, upmap is supposed to reach an equal number of pgs on all the disks (I guess weighted by their capacity). Thus I would expect more or less 30 pgs on the 8TB disks and 60 on the 16TB and around 50% usage on all. Which is not the case (by far). The problem is that it impact the free available space in the pools (264Ti while there is more than 578Ti free in the cluster) because free space seems to be based on space available before the first osd will be full ! Is it normal ? Did I missed something ? What could I do ? F.

3 years, 2 months

3
11
0 0

2024

2023

2022

2021

2020

2019

ceph-users January 2021