December 2020 - ceph-users

Larger number of OSDs, cheroot, cherrypy, limits + containers == broken

by David Orman

Hi, We have a ceph 15.2.7 deployment using cephadm under podman w/ systemd. We've run into what we believe is: https://github.com/ceph/ceph-container/issues/1748 https://tracker.ceph.com/issues/47875 In our case, eventually the mgr container stops emitting output/logging. We are polling with external prometheus clusters, which is likely what triggers the issue, as it appears some amount of time after the container is spawned. Unfortunately, setting limits in the systemd service file for the mgr service on the host OS doesn't work, nor does modifying the unit.run file which is used to start the container under podman to include the --ulimit settings as suggested. Looking inside the container: lib/systemd/system/ceph-mgr@.service:LimitNOFILE=1048576 This prevents us from deploying medium to large ceph clusters, so I would argue it's a high priority bug that should not be closed, unless there is a workaround that works until EPEL 8 contains the fixed version of cheroot and the ceph containers include it. My understanding is this was fixed in cheroot 8.4.0: https://github.com/cherrypy/cheroot/issues/249 https://github.com/cherrypy/cheroot/pull/301 Thank you in advance for any suggestions, David

3 years, 3 months

4
10
0 0

radosgw sync using more capacity on secondary than on master zone

by Scheurer François

Dear Ceph contributors While our (new) rgw secondary zone is doing the initial data sync from our master zone, we noticed that the reported capacity usage was getting higher than on primary zone: Master Zone: ceph version 14.2.5 zone parameters: "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 0, "read_only": "false", "tier_type": "", "sync_from_all": "false", "sync_from": [], "redirect_zone": "" bucket stats: => "size_actual GiB": 269868.9823989868, "size_utilized GiB": 17180102071.868008, <= (cf. other issue below) "num_objects": 100218788, "Compression Rate": 63660.899148714125 pool stats: => "stored": 403834132234240, "objects": 191530416, "kb_used": 692891724288, "bytes_used": 709521125670912, "percent_used": 0.7570806741714478, "max_avail": 136595529269248 erasure-code-profile: crush-device-class=nvme crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=8 m=2 plugin=jerasure technique=reed_sol_van w=8 osd parameters: bluefs_alloc_size 1048576 default bluefs_shared_alloc_size 65536 default bluestore_extent_map_inline_shard_prealloc_size 256 default bluestore_max_alloc_size 0 default bluestore_min_alloc_size 0 default bluestore_min_alloc_size_hdd 65536 default bluestore_min_alloc_size_ssd 16384 default rgw compression: no Secondary Zone: ceph version 14.2.12 zone parameters: "log_meta": "false", "log_data": "true", "bucket_index_max_shards": 11, "read_only": "true", "tier_type": "", "sync_from_all": "true", "sync_from": [], "redirect_zone": "" bucket stats: => "size_actual GiB": 65282.37313461304, "size_utilized GiB": 60779.72828538809, "num_objects": 23074921, "Compression Rate": 0.9310281683550253 pool stats: => "stored": 407816305115136, "objects": 118637638, "kb_used": 497822635396, "bytes_used": 509770378645504, "percent_used": 0.7275146245956421, "max_avail": 152744706965504 erasure-code-profile: crush-device-class= crush-failure-domain=host crush-root=default jerasure-per-chunk-alignment=false k=3 m=2 plugin=jerasure technique=reed_sol_van w=8 osd parameters: EC k=8 m=2 bluefs_alloc_size 1048576 default bluefs_shared_alloc_size 65536 default bluestore_extent_map_inline_shard_prealloc_size 256 default bluestore_max_alloc_size 0 default bluestore_min_alloc_size 0 default bluestore_min_alloc_size_hdd 65536 default bluestore_min_alloc_size_ssd 4096 default rgw compression: yes As you see, the secondary zone is using 408 TB vs 404 TB on master zone. Summing size_actual for each buckets gives only 65 TB vs 270 TB on master zone. Any idea about what could cause such a difference? Is it a known issue? The are some known issue with Space Overhead with EC Pools for alloc_size > 4 KiB, cf. : https://www.mail-archive.com/ceph-users@ceph.io/msg06191.html https://lists.ceph.io/hyperkitty/list/dev@ceph.io/thread/OHPO43J54TPBEUISYC… https://www.spinics.net/lists/ceph-users/msg59587.html But our secondary zone is on nvme osd's with bluestore_min_alloc_size_ssd=4096, so that should be fine. I will also investigate further with "radosgw-admin bucket radoslist" and rgw-orphan-list. Thank you in advance for any help. And Happy New Year very soon ;-) Best Regards Francois PS: The value on master zone "size_utilized GiB": 17180102071.868008 is wrong. This is due to a bucket with wrong stats: { "bucket": "XXX", "tenant": "", "zonegroup": "d29ea82c-4f77-40af-952b-2ab0705ad268", "placement_rule": "default-placement", "explicit_placement": { "data_pool": "", "data_extra_pool": "", "index_pool": "" }, "id": "cb1594b3-a782-49d0-a19f-68cd48870a63.3119460.1", "marker": "cb1594b3-a782-49d0-a19f-68cd48870a63.63841.1481", "index_type": "Normal", "owner": "e13d054f6e9c4eea881f687923d7d380", "ver": "0#373075,1#286613,2#290913,3#341069,4#360862,5#341416,6#279526,7#352172,8#255944,9#314797,10#317650,11#305698,12#289557,13#345344,14#345273,15#294708,16#241001,17#298577,18#274866,19#293952,20#307635,21#334606,22#265355,23#302567,24#277505,25#307278,26#266963,27#297452,28#332274,29#319133,30#361027,31#314294,32#282887,33#324849,34#278560,35#307506,36#287269,37#344789,38#345389,39#323814,40#386483,41#280319,42#358072,43#336651,44#339176,45#248079,46#356784,47#381496,48#295152,49#251661,50#318661,51#330530,52#263564,53#332005,54#332937,55#320163,56#300485,57#296138,58#343271,59#359351,60#295711,61#275751,62#332264,63#351532", "master_ver": "0#0,1#0,2#0,3#0,4#0,5#0,6#0,7#0,8#0,9#0,10#0,11#0,12#0,13#0,14#0,15#0,16#0,17#0,18#0,19#0,20#0,21#0,22#0,23#0,24#0,25#0,26#0,27#0,28#0,29#0,30#0,31#0,32#0,33#0,34#0,35#0,36#0,37#0,38#0,39#0,40#0,41#0,42#0,43#0,44#0,45#0,46#0,47#0,48#0,49#0,50#0,51#0,52#0,53#0,54#0,55#0,56#0,57#0,58#0,59#0,60#0,61#0,62#0,63#0", "mtime": "2020-06-03 12:13:01.207610Z", "max_marker": "0#,1#,2#,3#00000341068.137793799.5,4#00000360861.156871075.5,5#00000341415.97799619.5,6#00000279525.155275260.5,7#,8#,9#00000314796.95564320.5,10#,11#,12#00000289556.137784091.5,13#,14#,15#,16#00000241000.126242121.5,17#,18#00000274865.124884405.5,19#,20#00000307634.137793798.5,21#00000334605.93836734.5,22#,23#00000302566.125226103.5,24#,25#,26#,27#00000297451.125229375.5,28#,29#00000319132.155275278.5,30#00000361026.98341455.5,31#,32#,33#00000324848.126242117.5,34#00000278559.124884417.5,35#,36#,37#00000344788.125945123.5,38#00000345388.97796715.5,39#,40#00000386482.98341457.5,41#,42#00000358071.124884415.5,43#,44#00000339175.135084366.5,45#00000248078.155263912.5,46#00000356783.98341461.5,47#00000381495.94538350.5,48#00000295151.138701826.5,49#00000251660.137793803.5,50#00000318660.93848186.5,51#,52#,53#00000332004.126242119.5,54#00000332936.138701824.5,55#,56#00000300484.156871073.5,57#,58#00000343270.98341459.5,59#00000359350.94257302.5,60#,61#,62#,63#00000351531.135084368.5", "usage": { "rgw.none": { "size": 0, "size_actual": 0, "size_utilized": 0, "size_kb": 0, "size_kb_actual": 0, "size_kb_utilized": 0, "num_objects": 18446744073709551604 }, "rgw.main": { "size": 22615692324233, "size_actual": 22617004142592, "size_utilized": 18446736471689615156, "size_kb": 22085637036, "size_kb_actual": 22086918108, "size_kb_utilized": 18014391085634390, "num_objects": 521927 }, "rgw.multimeta": { "size": 0, "size_actual": 0, "size_utilized": 0, "size_kb": 0, "size_kb_actual": 0, "size_kb_utilized": 0, "num_objects": 977 } }, "bucket_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 } } This is another issue, probably not related to our sync issue. -- EveryWare AG François Scheurer Senior Systems Engineer Zurlindenstrasse 52a CH-8003 Zürich tel: +41 44 466 60 00 fax: +41 44 466 60 10 mail: francois.scheurer(a)everyware.ch web: http://www.everyware.ch

3 years, 3 months

2
2
0 0

logging to stdout/stderr causes huge container log file

by Tony Liu

Hi, With ceph 15.2.5 octopus, mon, mgd and rgw dump loggings on debug level to stdout/stderr. It causes huge container log file (/var/lib/docker/containers/<ID>/<ID>-json.log). Is there any way to stop dumping logs or change the logging level? BTW, I tried "ceph config set <service> log_to_stderr false". It doesn't help. Thanks! Tony

3 years, 3 months

2
2
0 0

Bluestore migration: per-osd device copy

by Chris Dunlop

Hi, The docs have scant detail on doing a migration to bluestore using a per-osd device copy: https://docs.ceph.com/en/latest/rados/operations/bluestore-migration/#per-o… This mentions "using the copy function of ceph-objectstore-tool", but ceph-objectstore-tool doesn't have a copy function (all the way from v9 to current). Has anyone actually tried doing this? Is there any further detail available on what is involved, e.g. a broad outline of the steps? Of course, detailed instructions would be even better, even if accompanied by "here be dragons!" warnings. Cheers, Chris

3 years, 3 months

3
6
0 0

What is the specific meaning "total_time" in RGW ops log

by opengers

Hello everyone，I enabled rgw ops log by setting "rgw_enable_ops_log = true"，There is a "total_time" field in rgw ops log But I want to figure out whether "total_time" includes the period of time when rgw returns a response to the client?

3 years, 3 months

2
2
0 0

Sequence replacing a failed OSD disk?

by Rainer Krienke

Hello, hope you had a nice Xmas and I wish all of you a good and happy new year in advance... Yesterday my ceph nautilus 14.2.15 cluster had a disk with unreadable sectors, after several tries the OSD was marked down and rebalancing started and has also finished successfully. ceph osd stat shows the osd now as "autoout,exists". Usually the steps to replace a failed disk are: 1. Destroy the failed OSD: ceph osd destroy {id} 2. run ceph-volume lvm create --bluestore --osd-id {id} --data /dev/sdX ... with a new disk in place to recreate a OSD with the same id without the need to change the crushmap or auth info etc. Now I still wait for a new disk and I am a unsure if I should run the destroy-command already now to keep ceph from trying to reactivate the broken osd? Then I would wait until the disk has arrived in a day or so and then use ceph volume to create a new osd? Or should I leave the state as it is now until the disk has arrived and then run both steps (destroy, volume ceph-lvm-create) one right after the other? Do the two slightly different ways make any difference if for example a power failure would result in a reboot of the node with the failed OSD before I could replace the broken disk? Any comments on this? Thanks Rainer -- Rainer Krienke, Uni Koblenz, Rechenzentrum, A22, Universitaetsstrasse 1 56070 Koblenz, Web: http://www.uni-koblenz.de/~krienke, Tel: +49261287 1312 PGP: http://www.uni-koblenz.de/~krienke/mypgp.html, Fax: +49261287 1001312

3 years, 3 months

2
1
0 0

Re: kvm vm cephfs mount hangs on osd node (something like umount -l available?) (help wanted going to production)

by Marc Roos

Hi Eugen, Indeed some really useful tips explaining what goes wrong, yet this thread[1] is about cephfs directly mounted on the osd node. I was having this also quite some time without any problems until suddenly I ran into the same issue as they had. I think I did not have any issues with the kernel client cephfs mount with luminous until I enabled the cephfs snapshots. Then I had to switch to the fuse client. In my case I am running a vm on the the osd node, which I thought would be different. I have been able to reproduce this stale mount just 2 times now. I have been testing with 10x more clients and it still works. Anyway I decided to move everything to rbd. I have been running vm's with rbd images without problems colocated on osd nodes for quite some time. I really would like to use the hosts because they have each 16c/32t and an average load of just 2-3. Unfortunately I did not document precisely how I recovered from the stale mount. I would like to see if I can reduce the amount of steps to take. Things started happening for me after I did the mds failover. Then I got blocked clients that I could unblock and fix the mount with mount -l Thanks for the pointers, I have linked them in my docs ;) -----Original Message----- To: ceph-users(a)ceph.io Subject: [ceph-users] Re: kvm vm cephfs mount hangs on osd node (something like umount -l available?) (help wanted going to production) Hi, there have been several threads about hanging cephfs mounts, one quite long thread [1] describes a couple of debugging options but also mentions to avoid mounting cephfs on OSD nodes in a production environment. Do you see blacklisted clients with 'ceph osd blacklist ls'? If the answer is yes try to unblock that client [2]. The same option ('umount -l') is available on a cephfs client, you can try that, too. Other options described in [1] are to execute an MDS failover, but sometimes a reboot of that VM is the only solution left. Regards, Eugen [1] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028719.html [2] https://docs.ceph.com/en/latest/cephfs/eviction/#advanced-un-blocklisting-a… Zitat von Marc Roos <M.Roos(a)f1-outsourcing.eu>: > Is there not some genius out there that can shed a ligth on this? ;) > Currently I am not able to reproduce this. Thus it would be nice to > have some procedure at hand that resolves stale cephfs mounts nicely. > > > -----Original Message----- > To: ceph-users > Subject: [ceph-users] kvm vm cephfs mount hangs on osd node (something > like umount -l available?) (help wanted going to production) > > > > I have a vm on a osd node (which can reach host and other nodes via > the macvtap interface (used by the host and guest)). I just did a > simple > bonnie++ test and everything seems to be fine. Yesterday however the > dovecot procces apparently caused problems (only using cephfs for an > archive namespace, inbox is on rbd ssd, fs meta also on ssd) > > How can I recover from such lock-up. If I have a similar situation > with an nfs-ganesha mount, I have the option to do a umount -l, and > clients recover quickly without any issues. > > Having to reset the vm, is not really an option. What is best way to > resolve this? > > > > Ceph cluster: 14.2.11 (the vm has 14.2.16) > > I have in my ceph.conf nothing special, these 2x in the mds section: > > mds bal fragment size max = 120000 > # maybe for nfs-ganesha problems? > # http://docs.ceph.com/docs/master/cephfs/eviction/ > #mds_session_blacklist_on_timeout = false > #mds_session_blacklist_on_evict = false mds_cache_memory_limit = > 17179860387 > > > All running: > CentOS Linux release 7.9.2009 (Core) > Linux mail04 3.10.0-1160.6.1.el7.x86_64 #1 SMP Tue Nov 17 13:59:11 UTC > 2020 x86_64 x86_64 x86_64 GNU/Linux > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > email to ceph-users-leave(a)ceph.io > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an > email to ceph-users-leave(a)ceph.io _______________________________________________ ceph-users mailing list -- ceph-users(a)ceph.io To unsubscribe send an email to ceph-users-leave(a)ceph.io

3 years, 3 months

2
1
0 0

Data migration between clusters

by Szabo, Istvan (Agoda)

What is the easiest and best way to migrate bucket from an old cluster to a new one? Luminous to octopus not sure does it matter from the data perspective. ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 3 months

2
3
0 0

OSD reboot loop after running out of memory

by Stefan Wild

Hi, We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one of the server ran out of memory for unknown reasons (normally the machine uses about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an endless restart loop. Logs will just mention system seeing the daemon fail and then restarting it. Since the out of memory incident, we’ve have 3 OSDs fail this way at separate times. We resorted to wiping the affected OSD and re-adding it to the cluster, but it seems as soon as all PGs have moved to the OSD, the next one fails. This is also keeping us from re-deploying RGW, which was affected by the same out of memory incident, since cephadm runs a check and won’t deploy the service unless the cluster is in HEALTH_OK status. Any help would be greatly appreciated. Thanks, Stefan

3 years, 3 months

5
17
0 0

High read IO on RocksDB/WAL since upgrade to Octopus

by Glen Baars

Hello Ceph Users, Since upgrading from Nautilus to Octopus ( cluster started in luminous ) I have been trying to debug why the RocksDB/WAL is maxing out the SSD drives. ( QD > 32, 12000 read IOPS, 200 write IOPS ). The omap upgrade on migration was disabled initially but I reenabled it and restarted all OSD's. This completed without issue. I have increased the memory target from 4 to 6GB per OSD but it doesn't look like it is using it all anyway ( based on top ). I have offline compacted all OSDs. This seems to help for about 4-6 hours ( backfilling is occuring - maybe this triggers it? ). RGW garbage collection is upto date. Pg_log on some PG's are high due to them not being in a clean state ( 8% PGs > 3000 ) remainder of PG's I have reduced to 500 logs - no change. I've been working on this issue for days not without much luck. Nothing in the logs indicates a major issue. The client impact is a major reduction in speed. { "mon": { "ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)": 5 }, "mgr": { "ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)": 1 }, "osd": { "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 18, "ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)": 280 }, "mds": {}, "rgw": { "ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)": 2 }, "tcmu-runner": { "ceph version 14.2.13-450-g65ea1b614d (65ea1b614db8b6d10f334a8ff67c4de97f73bcbf) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.13-450-g65ea1b614d (65ea1b614db8b6d10f334a8ff67c4de97f73bcbf) nautilus (stable)": 2, "ceph version 15.2.5 (2c93eff00150f0cc5f106a559557a58d3d7b6f1f) octopus (stable)": 18, "ceph version 15.2.8 (bdf3eebcd22d7d0b3dd4d5501bee5bac354d5b55) octopus (stable)": 288 } } Any assistance in debugging would be greatly helpful. Glen This e-mail is intended solely for the benefit of the addressee(s) and any other named recipient. It is confidential and may contain legally privileged or confidential information. If you are not the recipient, any use, distribution, disclosure or copying of this e-mail is prohibited. The confidentiality and legal privilege attached to this communication is not waived or lost by reason of the mistaken transmission or delivery to you. If you have received this e-mail in error, please notify us immediately.

3 years, 3 months

2
1
0 0

2024

2023

2022

2021

2020

2019

ceph-users December 2020