January 2024 - ceph-users

1 clients failing to respond to cache pressure (quincy:17.2.6)

by Özkan Göksu

Hello. I have 5 node ceph cluster and I'm constantly having "clients failing to respond to cache pressure" warning. I have 84 cephfs kernel clients (servers) and my users are accessing their personal subvolumes located on one pool. My users are software developers and the data is home and user data. (Git, python projects, sample data and generated new data) --------------------------------------------------------------------------------- --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED ssd 146 TiB 101 TiB 45 TiB 45 TiB 30.71 TOTAL 146 TiB 101 TiB 45 TiB 45 TiB 30.71 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 356 MiB 90 1.0 GiB 0 30 TiB cephfs.ud-data.meta 9 256 69 GiB 3.09M 137 GiB 0.15 45 TiB cephfs.ud-data.data 10 2048 26 TiB 100.83M 44 TiB 32.97 45 TiB --------------------------------------------------------------------------------- root@ud-01:~# ceph fs status ud-data - 84 clients ======= RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active ud-data.ud-04.seggyv Reqs: 142 /s 2844k 2798k 303k 720k POOL TYPE USED AVAIL cephfs.ud-data.meta metadata 137G 44.9T cephfs.ud-data.data data 44.2T 44.9T STANDBY MDS ud-data.ud-02.xcoojt ud-data.ud-05.rnhcfe ud-data.ud-03.lhwkml ud-data.ud-01.uatjle MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) ----------------------------------------------------------------------------------- My MDS settings are below: mds_cache_memory_limit | 8589934592 mds_cache_trim_threshold | 524288 mds_recall_global_max_decay_threshold | 131072 mds_recall_max_caps | 30000 mds_recall_max_decay_rate | 1.500000 mds_recall_max_decay_threshold | 131072 mds_recall_warning_threshold | 262144 I have 2 questions: 1- What should I do to prevent cache pressue warning ? 2- What can I do to increase speed ? - Thanks

3 months, 3 weeks

4
23
0 0

crushmap rules :: host selection

by Adrian Sevcenco

Hi! I'm new with ceph and i struggle to make a mapping between my current storage knowledge and ceph... So, i will state my understanding of the context and the question so please correct me with anything that i got wrong :) So, files (or pieces of files) are put in PGs that are given sections of OSDs. The crushmap gives a physical OSDs map to be chosen for placement or access Pools are a logical name for a storage space but how can i specify what osds or host are part of a pool? For replication, how can i specify: if a replica is missing (for a given time) start rebuilding on some available OSD? Is there a notion of "spare" so if an osd is missing on action, the rebuild to start on another host and when the old OSD is back (the hdd is replaced, or the machine was repaired) to be automatically cleaned up and used? I'm thinking about a 3 node cluster with the replica=2 failure domain = host, in such a way if one node is down, the data from there to be replicated on the remaining nodes (with some drives kept as spares..) I am almost certain that from the point of view of ceph, what i'm thinking is wrong so i would love to receive some advice :) Thanks a lot! Adrian

3 months, 3 weeks

3
9
0 0

Re: Ceph OSD reported Slow operations

by V A Prabha

Hi Eugen Please find the details below root@meghdootctr1:/var/log/ceph# ceph -s cluster: id: c59da971-57d1-43bd-b2b7-865d392412a5 health: HEALTH_WARN nodeep-scrub flag(s) set 544 pgs not deep-scrubbed in time services: mon: 3 daemons, quorum meghdootctr1,meghdootctr2,meghdootctr3 (age 5d) mgr: meghdootctr1(active, since 5d), standbys: meghdootctr2, meghdootctr3 mds: 3 up:standby osd: 36 osds: 36 up (since 34h), 36 in (since 34h) flags nodeep-scrub data: pools: 2 pools, 544 pgs objects: 10.14M objects, 39 TiB usage: 116 TiB used, 63 TiB / 179 TiB avail pgs: 544 active+clean io: client: 24 MiB/s rd, 16 MiB/s wr, 2.02k op/s rd, 907 op/s wr Ceph Versions: root@meghdootctr1:/var/log/ceph# ceph --version ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable) Ceph df -h https://pastebin.com/1ffucyJg Ceph OSD performance dump https://pastebin.com/1R6YQksE Ceph tell osd.XX bench (Out of 36 osds only 8 OSDs give High IOPS value of 250 +. Out of that 4 OSDs are from HP 3PAR and 4 OSDS from DELL EMC. We are using only 4 OSDs from HP3 par and it is working fine without any latency and iops issues from the beginning but the remaining 32 OSDs are from DELL EMC in which 4 OSDs are much better than the remaining 28 OSDs) https://pastebin.com/CixaQmBi Please help me to identify if the issue is with the DELL EMC Storage, Ceph configuration parameter tuning or the Overload in the cloud setup On November 1, 2023 at 9:48 PM Eugen Block <eblock(a)nde.ag> wrote: > Hi, > > for starters please add more cluster details like 'ceph status', 'ceph > versions', 'ceph osd df tree'. Increasing the to 10G was the right > thing to do, you don't get far with 1G with real cluster load. How are > the OSDs configured (HDD only, SSD only or HDD with rocksdb on SSD)? > How is the disk utilization? > > Regards, > Eugen > > Zitat von prabhav(a)cdac.in: > > > In a production setup of 36 OSDs( SAS disks) totalling 180 TB > > allocated to a single Ceph Cluster with 3 monitors and 3 managers. > > There were 830 volumes and VMs created in Openstack with Ceph as a > > backend. On Sep 21, users reported slowness in accessing the VMs. > > Analysing the logs lead us to problem with SAS , Network congestion > > and Ceph configuration( as all default values were used). We updated > > the Network from 1Gbps to 10Gbps for public and cluster networking. > > There was no change. > > The ceph benchmark performance showed that 28 OSDs out of 36 OSDs > > reported very low IOPS of 30 to 50 while the remaining showed 300+ > > IOPS. > > We gradually started reducing the load on the ceph cluster and now > > the volumes count is 650. Now the slow operations has gradually > > reduced but I am aware that this is not the solution. > > Ceph configuration is updated with increasing the > > osd_journal_size to 10 GB, > > osd_max_backfills = 1 > > osd_recovery_max_active = 1 > > osd_recovery_op_priority = 1 > > bluestore_cache_trim_max_skip_pinned=10000 > > > > After one month, now we faced another issue with Mgr daemon stopped > > in all 3 quorums and 16 OSDs went down. From the > > ceph-mon,ceph-mgr.log could not get the reason. Please guide me as > > its a production setup > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io > > > _______________________________________________ > ceph-users mailing list -- ceph-users(a)ceph.io > To unsubscribe send an email to ceph-users-leave(a)ceph.io Thanks & Regards, Ms V A Prabha / श्रीमती प्रभा वी ए Joint Director / संयुक्त निदेशक Centre for Development of Advanced Computing(C-DAC) / प्रगत संगणन विकास केन्द्र(सी-डैक) Tidel Park”, 8th Floor, “D” Block, (North &South) / “टाइडल पार्क”,8वीं मंजिल, “डी” ब्लॉक, (उत्तर और दक्षिण) No.4, Rajiv Gandhi Salai / नं.4, राजीव गांधी सलाई Taramani / तारामणि Chennai / चेन्नई – 600113 Ph.No.:044-22542226/27 Fax No.: 044-22542294 ------------------------------------------------------------------------------------------------------------ [ C-DAC is on Social-Media too. Kindly follow us at: Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ] This e-mail is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies and the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email is strictly prohibited and appropriate legal action will be taken. ------------------------------------------------------------------------------------------------------------

3 months, 3 weeks

4
15
0 0

17.2.7: Backfilling deadlock / stall / stuck / standstill

by Kai Stian Olstad

Hi, This is a cluster running 17.2.7 upgraded from 16.2.6 on the 15 January 2024. On Monday 22 January we had 4 HDD all on different server with I/O-error because of some damage sectors, the OSD is hybrid so the DB is on SSD, 5 HDD share 1 SSD. I set the OSD out, ceph osd out 223 269 290 318 and all hell broke loose. I took only minutes before the users complained about Ceph not working. Ceph status reportet slow OPS on the OSDs that was set to out, and “ceph tell osd.<id> dump_ops_in_flight” against the out OSDs it just hang, after 30 minutes I stopped the dump command. Long story short I ended up running “ceph osd set nobackfill” to slow ops was gone and then unset it when the slow ops message disappeared. I needed to run that all the time so the cluster didn’t come to a holt so this oneliner loop was used “while true; do ceph -s | grep -qE "oldest one blocked for [0-9]{2,}" && (date; ceph osd set nobackfill; sleep 15; ceph osd unset nobackfill); sleep 10; done” But now 4 days later the backfilling has stopped progressing completely and the number of misplaced object is increasing. Some PG has 0 misplaced object but sill have backfilling state, and been in this state for over 24 hours now. I have a hunch that it’s because of PG 404.6e7 is in state “active+recovering+degraded+remapped” it’s been in this state for over 48 hours. It’s has possible 2 missing object, but since they are not unfound I can’t delete them with “ceph pg 404.6e7 mark_unfound_lost delete” Could someone please help to solve this? Down below is some output of ceph commands, I’ll also attache them. ceph status (only removed information about no running scrub and deep_scrub) --- cluster: id: b321e76e-da3a-11eb-b75c-4f948441dcd0 health: HEALTH_WARN Degraded data redundancy: 2/6294904971 objects degraded (0.000%), 1 pg degraded services: mon: 3 daemons, quorum ceph-mon-1,ceph-mon-2,ceph-mon-3 (age 11d) mgr: ceph-mon-1.ptrsea(active, since 11d), standbys: ceph-mon-2.mfdanx mds: 1/1 daemons up, 1 standby osd: 355 osds: 355 up (since 22h), 351 in (since 4d); 18 remapped pgs rgw: 7 daemons active (7 hosts, 1 zones) data: volumes: 1/1 healthy pools: 14 pools, 3945 pgs objects: 1.14G objects, 1.1 PiB usage: 1.8 PiB used, 1.2 PiB / 3.0 PiB avail pgs: 2/6294904971 objects degraded (0.000%) 2980455/6294904971 objects misplaced (0.047%) 3901 active+clean 22 active+clean+scrubbing+deep 17 active+remapped+backfilling 4 active+clean+scrubbing 1 active+recovering+degraded+remapped io: client: 167 MiB/s rd, 13 MiB/s wr, 6.02k op/s rd, 2.35k op/s wr ceph health detail (only removed information about no running scrub and deep_scrub) --- HEALTH_WARN Degraded data redundancy: 2/6294902067 objects degraded (0.000%), 1 pg degraded [WRN] PG_DEGRADED: Degraded data redundancy: 2/6294902067 objects degraded (0.000%), 1 pg degraded pg 404.6e7 is active+recovering+degraded+remapped, acting [223,274,243,290,286,283] ceph pg 202.6e7 list_unfound --- { "num_missing": 2, "num_unfound": 0, "objects": [], "state": "Active", "available_might_have_unfound": true, "might_have_unfound": [], "more": false } ceph pg 404.6e7 query | jq .recovery_state --- [ { "name": "Started/Primary/Active", "enter_time": "2024-01-26T09:08:41.918637+0000", "might_have_unfound": [ { "osd": "243(2)", "status": "already probed" }, { "osd": "274(1)", "status": "already probed" }, { "osd": "275(0)", "status": "already probed" }, { "osd": "283(5)", "status": "already probed" }, { "osd": "286(4)", "status": "already probed" }, { "osd": "290(3)", "status": "already probed" }, { "osd": "335(3)", "status": "already probed" } ], "recovery_progress": { "backfill_targets": [ "275(0)", "335(3)" ], "waiting_on_backfill": [], "last_backfill_started": "404:e76011a9:::1f244892-a2e7-406b-aa62-1b13511333a2.625411.18_56463c71-286c-4399-8d5d-0c278b7c97fd:head", "backfill_info": { "begin": "MIN", "end": "MIN", "objects": [] }, "peer_backfill_info": [], "backfills_in_flight": [], "recovering": [], "pg_backend": { "recovery_ops": [], "read_ops": [] } } }, { "name": "Started", "enter_time": "2024-01-26T09:08:40.909151+0000" } ] ceph pg ls recovering backfilling --- PG OBJECTS DEGRADED MISPLACED UNFOUND BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS STATE SINCE VERSION REPORTED UP ACTING 404.bc 287986 0 0 0 512046716673 0 0 10091 0 active+recovering+remapped 2h 217988'1385478 217988:10897565 [193,297,279,276,136,197]p193 [223,297,269,276,136,197]p223 404.c4 288236 0 288236 0 511669837559 0 0 10063 0 active+remapped+backfilling 24h 217988'1378228 217988:11719855 [156,186,178,345,339,177]p156 [223,186,178,345,339,177]p223 404.12a 287544 0 0 0 512246100354 0 0 10009 0 active+remapped+backfilling 24h 217988'1392371 217988:13739524 [248,178,250,145,304,272]p248 [223,178,250,145,304,272]p223 404.1c1 287739 0 286969 0 511800674008 0 0 10047 0 active+remapped+backfilling 2d 217988'1402889 217988:10975174 [332,246,183,169,280,255]p332 [318,246,183,169,280,255]p318 404.258 287737 0 277111 0 510099501390 0 0 10077 0 active+remapped+backfilling 24h 217988'1451778 217988:12780104 [308,199,134,342,188,221]p308 [318,199,134,342,188,221]p318 404.269 287990 0 0 0 512343190608 0 0 10043 0 active+remapped+backfilling 24h 217988'1358446 217988:14020217 [275,205,283,247,211,292]p275 [223,205,283,247,211,292]p223 404.34e 287624 0 277899 0 510447074297 0 0 10002 0 active+remapped+backfilling 24h 217988'1392933 217988:12636557 [322,141,338,168,251,218]p322 [318,141,338,168,251,218]p318 404.39c 287844 0 286692 0 512947685682 0 0 10017 0 active+remapped+backfilling 2d 217988'1414697 217988:11004944 [288,188,131,299,295,181]p288 [318,188,131,299,295,181]p318 404.511 287589 0 0 0 512014863711 0 0 10057 0 active+remapped+backfilling 24h 217988'1368741 217988:11544729 [166,151,327,333,186,150]p166 [223,151,327,333,186,150]p223 404.5f1 288126 0 286621 0 510850256945 0 0 10071 0 active+remapped+backfilling 24h 217988'1365831 217988:10348125 [214,332,289,184,255,160]p214 [223,332,289,184,255,160]p223 404.62a 288035 0 0 0 511318662269 0 0 10014 0 active+remapped+backfilling 3h 217988'1358010 217988:12528704 [322,260,259,319,149,152]p322 [318,260,259,319,149,152]p318 404.63d 287372 0 286559 0 508783837699 0 0 10074 0 active+remapped+backfilling 24h 217988'1402174 217988:11685744 [303,307,186,350,161,267]p303 [318,307,186,350,161,267]p318 404.6e3 288110 0 0 0 509047569016 0 0 10049 0 active+remapped+backfilling 24h 217988'1368547 217988:12202278 [166,317,233,144,337,240]p166 [223,317,233,144,337,240]p223 404.6e7 287856 2 2 0 510383394904 0 0 10047 0 active+recovering+degraded+remapped 3h 217988'1356501 217988:13157749 [275,274,243,335,286,283]p275 [223,274,243,290,286,283]p223 404.7d2 287619 0 286026 0 510708533087 0 0 10093 0 active+remapped+backfilling 3d 217988'1397393 217988:12146656 [185,139,299,222,155,149]p185 [223,139,299,222,155,149]p223 412.119 711468 0 0 0 207473602580 0 0 10099 0 active+remapped+backfilling 24h 217988'21613330 217988:87589096 [352,207,292,314,230,262]p352 [318,207,292,314,230,262]p318 412.12f 711529 0 701279 0 208498170310 0 0 10033 0 active+remapped+backfilling 24h 217988'14873593 217988:86198113 [303,305,183,215,130,244]p303 [318,305,183,215,130,244]p318 412.1fb 713044 0 3166 0 207787641403 0 0 10097 0 active+remapped+backfilling 2d 217988'14893270 217988:105346132 [156,137,228,241,262,353]p156 [223,137,228,241,262,353]p223 ceph osd tree out --- ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 3112.43481 root default -67 192.35847 host ceph-hd-001 269 hdd 12.82390 osd.269 up 0 1.00000 -49 192.35847 host ceph-hd-003 223 hdd 12.82390 osd.223 up 0 1.00000 -73 192.35847 host ceph-hd-011 290 hdd 12.82390 osd.290 up 0 1.00000 -79 192.35847 host ceph-hd-014 318 hdd 12.82390 osd.318 up 0 1.00000

3 months, 3 weeks

3
4
0 0

Quite important: How do I restart a small cluster using cephadm at 18.2.1

by Carl J Taylor

Hi, Due to idiotic behaviour on my part I made a mistake while replacing some disks in our data centre and our cluster ended up all powered off! I have been using ceph for many years (since firefly) but only recently upgraded to reef and moved to the cephadm / podman setup. I am trying to figure out how to get it all started up again. I am not very familiar with docker at all. I can see the bootstrap option but no "recover" option. It is a small cluster with 3 nodes and two 3TB/4TB disks in each node. I have had a look at https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#… but wonder if cephadm does this itself automagically? Help please, I don't want to lose my data! Many thanks Carl.

3 months, 3 weeks

2
1
0 0

c-states and OSD performance

by Christopher Durham

Hi, The following article: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/ suggests that dsabling C-states on your CPUs (on the OSD nodes) as one method to improve performance. The article seems to indicate that the scenariobeing addressed in the article was with NVMEs as OSDs. Questions: Will disabling C-states and keeping the processors at max power state help performance for the following: 1. NVME OSDs (yes)2. SSD OSDs3. Spinning disk OSDs -Chris

3 months, 3 weeks

2
1
0 0

Odd auto-scaler warnings about too few/many PGs

by Torkil Svensgaard

Hi A few years ago we were really strapped for space so we tweaked pg_num for some pools to ensure all pgs were as to close to the same size as possible while stile observing the power of 2 rule, in order to get the most mileage space wise. We set the auto-scaler to off for the tweaked pools to get rid of the warnings. We now have a lot more free space so I flipped the auto-scaler to warn for all pools and set the bulk flag for the pools expected to be data pools, leading to this: " [WRN] POOL_TOO_FEW_PGS: 4 pools have too few placement groups Pool rbd has 512 placement groups, should have 2048 Pool rbd_internal has 1024 placement groups, should have 2048 Pool cephfs.nvme.data has 32 placement groups, should have 4096 Pool cephfs.ssd.data has 32 placement groups, should have 1024 [WRN] POOL_TOO_MANY_PGS: 4 pools have too many placement groups Pool libvirt has 256 placement groups, should have 32 Pool cephfs.cephfs.data has 512 placement groups, should have 32 Pool rbd_ec_data has 4096 placement groups, should have 1024 Pool cephfs.hdd.data has 2048 placement groups, should have 1024 " That's a lot of warnings *ponder* " # ceph osd pool autoscale-status POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE BULK libvirt 2567G 3.0 3031T 0.0025 1.0 256 warn False .mgr 807.5M 2.0 6520G 0.0002 1.0 1 warn False rbd_ec 9168k 3.0 6520G 0.0000 1.0 32 warn False nvme 31708G 2.0 209.5T 0.2955 1.0 2048 warn False .nfs 36864 3.0 6520G 0.0000 1.0 32 warn False cephfs.cephfs.meta 24914M 3.0 6520G 0.0112 4.0 32 warn False cephfs.cephfs.data 16384 3.0 6520G 0.0000 1.0 512 warn False rbd.ssd.data 798.1G 2.25 6520G 0.2754 1.0 64 warn False rbd_ec_data 609.2T 1.5 3031T 0.3014 1.0 4096 warn True rbd 68170G 3.0 3031T 0.0659 1.0 512 warn True rbd_internal 69553G 3.0 3031T 0.0672 1.0 1024 warn True cephfs.nvme.data 0 2.0 209.5T 0.0000 1.0 32 warn True cephfs.ssd.data 68609M 2.0 6520G 0.0206 1.0 32 warn True cephfs.hdd.data 111.0T 2.25 3031T 0.0824 1.0 2048 warn True " " # ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 3.0 PiB 1.3 PiB 1.6 PiB 1.6 PiB 54.69 nvme 210 TiB 146 TiB 63 TiB 63 TiB 30.21 ssd 6.4 TiB 4.0 TiB 2.4 TiB 2.4 TiB 37.69 TOTAL 3.2 PiB 1.5 PiB 1.7 PiB 1.7 PiB 53.07 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL rbd 4 512 80 TiB 21.35M 200 TiB 19.31 278 TiB libvirt 5 256 3.0 TiB 810.89k 7.5 TiB 0.89 278 TiB rbd_internal 6 1024 86 TiB 28.22M 204 TiB 19.62 278 TiB .mgr 8 1 4.3 GiB 1.06k 1.6 GiB 0.07 1.0 TiB rbd_ec 10 32 55 MiB 25 27 MiB 0 708 GiB rbd_ec_data 11 4096 683 TiB 180.52M 914 TiB 52.26 556 TiB nvme 23 2048 46 TiB 25.18M 62 TiB 31.62 67 TiB .nfs 25 32 4.6 KiB 10 108 KiB 0 708 GiB cephfs.cephfs.meta 31 32 25 GiB 1.66M 73 GiB 3.32 708 GiB cephfs.cephfs.data 32 679 489 B 40.41M 48 KiB 0 708 GiB cephfs.nvme.data 34 32 0 B 0 0 B 0 67 TiB cephfs.ssd.data 35 32 77 GiB 425.03k 134 GiB 5.94 1.0 TiB cephfs.hdd.data 37 2048 121 TiB 68.42M 250 TiB 23.03 371 TiB rbd.ssd.data 38 64 934 GiB 239.94k 1.8 TiB 45.82 944 GiB " The most weird one: Pool rbd_ec_data stores 683TB in 4096 pgs -> warn should be 1024 Pool rbd_internal stores 86TB in 1024 pgs-> warn should be 2048 That makes no sense to me based on the amount of data stored. Is this a bug or what am I missing? Ceph version is 17.2.7. Mvh. Torkil -- Torkil Svensgaard Systems Administrator Danish Research Centre for Magnetic Resonance DRCMR, Section 714 Copenhagen University Hospital Amager and Hvidovre Kettegaard Allé 30, 2650 Hvidovre, Denmark

3 months, 3 weeks

3
2
0 0

Questions about the CRUSH details

by Henry lol

Hello, I'm new to ceph and sorry in advance for the naive questions. 1. As far as I know, CRUSH utilizes the cluster map consisting of the PG map and others. I don't understand why CRUSH computation is required on client-side, even though PG-to-OSDs mapping can be acquired from the PG map. 2. how does the client get a valid(old) OSD set when the PG is being remapped to a new ODS set which CRUSH returns? thanks.

3 months, 3 weeks

5
9
0 0

podman / docker issues

by Marc

More and more I am annoyed with the 'dumb' design decisions of redhat. Just now I have an issue on an 'air gapped' vm that I am unable to start a docker/podman container because it tries to contact the repository to update the image and instead of using the on disk image it just fails. (Not to mention the %$#$%#$ that design containers to download stuff from the internet on startup) I was wondering if this is also an issue with ceph-admin. Is there an issue with starting containers when container image repositories are not available or when there is no internet connection.

3 months, 3 weeks

3
2
0 0

RGW crashes when rgw_enable_ops_log is enabled

by Marc Singer

Hi Ceph Users I am encountering a problem with the RGW Admin Ops Socket. I am setting up the socket as follows: rgw_enable_ops_log = true rgw_ops_log_socket_path = /tmp/ops/rgw-ops.socket rgw_ops_log_data_backlog = 16Mi Seems like the socket fills up over time and it doesn't seem to get flushed, at some point the process runs out of file space. Do I need to configure something or send something for the socket to flush? See the log here: 0> 2024-01-25T13:10:13.908+0000 7f247b00eb00 -1 *** Caught signal (File size limit exceeded) ** in thread 7f247b00eb00 thread_name:ops_log_file ceph version 18.2.0 (5dd24139a1eada541a3bc16b6941c5dde975e26d) reef (stable) NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. --- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/ 5 rgw_datacache 1/ 5 rgw_access 1/ 5 rgw_dbstore 1/ 5 rgw_flight 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 fuse 2/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite 0/ 5 seastore 0/ 5 seastore_onode 0/ 5 seastore_odata 0/ 5 seastore_omap 0/ 5 seastore_tm 0/ 5 seastore_t 0/ 5 seastore_cleaner 0/ 5 seastore_epm 0/ 5 seastore_lba 0/ 5 seastore_fixedkv_tree 0/ 5 seastore_cache 0/ 5 seastore_journal 0/ 5 seastore_device 0/ 5 seastore_backref 0/ 5 alienstore 1/ 5 mclock 0/ 5 cyanstore 1/ 5 ceph_exporter 1/ 5 memstore -2/-2 (syslog threshold) 99/99 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7f2472a89b00 / safe_timer 7f2472cadb00 / radosgw ... log_file /var/lib/ceph/crash/2024-01-25T13:10:13.909546Z_01ee6e6a-e946-4006-9d32-e17ef2f9df74/log --- end dump of recent events --- reraise_fatal: default handler for signal 25 didn't terminate the process? Thank you for your help. Marc

3 months, 3 weeks

2
2
0 0

2024

2023

2022

2021

2020

2019

ceph-users January 2024