January 2021 - ceph-users

by Szabo, Istvan (Agoda)

Hi, Is there anybody running a cluster with different os? Due to the centos 8 change I might try to add ubuntu osd nodes to centos cluster and decommission the centos slowly but I'm not sure this is possible or not. Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 2 months

4
3
0 0

Permissions for OSD

by George Shuklin

Docs for permissions are super vague. What each flag does? What is 'x' permitting? What's the difference between class-write and write? And the last question: can we limit user to reading/writing only to existing objects in the pool? Thanks!

3 years, 2 months

1
0
0 0

January Ceph Science Virtual User Group Meeting

by Kevin Hrpcek

Hey all, We will be having a Ceph science/research/big cluster call on Wednesday January 27th. If anyone wants to discuss something specific they can add it to the pad linked below. If you have questions or comments you can contact me. This is an informal open call of community members mostly from hpc/htc/research environments where we discuss whatever is on our minds regarding ceph. Updates, outages, features, maintenance, etc...there is no set presenter but I do attempt to keep the conversation lively. https://pad.ceph.com/p/Ceph_Science_User_Group_20210127 <https://pad.ceph.com/p/Ceph_Science_User_Group_20200923> We try to keep it to an hour or less. Ceph calendar event details: January 27th, 2020 15:00 UTC 4pm Central European 9am Central US Description: Main pad for discussions: https://pad.ceph.com/p/Ceph_Science_User_Group_Index Meetings will be recorded and posted to the Ceph Youtube channel. To join the meeting on a computer or mobile phone: https://bluejeans.com/908675367?src=calendarLink To join from a Red Hat Deskphone or Softphone, dial: 84336. Connecting directly from a room system? 1.) Dial: 199.48.152.152 or bjn.vc 2.) Enter Meeting ID: 908675367 Just want to dial in on your phone? 1.) Dial one of the following numbers: 408-915-6466 (US) See all numbers: https://www.redhat.com/en/conference-numbers 2.) Enter Meeting ID: 908675367 3.) Press # Want to test your video connection? https://bluejeans.com/111 Kevin -- Kevin Hrpcek NASA VIIRS Atmosphere SIPS Space Science & Engineering Center University of Wisconsin-Madison

3 years, 2 months

1
0
0 0

Multisite bucket data inconsistency

by Szabo, Istvan (Agoda)

Hi, We have bucket sync enabled and seems like it is inconsistent ☹ This is the master zone sync status on that specific bucket: realm 5fd28798-9195-44ac-b48d-ef3e95caee48 (realm) zonegroup 31a5ea05-c87a-436d-9ca0-ccfcbad481e3 (data) zone 9213182a-14ba-48ad-bde9-289a1c0c0de8 (hkg) metadata sync no sync (zone is master) data sync source: 61c9d940-fde4-4bed-9389-edc8d7741817 (sin) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source source: f20ddd64-924b-4f78-8d2d-dd6c65f98ba9 (ash) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 126 shards behind shards: [0,1,2,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127] oldest incremental change not applied: 2021-01-25T11:32:57.726042+0700 [62] 104 shards are recovering recovering shards: [0,2,3,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,24,25,26,27,28,29,31,32,33,36,37,38,39,40,42,43,44,45,47,50,51,52,53,54,55,57,58,61,63,65,66,67,68,69,70,71,72,73,74,75,76,78,80,81,82,83,84,85,87,88,90,92,93,95,96,97,98,99,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,123,124,125,126,127] This is the secondary zone where the data has been uploaded: realm 5fd28798-9195-44ac-b48d-ef3e95caee48 (realm) zonegroup 31a5ea05-c87a-436d-9ca0-ccfcbad481e3 (data) zone f20ddd64-924b-4f78-8d2d-dd6c65f98ba9 (ash) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: 61c9d940-fde4-4bed-9389-edc8d7741817 (sin) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source source: 9213182a-14ba-48ad-bde9-289a1c0c0de8 (hkg) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is behind on 125 shards behind shards: [0,1,2,3,4,5,6,8,9,10,11,12,13,14,15,16,17,18,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127] oldest incremental change not applied: 2021-01-25T11:29:32.450031+0700 [61] 126 shards are recovering recovering shards: [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,104,105,106,107,108,109,110,111,112,114,115,116,117,118,119,120,121,122,123,124,125,126,127] The pipes are already there: "id": "seo-2", "data_flow": { "symmetrical": [ { "id": "seo-2-flow", "zones": [ "9213182a-14ba-48ad-bde9-289a1c0c0de8", "f20ddd64-924b-4f78-8d2d-dd6c65f98ba9" ] } ] }, "pipes": [ { "id": "seo-2-hkg-ash-pipe", "source": { "bucket": "seo..prerender", "zones": [ "9213182a-14ba-48ad-bde9-289a1c0c0de8" ] }, "dest": { "bucket": "seo..prerender", "zones": [ "f20ddd64-924b-4f78-8d2d-dd6c65f98ba9" ] }, "params": { "source": { "filter": { "tags": [] } }, "dest": {}, "priority": 0, "mode": "system", "user": "" } }, { "id": "seo-2-ash-hkg-pipe", "source": { "bucket": "seo..prerender", "zones": [ "f20ddd64-924b-4f78-8d2d-dd6c65f98ba9" ] }, "dest": { "bucket": "seo..prerender", "zones": [ "9213182a-14ba-48ad-bde9-289a1c0c0de8" ] }, "params": { "source": { "filter": { "tags": [] } }, "dest": {}, "priority": 0, "mode": "system", "user": "" } } ], "status": "enabled" } Any idea to troubleshoot? Thank you ________________________________ This message is confidential and is for the sole use of the intended recipient(s). It may also be privileged or otherwise protected by copyright or other legal rules. If you have received it by mistake please let us know by reply email and delete it from your system. It is prohibited to copy this message or disclose its content to anyone. Any confidentiality or privilege is not waived or lost by any mistaken delivery or unauthorized disclosure of the message. All messages sent to and from Agoda may be monitored to ensure compliance with company policies, to protect the company's interests and to remove potential malware. Electronic messages may be intercepted, amended, lost or deleted, or contain viruses.

3 years, 2 months

1
2
0 0

cephadm db_slots and wal_slots ignored

by Schweiss, Chip

I'm trying to set up a new ceph cluster with cephadm on a SUSE SES trial that has Ceph 15.2.8 Each OSD node has 18 rotational SAS disks, 4 NVMe 2TB SSDs for DB, and 2 NVME2 200GB Optane SSDs for WAL. These servers will eventually have 24 rotational SAS disks that they will inherit from existing storage servers. So I don't want all the space used on the DB and WAL SSDs. I suspect from the comment "(db_slots is actually to be favoured here, but it's not implemented yet)" on this page in the docs: https://docs.ceph.com/en/latest/cephadm/drivegroups/#the-advanced-case these parameters are not yet implemented, yet are documented as such under "ADDITIONAL OPTIONS" My osd_spec.yml: service_type: osd service_id: three_tier_osd placement: host_pattern: '*' data_devices: rotational: 1 model: 'ST14000NM0288' db_devices: rotational: 0 model: 'INTEL SSDPE2KX020T8' limit: 6 wal_devices: model: 'INTEL SSDPEL1K200GA' limit: 12 db_slots: 6 wal_slots: 12 All available space is consumed on my DB and WAL SSDs with only 18 OSDs, leaving no room to add additional spindles. Is this still work in progress, or a bug I should report? Possibly related to https://github.com/rook/rook/issues/5026 At the minimum, this appears to be a documentation bug. How can I work around this? -Chip

3 years, 2 months

2
3
0 0

radosgw-admin realm pull from the secondary site fails "(13) Permission denied"

by Hayashida, Mami

I have been trying to create two virtual test clusters to learn about the RGW multisite setting. So far, I have set up two small Nautilus (v.14.2.16) clusters, designated one of them as the "master zone site" and followed every step outlined in the doc ( https://docs.ceph.com/en/nautilus/radosgw/multisite/), including create a system user, updating the period, and restarting the rgw daemon. (For the sake of simplicity, there is only one RGW daemon running on each site.) Once I installed the RGW daemon on the secondary zone site, I tried pulling the realm from the master zone cluster, but ended up with this: ``` $ radosgw-admin realm pull --url=http://<master zone gateway>:80 --access-key=<system_access_key> --secret=<system_secret_key> request failed: (13) Permission denied If the realm has been changed on the master zone, the master zone's gateway may need to be restarted to recognize this user. ``` I tried adding the --rgw-realm=<realm set up in the primary site>, but the result was the same. I restarted the rgw daemon on both sides -- that did not help, either. The output of all of the following on the master zone side, as far as I could tell, seems correct -- the realm, zonegroup, zone I created are the only ones and set to default. ``` radosgw-admin zone/zonegroup/realm list radosgw-admin zone/zonegroup/realm get ``` On the "master zone" side, the rgw log shows ``` 2021-01-22 13:34:48.404 7fb9ca89e700 1 ====== starting new request req=0x7fb9ca897740 ===== 2021-01-22 13:34:48.428 7fb9ca89e700 1 ====== req done req=0x7fb9ca897740 op status=0 http_status=403 latency=0.0240002s ====== 2021-01-22 13:34:48.428 7fb9ca89e700 1 civetweb: 0x559d6509a000: 10.33.30.55 - - [22/Jan/2021:13:34:48 -0500] "GET /admin/realm HTTP/1.1" 403 318 - - ``` I am using Ubuntu 18.04, Ceph v.14.2.16, deployed using `ceph-deploy`. *Mami Hayashida* *Research Computing Associate* Univ. of Kentucky ITS Research Computing Infrastructure

3 years, 2 months

2
2
0 0

Re: Storage down due to MON sync very slow

by Frank Schilder

Hi Dan, it is possible that the payload reduction also solved or at least reduced a really bad problem that looks related (beware, that's a long one): https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/FBGIJZNFG44… . Since reducing the payload size I still observe these large peaks in the MON network activity. However, it seems that the cluster does not go down like before any more. During these peaks, I see warnings like these: 2021-01-22 12:00:00.000102 [WRN] overall HEALTH_WARN 1 pools nearfull 2021-01-22 11:04:09.156796 [INF] Health check cleared: SLOW_OPS (was: 5 slow ops, oldest one blocked for 75 sec, mon.ceph-02 has slow ops) 2021-01-22 11:04:07.994416 [WRN] Health check update: 5 slow ops, oldest one blocked for 75 sec, mon.ceph-02 has slow ops (SLOW_OPS) 2021-01-22 11:04:01.469498 [WRN] Health check failed: 124 slow ops, oldest one blocked for 82 sec, daemons [mon.ceph-02,mon.ceph-03] have slow ops. (SLOW_OPS) 2021-01-22 11:00:00.000104 [WRN] overall HEALTH_WARN 1 pools nearfull 2021-01-22 10:36:44.576663 [INF] Health check cleared: SLOW_OPS (was: 25 slow ops, oldest one blocked for 42 sec, daemons [mon.ceph-02,mon.ceph-03] have slow ops.) 2021-01-22 10:36:38.543763 [WRN] Health check failed: 18 slow ops, oldest one blocked for 38 sec, daemons [mon.ceph-02,mon.ceph-03] have slow ops. (SLOW_OPS) So, at least stuff is working. I now lean towards the hypothesis that these outages were caused by some synchronisation process between MONs that got less problematic with reducing the payload size. I might be able to reduce my insane beacon time-outs again, but before doing so, do you know of any other communication parameters similar to the mon_sync_max_payload_size that might be relevant in MON-[MON, MGR, OSD] communication? In general, I have the impression that due to such little bugs the recommendation for production clusters should be elevated to at least 5 MONs so that one can afford 2 MONs going out of quorum temporarily. I will upgrade our cluster to 5 MONs as soon as I can. Thanks for your help and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Dan van der Ster <dan(a)vanderster.com> Sent: 06 January 2021 20:53:14 To: Frank Schilder Subject: Re: [ceph-users] Re: Storage down due to MON sync very slow Yeah I was going to say -- ignore all of the rsync advice in that thread, it is unnecessary. Setting a small mon sync payload works like magic :) -- dan On Wed, Jan 6, 2021 at 8:49 PM Frank Schilder <frans(a)dtu.dk> wrote: > > OK, sorry for all my questions. > > Setting mon_sync_max_payload_size=4096 actually makes the MON sync in no time! Thank you so much :) > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder > Sent: 06 January 2021 20:40:26 > To: Dan van der Ster > Subject: Re: [ceph-users] Re: Storage down due to MON sync very slow > > OK, thanks a lot! I will try it now. Hope the cluster remains responsive. > > I'm wondering about this approach someone brought up in your thread: > > Eventually I stopped one MON, tarballed it's database and used that to > bring back the MON which was upgraded to 13.2.8 > > That work without any hickups. The MON joined again within a few seconds. > > Stopping one MON for a copy would be much shorter storage outage than the sync I'm doing. I guess its the entire mon data directory to copy. I always wondered if this contains data tied to a specific MON. If not, the copy approach could speed things up a lot. What do you think? > > Thanks again and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Dan van der Ster <dan(a)vanderster.com> > Sent: 06 January 2021 20:36:15 > To: Frank Schilder > Subject: Re: [ceph-users] Re: Storage down due to MON sync very slow > > We have used mon_sync_max_payload_size 4096 on our largest most > important prod cluster since that thread. > The PR from Sage makes something like that the default anyway. (the PR > counts keys rather than bytes, but the effect is the same). > > mon_sync_max_payload_size 4096 should not impact the speed of syncing > -- it simply breaks the sync into smaller more manageable pieces. > (Without this, if you have lots of keys in the mon db, in our case > caused by lots of rbd snapshots, then syncing will never ever > complete). > > -- dan > > On Wed, Jan 6, 2021 at 8:32 PM Frank Schilder <frans(a)dtu.dk> wrote: > > > > Hi Dan, > > > > thanks for that. Will it slow down or accelerate the syncing (will read your post after that e-mail), or will it just allow I/O to continue and sync more in the background? Current value is > > > > mon_sync_max_payload_size 1048576 > > > > Related to that, would building a MON store from OSDs following https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/#… help providing a head start? Not sure if this procedure works on an active cluster. > > > > Will study your thread now ... > > > > Thanks again and best regards, > > ================= > > Frank Schilder > > AIT Risø Campus > > Bygning 109, rum S14 > > > > ________________________________________ > > From: Dan van der Ster <dan(a)vanderster.com> > > Sent: 06 January 2021 20:26:46 > > To: Frank Schilder > > Subject: Re: [ceph-users] Re: Storage down due to MON sync very slow > > > > (obviously just put that config in the ceph.conf on the mons if mimic > > doesn't have ceph config... I don't quite remember.) > > > > -- dan > > > > On Wed, Jan 6, 2021 at 8:25 PM Dan van der Ster <dan(a)vanderster.com> wrote: > > > > > > This sounds a lot like an old thread of mine: > > > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/M5ZKF7PTEO2… > > > > > > See the discussion about mon_sync_max_payload_size, and the PR that > > > fixed this at some point in nautilus. > > > > > > Our workaround was: > > > > > > ceph config set mon mon_sync_max_payload_size 4096 > > > > > > Hope that helps, > > > > > > Dan > > > > > > > > > On Wed, Jan 6, 2021 at 8:18 PM Frank Schilder <frans(a)dtu.dk> wrote: > > > > > > > > Dear Dan, > > > > > > > > thanks for your fast response. > > > > > > > > Version: mimic 13.2.10. > > > > > > > > Here is the mon_status of the "new" MON during syncing: > > > > > > > > [root@ceph-01 ~]# ceph daemon mon.ceph-01 mon_status > > > > { > > > > "name": "ceph-01", > > > > "rank": 0, > > > > "state": "synchronizing", > > > > "election_epoch": 0, > > > > "quorum": [], > > > > "features": { > > > > "required_con": "144115188346404864", > > > > "required_mon": [ > > > > "kraken", > > > > "luminous", > > > > "mimic", > > > > "osdmap-prune" > > > > ], > > > > "quorum_con": "0", > > > > "quorum_mon": [] > > > > }, > > > > "outside_quorum": [ > > > > "ceph-01" > > > > ], > > > > "extra_probe_peers": [], > > > > "sync_provider": [], > > > > "sync": { > > > > "sync_provider": "mon.2 192.168.32.67:6789/0", > > > > "sync_cookie": 33302773774, > > > > "sync_start_version": 38355711 > > > > }, > > > > "monmap": { > > > > "epoch": 3, > > > > "fsid": "e4ece518-f2cb-4708-b00f-b6bf511e91d9", > > > > "modified": "2019-03-14 23:08:34.717223", > > > > "created": "2019-03-14 22:18:15.088212", > > > > "features": { > > > > "persistent": [ > > > > "kraken", > > > > "luminous", > > > > "mimic", > > > > "osdmap-prune" > > > > ], > > > > "optional": [] > > > > }, > > > > "mons": [ > > > > { > > > > "rank": 0, > > > > "name": "ceph-01", > > > > "addr": "192.168.32.65:6789/0", > > > > "public_addr": "192.168.32.65:6789/0" > > > > }, > > > > { > > > > "rank": 1, > > > > "name": "ceph-02", > > > > "addr": "192.168.32.66:6789/0", > > > > "public_addr": "192.168.32.66:6789/0" > > > > }, > > > > { > > > > "rank": 2, > > > > "name": "ceph-03", > > > > "addr": "192.168.32.67:6789/0", > > > > "public_addr": "192.168.32.67:6789/0" > > > > } > > > > ] > > > > }, > > > > "feature_map": { > > > > "mon": [ > > > > { > > > > "features": "0x3ffddff8ffacfffb", > > > > "release": "luminous", > > > > "num": 1 > > > > } > > > > ], > > > > "mds": [ > > > > { > > > > "features": "0x3ffddff8ffacfffb", > > > > "release": "luminous", > > > > "num": 2 > > > > } > > > > ], > > > > "client": [ > > > > { > > > > "features": "0x2f018fb86aa42ada", > > > > "release": "luminous", > > > > "num": 1 > > > > }, > > > > { > > > > "features": "0x3ffddff8eeacfffb", > > > > "release": "luminous", > > > > "num": 1 > > > > }, > > > > { > > > > "features": "0x3ffddff8ffacfffb", > > > > "release": "luminous", > > > > "num": 17 > > > > } > > > > ] > > > > } > > > > } > > > > > > > > I'm a bit surprised that the other 2 MONs don't remain in quorum until this MON has caught up. Is there any way to monitor the syncing progress? Right now I need to interrupt regularly to allow some I/O, but I have no clue how long I need to wait. > > > > > > > > Thanks for your help! > > > > > > > > Best regards, > > > > ================= > > > > Frank Schilder > > > > AIT Risø Campus > > > > Bygning 109, rum S14 > > > > > > > > ________________________________________ > > > > From: Dan van der Ster <dan(a)vanderster.com> > > > > Sent: 06 January 2021 20:16:44 > > > > To: Frank Schilder > > > > Cc: Ceph Users > > > > Subject: Re: [ceph-users] Re: Storage down due to MON sync very slow > > > > > > > > Which version of Ceph are you running? > > > > > > > > .. dan > > > > > > > > > > > > On Wed, Jan 6, 2021, 8:14 PM Frank Schilder <frans(a)dtu.dk<mailto:frans@dtu.dk>> wrote: > > > > In the output of the MON I see slow ops warnings: > > > > > > > > debug 2021-01-06 20:12:48.854 7f1a3d29f700 -1 mon.ceph-01@0(synchronizing) e3 get_health_metrics reporting 20 slow ops, oldest is log(1 entries from seq 1 at 2021-01-06 20:00:12.014861) > > > > > > > > There appears to be no progress on this operation, it is stuck. > > > > > > > > Best regards, > > > > ================= > > > > Frank Schilder > > > > AIT Risø Campus > > > > Bygning 109, rum S14 > > > > > > > > ________________________________________ > > > > From: Frank Schilder <frans(a)dtu.dk<mailto:frans@dtu.dk>> > > > > Sent: 06 January 2021 20:11:25 > > > > To: ceph-users(a)ceph.io<mailto:ceph-users@ceph.io> > > > > Subject: [ceph-users] Storage down due to MON sync very slow > > > > > > > > Dear all, > > > > > > > > I had to restart one out of 3 MONs on an empty MON DB dir. It is in state syncing right now, but I'm not sure if there is any progress. The cluster is completely unresponsive even though I have 2 healthy MONs. Is there any way to sync the DB directory faster and/or without downtime? > > > > > > > > Thanks a lot! > > > > > > > > Best regards, > > > > ================= > > > > Frank Schilder > > > > AIT Risø Campus > > > > Bygning 109, rum S14 > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users(a)ceph.io<mailto:ceph-users@ceph.io> > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io<mailto:ceph-users-leave@ceph.io> > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users(a)ceph.io<mailto:ceph-users@ceph.io> > > > > To unsubscribe send an email to ceph-users-leave(a)ceph.io<mailto:ceph-users-leave@ceph.io>

3 years, 2 months

1
0
0 0

Cannot create new OSD node - _read_fsid unparsable uuid

by Ha, Son Hai

Hello everyone, I'm trying to add an OSD node to my current cluster. I created an lvm volume for this node to use for OSD. My current Ceph version is 14.2.6 and it runs on an RHEL 7 OS. However, I got error when trying to activate the node. I'm confused with the output. I tried to see what really happened, but I don't know how to proceed to pinpoint the issues. I appreciate so much for any help to clear the confusion. * Why did the "mon getmap" command send the result to /dev/stderr? I saw the monmap file having value and it looked like it got the monmap, is there any way to check if the monmap having problem? * What caused "_read_fsid unparsable uuid"? Sincerely, Hai [xxx@xxx.com@xxx430 ~]$ sudo ceph-volume --cluster aap-storage lvm create --data /dev/kubernetes/ceph-osd Running command: /bin/ceph-authtool --gen-print-key Running command: /bin/ceph --cluster aap-storage --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/aap-storage.keyring -i - osd new c84f74b3-9a3e-4cd9-a4ce-4c1819a56f90 Running command: /bin/ceph-authtool --gen-print-key Running command: /bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/aap-storage-5 Running command: /sbin/restorecon /var/lib/ceph/osd/aap-storage-5 Running command: /bin/chown -h ceph:ceph /dev/kubernetes/ceph-osd Running command: /bin/chown -R ceph:ceph /dev/dm-9 Running command: /bin/ln -s /dev/kubernetes/ceph-osd /var/lib/ceph/osd/aap-storage-5/block Running command: /bin/ceph --cluster aap-storage --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/aap-storage.keyring mon getmap -o /var/lib/ceph/osd/aap-storage-5/activate.monmap stderr: got monmap epoch 8 Running command: /bin/ceph-authtool /var/lib/ceph/osd/aap-storage-5/keyring --create-keyring --name osd.5 --add-key AQB2RAhgnkcLIBAAdcdR5N4YKzSJmfoA6G6XvA== stdout: creating /var/lib/ceph/osd/aap-storage-5/keyring added entity osd.5 auth(key=AQB2RAhgnkcLIBAAdcdR5N4YKzSJmfoA6G6XvA==) Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/aap-storage-5/keyring Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/aap-storage-5/ Running command: /bin/ceph-osd --cluster aap-storage --osd-objectstore bluestore --mkfs -i 5 --monmap /var/lib/ceph/osd/aap-storage-5/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/aap-storage-5/ --osd-uuid c84f74b3-9a3e-4cd9-a4ce-4c1819a56f90 --setuser ceph --setgroup ceph stderr: 2021-01-20 15:55:52.319 7f0970d3ba80 -1 bluestore(/var/lib/ceph/osd/aap-storage-5/) _read_fsid unparsable uuid --> ceph-volume lvm prepare successful for: kubernetes/ceph-osd Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/aap-storage-5 Running command: /bin/ceph-bluestore-tool --cluster=aap-storage prime-osd-dir --dev /dev/kubernetes/ceph-osd --path /var/lib/ceph/osd/aap-storage-5 --no-mon-config Running command: /bin/ln -snf /dev/kubernetes/ceph-osd /var/lib/ceph/osd/aap-storage-5/block Running command: /bin/chown -h ceph:ceph /var/lib/ceph/osd/aap-storage-5/block Running command: /bin/chown -R ceph:ceph /dev/dm-9 Running command: /bin/chown -R ceph:ceph /var/lib/ceph/osd/aap-storage-5 Running command: /bin/systemctl enable ceph-volume@lvm-5-c84f74b3-9a3e-4cd9-a4ce-4c1819a56f90 stderr: Created symlink from /etc/systemd/system/multi-user.target.wants/ceph-volume(a)lvm-5-c84f74b3-9a3e-4cd9-a4ce-4c1819a56f90.service to /usr/lib/systemd/system/ceph-volume@.service. Running command: /bin/systemctl enable --runtime ceph-osd@5 stderr: Created symlink from /run/systemd/system/ceph-osd.target.wants/ceph-osd(a)5.service to /usr/lib/systemd/system/ceph-osd@.service. Running command: /bin/systemctl start ceph-osd@5 stderr: Job for ceph-osd(a)5.service failed because the control process exited with error code. See "systemctl status ceph-osd(a)5.service" and "journalctl -xe" for details. --> Was unable to complete a new OSD, will rollback changes Running command: /bin/ceph --cluster aap-storage --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/aap-storage.keyring osd purge-new osd.5 --yes-i-really-mean-it stderr: purged osd.5 --> RuntimeError: command returned non-zero exit status: 1 So I break the procedure into steps, and proceed them one by one. I saw the problem happen with the bluestore. [xxx@xxx.com@xxx430 ~]$ sudo ceph-osd -c /etc/ceph/${CLUSTER_NAME}.conf -k /etc/ceph/${CLUSTER_NAME}.client.admin.keyring -i $ID --mkfs \ > --monmap /var/lib/ceph/osd/${CLUSTER_NAME}-$ID/activate.monmap --osd-uuid $UUID --no-mon-config \ > --osd-data /var/lib/ceph/osd/${CLUSTER_NAME}-$ID/ --setuser ceph --setgroup ceph 2021-01-21 16:59:14.907 7f724e882a80 -1 bluestore(/var/lib/ceph/osd/aap-storage-5//block) _read_bdev_label failed to open /var/lib/ceph/osd/aap-storage-5//block: (13) Permission denied 2021-01-21 16:59:14.908 7f724e882a80 -1 bluestore(/var/lib/ceph/osd/aap-storage-5//block) _read_bdev_label failed to open /var/lib/ceph/osd/aap-storage-5//block: (13) Permission denied 2021-01-21 16:59:14.908 7f724e882a80 -1 bluestore(/var/lib/ceph/osd/aap-storage-5/) _read_fsid unparsable uuid 2021-01-21 16:59:14.908 7f724e882a80 -1 bluestore(/var/lib/ceph/osd/aap-storage-5/) _setup_block_symlink_or_file failed to open block file: (13) Permission denied 2021-01-21 16:59:14.908 7f724e882a80 -1 bluestore(/var/lib/ceph/osd/aap-storage-5/) mkfs failed, (13) Permission denied 2021-01-21 16:59:14.908 7f724e882a80 -1 OSD::mkfs: ObjectStore::mkfs failed with error (13) Permission denied 2021-01-21 16:59:14.908 7f724e882a80 -1 ** ERROR: error creating empty object store in /var/lib/ceph/osd/aap-storage-5/: (13) Permission denied Because of the permission problem, so I avoid using the ceph user for the ceph service, however, the problem still persisted. [xxx@xxx.com@xxx430 ~]$ sudo ceph-osd -c /etc/ceph/${CLUSTER_NAME}.conf -k /etc/ceph/${CLUSTER_NAME}.client.admin.keyring -i $ID --mkfs --monmap /var/lib/ceph/osd/${CLUSTER_NAME}-$ID/activate.monmap --osd-uuid $UUID --no-mon-config --osd-data /var/lib/ceph/osd/${CLUSTER_NAME}-$ID/ 2021-01-21 17:00:08.583 7f3acc546a80 -1 bluestore(/var/lib/ceph/osd/aap-storage-5/) _read_fsid unparsable uuid -- KPMG IT Service GmbH Sitz/Registergericht: Berlin/Amtsgericht Charlottenburg, HRB 87521 B Geschäftsführer: Hans-Christian Schwieger, Helmar Symmank Aufsichtsratsvorsitzender: WP StB Klaus Becker Allgemeine Informationen zur Datenverarbeitung im Rahmen unserer allgemeinen Geschäftstätigkeit sowie im Mandatsverhältnis gemäß EU Datenschutz-Grundverordnung sind hier <https://home.kpmg.com/content/dam/kpmg/de/pdf/Themen/2018/datenschutzinform…> abrufbar. Die Information in dieser E-Mail ist vertraulich und kann dem Berufsgeheimnis unterliegen. Sie ist ausschließlich für den Adressaten bestimmt. Jeglicher Zugriff auf diese E-Mail durch andere Personen als den Adressaten ist untersagt. Sollten Sie nicht der für diese E-Mail bestimmte Adressat sein, ist Ihnen jede Veröffentlichung, Vervielfältigung oder Weitergabe wie auch das Ergreifen oder Unterlassen von Maßnahmen im Vertrauen auf erlangte Information untersagt. In dieser E-Mail enthaltene Meinungen oder Empfehlungen unterliegen den Bedingungen des jeweiligen Mandatsverhältnisses mit dem Adressaten. The information in this e-mail is confidential and may be legally privileged. It is intended solely for the addressee. Access to this e-mail by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. Any opinions or advice contained in this e-mail are subject to the terms and conditions expressed in the governing KPMG client engagement letter.

3 years, 2 months

1
0
0 0

mds openfiles table shards

by Dan van der Ster

Hi all, During rejoin an MDS can sometimes go OOM if the openfiles table is too large. The workaround has been described by ceph devs as "rados rm -p cephfs_metadata mds0_openfiles.0". On our cluster we have several such objects for rank 0: mds0_openfiles.0 exists with size: 199978 mds0_openfiles.1 exists with size: 153650 mds0_openfiles.2 exists with size: 40987 mds0_openfiles.3 exists with size: 7746 mds0_openfiles.4 exists with size: 413 If we suffer such an OOM, do we need to rm *all* of those objects or only the `.0` object? Best Regards, Dan

3 years, 2 months

1
1
0 0

Large rbd

by Chris Dunlop

Hi, What limits are there on the "reasonable size" of an rbd? E.g. when I try to create a 1 PB rbd with default 4 MiB objects on my octopus cluster: $ rbd create --size 1P --data-pool rbd.ec rbd.meta/fs 2021-01-20T18:19:35.799+1100 7f47a99253c0 -1 librbd::image::CreateRequest: validate_layout: image size not compatible with object ...which somes from: == src/librbd/image/CreateRequest.cc bool validate_layout(CephContext *cct, uint64_t size, file_layout_t &layout) { if (!librbd::ObjectMap<>::is_compatible(layout, size)) { lderr(cct) << "image size not compatible with object map" << dendl; return false; } == src/librbd/ObjectMap.cc template <typename I> bool ObjectMap<I>::is_compatible(const file_layout_t& layout, uint64_t size) { uint64_t object_count = Striper::get_num_objects(layout, size); return (object_count <= cls::rbd::MAX_OBJECT_MAP_OBJECT_COUNT); } == src/cls/rbd/cls_rbd_types.h static const uint32_t MAX_OBJECT_MAP_OBJECT_COUNT = 256000000; For 4 MiB objects that object count equates to just over 976 TB. Is there any particular reason for that MAX_OBJECT_MAP_OBJECT_COUNT, or it just "this is crazy large, if you're trying to go over this you're doing something wrong, rethink your life..."? Yes, I realise I can increase the size of the objects to get a larger rbd, or drop the object-map support (and the fast-diff that goes along with it). I'm SO glad I found this limit now rather than starting on a smaller rbd and a finding the limit when I tried to grow the rbd underneath a rapidly filling filesystem. What else should I know? Background: I currently have nearly 0.5 PB on XFS (on lvm / raid6) and ZFS that I'm looking to move over to ceph. XFS is a requirement, for the reflinking (sadly not yet available in CephFS: https://tracker.ceph.com/issues/1680). The recommendation for XFS is to start larger, on a thin-provisioned store (hello rbd!), rather than start smaller and grow as needed - e.g. see the thread surrounding: https://www.spinics.net/lists/linux-xfs/msg20099.html Rather than a single large rbd, should I be looking at multiple smaller rbds linked together using lvm or somesuch? What are the tradeoffs? And whilst we're here... for an rbd with the data on an erasure-coded pool, how do you calculate the amount of rbd metadata required if/when the rbd data is fully allocated? Cheers, Chris

3 years, 2 months

8
9
0 0

2024

2023

2022

2021

2020

2019

ceph-users January 2021